fix(traces): cap captured body size to keep admin Traces UI responsive (#9946 )

The trace middleware buffered the full request and response bodies for every JSON exchange. With a chatty agent-pool RAG workload, /embeddings responses (large vector arrays) accumulated to tens of MB in the in-memory buffer; the admin Traces page would then download and parse 40+ MB on every load and on every 5s auto-refresh, locking the UI in a loading state. Add LOCALAI_TRACING_MAX_BODY_BYTES (default 64 KiB) that caps each captured body. The full payload still flows through to the real client; only the trace copy is bounded. Exchanges record body_truncated and original body_bytes so the dashboard can show that truncation happened. The cap is configurable via env, CLI, and runtime_settings.json. Also unblock recovery: the Traces page now keeps the Clear button enabled while loading, since "buffer too large to render" is exactly when the user needs to clear it. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(openai): stream usage non-zero when tools are enabled (#9941 )
2026-05-22 15:50:31 -04:00 · 2026-05-22 15:29:24 +02:00 · 2026-05-22 10:13:41 +02:00 · 2026-05-22 09:49:33 +02:00 · 2026-05-22 08:31:49 +02:00 · 2026-05-22 00:31:19 +02:00
128 changed files with 6054 additions and 765 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -112,6 +112,8 @@ Add a YAML anchor definition in the `## metas` section (around line 2-300). Look

 Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags.

+**Note on integrity:** OCI backends installed from a gallery whose `verification:` block is set are verified against a keyless-cosign policy before extraction; tarball/HTTP backends use the optional `sha256:` field. New backends do not need any extra YAML — the gallery-level `verification:` block covers every entry. See [.agents/backend-signing.md](backend-signing.md) for the producer-side CI step.
+
 ## 4. Update the Makefile

 The Makefile needs to be updated in several places to support building and testing the new backend:
--- a/.agents/backend-signing.md
+++ b/.agents/backend-signing.md
@@ -0,0 +1,120 @@
+# Backend image signing & verification
+
+LocalAI verifies backend OCI images against a per-gallery keyless-cosign
+policy. This page documents the trust model, the producer side
+(`.github/workflows/backend_merge.yml` in this repo), and the consumer
+side (`pkg/oci/cosignverify` plus the gallery YAML).
+
+## Trust model
+
+- **Producer:** `.github/workflows/backend_merge.yml` signs each pushed
+  manifest list with `cosign sign --recursive` in keyless mode after
+  `docker buildx imagetools create`. The signing cert is issued by
+  Fulcio bound to the workflow's OIDC identity. There is no long-lived
+  signing key. `--recursive` signs both the manifest list and every
+  per-arch entry — needed because our consumer resolves a tag to a
+  per-arch manifest before checking signatures.
+- **Storage:** Signatures are written as OCI 1.1 referrers
+  (`--registry-referrers-mode=oci-1-1`) in the new Sigstore bundle format
+  (`--new-bundle-format`). No `:sha256-<hex>.sig` tag clutter.
+- **Consumer:** `pkg/oci/cosignverify` discovers the bundle via the
+  referrers API, hands it to `sigstore-go`, and verifies it against the
+  policy declared in the gallery YAML (`Gallery.Verification`).
+- **Revocation:** Keyless cosign certs are ephemeral (10-minute Fulcio
+  validity), so revocation is policy-side, not CA-side. The gallery's
+  `verification.not_before` (RFC3339) is the kill-switch — advance it to
+  invalidate every signature produced before a known compromise window.
+
+## Producer setup
+
+`backend_merge.yml` is the workflow that joins per-arch digests into the
+multi-arch manifest list users actually pull, so it's also the right place
+to sign. The job needs:
+
+- `permissions: { id-token: write, contents: read }` at the job level so
+  the runner can exchange its GitHub OIDC token for a Fulcio cert.
+- `sigstore/cosign-installer@v3` step (cosign ≥ 2.2 for
+  `--new-bundle-format`).
+- After each `docker buildx imagetools create`, resolve the resulting
+  list digest with `docker buildx imagetools inspect <tag> --format
+  '{{.Manifest.Digest}}'` and sign:
+
+```sh
+cosign sign --yes --recursive \
+  --new-bundle-format \
+  --registry-referrers-mode=oci-1-1 \
+  "${REGISTRY_REPO}@${DIGEST}"
+```
+
+Sign by digest, never by tag — signing by tag binds the signature to
+whatever the tag points at *now*, and a subsequent tag push orphans it.
+
+`backend_build_darwin.yml` builds and pushes single-arch darwin images
+that bypass the manifest-list merge. If/when those entries get a gallery
+`verification:` policy, the equivalent cosign step has to land there
+too.
+
+## Consumer setup (in `mudler/LocalAI` gallery YAML)
+
+Once CI is signing, add a `verification:` block to the backend gallery
+entry (`backend/index.yaml`):
+
+```yaml
+- name: localai
+  url: github:mudler/LocalAI/backend/index.yaml@master
+  verification:
+    issuer: "https://token.actions.githubusercontent.com"
+    identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@refs/heads/master$"
+    # Optional revocation cutoff; advance during incident response.
+    # not_before: "2026-06-01T00:00:00Z"
+```
+
+Identity matching pins the OIDC subject Fulcio issued the signing cert
+to. Without this, any image signed by *anyone* with a Fulcio cert would
+pass — the regex is what makes a signature mean "produced by our CI".
+
+## Strict mode
+
+Default behaviour: OCI backends without a `verification:` block install
+with a warning (logs include `installing OCI backend without signature
+verification`). Tarball/HTTP backends without a `sha256` field log a
+similar warning.
+
+For production, set `LOCALAI_REQUIRE_BACKEND_INTEGRITY=1` (or pass
+`--require-backend-integrity` to `local-ai run` / `local-ai backends
+install` / `local-ai models install`). The warning becomes a hard error
+and unverifiable backends refuse to install.
+
+## Revocation playbook
+
+If `backend_merge.yml` (or any workflow with `id-token: write`) is
+compromised and we've shipped malicious signed images:
+
+1. **Identify the compromise window.** Find the earliest IntegratedTime
+   from the bad signatures (Rekor search by `subject` filter).
+2. **Set `verification.not_before`** in `backend/index.yaml` to a
+   timestamp just *after* that window's start.
+3. **Push the YAML.** Deployed LocalAI instances pick it up on next
+   gallery refresh (1-hour cache in `core/gallery/gallery.go`).
+4. **Fix the underlying compromise** in the workflow and re-sign images
+   with the new build, which will have IntegratedTime > `not_before`.
+5. **Optional:** for absolute decisiveness, also rotate to a new
+   workflow path (`backend_merge_v2.yml`) and update `identity_regex`.
+
+## Where the code lives
+
+- `pkg/oci/cosignverify/` — verifier, policy, OCI referrer fetch, NotBefore enforcement.
+- `pkg/downloader/uri.go` — `WithImageVerifier` option threaded through `DownloadFileWithContext`.
+- `core/gallery/backends.go` — `backendDownloadOptions` builds the verifier from the gallery's policy.
+- `core/config/gallery.go` — `Gallery.Verification` YAML schema.
+- `core/cli/run.go`, `core/cli/backends.go`, `core/cli/models.go` — `--require-backend-integrity` flag propagation.
+- `.github/workflows/backend_merge.yml` — producer-side `cosign sign --recursive` after each multi-arch manifest list push.
+
+## Out of scope (follow-ups)
+
+- **Signing the gallery YAML itself.** The index is fetched over HTTPS
+  from GitHub; we trust the host. A cosign blob signature on the YAML
+  would close that gap but adds key-management overhead. Revisit this
+  page if/when added.
+- **Tarball/HTTP backend signing.** Cosign can sign arbitrary blobs, but
+  for now non-OCI backends keep using the `sha256:` field in YAML.
--- a/.agents/llama-cpp-backend.md
+++ b/.agents/llama-cpp-backend.md
@@ -61,6 +61,12 @@ Always check `llama.cpp` for new model configuration options that should be supp
   - `reasoning_format` - Reasoning format options
   - Any new flags or parameters

+### Speculative Decoding Types
+
+The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
+
+`draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
+
 ### Implementation Guidelines

 1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -31,6 +31,13 @@ on:
 jobs:
  merge:
    runs-on: ubuntu-latest
+    # id-token: write is required for keyless cosign — the workflow
+    # exchanges the GitHub OIDC token for a short-lived Fulcio cert that
+    # signs each pushed manifest. Without this permission the runner
+    # cannot mint the token, and `cosign sign` fails with "no token".
+    permissions:
+      contents: read
+      id-token: write
    env:
      quay_username: ${{ secrets.quayUsername }}
    steps:
@@ -57,6 +64,15 @@ jobs:
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@master

+      # cosign signs each pushed manifest list with --recursive so the
+      # index and every per-arch entry get an attached Sigstore bundle.
+      # 2.2+ is required for --new-bundle-format.
+      - name: Install cosign
+        if: github.event_name != 'pull_request'
+        uses: sigstore/cosign-installer@v3
+        with:
+          cosign-release: 'v2.4.1'
+
      - name: Login to DockerHub
        if: github.event_name != 'pull_request'
        uses: docker/login-action@v4
@@ -120,11 +136,26 @@ jobs:
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          if [ -z "$tags" ]; then
            echo "No quay.io tags from docker/metadata-action; skipping quay merge"
-          else
-            # shellcheck disable=SC2086
-            docker buildx imagetools create $tags \
-              $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
+            exit 0
          fi
+          # shellcheck disable=SC2086
+          docker buildx imagetools create $tags \
+            $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
+          # Resolve the manifest-list digest (any tag points at it) so
+          # cosign can sign by digest. Signing by tag would leave the
+          # signature orphaned the next time the tag moves.
+          first_tag=$(jq -cr '
+            .tags | map(select(startswith("quay.io/"))) | .[0]
+          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
+          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
+          # --recursive walks the list and signs every per-arch entry
+          # too — clients that resolve a tag to a platform-specific
+          # manifest before checking signatures need the per-arch
+          # signatures, not just the list-level one.
+          cosign sign --yes --recursive \
+            --new-bundle-format \
+            --registry-referrers-mode=oci-1-1 \
+            "quay.io/go-skynet/local-ai-backends@${digest}"

      - name: Create manifest list and push (dockerhub)
        if: github.event_name != 'pull_request'
@@ -139,11 +170,19 @@ jobs:
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          if [ -z "$tags" ]; then
            echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
-          else
-            # shellcheck disable=SC2086
-            docker buildx imagetools create $tags \
-              $(printf 'localai/localai-backends@sha256:%s ' *)
+            exit 0
          fi
+          # shellcheck disable=SC2086
+          docker buildx imagetools create $tags \
+            $(printf 'localai/localai-backends@sha256:%s ' *)
+          first_tag=$(jq -cr '
+            .tags | map(select(startswith("localai/"))) | .[0]
+          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
+          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
+          cosign sign --yes --recursive \
+            --new-bundle-format \
+            --registry-referrers-mode=oci-1-1 \
+            "localai/localai-backends@${digest}"

      - name: Inspect manifest
        if: github.event_name != 'pull_request'
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -106,6 +106,7 @@ jobs:
            type=ref,event=branch
            type=semver,pattern={{raw}}
            type=sha
+            type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true
--- a/.github/workflows/image_merge.yml
+++ b/.github/workflows/image_merge.yml
@@ -80,6 +80,7 @@ jobs:
            type=ref,event=branch
            type=semver,pattern={{raw}}
            type=sha
+            type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true
--- a/.gitignore
+++ b/.gitignore
@@ -77,3 +77,6 @@ local-backends/
 tests/e2e-ui/ui-test-server
 core/http/react-ui/playwright-report/
 core/http/react-ui/test-results/
+
+# Local worktrees
+.worktrees/
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -46,8 +46,52 @@ linters:
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.'
        - pattern: '^t\.FailNow$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.'
+        # In-process config should flow through ApplicationConfig / kong-bound
+        # CLI flags, not via os.Getenv. The CLI layer is the legitimate
+        # env→struct boundary (kong's `env:"..."` tag); anything deeper that
+        # reads env directly leaks process state into business logic and
+        # makes flags impossible to test or override per-request. Backend
+        # subprocesses, the system/capabilities probe, and a few places that
+        # read non-LocalAI env vars (HOME, PATH, AUTH_TOKEN passed by parent)
+        # are exempt — see linters.exclusions.rules below.
+        - pattern: '^os\.(Getenv|LookupEnv|Environ)$'
+          msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
  exclusions:
    paths:
      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
      - 'backend/go/whisper/sources'
      - 'docs/'
+    rules:
+      # CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
+      # boundary, and a handful of subcommands legitimately propagate values
+      # to spawned subprocesses (LLAMACPP_GRPC_SERVERS, MLX hostfile, ...).
+      - path: ^core/cli/
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
+      # Backend subprocesses are independent binaries with their own env
+      # surface; they're not "in-process config" of the LocalAI server.
+      - path: ^backend/
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
+      # System capability probe reads HOME, PATH-style vars to discover
+      # GPUs, default paths, etc. — not LocalAI config.
+      - path: ^pkg/system/
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
+      # gRPC server reads AUTH_TOKEN passed in by the parent process at spawn
+      # time; model.Loader sets/inherits env to communicate with subprocesses.
+      - path: ^pkg/grpc/
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
+      - path: ^pkg/model/
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
+      # Top-level main binaries (local-ai, launcher) are entry points.
+      - path: ^cmd/
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
+      # Tests legitimately read $HOME, $TMPDIR, and gating env vars
+      # (LOCALAI_COSIGN_LIVE, etc.) to skip live-network specs.
+      - path: _test\.go$
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -31,6 +31,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
 | [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery |
 | [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) | LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync |
+| [.agents/backend-signing.md](.agents/backend-signing.md) | Backend OCI image signing (keyless cosign + sigstore-go) — producer-side CI setup, consumer-side gallery `verification:` block, strict mode (`LOCALAI_REQUIRE_BACKEND_INTEGRITY`), revocation via `not_before` |

 ## Quick Reference

--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -1,10 +1,10 @@
 # ds4 backend Makefile.
 #
-# Upstream pin lives below as DS4_VERSION?=950e8e6474a1c9fabe04e669d607606a7ef8824f
+# Upstream pin lives below as DS4_VERSION?=8d576642c39b9a2d782a80159ba84ef5a81c0b81
 # (.github/bump_deps.sh) can find and update it - matches the
 # llama-cpp / ik-llama-cpp / turboquant convention.

-DS4_VERSION?=950e8e6474a1c9fabe04e669d607606a7ef8824f
+DS4_VERSION?=8d576642c39b9a2d782a80159ba84ef5a81c0b81
 DS4_REPO?=https://github.com/antirez/ds4

 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=5cc0d86c760e9858e4bed4418400bb39dbe025f2
+IK_LLAMA_VERSION?=48a55f74e4c6e2aeda363dd386c1ac9170a0af71
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=1348f67c58f561808136e8a152a9eddec168f221
+LLAMA_VERSION?=bb28c1fe246b72276ee1d00ce89306be7b865766
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -517,16 +517,27 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    params.warmup = true;
    // no_op_offload: disable host tensor op offload (default: false)
    params.no_op_offload = false;
-    // kv_unified: enable unified KV cache (default: false)
-    params.kv_unified = false;
-    // n_ctx_checkpoints: max context checkpoints per slot (default: 8)
-    params.n_ctx_checkpoints = 8;
-
-    // llama memory fit fails if we don't provide a buffer for tensor overrides
-    const size_t ntbo = llama_max_tensor_buft_overrides();
-    while (params.tensor_buft_overrides.size() < ntbo) {
-        params.tensor_buft_overrides.push_back({nullptr, nullptr});
-    }
+    // kv_unified: enable unified KV cache. Upstream's server auto-enables this
+    // when the slot count is auto (-np <0), bumping n_parallel to 4 alongside.
+    // LocalAI keeps n_parallel=1 by default, which would skip that auto path
+    // and leave kv_unified=false. We flip the default to true here so the
+    // server-side prompt cache (cache_idle_slots) is actually usable on the
+    // single-slot path that LocalAI ships with: without it, idle slots are
+    // never persisted across requests and the prompt cache is dead weight.
+    // Users can opt out with `options: [ "kv_unified:false" ]`.
+    params.kv_unified = true;
+    // n_ctx_checkpoints: max context checkpoints per slot. Match upstream's
+    // default (32); the previous LocalAI-specific 8 was unnecessarily tight
+    // and limits partial-prefix recovery without a clear memory rationale.
+    params.n_ctx_checkpoints = 32;
+    // cache_idle_slots: save and clear idle slot KV to the prompt cache on
+    // task switch. Upstream default is true; the server auto-disables it if
+    // kv_unified=false or cache_ram_mib=0, so flipping kv_unified above is
+    // what actually unlocks it.
+    params.cache_idle_slots = true;
+    // checkpoint_every_nt: create a context checkpoint every N tokens during
+    // prefill (-1 disables). Match upstream's default (8192).
+    params.checkpoint_every_nt = 8192;

     // decode options. Options are in form optname:optvale, or if booleans only optname.
    for (int i = 0; i < request->options_size(); i++) {
@@ -685,7 +696,29 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                try {
                    params.n_ctx_checkpoints = std::stoi(optval_str);
                } catch (const std::exception& e) {
-                    // If conversion fails, keep default value (8)
+                    // If conversion fails, keep default value (32)
+                }
+            }
+
+        // --- server-side idle-slot prompt cache toggle (upstream --cache-idle-slots) ---
+        // Saves the slot's KV state into the host-side prompt cache on task
+        // switch so a later request with the same prefix can warm-load it.
+        // Auto-disabled by the server if kv_unified=false or cache_ram=0.
+        } else if (!strcmp(optname, "cache_idle_slots") || !strcmp(optname, "idle_slots_cache")) {
+            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
+                params.cache_idle_slots = true;
+            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
+                params.cache_idle_slots = false;
+            }
+
+        // --- prefill checkpoint cadence (upstream -cpent / --checkpoint-every-n-tokens) ---
+        // -1 disables checkpointing during prefill.
+        } else if (!strcmp(optname, "checkpoint_every_nt") || !strcmp(optname, "checkpoint_every_n_tokens")) {
+            if (optval != NULL) {
+                try {
+                    params.checkpoint_every_nt = std::stoi(optval_str);
+                } catch (const std::exception& e) {
+                    // If conversion fails, keep default value (8192)
                }
            }

@@ -1081,6 +1114,20 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
        params.kv_overrides.back().key[0] = 0;
    }

+    // tensor_buft_overrides sentinel termination (mirrors upstream common/arg.cpp).
+    // Real entries are pushed during option parsing; here we pad/terminate so the
+    // model loader sees back().pattern == nullptr (GGML_ASSERT at common.cpp:1543)
+    // and so llama_params_fit has the placeholder slots it requires.
+    {
+        const size_t ntbo = llama_max_tensor_buft_overrides();
+        while (params.tensor_buft_overrides.size() < ntbo) {
+            params.tensor_buft_overrides.push_back({nullptr, nullptr});
+        }
+    }
+    if (!params.speculative.draft.tensor_buft_overrides.empty()) {
+        params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
+    }
+
    // TODO: Add yarn

    if (!request->tensorsplit().empty()) {
--- a/backend/go/acestep-cpp/Makefile
+++ b/backend/go/acestep-cpp/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # acestep.cpp version
 ACESTEP_REPO?=https://github.com/ace-step/acestep.cpp
-ACESTEP_CPP_VERSION?=e0c8d75a672fca5684c88c68dbf6d12f58754258
+ACESTEP_CPP_VERSION?=ed53caf164e4492a5620b2e3f2264629cf66da24
 SO_TARGET?=libgoacestepcpp.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/go/acestep-cpp/cpp/goacestepcpp.cpp
+++ b/backend/go/acestep-cpp/cpp/goacestepcpp.cpp
@@ -22,12 +22,11 @@
 #include <vector>

 // Global model contexts (loaded once, reused across requests)
-static DiTGGML       g_dit       = {};
-static DiTGGMLConfig g_dit_cfg;
-static VAEGGML       g_vae       = {};
-static bool          g_dit_loaded = false;
-static bool          g_vae_loaded = false;
-static bool          g_is_turbo   = false;
+static DiTGGML g_dit        = {};
+static VAEGGML g_vae        = {};
+static bool    g_dit_loaded = false;
+static bool    g_vae_loaded = false;
+static bool    g_is_turbo   = false;

 // Silence latent [15000, 64] — read once from DiT GGUF
 static std::vector<float> g_silence_full;
@@ -72,10 +71,9 @@ int load_model(const char * lm_model_path, const char * text_encoder_path,
    g_text_enc_path = text_encoder_path;
    g_dit_path      = dit_model_path;

-    // Load DiT model
+    // Load DiT model (backend init + config are handled inside dit_ggml_load)
    fprintf(stderr, "[acestep-cpp] Loading DiT from %s\n", dit_model_path);
-    dit_ggml_init_backend(&g_dit);
-    if (!dit_ggml_load(&g_dit, dit_model_path, g_dit_cfg, nullptr, 0.0f)) {
+    if (!dit_ggml_load(&g_dit, dit_model_path)) {
        fprintf(stderr, "[acestep-cpp] FATAL: failed to load DiT from %s\n", dit_model_path);
        return 1;
    }
@@ -149,16 +147,16 @@ int generate_music(const char * caption, const char * lyrics, int bpm,

    // Compute T (latent frames at 25Hz)
    int T = (int)(duration * FRAMES_PER_SECOND);
-    T     = ((T + g_dit_cfg.patch_size - 1) / g_dit_cfg.patch_size) * g_dit_cfg.patch_size;
-    int S = T / g_dit_cfg.patch_size;
+    T     = ((T + g_dit.cfg.patch_size - 1) / g_dit.cfg.patch_size) * g_dit.cfg.patch_size;
+    int S = T / g_dit.cfg.patch_size;

    if (T > 15000) {
        fprintf(stderr, "[acestep-cpp] ERROR: T=%d exceeds max 15000\n", T);
        return 2;
    }

-    int Oc     = g_dit_cfg.out_channels;      // 64
-    int ctx_ch = g_dit_cfg.in_channels - Oc;  // 128
+    int Oc     = g_dit.cfg.out_channels;      // 64
+    int ctx_ch = g_dit.cfg.in_channels - Oc;  // 128

    fprintf(stderr, "[acestep-cpp] T=%d, S=%d, duration=%.1fs, seed=%d\n", T, S, duration, seed);

@@ -191,9 +189,8 @@ int generate_music(const char * caption, const char * lyrics, int bpm,

    fprintf(stderr, "[acestep-cpp] caption: %d tokens, lyrics: %d tokens\n", S_text, S_lyric);

-    // 4. Text encoder forward
+    // 4. Text encoder forward (backend init handled inside qwen3_load_text_encoder)
    Qwen3GGML text_enc = {};
-    qwen3_init_backend(&text_enc);
    if (!qwen3_load_text_encoder(&text_enc, g_text_enc_path.c_str())) {
        fprintf(stderr, "[acestep-cpp] FATAL: failed to load text encoder\n");
        return 4;
@@ -209,9 +206,8 @@ int generate_music(const char * caption, const char * lyrics, int bpm,
    std::vector<float> lyric_embed(H_text * S_lyric);
    qwen3_embed_lookup(&text_enc, lyric_ids.data(), S_lyric, lyric_embed.data());

-    // 6. Condition encoder
+    // 6. Condition encoder (backend init handled inside cond_ggml_load)
    CondGGML cond = {};
-    cond_ggml_init_backend(&cond);
    if (!cond_ggml_load(&cond, g_dit_path.c_str())) {
        fprintf(stderr, "[acestep-cpp] FATAL: failed to load condition encoder\n");
        qwen3_free(&text_enc);
--- a/backend/go/stablediffusion-ggml/Makefile
+++ b/backend/go/stablediffusion-ggml/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # stablediffusion.cpp (ggml)
 STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
-STABLEDIFFUSION_GGML_VERSION?=0b8296915c4094090cff6bd2e09a5e98288c3c7d
+STABLEDIFFUSION_GGML_VERSION?=3a8788cb7d74f185d6b18688e9563015524ecaf5

 CMAKE_ARGS+=-DGGML_MAX_NAME=128

--- a/backend/go/stablediffusion-ggml/cpp/gosd.cpp
+++ b/backend/go/stablediffusion-ggml/cpp/gosd.cpp
@@ -1188,6 +1188,9 @@ int gen_video(sd_vid_gen_params_t *p, int steps, char *dst, float cfg_scale, int
    p->high_noise_sample_params.scheduler                = scheduler;
    p->high_noise_sample_params.flow_shift               = flow_shift;

+    // Pin output fps in params; upstream uses it for audio sync (and we also mux at this rate).
+    p->fps = fps;
+
    // Load init/end reference images if provided (resized to output dims).
    uint8_t* init_buf = nullptr;
    uint8_t* end_buf  = nullptr;
@@ -1206,11 +1209,14 @@ int gen_video(sd_vid_gen_params_t *p, int steps, char *dst, float cfg_scale, int

    // Generate
    int num_frames_out = 0;
-    sd_image_t* frames = generate_video(sd_c, p, &num_frames_out);
+    sd_image_t* frames = nullptr;
+    sd_audio_t* audio = nullptr;
+    bool ok = generate_video(sd_c, p, &frames, &num_frames_out, &audio);
    std::free(p);

-    if (!frames || num_frames_out == 0) {
+    if (!ok || !frames || num_frames_out == 0) {
        fprintf(stderr, "generate_video produced no frames\n");
+        if (audio) free_sd_audio(audio);
        if (init_buf) free(init_buf);
        if (end_buf) free(end_buf);
        return 1;
@@ -1224,6 +1230,7 @@ int gen_video(sd_vid_gen_params_t *p, int steps, char *dst, float cfg_scale, int
        if (frames[i].data) free(frames[i].data);
    }
    free(frames);
+    if (audio) free_sd_audio(audio);
    if (init_buf) free(init_buf);
    if (end_buf) free(end_buf);

--- a/backend/go/whisper/Makefile
+++ b/backend/go/whisper/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # whisper.cpp version
 WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
-WHISPER_CPP_VERSION?=968eebe77225d25e57a3f981da7c696310f0e881
+WHISPER_CPP_VERSION?=8443cf05e3fa8ce1b32348e1bcbcf8fc31f7f3ae
 SO_TARGET?=libgowhisper.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/python/transformers/requirements-cpu.txt
+++ b/backend/python/transformers/requirements-cpu.txt
@@ -2,9 +2,9 @@ torch==2.7.1
 llvmlite==0.43.0
 numba==0.60.0
 accelerate
-transformers>=5.8.0
+transformers>=5.8.1
 bitsandbytes
-sentence-transformers==5.4.0
+sentence-transformers==5.5.0
 diffusers
 soundfile
 protobuf==6.33.5
--- a/backend/python/transformers/requirements-cublas12.txt
+++ b/backend/python/transformers/requirements-cublas12.txt
@@ -2,9 +2,9 @@ torch==2.7.1
 accelerate
 llvmlite==0.43.0
 numba==0.60.0
-transformers>=5.8.0
+transformers>=5.8.1
 bitsandbytes
-sentence-transformers==5.4.0
+sentence-transformers==5.5.0
 diffusers
 soundfile
 protobuf==6.33.5
--- a/backend/python/transformers/requirements-cublas13.txt
+++ b/backend/python/transformers/requirements-cublas13.txt
@@ -2,9 +2,9 @@
 torch==2.9.0
 llvmlite==0.43.0
 numba==0.60.0
-transformers>=5.8.0
+transformers>=5.8.1
 bitsandbytes
-sentence-transformers==5.4.0
+sentence-transformers==5.5.0
 diffusers
 soundfile
 protobuf==6.33.5
--- a/backend/python/transformers/requirements-hipblas.txt
+++ b/backend/python/transformers/requirements-hipblas.txt
@@ -1,11 +1,11 @@
 --extra-index-url https://download.pytorch.org/whl/rocm7.0
 torch==2.10.0+rocm7.0
 accelerate
-transformers>=5.8.0
+transformers>=5.8.1
 llvmlite==0.43.0
 numba==0.60.0
 bitsandbytes
-sentence-transformers==5.4.0
+sentence-transformers==5.5.0
 diffusers
 soundfile
 protobuf==6.33.5
--- a/backend/python/transformers/requirements-intel.txt
+++ b/backend/python/transformers/requirements-intel.txt
@@ -3,9 +3,9 @@ torch
 optimum[openvino]
 llvmlite==0.43.0
 numba==0.60.0
-transformers>=5.8.0
+transformers>=5.8.1
 bitsandbytes
-sentence-transformers==5.4.0
+sentence-transformers==5.5.0
 diffusers
 soundfile
 protobuf==6.33.5
--- a/backend/python/transformers/requirements-mps.txt
+++ b/backend/python/transformers/requirements-mps.txt
@@ -2,9 +2,9 @@ torch==2.7.1
 llvmlite==0.43.0
 numba==0.60.0
 accelerate
-transformers>=5.8.0
+transformers>=5.8.1
 bitsandbytes
-sentence-transformers==5.4.0
+sentence-transformers==5.5.0
 diffusers
 soundfile
 protobuf==6.33.5
--- a/core/application/startup.go
+++ b/core/application/startup.go
@@ -212,12 +212,12 @@ func New(opts ...config.AppOption) (*Application, error) {
 		}
 	}

-	if err := coreStartup.InstallModels(options.Context, application.GalleryService(), options.Galleries, options.BackendGalleries, options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, nil, options.ModelsURL...); err != nil {
+	if err := coreStartup.InstallModels(options.Context, application.GalleryService(), options.Galleries, options.BackendGalleries, options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.RequireBackendIntegrity, nil, options.ModelsURL...); err != nil {
 		xlog.Error("error installing models", "error", err)
 	}

 	for _, backend := range options.ExternalBackends {
-		if err := galleryop.InstallExternalBackend(options.Context, options.BackendGalleries, options.SystemState, application.ModelLoader(), nil, backend, "", ""); err != nil {
+		if err := galleryop.InstallExternalBackend(options.Context, options.BackendGalleries, options.SystemState, application.ModelLoader(), nil, backend, "", "", options.RequireBackendIntegrity); err != nil {
 			xlog.Error("error installing external backend", "error", err)
 		}
 	}
@@ -267,13 +267,13 @@ func New(opts ...config.AppOption) (*Application, error) {
 	}

 	if options.PreloadJSONModels != "" {
-		if err := galleryop.ApplyGalleryFromString(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadJSONModels); err != nil {
+		if err := galleryop.ApplyGalleryFromString(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadJSONModels, options.RequireBackendIntegrity); err != nil {
 			return nil, err
 		}
 	}

 	if options.PreloadModelsFromPath != "" {
-		if err := galleryop.ApplyGalleryFromFile(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadModelsFromPath); err != nil {
+		if err := galleryop.ApplyGalleryFromFile(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadModelsFromPath, options.RequireBackendIntegrity); err != nil {
 			return nil, err
 		}
 	}
@@ -552,6 +552,13 @@ func loadRuntimeSettingsFromFile(options *config.ApplicationConfig) {
 			options.TracingMaxItems = *settings.TracingMaxItems
 		}
 	}
+	if settings.TracingMaxBodyBytes != nil {
+		// Allow the on-disk setting to override the CLI/env default. The
+		// startup default is non-zero (see NewApplicationConfig), so a plain
+		// `== 0` guard like the others would never trigger; we instead respect
+		// any value the file specifies. 0 in the file means "uncapped".
+		options.TracingMaxBodyBytes = *settings.TracingMaxBodyBytes
+	}

 	// Branding / whitelabeling. There are no env vars for these — the file is
 	// the only source — so apply unconditionally. Without this block a server
--- a/core/application/upgrade_checker.go
+++ b/core/application/upgrade_checker.go
@@ -217,7 +217,7 @@ func (uc *UpgradeChecker) runCheck(ctx context.Context) {
 				err = bm.UpgradeBackend(ctx, name, nil)
 			} else {
 				err = gallery.UpgradeBackend(ctx, uc.systemState, uc.modelLoader,
-					uc.galleries, name, nil)
+					uc.galleries, name, nil, uc.appConfig.RequireBackendIntegrity)
 			}
 			if err != nil {
 				xlog.Error("Failed to auto-upgrade backend",
--- a/core/backend/llm.go
+++ b/core/backend/llm.go
@@ -86,7 +86,7 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
 		if !slices.Contains(modelNames, modelName) {
 			utils.ResetDownloadTimers()
 			// if we failed to load the model, we try to download it
-			err := gallery.InstallModelFromGallery(ctx, o.Galleries, o.BackendGalleries, o.SystemState, loader, modelName, gallery.GalleryModel{}, utils.DisplayDownloadFunction, o.EnforcePredownloadScans, o.AutoloadBackendGalleries)
+			err := gallery.InstallModelFromGallery(ctx, o.Galleries, o.BackendGalleries, o.SystemState, loader, modelName, gallery.GalleryModel{}, utils.DisplayDownloadFunction, o.EnforcePredownloadScans, o.AutoloadBackendGalleries, o.RequireBackendIntegrity)
 			if err != nil {
 				xlog.Error("failed to install model from gallery", "error", err, "model", modelFile)
 				//return nil, err
--- a/core/cli/backends.go
+++ b/core/cli/backends.go
@@ -17,9 +17,10 @@ import (
 )

 type BackendsCMDFlags struct {
-	BackendGalleries   string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
-	BackendsPath       string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"storage"`
-	BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
+	BackendGalleries        string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
+	BackendsPath            string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"storage"`
+	BackendsSystemPath      string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
+	RequireBackendIntegrity bool   `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
 }

 type BackendsList struct {
@@ -126,7 +127,7 @@ func (bi *BackendsInstall) Run(ctx *cliContext.Context) error {
 	}

 	modelLoader := model.NewModelLoader(systemState)
-	err = galleryop.InstallExternalBackend(context.Background(), galleries, systemState, modelLoader, progressCallback, bi.BackendArgs, bi.Name, bi.Alias)
+	err = galleryop.InstallExternalBackend(context.Background(), galleries, systemState, modelLoader, progressCallback, bi.BackendArgs, bi.Name, bi.Alias, bi.RequireBackendIntegrity)
 	if err != nil {
 		return err
 	}
@@ -197,7 +198,7 @@ func (bu *BackendsUpgrade) Run(ctx *cliContext.Context) error {
 			}
 		}

-		if err := gallery.UpgradeBackend(context.Background(), systemState, modelLoader, galleries, name, progressCallback); err != nil {
+		if err := gallery.UpgradeBackend(context.Background(), systemState, modelLoader, galleries, name, progressCallback, bu.RequireBackendIntegrity); err != nil {
 			fmt.Printf("Failed to upgrade %s: %v\n", name, err)
 		} else {
 			fmt.Printf("Backend %s upgraded successfully\n", name)
--- a/core/cli/models.go
+++ b/core/cli/models.go
@@ -32,6 +32,7 @@ type ModelsList struct {

 type ModelsInstall struct {
 	DisablePredownloadScan   bool     `env:"LOCALAI_DISABLE_PREDOWNLOAD_SCAN" help:"If true, disables the best-effort security scanner before downloading any files." group:"hardening" default:"false"`
+	RequireBackendIntegrity  bool     `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
 	AutoloadBackendGalleries bool     `env:"LOCALAI_AUTOLOAD_BACKEND_GALLERIES" help:"If true, automatically loads backend galleries" group:"backends" default:"true"`
 	ModelArgs                []string `arg:"" optional:"" name:"models" help:"Model configuration URLs to load"`

@@ -71,7 +72,6 @@ func (ml *ModelsList) Run(ctx *cliContext.Context) error {
 }

 func (mi *ModelsInstall) Run(ctx *cliContext.Context) error {
-
 	systemState, err := system.GetSystemState(
 		system.WithModelPath(mi.ModelsPath),
 		system.WithBackendPath(mi.BackendsPath),
@@ -135,7 +135,7 @@ func (mi *ModelsInstall) Run(ctx *cliContext.Context) error {
 		}

 		modelLoader := model.NewModelLoader(systemState)
-		err = startup.InstallModels(context.Background(), galleryService, galleries, backendGalleries, systemState, modelLoader, !mi.DisablePredownloadScan, mi.AutoloadBackendGalleries, progressCallback, modelName)
+		err = startup.InstallModels(context.Background(), galleryService, galleries, backendGalleries, systemState, modelLoader, !mi.DisablePredownloadScan, mi.AutoloadBackendGalleries, mi.RequireBackendIntegrity, progressCallback, modelName)
 		if err != nil {
 			return err
 		}
--- a/core/cli/run.go
+++ b/core/cli/run.go
@@ -67,6 +67,7 @@ type RunCMD struct {
 	OllamaAPIRootEndpoint              bool     `env:"LOCALAI_OLLAMA_API_ROOT_ENDPOINT" default:"false" help:"Register Ollama-compatible health check on / (replaces web UI on root path). The /api/* Ollama endpoints are always available regardless of this flag" group:"api"`
 	DisableRuntimeSettings             bool     `env:"LOCALAI_DISABLE_RUNTIME_SETTINGS,DISABLE_RUNTIME_SETTINGS" default:"false" help:"Disables the runtime settings. When set to true, the server will not load the runtime settings from the runtime_settings.json file" group:"api"`
 	DisablePredownloadScan             bool     `env:"LOCALAI_DISABLE_PREDOWNLOAD_SCAN" help:"If true, disables the best-effort security scanner before downloading any files." group:"hardening" default:"false"`
+	RequireBackendIntegrity            bool     `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, backend installs without a configured signature verification policy (for OCI URIs) or SHA256 (for tarball/HTTP URIs) are rejected. Default is to warn and install. Set this in production once your gallery's verification: block is populated." group:"hardening" default:"false"`
 	OpaqueErrors                       bool     `env:"LOCALAI_OPAQUE_ERRORS" default:"false" help:"If true, all error responses are replaced with blank 500 errors. This is intended only for hardening against information leaks and is normally not recommended." group:"hardening"`
 	UseSubtleKeyComparison             bool     `env:"LOCALAI_SUBTLE_KEY_COMPARISON" default:"false" help:"If true, API Key validation comparisons will be performed using constant-time comparisons rather than simple equality. This trades off performance on each request for resiliancy against timing attacks." group:"hardening"`
 	DisableApiKeyRequirementForHttpGet bool     `env:"LOCALAI_DISABLE_API_KEY_REQUIREMENT_FOR_HTTP_GET" default:"false" help:"If true, a valid API key is not required to issue GET requests to portions of the web ui. This should only be enabled in secure testing environments" group:"hardening"`
@@ -99,6 +100,7 @@ type RunCMD struct {
 	LoadToMemory                       []string `env:"LOCALAI_LOAD_TO_MEMORY,LOAD_TO_MEMORY" help:"A list of models to load into memory at startup" group:"models"`
 	EnableTracing                      bool     `env:"LOCALAI_ENABLE_TRACING,ENABLE_TRACING" help:"Enable API tracing" group:"api"`
 	TracingMaxItems                    int      `env:"LOCALAI_TRACING_MAX_ITEMS" default:"1024" help:"Maximum number of traces to keep" group:"api"`
+	TracingMaxBodyBytes                int      `env:"LOCALAI_TRACING_MAX_BODY_BYTES" default:"65536" help:"Maximum bytes captured per request/response body in the trace buffer (0 = uncapped). Caps memory growth from chatty endpoints like /embeddings." group:"api"`
 	AgentJobRetentionDays              int      `env:"LOCALAI_AGENT_JOB_RETENTION_DAYS,AGENT_JOB_RETENTION_DAYS" default:"30" help:"Number of days to keep agent job history (default: 30)" group:"api"`
 	OpenResponsesStoreTTL              string   `env:"LOCALAI_OPEN_RESPONSES_STORE_TTL,OPEN_RESPONSES_STORE_TTL" default:"0" help:"TTL for Open Responses store (e.g., 1h, 30m, 0 = no expiration)" group:"api"`

@@ -272,6 +274,7 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
 		opts = append(opts, config.EnableTracing)
 	}
 	opts = append(opts, config.WithTracingMaxItems(r.TracingMaxItems))
+	opts = append(opts, config.WithTracingMaxBodyBytes(r.TracingMaxBodyBytes))

 	token := ""
 	if r.Peer2Peer || r.Peer2PeerToken != "" {
@@ -503,6 +506,10 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
 		opts = append(opts, config.WithAutoUpgradeBackends(r.AutoUpgradeBackends))
 	}

+	if r.RequireBackendIntegrity {
+		opts = append(opts, config.WithRequireBackendIntegrity(r.RequireBackendIntegrity))
+	}
+
 	if r.PreferDevelopmentBackends {
 		opts = append(opts, config.WithPreferDevelopmentBackends(r.PreferDevelopmentBackends))
 	}
--- a/core/cli/worker/worker.go
+++ b/core/cli/worker/worker.go
@@ -1,10 +1,11 @@
 package worker

 type WorkerFlags struct {
-	BackendsPath       string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"backends"`
-	BackendGalleries   string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
-	BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
-	ExtraLLamaCPPArgs  string `name:"llama-cpp-args" env:"LOCALAI_EXTRA_LLAMA_CPP_ARGS,EXTRA_LLAMA_CPP_ARGS" help:"Extra arguments to pass to llama-cpp-rpc-server"`
+	BackendsPath            string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"backends"`
+	BackendGalleries        string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
+	BackendsSystemPath      string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
+	RequireBackendIntegrity bool   `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
+	ExtraLLamaCPPArgs       string `name:"llama-cpp-args" env:"LOCALAI_EXTRA_LLAMA_CPP_ARGS,EXTRA_LLAMA_CPP_ARGS" help:"Extra arguments to pass to llama-cpp-rpc-server"`
 }

 type Worker struct {
--- a/core/cli/worker/worker_backend_common.go
+++ b/core/cli/worker/worker_backend_common.go
@@ -18,7 +18,7 @@ import (
 // installing the backend from the gallery if it isn't present.
 // `name` is the gallery entry name (for vLLM the meta entry "vllm"
 // resolves to a platform-specific package via capability lookup).
-func findBackendPath(name, galleries string, systemState *system.SystemState) (string, error) {
+func findBackendPath(name, galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
 	backends, err := gallery.ListSystemBackends(systemState)
 	if err != nil {
 		return "", err
@@ -33,7 +33,7 @@ func findBackendPath(name, galleries string, systemState *system.SystemState) (s
 		xlog.Error("failed loading galleries", "error", err)
 		return "", err
 	}
-	if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, name, nil, true); err != nil {
+	if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, name, nil, true, requireIntegrity); err != nil {
 		xlog.Error("backend not found, failed to install it", "name", name, "error", err)
 		return "", err
 	}
--- a/core/cli/worker/worker_llamacpp.go
+++ b/core/cli/worker/worker_llamacpp.go
@@ -27,7 +27,7 @@ const (
 	llamaCPPGalleryName   = "llama-cpp"
 )

-func findLLamaCPPBackend(galleries string, systemState *system.SystemState) (string, error) {
+func findLLamaCPPBackend(galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
 	backends, err := gallery.ListSystemBackends(systemState)
 	if err != nil {
 		xlog.Warn("Failed listing system backends", "error", err)
@@ -43,7 +43,7 @@ func findLLamaCPPBackend(galleries string, systemState *system.SystemState) (str
 			xlog.Error("failed loading galleries", "error", err)
 			return "", err
 		}
-		err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, llamaCPPGalleryName, nil, true)
+		err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, llamaCPPGalleryName, nil, true, requireIntegrity)
 		if err != nil {
 			xlog.Error("llama-cpp backend not found, failed to install it", "error", err)
 			return "", err
@@ -76,7 +76,7 @@ func (r *LLamaCPP) Run(ctx *cliContext.Context) error {
 	if err != nil {
 		return err
 	}
-	grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState)
+	grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 	if err != nil {
 		return err
 	}
--- a/core/cli/worker/worker_mlx_common.go
+++ b/core/cli/worker/worker_mlx_common.go
@@ -9,8 +9,8 @@ import (

 const mlxDistributedGalleryName = "mlx-distributed"

-func findMLXDistributedBackendPath(galleries string, systemState *system.SystemState) (string, error) {
-	return findBackendPath(mlxDistributedGalleryName, galleries, systemState)
+func findMLXDistributedBackendPath(galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
+	return findBackendPath(mlxDistributedGalleryName, galleries, systemState, requireIntegrity)
 }

 // buildMLXCommand builds the exec.Cmd to launch the mlx-distributed backend.
--- a/core/cli/worker/worker_mlx_distributed.go
+++ b/core/cli/worker/worker_mlx_distributed.go
@@ -28,7 +28,7 @@ func (r *MLXDistributed) Run(ctx *cliContext.Context) error {
 		return err
 	}

-	backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState)
+	backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 	if err != nil {
 		return fmt.Errorf("cannot find mlx-distributed backend: %w", err)
 	}
--- a/core/cli/worker/worker_p2p.go
+++ b/core/cli/worker/worker_p2p.go
@@ -73,7 +73,7 @@ func (r *P2P) Run(ctx *cliContext.Context) error {
 			for {
 				xlog.Info("Starting llama-cpp-rpc-server", "address", address, "port", port)

-				grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState)
+				grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 				if err != nil {
 					xlog.Error("Failed to find llama-cpp-rpc-server", "error", err)
 					return
--- a/core/cli/worker/worker_p2p_mlx.go
+++ b/core/cli/worker/worker_p2p_mlx.go
@@ -48,7 +48,7 @@ func (r *P2PMLX) Run(ctx *cliContext.Context) error {
 	c, cancel := context.WithCancel(context.Background())
 	defer cancel()

-	backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState)
+	backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 	if err != nil {
 		xlog.Warn("Could not find mlx-distributed backend from gallery, will try backend.py directly", "error", err)
 	}
--- a/core/cli/worker/worker_vllm.go
+++ b/core/cli/worker/worker_vllm.go
@@ -77,7 +77,7 @@ func (r *VLLMDistributed) Run(ctx *cliContext.Context) error {
 		return fmt.Errorf("getting system state: %w", err)
 	}

-	backendPath, err := findBackendPath("vllm", r.BackendGalleries, systemState)
+	backendPath, err := findBackendPath("vllm", r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 	if err != nil {
 		return fmt.Errorf("cannot find vllm backend: %w", err)
 	}
--- a/core/config/application_config.go
+++ b/core/config/application_config.go
@@ -21,6 +21,7 @@ type ApplicationConfig struct {
 	Debug                               bool
 	EnableTracing                       bool
 	TracingMaxItems                     int
+	TracingMaxBodyBytes                 int // Per-body cap for captured request/response bodies; 0 disables the cap
 	EnableBackendLogging                bool
 	GeneratedContentDir                 string

@@ -60,6 +61,13 @@ type ApplicationConfig struct {
 	AutoUpgradeBackends                         bool
 	PreferDevelopmentBackends                   bool

+	// RequireBackendIntegrity promotes a missing SHA256 (tarball/HTTP URIs)
+	// or missing verification policy (OCI URIs) from a warning to a hard
+	// failure during backend install/upgrade. Off by default to keep
+	// upgrades non-breaking; operators opt in explicitly via
+	// --require-backend-integrity / LOCALAI_REQUIRE_BACKEND_INTEGRITY.
+	RequireBackendIntegrity bool
+
 	SingleBackend           bool // Deprecated: use MaxActiveBackends = 1 instead
 	MaxActiveBackends       int  // Maximum number of active backends (0 = unlimited, 1 = single backend mode)
 	WatchDogIdle bool
@@ -180,6 +188,7 @@ func NewApplicationConfig(o ...AppOption) *ApplicationConfig {
 		LRUEvictionRetryInterval: 1 * time.Second,        // Default: 1 second
 		WatchDogInterval:         500 * time.Millisecond, // Default: 500ms
 		TracingMaxItems:          1024,
+		TracingMaxBodyBytes:      64 * 1024, // 64 KiB - caps each request/response body in the trace buffer
 		AgentPool: AgentPoolConfig{
 			Enabled:         true,
 			Timeout:         "5m",
@@ -436,6 +445,10 @@ func WithAutoUpgradeBackends(v bool) AppOption {
 	return func(o *ApplicationConfig) { o.AutoUpgradeBackends = v }
 }

+func WithRequireBackendIntegrity(v bool) AppOption {
+	return func(o *ApplicationConfig) { o.RequireBackendIntegrity = v }
+}
+
 func WithPreferDevelopmentBackends(v bool) AppOption {
 	return func(o *ApplicationConfig) { o.PreferDevelopmentBackends = v }
 }
@@ -567,6 +580,12 @@ func WithTracingMaxItems(items int) AppOption {
 	}
 }

+func WithTracingMaxBodyBytes(bytes int) AppOption {
+	return func(o *ApplicationConfig) {
+		o.TracingMaxBodyBytes = bytes
+	}
+}
+
 func WithGeneratedContentDir(generatedContentDir string) AppOption {
 	return func(o *ApplicationConfig) {
 		o.GeneratedContentDir = generatedContentDir
@@ -909,6 +928,7 @@ func (o *ApplicationConfig) ToRuntimeSettings() RuntimeSettings {
 	f16 := o.F16
 	debug := o.Debug
 	tracingMaxItems := o.TracingMaxItems
+	tracingMaxBodyBytes := o.TracingMaxBodyBytes
 	enableTracing := o.EnableTracing
 	enableBackendLogging := o.EnableBackendLogging
 	cors := o.CORS
@@ -997,6 +1017,7 @@ func (o *ApplicationConfig) ToRuntimeSettings() RuntimeSettings {
 		F16:                       &f16,
 		Debug:                     &debug,
 		TracingMaxItems:           &tracingMaxItems,
+		TracingMaxBodyBytes:       &tracingMaxBodyBytes,
 		EnableTracing:             &enableTracing,
 		EnableBackendLogging:      &enableBackendLogging,
 		CORS:                      &cors,
@@ -1135,6 +1156,9 @@ func (o *ApplicationConfig) ApplyRuntimeSettings(settings *RuntimeSettings) (req
 	if settings.TracingMaxItems != nil {
 		o.TracingMaxItems = *settings.TracingMaxItems
 	}
+	if settings.TracingMaxBodyBytes != nil {
+		o.TracingMaxBodyBytes = *settings.TracingMaxBodyBytes
+	}
 	if settings.EnableBackendLogging != nil {
 		o.EnableBackendLogging = *settings.EnableBackendLogging
 	}
--- a/core/config/gallery.go
+++ b/core/config/gallery.go
@@ -1,6 +1,37 @@
 package config

-type Gallery struct {
-	URL  string `json:"url" yaml:"url"`
-	Name string `json:"name" yaml:"name"`
+// GalleryVerification declares the keyless-cosign signature policy that
+// every OCI backend image fetched from this gallery must satisfy.
+//
+// Verification is opt-in: galleries without a Verification block install
+// backends with no signature check (the downloader logs a warning when
+// LOCALAI_REQUIRE_BACKEND_INTEGRITY is unset; that flag turns the warning
+// into a hard error).
+//
+// Identity matching: set Issuer (exact) or IssuerRegex, AND Identity
+// (exact) or IdentityRegex. For GitHub Actions keyless signing the
+// typical shape is:
+//
+//	verification:
+//	  issuer: "https://token.actions.githubusercontent.com"
+//	  identity_regex: "^https://github\\.com/mudler/local-ai-backends/\\.github/workflows/build\\.yaml@refs/heads/master$"
+//	  not_before: "2026-05-01T00:00:00Z"
+//
+// NotBefore is the revocation lever: advance it to invalidate every
+// signature produced before a known compromise window. Keyless cosign
+// certs are ephemeral so there is no CA-side revocation.
+type GalleryVerification struct {
+	Issuer        string `json:"issuer,omitempty" yaml:"issuer,omitempty"`
+	IssuerRegex   string `json:"issuer_regex,omitempty" yaml:"issuer_regex,omitempty"`
+	Identity      string `json:"identity,omitempty" yaml:"identity,omitempty"`
+	IdentityRegex string `json:"identity_regex,omitempty" yaml:"identity_regex,omitempty"`
+
+	// NotBefore is an RFC3339 timestamp. Empty disables the time check.
+	NotBefore string `json:"not_before,omitempty" yaml:"not_before,omitempty"`
+}
+
+type Gallery struct {
+	URL          string               `json:"url" yaml:"url"`
+	Name         string               `json:"name" yaml:"name"`
+	Verification *GalleryVerification `json:"verification,omitempty" yaml:"verification,omitempty"`
 }
--- a/core/config/gguf.go
+++ b/core/config/gguf.go
@@ -54,6 +54,13 @@ func guessGGUFFromFile(cfg *ModelConfig, f *gguf.GGUFFile, defaultCtx int) {
 		cfg.modelTemplate = chatTemplate.ValueString()
 	}

+	// Auto-enable Multi-Token Prediction (ggml-org/llama.cpp#22673) when the
+	// GGUF carries an embedded MTP head. Skipped silently for non-MTP models
+	// and when the user already configured a spec_type.
+	if n, ok := HasEmbeddedMTPHead(f); ok {
+		ApplyMTPDefaults(cfg, n)
+	}
+
 	// Thinking support detection is done after model load via DetectThinkingSupportFromBackend

 	// template estimations
--- a/core/config/mtp.go
+++ b/core/config/mtp.go
@@ -0,0 +1,84 @@
+package config
+
+import (
+	"strings"
+
+	gguf "github.com/gpustack/gguf-parser-go"
+	"github.com/mudler/xlog"
+)
+
+// mtpSpecOptions lists the speculative-decoding option keys auto-applied when
+// an MTP head is detected on a llama-cpp GGUF. Defaults track the upstream
+// MTP PR (ggml-org/llama.cpp#22673):
+//
+//   - spec_type:draft-mtp      activates Multi-Token Prediction
+//   - spec_n_max:6             draft window
+//   - spec_p_min:0.75          pinned because upstream marked the 0.75 default
+//     with a "change to 0.0f" TODO; locking it here keeps acceptance
+//     thresholds stable across future bumps
+var mtpSpecOptions = []string{
+	"spec_type:draft-mtp",
+	"spec_n_max:6",
+	"spec_p_min:0.75",
+}
+
+// MTPSpecOptions returns a copy of the option keys auto-applied when an MTP
+// head is detected. Exported for testing and for the importer.
+func MTPSpecOptions() []string {
+	out := make([]string, len(mtpSpecOptions))
+	copy(out, mtpSpecOptions)
+	return out
+}
+
+// HasEmbeddedMTPHead reports whether the parsed GGUF declares a Multi-Token
+// Prediction head. Detection reads `<arch>.nextn_predict_layers`, which is
+// what `gguf_writer.add_nextn_predict_layers(n)` emits in upstream's
+// `conversion/qwen.py` MTP mixin. A positive layer count means the head is
+// present in the same GGUF as the trunk.
+func HasEmbeddedMTPHead(f *gguf.GGUFFile) (uint32, bool) {
+	if f == nil {
+		return 0, false
+	}
+	arch := f.Architecture().Architecture
+	if arch == "" {
+		return 0, false
+	}
+	v, ok := f.Header.MetadataKV.Get(arch + ".nextn_predict_layers")
+	if !ok {
+		return 0, false
+	}
+	n := gguf.ValueNumeric[uint32](v)
+	return n, n > 0
+}
+
+// hasSpecTypeOption returns true when the slice already contains a
+// user-configured `spec_type:` / `speculative_type:` entry. Used to avoid
+// clobbering an explicit choice with the MTP auto-defaults.
+func hasSpecTypeOption(opts []string) bool {
+	for _, o := range opts {
+		if strings.HasPrefix(o, "spec_type:") || strings.HasPrefix(o, "speculative_type:") {
+			return true
+		}
+	}
+	return false
+}
+
+// ApplyMTPDefaults appends the auto-MTP option keys to cfg.Options when none
+// is already configured. It is a no-op when the user already picked a
+// `spec_type` (either via YAML or via the importer's preferences flow).
+//
+// `layers` is the value read from `<arch>.nextn_predict_layers` and is only
+// used for the diagnostic log line.
+func ApplyMTPDefaults(cfg *ModelConfig, layers uint32) {
+	if cfg == nil {
+		return
+	}
+	if hasSpecTypeOption(cfg.Options) {
+		xlog.Debug("[mtp] embedded MTP head detected but spec_type already configured; leaving user choice intact",
+			"name", cfg.Name, "nextn_layers", layers)
+		return
+	}
+	cfg.Options = append(cfg.Options, mtpSpecOptions...)
+	xlog.Info("[mtp] embedded MTP head detected; enabling draft-mtp speculative decoding",
+		"name", cfg.Name, "nextn_layers", layers, "spec_n_max", 6, "spec_p_min", 0.75)
+}
--- a/core/config/mtp_test.go
+++ b/core/config/mtp_test.go
@@ -0,0 +1,86 @@
+package config_test
+
+import (
+	. "github.com/mudler/LocalAI/core/config"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("MTP auto-defaults", func() {
+	Context("MTPSpecOptions", func() {
+		It("returns the upstream-recommended speculative tuple", func() {
+			Expect(MTPSpecOptions()).To(Equal([]string{
+				"spec_type:draft-mtp",
+				"spec_n_max:6",
+				"spec_p_min:0.75",
+			}))
+		})
+
+		It("returns a defensive copy so callers cannot mutate the package default", func() {
+			opts := MTPSpecOptions()
+			opts[0] = "spec_type:none"
+			Expect(MTPSpecOptions()[0]).To(Equal("spec_type:draft-mtp"))
+		})
+	})
+
+	Context("ApplyMTPDefaults", func() {
+		It("appends MTP options when nothing is configured", func() {
+			cfg := &ModelConfig{Name: "qwen-mtp"}
+			ApplyMTPDefaults(cfg, 1)
+			Expect(cfg.Options).To(Equal([]string{
+				"spec_type:draft-mtp",
+				"spec_n_max:6",
+				"spec_p_min:0.75",
+			}))
+		})
+
+		It("preserves unrelated options already on the config", func() {
+			cfg := &ModelConfig{
+				Name:    "qwen-mtp",
+				Options: []string{"use_jinja:true", "cache_reuse:256"},
+			}
+			ApplyMTPDefaults(cfg, 1)
+			Expect(cfg.Options).To(Equal([]string{
+				"use_jinja:true",
+				"cache_reuse:256",
+				"spec_type:draft-mtp",
+				"spec_n_max:6",
+				"spec_p_min:0.75",
+			}))
+		})
+
+		It("is a no-op when the user already configured spec_type", func() {
+			cfg := &ModelConfig{
+				Name:    "qwen-mtp",
+				Options: []string{"spec_type:ngram-simple", "use_jinja:true"},
+			}
+			ApplyMTPDefaults(cfg, 1)
+			Expect(cfg.Options).To(Equal([]string{
+				"spec_type:ngram-simple",
+				"use_jinja:true",
+			}))
+		})
+
+		It("also respects the legacy speculative_type alias", func() {
+			cfg := &ModelConfig{
+				Name:    "qwen-mtp",
+				Options: []string{"speculative_type:ngram-mod"},
+			}
+			ApplyMTPDefaults(cfg, 1)
+			Expect(cfg.Options).To(Equal([]string{"speculative_type:ngram-mod"}))
+		})
+
+		It("tolerates a nil config", func() {
+			Expect(func() { ApplyMTPDefaults(nil, 1) }).ToNot(Panic())
+		})
+	})
+
+	Context("HasEmbeddedMTPHead", func() {
+		It("returns false on a nil GGUF file", func() {
+			n, ok := HasEmbeddedMTPHead(nil)
+			Expect(ok).To(BeFalse())
+			Expect(n).To(BeZero())
+		})
+	})
+})
--- a/core/config/runtime_settings.go
+++ b/core/config/runtime_settings.go
@@ -38,6 +38,7 @@ type RuntimeSettings struct {
 	Debug                *bool `json:"debug,omitempty"`
 	EnableTracing        *bool `json:"enable_tracing,omitempty"`
 	TracingMaxItems      *int  `json:"tracing_max_items,omitempty"`
+	TracingMaxBodyBytes  *int  `json:"tracing_max_body_bytes,omitempty"` // Per-body cap in bytes; 0 disables the cap
 	EnableBackendLogging *bool `json:"enable_backend_logging,omitempty"`

 	// Security/CORS settings
--- a/core/gallery/backends.go
+++ b/core/gallery/backends.go
@@ -16,6 +16,7 @@ import (
 	"github.com/mudler/LocalAI/pkg/downloader"
 	"github.com/mudler/LocalAI/pkg/model"
 	"github.com/mudler/LocalAI/pkg/oci"
+	"github.com/mudler/LocalAI/pkg/oci/cosignverify"
 	"github.com/mudler/LocalAI/pkg/system"
 	"github.com/mudler/xlog"
 	cp "github.com/otiai10/copy"
@@ -102,8 +103,81 @@ func writeBackendMetadata(backendPath string, metadata *BackendMetadata) error {
 	return nil
 }

+// backendDownloadOptions translates the gallery's verification policy into
+// downloader options, and gates the call on strict-integrity mode. Both
+// InstallBackend and UpgradeBackend MUST route their download through these
+// options — without them, the corresponding code path silently downloads
+// and activates unverified backend bytes even when the gallery has a
+// verification: policy configured.
+//
+// For OCI URIs with a verification policy, returns a slice containing
+// downloader.WithImageVerifier(v) — the downloader will then run cosign
+// signature verification between fetching the manifest and extracting
+// layers (see pkg/downloader/uri.go OCI branch).
+//
+// For OCI URIs without a verification policy, or non-OCI URIs without a
+// SHA256, the function either returns a non-fatal warning (requireIntegrity
+// false) or fails the install (requireIntegrity true).
+func backendDownloadOptions(config *GalleryBackend, requireIntegrity bool) ([]downloader.DownloadOption, error) {
+	uri := downloader.URI(config.URI)
+	hasVerification := config.Gallery.Verification != nil
+	hasSHA := config.SHA256 != ""
+
+	switch {
+	case uri.LooksLikeOCI():
+		if !hasVerification {
+			if requireIntegrity {
+				return nil, fmt.Errorf("strict integrity: gallery %q has no verification policy for OCI backend %q (set verification: in the gallery YAML or disable --require-backend-integrity)",
+					config.Gallery.Name, config.Name)
+			}
+			xlog.Warn("installing OCI backend without signature verification",
+				"backend", config.Name, "gallery", config.Gallery.Name, "uri", config.URI)
+			return nil, nil
+		}
+		v, err := newGalleryVerifier(config.Gallery.Verification)
+		if err != nil {
+			return nil, fmt.Errorf("gallery %q verification policy: %w", config.Gallery.Name, err)
+		}
+		return []downloader.DownloadOption{downloader.WithImageVerifier(v)}, nil
+
+	case uri.LooksLikeDir():
+		// Local directory — out of scope for integrity checks.
+		return nil, nil
+
+	default:
+		if !hasSHA && requireIntegrity {
+			return nil, fmt.Errorf("strict integrity: backend %q has no SHA256 (gallery %q)",
+				config.Name, config.Gallery.Name)
+		}
+		// Non-strict: pkg/downloader already emits a warning when sha is empty.
+		return nil, nil
+	}
+}
+
+// newGalleryVerifier constructs a cosignverify.Verifier from the gallery
+// policy. Parses NotBefore (RFC3339) here so YAML errors surface at install
+// time rather than during signature verification.
+func newGalleryVerifier(p *config.GalleryVerification) (*cosignverify.Verifier, error) {
+	pol := cosignverify.Policy{
+		Issuer:        p.Issuer,
+		IssuerRegex:   p.IssuerRegex,
+		Identity:      p.Identity,
+		IdentityRegex: p.IdentityRegex,
+	}
+	if p.NotBefore != "" {
+		t, err := time.Parse(time.RFC3339, p.NotBefore)
+		if err != nil {
+			return nil, fmt.Errorf("not_before %q: %w", p.NotBefore, err)
+		}
+		pol.NotBefore = t
+	}
+	return cosignverify.NewVerifier(pol, nil, nil)
+}
+
 // InstallBackendFromGallery installs a backend from the gallery.
-func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, name string, downloadStatus func(string, string, string, float64), force bool) error {
+// requireIntegrity escalates a missing SHA256 / verification policy from a
+// warning to a hard failure (see backendDownloadOptions).
+func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, name string, downloadStatus func(string, string, string, float64), force, requireIntegrity bool) error {
 	if !force {
 		// check if we already have the backend installed
 		backends, err := ListSystemBackends(systemState)
@@ -149,7 +223,7 @@ func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery,
 		xlog.Debug("Installing backend from meta backend", "name", name, "bestBackend", bestBackend.Name)

 		// Then, let's install the best backend
-		if err := InstallBackend(ctx, systemState, modelLoader, bestBackend, downloadStatus); err != nil {
+		if err := InstallBackend(ctx, systemState, modelLoader, bestBackend, downloadStatus, requireIntegrity); err != nil {
 			return err
 		}

@@ -175,10 +249,10 @@ func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery,
 		return nil
 	}

-	return InstallBackend(ctx, systemState, modelLoader, backend, downloadStatus)
+	return InstallBackend(ctx, systemState, modelLoader, backend, downloadStatus, requireIntegrity)
 }

-func InstallBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, config *GalleryBackend, downloadStatus func(string, string, string, float64)) error {
+func InstallBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, config *GalleryBackend, downloadStatus func(string, string, string, float64), requireIntegrity bool) error {
 	// Get configurable fallback tag values from SystemState
 	latestTag, masterTag, devSuffix := getFallbackTagValues(systemState)

@@ -213,6 +287,14 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 		return fmt.Errorf("failed to create base path: %v", err)
 	}

+	// Build the download options once and reuse for every retry path —
+	// mirrors and tag fallbacks must verify against the same gallery
+	// policy or we open a hole where a non-default URI bypasses the check.
+	downloadOpts, optsErr := backendDownloadOptions(config, requireIntegrity)
+	if optsErr != nil {
+		return fmt.Errorf("backend %q: %w", config.Name, optsErr)
+	}
+
 	uri := downloader.URI(config.URI)
 	// Check if it is a directory
 	if uri.LooksLikeDir() {
@@ -222,7 +304,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 		}
 	} else {
 		xlog.Debug("Downloading backend", "uri", config.URI, "backendPath", backendPath)
-		if err := uri.DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err != nil {
+		if err := uri.DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err != nil {
 			xlog.Debug("Backend download failed, trying fallback", "backendPath", backendPath, "error", err)

 			// resetBackendPath cleans up partial state from a failed OCI extraction
@@ -243,7 +325,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 				default:
 				}
 				resetBackendPath()
-				if err := downloader.URI(mirror).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil {
+				if err := downloader.URI(mirror).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
 					success = true
 					xlog.Debug("Downloaded backend from mirror", "uri", config.URI, "backendPath", backendPath)
 					break
@@ -256,7 +338,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 				if fallbackURI != string(config.URI) {
 					resetBackendPath()
 					xlog.Info("Trying fallback URI", "original", config.URI, "fallback", fallbackURI)
-					if err := downloader.URI(fallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil {
+					if err := downloader.URI(fallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
 						xlog.Info("Downloaded backend using fallback URI", "uri", fallbackURI, "backendPath", backendPath)
 						success = true
 					} else {
@@ -265,7 +347,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 							resetBackendPath()
 							devFallbackURI := fallbackURI + "-" + devSuffix
 							xlog.Info("Trying development fallback URI", "fallback", devFallbackURI)
-							if err := downloader.URI(devFallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil {
+							if err := downloader.URI(devFallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
 								xlog.Info("Downloaded backend using development fallback URI", "uri", devFallbackURI, "backendPath", backendPath)
 								success = true
 							} else {
--- a/core/gallery/backends_test.go
+++ b/core/gallery/backends_test.go
@@ -117,13 +117,13 @@ var _ = Describe("Gallery Backends", func() {

 	Describe("InstallBackendFromGallery", func() {
 		It("should return error when backend is not found", func() {
-			err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "non-existent", nil, true)
+			err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "non-existent", nil, true, false)
 			Expect(err).To(HaveOccurred())
 			Expect(err.Error()).To(ContainSubstring("no backend found with name \"non-existent\""))
 		})

 		It("should install backend from gallery", func() {
-			err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "test-backend", nil, true)
+			err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "test-backend", nil, true, false)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(filepath.Join(tempDir, "test-backend", "run.sh")).To(BeARegularFile())
 		})
@@ -545,7 +545,7 @@ var _ = Describe("Gallery Backends", func() {
 				VRAM:      1000000000000,
 				Backend:   system.Backend{BackendsPath: tempDir},
 			}
-			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true)
+			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
 			Expect(err).NotTo(HaveOccurred())

 			metaBackendPath := filepath.Join(tempDir, "meta-backend")
@@ -625,7 +625,7 @@ var _ = Describe("Gallery Backends", func() {
 				VRAM:      1000000000000,
 				Backend:   system.Backend{BackendsPath: tempDir},
 			}
-			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true)
+			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
 			Expect(err).NotTo(HaveOccurred())

 			metaBackendPath := filepath.Join(tempDir, "meta-backend")
@@ -709,7 +709,7 @@ var _ = Describe("Gallery Backends", func() {
 				VRAM:      1000000000000,
 				Backend:   system.Backend{BackendsPath: tempDir},
 			}
-			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true)
+			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
 			Expect(err).NotTo(HaveOccurred())

 			metaBackendPath := filepath.Join(tempDir, "meta-backend")
@@ -808,7 +808,7 @@ var _ = Describe("Gallery Backends", func() {
 				system.WithBackendPath(newPath),
 			)
 			Expect(err).NotTo(HaveOccurred())
-			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
+			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
 			Expect(newPath).To(BeADirectory())
 			Expect(err).To(HaveOccurred()) // Will fail due to invalid URI, but path should be created
 		})
@@ -840,7 +840,7 @@ var _ = Describe("Gallery Backends", func() {
 				system.WithBackendPath(tempDir),
 			)
 			Expect(err).NotTo(HaveOccurred())
-			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
+			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())
 			dat, err := os.ReadFile(filepath.Join(tempDir, "test-backend", "metadata.json"))
@@ -873,7 +873,7 @@ var _ = Describe("Gallery Backends", func() {

 			Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).ToNot(BeARegularFile())

-			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
+			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())
 		})
@@ -894,7 +894,7 @@ var _ = Describe("Gallery Backends", func() {
 				system.WithBackendPath(tempDir),
 			)
 			Expect(err).NotTo(HaveOccurred())
-			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
+			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())

--- a/core/gallery/backends_version_test.go
+++ b/core/gallery/backends_version_test.go
@@ -47,7 +47,7 @@ var _ = Describe("Backend versioning", func() {
 		backend.URI = srcDir
 		backend.Version = "1.2.3"

-		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil)
+		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
 		Expect(err).NotTo(HaveOccurred())

 		// Read the metadata file and check version
@@ -74,7 +74,7 @@ var _ = Describe("Backend versioning", func() {
 		backend.URI = srcDir
 		backend.Version = "2.0.0"

-		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil)
+		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
 		Expect(err).NotTo(HaveOccurred())

 		metadataPath := filepath.Join(tempDir, "test-backend-uri", "metadata.json")
@@ -100,7 +100,7 @@ var _ = Describe("Backend versioning", func() {
 		backend.URI = srcDir
 		// Version intentionally left empty

-		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil)
+		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
 		Expect(err).NotTo(HaveOccurred())

 		metadataPath := filepath.Join(tempDir, "test-backend-noversion", "metadata.json")
--- a/core/gallery/importers/llama-cpp.go
+++ b/core/gallery/importers/llama-cpp.go
@@ -1,10 +1,13 @@
 package importers

 import (
+	"context"
 	"encoding/json"
 	"path/filepath"
 	"strings"
+	"time"

+	gguf "github.com/gpustack/gguf-parser-go"
 	"github.com/mudler/LocalAI/core/config"
 	"github.com/mudler/LocalAI/core/gallery"
 	"github.com/mudler/LocalAI/core/schema"
@@ -261,6 +264,13 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error)
 	// Apply per-model-family inference parameter defaults
 	config.ApplyInferenceDefaults(&modelConfig, details.URI)

+	// Auto-detect Multi-Token Prediction heads (ggml-org/llama.cpp#22673) and
+	// enable speculative decoding. Mirrors the load-time hook so freshly
+	// imported configs already carry spec_type:draft-mtp before the model is
+	// ever loaded - users see it in the YAML preview rather than discovering
+	// it after the first start.
+	maybeApplyMTPDefaults(&modelConfig, details, &cfg)
+
 	data, err := yaml.Marshal(modelConfig)
 	if err != nil {
 		return gallery.ModelConfig{}, err
@@ -291,6 +301,85 @@ func pickPreferredGroup(groups []hfapi.ShardGroup, prefs []string) *hfapi.ShardG
 	return &groups[len(groups)-1]
 }

+// maybeApplyMTPDefaults parses the picked GGUF header (range-fetched over
+// HTTP for HF/URL imports) and, if the file declares a Multi-Token Prediction
+// head, appends the auto-MTP option keys to modelConfig.Options. Failures
+// during the probe are non-fatal: the importer keeps the config without MTP
+// so an unrelated network blip or weird header doesn't break the import.
+//
+// OCI/Ollama URIs are skipped because the artifact isn't directly fetchable
+// as a GGUF byte stream - the load-time hook (core/config/gguf.go) covers
+// those once the model is materialised on disk.
+func maybeApplyMTPDefaults(modelConfig *config.ModelConfig, details Details, cfg *gallery.ModelConfig) {
+	probeURL := pickMTPProbeURL(details, cfg)
+	if probeURL == "" {
+		return
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
+	defer func() {
+		if r := recover(); r != nil {
+			xlog.Debug("[mtp-importer] panic while probing GGUF header", "uri", probeURL, "recover", r)
+		}
+	}()
+
+	f, err := gguf.ParseGGUFFileRemote(ctx, probeURL)
+	if err != nil {
+		xlog.Debug("[mtp-importer] failed to read remote GGUF header for MTP detection", "uri", probeURL, "error", err)
+		return
+	}
+
+	n, ok := config.HasEmbeddedMTPHead(f)
+	if !ok {
+		return
+	}
+	config.ApplyMTPDefaults(modelConfig, n)
+}
+
+// pickMTPProbeURL returns an HTTP(S) URL pointing at the main (non-mmproj)
+// GGUF shard that should be inspected for an MTP head, or "" when no
+// suitable URL is available. Custom URI schemes (`huggingface://`,
+// `ollama://`, etc.) are run through `downloader.URI.ResolveURL` so the
+// resulting URL is something `gguf.ParseGGUFFileRemote` can actually open.
+// OCI/Ollama URIs are skipped because the artifact is not directly
+// streamable as a GGUF byte range.
+func pickMTPProbeURL(details Details, cfg *gallery.ModelConfig) string {
+	uri := downloader.URI(details.URI)
+
+	if uri.LooksLikeOCI() {
+		return ""
+	}
+
+	if strings.HasSuffix(strings.ToLower(details.URI), ".gguf") {
+		return resolveHTTPProbe(details.URI)
+	}
+
+	for _, f := range cfg.Files {
+		lower := strings.ToLower(f.Filename)
+		if strings.Contains(lower, "mmproj") {
+			continue
+		}
+		if !strings.HasSuffix(lower, ".gguf") {
+			continue
+		}
+		return resolveHTTPProbe(f.URI)
+	}
+	return ""
+}
+
+// resolveHTTPProbe resolves an importer-side URI to the HTTP(S) URL that
+// `gguf.ParseGGUFFileRemote` can range-fetch. Returns "" if the URI can't
+// be reduced to an HTTP(S) endpoint (e.g. local path, unsupported scheme).
+func resolveHTTPProbe(uri string) string {
+	resolved := downloader.URI(uri).ResolveURL()
+	if downloader.URI(resolved).LooksLikeHTTPURL() {
+		return resolved
+	}
+	return ""
+}
+
 // appendShardGroup copies every shard of group into cfg.Files under dest,
 // skipping any entry whose target filename is already present so repeated
 // calls (e.g. the rare case of mmproj + model picking the same group)
--- a/core/gallery/models.go
+++ b/core/gallery/models.go
@@ -77,7 +77,7 @@ func InstallModelFromGallery(
 	modelGalleries, backendGalleries []lconfig.Gallery,
 	systemState *system.SystemState,
 	modelLoader *model.ModelLoader,
-	name string, req GalleryModel, downloadStatus func(string, string, string, float64), enforceScan, automaticallyInstallBackend bool) error {
+	name string, req GalleryModel, downloadStatus func(string, string, string, float64), enforceScan, automaticallyInstallBackend, requireBackendIntegrity bool) error {

 	applyModel := func(model *GalleryModel) error {
 		name = strings.ReplaceAll(name, string(os.PathSeparator), "__")
@@ -137,7 +137,7 @@ func InstallModelFromGallery(
 		if automaticallyInstallBackend && installedModel.Backend != "" {
 			xlog.Debug("Installing backend", "backend", installedModel.Backend)

-			if err := InstallBackendFromGallery(ctx, backendGalleries, systemState, modelLoader, installedModel.Backend, downloadStatus, false); err != nil {
+			if err := InstallBackendFromGallery(ctx, backendGalleries, systemState, modelLoader, installedModel.Backend, downloadStatus, false, requireBackendIntegrity); err != nil {
 				return err
 			}
 		}
--- a/core/gallery/models_test.go
+++ b/core/gallery/models_test.go
@@ -89,7 +89,7 @@ var _ = Describe("Model test", func() {
 			Expect(models[0].URL).To(Equal(bertEmbeddingsURL))
 			Expect(models[0].Installed).To(BeFalse())

-			err = InstallModelFromGallery(context.TODO(), galleries, []config.Gallery{}, systemState, nil, "test@bert", GalleryModel{}, func(s1, s2, s3 string, f float64) {}, true, true)
+			err = InstallModelFromGallery(context.TODO(), galleries, []config.Gallery{}, systemState, nil, "test@bert", GalleryModel{}, func(s1, s2, s3 string, f float64) {}, true, true, false)
 			Expect(err).ToNot(HaveOccurred())

 			dat, err := os.ReadFile(filepath.Join(tempdir, "bert.yaml"))
--- a/core/gallery/upgrade.go
+++ b/core/gallery/upgrade.go
@@ -232,7 +232,7 @@ func summarizeNodeDrift(nodes []NodeBackendRef) (majority struct{ version, diges

 // UpgradeBackend upgrades a single backend to the latest gallery version using
 // an atomic swap with backup-based rollback on failure.
-func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, galleries []config.Gallery, backendName string, downloadStatus func(string, string, string, float64)) error {
+func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, galleries []config.Gallery, backendName string, downloadStatus func(string, string, string, float64), requireIntegrity bool) error {
 	// Look up the installed backend
 	installedBackends, err := ListSystemBackends(systemState)
 	if err != nil {
@@ -251,7 +251,7 @@ func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelL
 	// If this is a meta backend, recursively upgrade the concrete backend it points to
 	if installed.Metadata != nil && installed.Metadata.MetaBackendFor != "" {
 		xlog.Info("Meta backend detected, upgrading concrete backend", "meta", backendName, "concrete", installed.Metadata.MetaBackendFor)
-		return UpgradeBackend(ctx, systemState, modelLoader, galleries, installed.Metadata.MetaBackendFor, downloadStatus)
+		return UpgradeBackend(ctx, systemState, modelLoader, galleries, installed.Metadata.MetaBackendFor, downloadStatus, requireIntegrity)
 	}

 	// Find the gallery entry
@@ -265,6 +265,16 @@ func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelL
 		return fmt.Errorf("no gallery entry found for backend %q", backendName)
 	}

+	// Resolve integrity options (cosign verifier for OCI URIs, strict-mode
+	// gate for missing SHA256/policy) BEFORE writing anything to disk.
+	// Without this, the upgrade path would atomically swap in an
+	// unverified backend even when the gallery has a verification policy
+	// — see backendDownloadOptions in backends.go.
+	downloadOpts, err := backendDownloadOptions(galleryEntry, requireIntegrity)
+	if err != nil {
+		return fmt.Errorf("upgrade %q: %w", backendName, err)
+	}
+
 	backendPath := filepath.Join(systemState.Backend.BackendsPath, backendName)
 	tmpPath := backendPath + ".upgrade-tmp"
 	backupPath := backendPath + ".backup"
@@ -285,7 +295,7 @@ func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelL
 			return fmt.Errorf("failed to copy backend from directory: %w", err)
 		}
 	} else {
-		if err := uri.DownloadFileWithContext(ctx, tmpPath, "", 1, 1, downloadStatus); err != nil {
+		if err := uri.DownloadFileWithContext(ctx, tmpPath, galleryEntry.SHA256, 1, 1, downloadStatus, downloadOpts...); err != nil {
 			os.RemoveAll(tmpPath)
 			return fmt.Errorf("failed to download backend: %w", err)
 		}
--- a/core/gallery/upgrade_test.go
+++ b/core/gallery/upgrade_test.go
@@ -383,7 +383,7 @@ var _ = Describe("Upgrade Detection and Execution", func() {
 			})

 			ml := model.NewModelLoader(systemState)
-			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil)
+			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil, false)
 			Expect(err).NotTo(HaveOccurred())

 			// Verify run.sh was updated
@@ -417,7 +417,7 @@ var _ = Describe("Upgrade Detection and Execution", func() {
 			})

 			ml := model.NewModelLoader(systemState)
-			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil)
+			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil, false)
 			Expect(err).To(HaveOccurred())

 			// Verify v1 is still intact
@@ -432,5 +432,41 @@ var _ = Describe("Upgrade Detection and Execution", func() {
 			Expect(json.Unmarshal(metaData, &meta)).To(Succeed())
 			Expect(meta.Version).To(Equal("1.0.0"))
 		})
+
+		// Regression: an earlier version of UpgradeBackend wrote the
+		// downloaded bytes to disk without going through
+		// backendDownloadOptions, so the gallery's verification policy
+		// (and strict-integrity gate) didn't apply on upgrade. This test
+		// pins the upgrade path to the same integrity gate as installs:
+		// strict mode + an OCI URI without a verification: block must
+		// hard-fail *before* anything is downloaded or swapped in.
+		It("should refuse to upgrade an OCI backend that bypasses integrity in strict mode", func() {
+			installBackendWithVersion("my-backend", "1.0.0", "#!/bin/sh\necho v1")
+
+			// OCI URI, no Gallery.Verification → backendDownloadOptions
+			// returns a strict-integrity error before any network call.
+			writeGalleryYAML([]GalleryBackend{
+				{
+					Metadata: Metadata{
+						Name: "my-backend",
+					},
+					URI:     "oci://example.invalid/missing:never-fetched",
+					Version: "2.0.0",
+				},
+			})
+
+			ml := model.NewModelLoader(systemState)
+			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil, true)
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring("strict integrity"))
+
+			// The installed v1 must be untouched — the upgrade should
+			// have aborted before writing anything.
+			content, err := os.ReadFile(filepath.Join(backendsPath, "my-backend", "run.sh"))
+			Expect(err).NotTo(HaveOccurred())
+			Expect(string(content)).To(Equal("#!/bin/sh\necho v1"))
+			Expect(filepath.Join(backendsPath, "my-backend.upgrade-tmp")).NotTo(BeAnExistingFile())
+			Expect(filepath.Join(backendsPath, "my-backend.backup")).NotTo(BeAnExistingFile())
+		})
 	})
 })
--- a/core/http/app.go
+++ b/core/http/app.go
@@ -28,6 +28,7 @@ import (
 	"github.com/mudler/LocalAI/core/services/monitoring"
 	"github.com/mudler/LocalAI/core/services/nodes"
 	"github.com/mudler/LocalAI/core/services/quantization"
+	"github.com/mudler/LocalAI/pkg/signals"

 	"github.com/mudler/xlog"
 )
@@ -267,9 +268,12 @@ func API(application *application.Application) (*echo.Echo, error) {
 		e.Static("/generated-videos", videoPath)
 	}

-	// Initialize usage recording when auth DB is available
+	// Initialize usage recording when auth DB is available, and ensure the
+	// batcher drains its in-memory queue on graceful shutdown so the last
+	// few seconds of usage don't disappear when the process exits.
 	if application.AuthDB() != nil {
 		httpMiddleware.InitUsageRecorder(application.AuthDB())
+		signals.RegisterGracefulTerminationHandler(httpMiddleware.ShutdownUsageRecorder)
 	}

 	// Auth is applied to _all_ endpoints. Filtering out endpoints to bypass is
@@ -403,7 +407,7 @@ func API(application *application.Application) (*echo.Echo, error) {
 		}
 	}
 	routes.RegisterNodeSelfServiceRoutes(e, registry, distCfg.RegistrationToken, distCfg.AutoApproveNodes, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret)
-	routes.RegisterNodeAdminRoutes(e, registry, remoteUnloader, adminMiddleware, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret, application.ApplicationConfig().Distributed.RegistrationToken)
+	routes.RegisterNodeAdminRoutes(e, registry, remoteUnloader, application.GalleryService(), opcache, application.ApplicationConfig(), adminMiddleware, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret, application.ApplicationConfig().Distributed.RegistrationToken)

 	// Distributed SSE routes (job progress + agent events via NATS)
 	if d := application.Distributed(); d != nil {
--- a/core/http/auth/db.go
+++ b/core/http/auth/db.go
@@ -38,9 +38,15 @@ func InitDB(databaseURL string) (*gorm.DB, error) {
 	}

 	// Backfill: users created before the provider column existed have an empty
-	// provider — treat them as local accounts so the UI can identify them.
+	// provider - treat them as local accounts so the UI can identify them.
 	db.Exec("UPDATE users SET provider = ? WHERE provider = '' OR provider IS NULL", ProviderLocal)

+	// Backfill: pre-feature usage_records have no source column. Classify them so the
+	// new per-source aggregators include them.
+	if err := BackfillUsageSource(db); err != nil {
+		return nil, fmt.Errorf("failed to backfill usage source: %w", err)
+	}
+
 	// Create composite index on users(provider, subject) for fast OAuth lookups
 	if err := db.Exec("CREATE INDEX IF NOT EXISTS idx_users_provider_subject ON users(provider, subject)").Error; err != nil {
 		// Ignore error on postgres if index already exists
--- a/core/http/auth/middleware.go
+++ b/core/http/auth/middleware.go
@@ -16,8 +16,10 @@ import (
 )

 const (
-	contextKeyUser = "auth_user"
-	contextKeyRole = "auth_role"
+	contextKeyUser   = "auth_user"
+	contextKeyRole   = "auth_role"
+	contextKeyAPIKey = "auth_apikey"
+	contextKeySource = "auth_source"
 )

 // Middleware returns an Echo middleware that handles authentication.
@@ -75,6 +77,7 @@ func Middleware(db *gorm.DB, appConfig *config.ApplicationConfig) echo.Middlewar
 					}
 					c.Set(contextKeyUser, syntheticUser)
 					c.Set(contextKeyRole, RoleAdmin)
+					c.Set(contextKeySource, UsageSourceLegacy)
 					authenticated = true
 				}
 			}
@@ -213,6 +216,20 @@ func GetUserRole(c echo.Context) string {
 	return role
 }

+// GetAPIKey returns the resolved API key from the echo context, or nil.
+// Nil for session-cookie and legacy-env-key authentication.
+func GetAPIKey(c echo.Context) *UserAPIKey {
+	k, _ := c.Get(contextKeyAPIKey).(*UserAPIKey)
+	return k
+}
+
+// GetSource returns the request's authentication source: UsageSourceAPIKey,
+// UsageSourceWeb, UsageSourceLegacy, or empty if no authentication was performed.
+func GetSource(c echo.Context) string {
+	s, _ := c.Get(contextKeySource).(string)
+	return s
+}
+
 // RequireRouteFeature returns a global middleware that checks the user has access
 // to the feature required by the matched route. It uses the RouteFeatureRegistry
 // to look up the required feature for each route pattern + HTTP method.
@@ -421,47 +438,67 @@ func RequireQuota(db *gorm.DB) echo.MiddlewareFunc {
 }

 // tryAuthenticate attempts to authenticate the request using the database.
+//
+// On success it returns the user and, as a side effect, sets the following
+// values on the Echo context:
+//   - contextKeySource ("auth_source"): always set, one of UsageSourceWeb /
+//     UsageSourceAPIKey. UsageSourceLegacy is set elsewhere by the parent
+//     Middleware when a legacy env key matches.
+//   - contextKeyAPIKey ("auth_apikey"): set to the resolved *UserAPIKey for
+//     named-key branches (Bearer, x-api-key, xi-api-key, token cookie).
+//   - "_auth_session": session record, used by Middleware to drive cookie
+//     rotation. Only set on the session-cookie branch.
+//
+// contextKeyUser and contextKeyRole are populated by the parent Middleware
+// after this function returns.
 func tryAuthenticate(c echo.Context, db *gorm.DB, appConfig *config.ApplicationConfig) *User {
 	hmacSecret := appConfig.Auth.APIKeyHMACSecret

-	// a. Session cookie
+	// a. Session cookie -> web UI
 	if cookie, err := c.Cookie(sessionCookie); err == nil && cookie.Value != "" {
 		if user, session := ValidateSession(db, cookie.Value, hmacSecret); user != nil {
 			// Store session for rotation check in middleware
 			c.Set("_auth_session", session)
+			c.Set(contextKeySource, UsageSourceWeb)
 			return user
 		}
 	}

-	// b. Authorization: Bearer token
+	// b. Authorization: Bearer
 	authHeader := c.Request().Header.Get("Authorization")
 	if strings.HasPrefix(authHeader, "Bearer ") {
 		token := strings.TrimPrefix(authHeader, "Bearer ")

-		// Try as session ID first
+		// b1. Session token via Bearer -> still web UI
 		if user, _ := ValidateSession(db, token, hmacSecret); user != nil {
+			c.Set(contextKeySource, UsageSourceWeb)
 			return user
 		}

-		// Try as user API key
+		// b2. Named API key
 		if key, err := ValidateAPIKey(db, token, hmacSecret); err == nil {
+			c.Set(contextKeySource, UsageSourceAPIKey)
+			c.Set(contextKeyAPIKey, key)
 			return &key.User
 		}
 	}

-	// c. x-api-key / xi-api-key headers
+	// c. x-api-key / xi-api-key -> named API key
 	for _, header := range []string{"x-api-key", "xi-api-key"} {
-		if key := c.Request().Header.Get(header); key != "" {
-			if apiKey, err := ValidateAPIKey(db, key, hmacSecret); err == nil {
+		if k := c.Request().Header.Get(header); k != "" {
+			if apiKey, err := ValidateAPIKey(db, k, hmacSecret); err == nil {
+				c.Set(contextKeySource, UsageSourceAPIKey)
+				c.Set(contextKeyAPIKey, apiKey)
 				return &apiKey.User
 			}
 		}
 	}

-	// d. token cookie (legacy)
+	// d. token cookie -> named API key
 	if cookie, err := c.Cookie("token"); err == nil && cookie.Value != "" {
-		// Try as user API key
 		if key, err := ValidateAPIKey(db, cookie.Value, hmacSecret); err == nil {
+			c.Set(contextKeySource, UsageSourceAPIKey)
+			c.Set(contextKeyAPIKey, key)
 			return &key.User
 		}
 	}
--- a/core/http/auth/middleware_test.go
+++ b/core/http/auth/middleware_test.go
@@ -303,4 +303,122 @@ var _ = Describe("Auth Middleware", func() {
 			}
 		})
 	})
+
+	Describe("auth context plumbing for usage source", func() {
+		// probeApp builds a minimal echo app with the auth middleware and a single
+		// "/probe" route that captures the user, source, and apikey from context.
+		type probe struct {
+			user   *auth.User
+			source string
+			key    *auth.UserAPIKey
+		}
+		probeApp := func(db *gorm.DB, appConfig *config.ApplicationConfig, p *probe) *echo.Echo {
+			e := echo.New()
+			e.Use(auth.Middleware(db, appConfig))
+			e.GET("/probe", func(c echo.Context) error {
+				p.user = auth.GetUser(c)
+				p.source = auth.GetSource(c)
+				p.key = auth.GetAPIKey(c)
+				return c.NoContent(http.StatusOK)
+			})
+			return e
+		}
+
+		It("session cookie sets source=web, apikey=nil", func() {
+			db := testDB()
+			appConfig := config.NewApplicationConfig()
+			user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
+			token := createTestSession(db, user.ID)
+
+			var p probe
+			app := probeApp(db, appConfig, &p)
+			rec := doRequest(app, http.MethodGet, "/probe", withSessionCookie(token))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(p.user).ToNot(BeNil())
+			Expect(p.user.ID).To(Equal(user.ID))
+			Expect(p.source).To(Equal(auth.UsageSourceWeb))
+			Expect(p.key).To(BeNil())
+		})
+
+		It("Bearer session token sets source=web, apikey=nil", func() {
+			db := testDB()
+			appConfig := config.NewApplicationConfig()
+			user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
+			token := createTestSession(db, user.ID)
+
+			var p probe
+			app := probeApp(db, appConfig, &p)
+			rec := doRequest(app, http.MethodGet, "/probe", withBearerToken(token))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(p.user).ToNot(BeNil())
+			Expect(p.user.ID).To(Equal(user.ID))
+			Expect(p.source).To(Equal(auth.UsageSourceWeb))
+			Expect(p.key).To(BeNil())
+		})
+
+		It("Bearer API key sets source=apikey and exposes the resolved *UserAPIKey", func() {
+			db := testDB()
+			appConfig := config.NewApplicationConfig()
+			user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
+			plaintext, key, err := auth.CreateAPIKey(db, user.ID, "ci", auth.RoleUser, appConfig.Auth.APIKeyHMACSecret, nil)
+			Expect(err).ToNot(HaveOccurred())
+
+			var p probe
+			app := probeApp(db, appConfig, &p)
+			rec := doRequest(app, http.MethodGet, "/probe", withBearerToken(plaintext))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(p.source).To(Equal(auth.UsageSourceAPIKey))
+			Expect(p.key).ToNot(BeNil())
+			Expect(p.key.ID).To(Equal(key.ID))
+		})
+
+		It("x-api-key header sets source=apikey", func() {
+			db := testDB()
+			appConfig := config.NewApplicationConfig()
+			user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
+			plaintext, _, err := auth.CreateAPIKey(db, user.ID, "ci", auth.RoleUser, appConfig.Auth.APIKeyHMACSecret, nil)
+			Expect(err).ToNot(HaveOccurred())
+
+			var p probe
+			app := probeApp(db, appConfig, &p)
+			rec := doRequest(app, http.MethodGet, "/probe", withXApiKey(plaintext))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(p.source).To(Equal(auth.UsageSourceAPIKey))
+			Expect(p.key).ToNot(BeNil())
+		})
+
+		It("token cookie sets source=apikey", func() {
+			db := testDB()
+			appConfig := config.NewApplicationConfig()
+			user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
+			plaintext, _, err := auth.CreateAPIKey(db, user.ID, "ci", auth.RoleUser, appConfig.Auth.APIKeyHMACSecret, nil)
+			Expect(err).ToNot(HaveOccurred())
+
+			var p probe
+			app := probeApp(db, appConfig, &p)
+			rec := doRequest(app, http.MethodGet, "/probe", withTokenCookie(plaintext))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(p.source).To(Equal(auth.UsageSourceAPIKey))
+			Expect(p.key).ToNot(BeNil())
+		})
+
+		It("legacy env key sets source=legacy, apikey=nil", func() {
+			db := testDB()
+			appConfig := config.NewApplicationConfig()
+			appConfig.ApiKeys = []string{"legacy-secret"}
+
+			var p probe
+			app := probeApp(db, appConfig, &p)
+			rec := doRequest(app, http.MethodGet, "/probe", withBearerToken("legacy-secret"))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(p.source).To(Equal(auth.UsageSourceLegacy))
+			Expect(p.key).To(BeNil())
+		})
+	})
 })
--- a/core/http/auth/usage.go
+++ b/core/http/auth/usage.go
@@ -5,14 +5,31 @@ import (
 	"strings"
 	"time"

+	"github.com/mudler/xlog"
 	"gorm.io/gorm"
 )

+// Source classification for a UsageRecord.
+const (
+	UsageSourceAPIKey = "apikey" // request authenticated with a named UserAPIKey
+	UsageSourceWeb    = "web"    // request authenticated with a session cookie (web UI)
+	UsageSourceLegacy = "legacy" // request authenticated with an env-configured legacy key
+)
+
 // UsageRecord represents a single API request's token usage.
 type UsageRecord struct {
-	ID               uint   `gorm:"primaryKey;autoIncrement"`
-	UserID           string `gorm:"size:36;index:idx_usage_user_time"`
-	UserName         string `gorm:"size:255"`
+	ID       uint   `gorm:"primaryKey;autoIncrement"`
+	UserID   string `gorm:"size:36;index:idx_usage_user_time"`
+	UserName string `gorm:"size:255"`
+
+	// Source classifies how the request authenticated. One of UsageSource* constants.
+	// Empty for pre-feature rows until the InitDB backfill runs.
+	Source string `gorm:"size:16;index:idx_usage_source"`
+	// APIKeyID is the UserAPIKey.ID when Source == UsageSourceAPIKey. Nil otherwise.
+	APIKeyID *string `gorm:"size:36;index:idx_usage_apikey"`
+	// APIKeyName is a snapshot of UserAPIKey.Name at write time. Survives key deletion.
+	APIKeyName string `gorm:"size:255"`
+
 	Model            string `gorm:"size:255;index"`
 	Endpoint         string `gorm:"size:255"`
 	PromptTokens     int64
@@ -30,9 +47,12 @@ func RecordUsage(db *gorm.DB, record *UsageRecord) error {
 // UsageBucket is an aggregated time bucket for the dashboard.
 type UsageBucket struct {
 	Bucket           string `json:"bucket"`
-	Model            string `json:"model"`
+	Model            string `json:"model,omitempty"`
 	UserID           string `json:"user_id,omitempty"`
 	UserName         string `json:"user_name,omitempty"`
+	Source           string `json:"source,omitempty"`
+	APIKeyID         string `json:"api_key_id,omitempty"`
+	APIKeyName       string `json:"api_key_name,omitempty"`
 	PromptTokens     int64  `json:"prompt_tokens"`
 	CompletionTokens int64  `json:"completion_tokens"`
 	TotalTokens      int64  `json:"total_tokens"`
@@ -119,6 +139,28 @@ func GetUserUsage(db *gorm.DB, userID, period string) ([]UsageBucket, error) {
 	return buckets, nil
 }

+// BackfillUsageSource sets the Source column on pre-feature usage rows.
+// Idempotent: only touches rows where source is NULL or empty.
+//   - rows whose user_id == "legacy-api-key" -> UsageSourceLegacy
+//   - everything else                        -> UsageSourceWeb
+func BackfillUsageSource(db *gorm.DB) error {
+	// Legacy first (more specific predicate)
+	if err := db.Exec(
+		`UPDATE usage_records SET source = ? WHERE (source IS NULL OR source = '') AND user_id = ?`,
+		UsageSourceLegacy, "legacy-api-key",
+	).Error; err != nil {
+		return fmt.Errorf("backfill legacy usage source: %w", err)
+	}
+	// Everything else -> web
+	if err := db.Exec(
+		`UPDATE usage_records SET source = ? WHERE (source IS NULL OR source = '')`,
+		UsageSourceWeb,
+	).Error; err != nil {
+		return fmt.Errorf("backfill web usage source: %w", err)
+	}
+	return nil
+}
+
 // GetAllUsage returns aggregated usage for all users (admin). Optional userID filter.
 func GetAllUsage(db *gorm.DB, period, userID string) ([]UsageBucket, error) {
 	sqlite := isSQLiteDB(db)
@@ -149,3 +191,257 @@ func GetAllUsage(db *gorm.DB, period, userID string) ([]UsageBucket, error) {
 	}
 	return buckets, nil
 }
+
+// TotalsEntry is a token+request roll-up.
+type TotalsEntry struct {
+	Tokens   int64 `json:"tokens"`
+	Requests int64 `json:"requests"`
+}
+
+// KeyTotal is the per-key roll-up returned by sources endpoints. UserID and
+// UserName are snapshotted from the UsageRecord so revoked-and-deleted keys
+// still carry their owner attribution in admin views.
+type KeyTotal struct {
+	APIKeyID   string    `json:"api_key_id"`
+	APIKeyName string    `json:"api_key_name"`
+	UserID     string    `json:"user_id"`
+	UserName   string    `json:"user_name"`
+	Tokens     int64     `json:"tokens"`
+	Requests   int64     `json:"requests"`
+	LastUsed   time.Time `json:"last_used"`
+}
+
+// UserSourceTotal is a per-(user, source) roll-up for sources that don't carry
+// a named API key identity (web, legacy). It exists so admin views can show
+// which user generated each block of Web UI / legacy traffic; the per-apikey
+// breakdown for source=apikey already lives in KeyTotal.
+type UserSourceTotal struct {
+	Source   string `json:"source"`
+	UserID   string `json:"user_id"`
+	UserName string `json:"user_name"`
+	Tokens   int64  `json:"tokens"`
+	Requests int64  `json:"requests"`
+}
+
+// SourceTotals summarises a per-source breakdown.
+type SourceTotals struct {
+	BySource     map[string]TotalsEntry `json:"by_source"`
+	ByKey        []KeyTotal             `json:"by_key"`                   // server-sorted desc by tokens, capped
+	ByUserSource []UserSourceTotal      `json:"by_user_source,omitempty"` // populated only when includeLegacy=true
+	GrandTotal   TotalsEntry            `json:"grand_total"`
+}
+
+const maxKeyTotals = 200
+
+// GetUserUsageBySource returns per-source aggregated usage for one user. Legacy
+// is excluded by design (visible to admins only via the admin variant).
+func GetUserUsageBySource(db *gorm.DB, userID, period string) ([]UsageBucket, SourceTotals, error) {
+	sqlite := isSQLiteDB(db)
+	since, dateFmt := periodToWindow(period, sqlite)
+	bucketExpr := fmt.Sprintf("%s as bucket", dateFmt)
+
+	query := db.Model(&UsageRecord{}).
+		Select(bucketExpr+", source, COALESCE(api_key_id, '') as api_key_id, api_key_name, "+
+			"SUM(prompt_tokens) as prompt_tokens, "+
+			"SUM(completion_tokens) as completion_tokens, "+
+			"SUM(total_tokens) as total_tokens, "+
+			"COUNT(*) as request_count").
+		Where("user_id = ?", userID).
+		Where("source <> ?", UsageSourceLegacy).
+		Group("bucket, source, api_key_id, api_key_name").
+		Order("bucket ASC")
+
+	if !since.IsZero() {
+		query = query.Where("created_at >= ?", since)
+	}
+
+	var buckets []UsageBucket
+	if err := query.Find(&buckets).Error; err != nil {
+		return nil, SourceTotals{}, err
+	}
+
+	totals := computeSourceTotals(db, userID, "", since, false)
+	return buckets, totals, nil
+}
+
+// computeSourceTotals rolls up by_source / by_key / grand_total.
+// userID/apiKeyID are optional filters. includeLegacy controls whether the
+// legacy bucket is exposed (admin-only).
+func computeSourceTotals(db *gorm.DB, userID, apiKeyID string, since time.Time, includeLegacy bool) SourceTotals {
+	totals := SourceTotals{BySource: map[string]TotalsEntry{}}
+
+	bySourceQ := db.Model(&UsageRecord{}).
+		Select("source, SUM(total_tokens) as tokens, COUNT(*) as requests").
+		Group("source")
+	bySourceQ = applyFilters(bySourceQ, userID, apiKeyID, since, includeLegacy)
+
+	var bySourceRows []struct {
+		Source   string
+		Tokens   int64
+		Requests int64
+	}
+	if err := bySourceQ.Scan(&bySourceRows).Error; err != nil {
+		xlog.Warn("computeSourceTotals: by-source Scan failed", "error", err)
+		return totals
+	}
+	for _, r := range bySourceRows {
+		totals.BySource[r.Source] = TotalsEntry{Tokens: r.Tokens, Requests: r.Requests}
+		totals.GrandTotal.Tokens += r.Tokens
+		totals.GrandTotal.Requests += r.Requests
+	}
+
+	byKeyQ := db.Model(&UsageRecord{}).
+		Select("COALESCE(api_key_id, '') as api_key_id, api_key_name, "+
+			"user_id, user_name, "+
+			"SUM(total_tokens) as tokens, COUNT(*) as requests, MAX(created_at) as last_used").
+		Where("api_key_id IS NOT NULL AND api_key_id <> ''").
+		Group("api_key_id, api_key_name, user_id, user_name").
+		Order("tokens DESC").
+		Limit(maxKeyTotals)
+	byKeyQ = applyFilters(byKeyQ, userID, apiKeyID, since, includeLegacy)
+
+	// Iterate Rows() manually because MAX(created_at) is returned as a string by
+	// the SQLite driver, and Go's database/sql refuses to scan that into
+	// *time.Time. Postgres returns a proper timestamp. We accept both shapes
+	// via a Rows.Scan into a string column, then parse uniformly.
+	rows, err := byKeyQ.Rows()
+	if err != nil {
+		xlog.Warn("computeSourceTotals: by-key Rows() failed", "error", err)
+	} else {
+		defer func() { _ = rows.Close() }()
+		out := make([]KeyTotal, 0)
+		for rows.Next() {
+			var (
+				apiKeyID, apiKeyName, userIDCol, userName, lastUsedRaw string
+				tokens, requests                                       int64
+			)
+			if scanErr := rows.Scan(&apiKeyID, &apiKeyName, &userIDCol, &userName, &tokens, &requests, &lastUsedRaw); scanErr != nil {
+				continue
+			}
+			out = append(out, KeyTotal{
+				APIKeyID:   apiKeyID,
+				APIKeyName: apiKeyName,
+				UserID:     userIDCol,
+				UserName:   userName,
+				Tokens:     tokens,
+				Requests:   requests,
+				LastUsed:   parseLastUsedString(lastUsedRaw),
+			})
+		}
+		if rerr := rows.Err(); rerr != nil {
+			xlog.Warn("computeSourceTotals: by-key rows iteration failed", "error", rerr)
+		}
+		totals.ByKey = out
+	}
+
+	// by_user_source: only populated for admin callers (includeLegacy=true) so
+	// they can attribute Web UI / legacy traffic to specific users. Per-apikey
+	// rows already carry user info via KeyTotal above, so this query only
+	// covers source != apikey.
+	if includeLegacy {
+		byUserSourceQ := db.Model(&UsageRecord{}).
+			Select("source, user_id, user_name, "+
+				"SUM(total_tokens) as tokens, COUNT(*) as requests").
+			Where("source <> ?", UsageSourceAPIKey).
+			Group("source, user_id, user_name").
+			Order("tokens DESC")
+		byUserSourceQ = applyFilters(byUserSourceQ, userID, apiKeyID, since, includeLegacy)
+
+		var byUserSourceRows []UserSourceTotal
+		if scanErr := byUserSourceQ.Scan(&byUserSourceRows).Error; scanErr != nil {
+			xlog.Warn("computeSourceTotals: by-user-source Scan failed", "error", scanErr)
+		} else {
+			totals.ByUserSource = byUserSourceRows
+		}
+	}
+
+	return totals
+}
+
+// parseLastUsedString converts the textual MAX(created_at) value returned by
+// SQLite (or any driver that surfaces the timestamp as a string) into a
+// time.Time. Returns the zero time on parse failure.
+func parseLastUsedString(s string) time.Time {
+	if s == "" {
+		return time.Time{}
+	}
+	// GORM's SQLite driver emits Go's default time formatting. Try the formats
+	// it commonly produces, falling back to RFC3339Nano.
+	layouts := []string{
+		"2006-01-02 15:04:05.999999999 -0700 MST",
+		"2006-01-02 15:04:05.999999999-07:00",
+		"2006-01-02 15:04:05.999999999",
+		"2006-01-02 15:04:05",
+		time.RFC3339Nano,
+		time.RFC3339,
+	}
+	for _, layout := range layouts {
+		if t, err := time.Parse(layout, s); err == nil {
+			return t
+		}
+	}
+	xlog.Warn("parseLastUsedString: unrecognised format", "value", s)
+	return time.Time{}
+}
+
+// GetAllUsageBySource is the admin variant of GetUserUsageBySource.
+// Optional filters: userID and apiKeyID. Legacy is included.
+// truncated == true iff the per-key roll-up was capped at maxKeyTotals.
+func GetAllUsageBySource(db *gorm.DB, period, userID, apiKeyID string) ([]UsageBucket, SourceTotals, bool, error) {
+	sqlite := isSQLiteDB(db)
+	since, dateFmt := periodToWindow(period, sqlite)
+	bucketExpr := fmt.Sprintf("%s as bucket", dateFmt)
+
+	query := db.Model(&UsageRecord{}).
+		Select(bucketExpr+", source, COALESCE(api_key_id, '') as api_key_id, api_key_name, "+
+			"user_id, user_name, "+
+			"SUM(prompt_tokens) as prompt_tokens, "+
+			"SUM(completion_tokens) as completion_tokens, "+
+			"SUM(total_tokens) as total_tokens, "+
+			"COUNT(*) as request_count").
+		Group("bucket, source, api_key_id, api_key_name, user_id, user_name").
+		Order("bucket ASC")
+
+	query = applyFilters(query, userID, apiKeyID, since, true)
+
+	var buckets []UsageBucket
+	if err := query.Find(&buckets).Error; err != nil {
+		return nil, SourceTotals{}, false, err
+	}
+
+	totals := computeSourceTotals(db, userID, apiKeyID, since, true)
+
+	// Count distinct api_key_ids matching the filters. If > maxKeyTotals,
+	// the by_key slice was capped and we signal truncation to the caller.
+	truncated := false
+	var distinct int64
+	countQ := applyFilters(
+		db.Model(&UsageRecord{}).
+			Distinct("api_key_id").
+			Where("api_key_id IS NOT NULL AND api_key_id <> ''"),
+		userID, apiKeyID, since, true,
+	)
+	if err := countQ.Count(&distinct).Error; err != nil {
+		xlog.Warn("GetAllUsageBySource: distinct api_key_id count failed", "error", err)
+	} else {
+		truncated = distinct > maxKeyTotals
+	}
+
+	return buckets, totals, truncated, nil
+}
+
+func applyFilters(q *gorm.DB, userID, apiKeyID string, since time.Time, includeLegacy bool) *gorm.DB {
+	if userID != "" {
+		q = q.Where("user_id = ?", userID)
+	}
+	if apiKeyID != "" {
+		q = q.Where("api_key_id = ?", apiKeyID)
+	}
+	if !since.IsZero() {
+		q = q.Where("created_at >= ?", since)
+	}
+	if !includeLegacy {
+		q = q.Where("source <> ?", UsageSourceLegacy)
+	}
+	return q
+}
--- a/core/http/auth/usage_test.go
+++ b/core/http/auth/usage_test.go
@@ -3,11 +3,13 @@
 package auth_test

 import (
+	"fmt"
 	"time"

 	"github.com/mudler/LocalAI/core/http/auth"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
+	"gorm.io/gorm"
 )

 var _ = Describe("Usage", func() {
@@ -158,4 +160,275 @@ var _ = Describe("Usage", func() {
 			}
 		})
 	})
+
+	Describe("Usage source backfill", func() {
+		It("backfills 'web' for pre-feature rows", func() {
+			db := testDB()
+
+			rawDB, err := db.DB()
+			Expect(err).ToNot(HaveOccurred())
+			_, err = rawDB.Exec(
+				`INSERT INTO usage_records (user_id, source, model, created_at, total_tokens, prompt_tokens, completion_tokens, duration) VALUES (?, '', ?, ?, 0, 0, 0, 0)`,
+				"user-x", "gpt-4", time.Now())
+			Expect(err).ToNot(HaveOccurred())
+
+			Expect(auth.BackfillUsageSource(db)).To(Succeed())
+
+			var loaded auth.UsageRecord
+			Expect(db.Where("user_id = ?", "user-x").First(&loaded).Error).To(Succeed())
+			Expect(loaded.Source).To(Equal(auth.UsageSourceWeb))
+		})
+
+		It("backfills 'legacy' for pre-feature rows with legacy-api-key user_id", func() {
+			db := testDB()
+
+			rawDB, err := db.DB()
+			Expect(err).ToNot(HaveOccurred())
+			_, err = rawDB.Exec(
+				`INSERT INTO usage_records (user_id, source, model, created_at, total_tokens, prompt_tokens, completion_tokens, duration) VALUES (?, '', ?, ?, 0, 0, 0, 0)`,
+				"legacy-api-key", "gpt-4", time.Now())
+			Expect(err).ToNot(HaveOccurred())
+
+			Expect(auth.BackfillUsageSource(db)).To(Succeed())
+
+			var loaded auth.UsageRecord
+			Expect(db.Where("user_id = ?", "legacy-api-key").First(&loaded).Error).To(Succeed())
+			Expect(loaded.Source).To(Equal(auth.UsageSourceLegacy))
+		})
+
+		It("is idempotent on re-run", func() {
+			db := testDB()
+			Expect(auth.BackfillUsageSource(db)).To(Succeed())
+			Expect(auth.BackfillUsageSource(db)).To(Succeed())
+		})
+	})
+
+	Describe("UsageRecord with source fields", func() {
+		It("persists Source, APIKeyID, APIKeyName", func() {
+			db := testDB()
+			keyID := "key-uuid-1"
+			record := &auth.UsageRecord{
+				UserID:      "user-1",
+				UserName:    "Test User",
+				Source:      auth.UsageSourceAPIKey,
+				APIKeyID:    &keyID,
+				APIKeyName:  "ci-runner",
+				Model:       "gpt-4",
+				Endpoint:    "/v1/chat/completions",
+				TotalTokens: 150,
+				CreatedAt:   time.Now(),
+			}
+			Expect(auth.RecordUsage(db, record)).To(Succeed())
+
+			var loaded auth.UsageRecord
+			Expect(db.First(&loaded, record.ID).Error).To(Succeed())
+			Expect(loaded.Source).To(Equal(auth.UsageSourceAPIKey))
+			Expect(loaded.APIKeyID).ToNot(BeNil())
+			Expect(*loaded.APIKeyID).To(Equal("key-uuid-1"))
+			Expect(loaded.APIKeyName).To(Equal("ci-runner"))
+		})
+
+		It("allows nil APIKeyID for web/legacy sources", func() {
+			db := testDB()
+			record := &auth.UsageRecord{
+				UserID:    "user-1",
+				Source:    auth.UsageSourceWeb,
+				Model:     "gpt-4",
+				CreatedAt: time.Now(),
+			}
+			Expect(auth.RecordUsage(db, record)).To(Succeed())
+
+			var loaded auth.UsageRecord
+			Expect(db.First(&loaded, record.ID).Error).To(Succeed())
+			Expect(loaded.Source).To(Equal(auth.UsageSourceWeb))
+			Expect(loaded.APIKeyID).To(BeNil())
+			Expect(loaded.APIKeyName).To(BeEmpty())
+		})
+	})
+
+	Describe("GetUserUsageBySource", func() {
+		insert := func(db *gorm.DB, userID, source, keyID, keyName string, tokens int64, when time.Time) {
+			rec := &auth.UsageRecord{
+				UserID:      userID,
+				Source:      source,
+				Model:       "gpt-4",
+				TotalTokens: tokens,
+				CreatedAt:   when,
+			}
+			if keyID != "" {
+				rec.APIKeyID = &keyID
+				rec.APIKeyName = keyName
+			}
+			Expect(auth.RecordUsage(db, rec)).To(Succeed())
+		}
+
+		It("returns only the caller's rows, never legacy", func() {
+			db := testDB()
+			now := time.Now()
+			insert(db, "alice", auth.UsageSourceAPIKey, "k1", "ci", 100, now)
+			insert(db, "alice", auth.UsageSourceWeb, "", "", 50, now)
+			insert(db, "alice", auth.UsageSourceLegacy, "", "", 30, now)
+			insert(db, "bob", auth.UsageSourceAPIKey, "k2", "bobk", 90, now)
+
+			buckets, totals, err := auth.GetUserUsageBySource(db, "alice", "month")
+			Expect(err).ToNot(HaveOccurred())
+
+			for _, b := range buckets {
+				Expect(b.UserID).To(Or(BeEmpty(), Equal("alice")))
+				Expect(b.Source).ToNot(Equal(auth.UsageSourceLegacy))
+			}
+
+			Expect(totals.GrandTotal.Tokens).To(Equal(int64(150)))
+			Expect(totals.BySource[auth.UsageSourceAPIKey].Tokens).To(Equal(int64(100)))
+			Expect(totals.BySource[auth.UsageSourceWeb].Tokens).To(Equal(int64(50)))
+			_, hasLegacy := totals.BySource[auth.UsageSourceLegacy]
+			Expect(hasLegacy).To(BeFalse())
+		})
+
+		It("snapshots survive key deletion", func() {
+			db := testDB()
+			now := time.Now()
+			insert(db, "alice", auth.UsageSourceAPIKey, "deleted-key", "old-name", 42, now)
+			_, totals, err := auth.GetUserUsageBySource(db, "alice", "month")
+			Expect(err).ToNot(HaveOccurred())
+			Expect(totals.ByKey).To(HaveLen(1))
+			Expect(totals.ByKey[0].APIKeyName).To(Equal("old-name"))
+			Expect(totals.ByKey[0].APIKeyID).To(Equal("deleted-key"))
+			Expect(totals.ByKey[0].LastUsed).ToNot(BeZero())
+			Expect(totals.ByKey[0].LastUsed).To(BeTemporally("~", now, 2*time.Second))
+		})
+	})
+
+	Describe("GetAllUsageBySource", func() {
+		insert := func(db *gorm.DB, userID, source, keyID string, tokens int64) {
+			rec := &auth.UsageRecord{
+				UserID:      userID,
+				Source:      source,
+				Model:       "gpt-4",
+				TotalTokens: tokens,
+				CreatedAt:   time.Now(),
+			}
+			if keyID != "" {
+				rec.APIKeyID = &keyID
+				rec.APIKeyName = "name-" + keyID
+			}
+			Expect(auth.RecordUsage(db, rec)).To(Succeed())
+		}
+
+		It("includes legacy for admins", func() {
+			db := testDB()
+			insert(db, "alice", auth.UsageSourceAPIKey, "k1", 10)
+			insert(db, "legacy-api-key", auth.UsageSourceLegacy, "", 5)
+
+			_, totals, _, err := auth.GetAllUsageBySource(db, "month", "", "")
+			Expect(err).ToNot(HaveOccurred())
+			Expect(totals.BySource).To(HaveKey(auth.UsageSourceLegacy))
+			Expect(totals.BySource[auth.UsageSourceLegacy].Tokens).To(Equal(int64(5)))
+		})
+
+		It("filters by user_id AND api_key_id", func() {
+			db := testDB()
+			insert(db, "alice", auth.UsageSourceAPIKey, "k1", 10)
+			insert(db, "alice", auth.UsageSourceAPIKey, "k2", 20)
+			insert(db, "bob", auth.UsageSourceAPIKey, "k3", 30)
+
+			_, totals, _, err := auth.GetAllUsageBySource(db, "month", "alice", "k2")
+			Expect(err).ToNot(HaveOccurred())
+			Expect(totals.GrandTotal.Tokens).To(Equal(int64(20)))
+		})
+
+		It("sets truncated=true when by_key exceeds the cap", func() {
+			db := testDB()
+			for i := 0; i < 210; i++ {
+				insert(db, "alice", auth.UsageSourceAPIKey, fmt.Sprintf("key-%03d", i), int64(210-i))
+			}
+
+			_, totals, truncated, err := auth.GetAllUsageBySource(db, "month", "", "")
+			Expect(err).ToNot(HaveOccurred())
+			Expect(truncated).To(BeTrue())
+			Expect(totals.ByKey).To(HaveLen(200))
+			Expect(totals.ByKey[0].Tokens > totals.ByKey[199].Tokens).To(BeTrue())
+		})
+
+		// insertNamed records a row with explicit user_id, user_name, source,
+		// and optional api key snapshot. Used by the user-attribution tests
+		// below which the older insert helper can't express.
+		insertNamed := func(db *gorm.DB, userID, userName, source, keyID, keyName string, tokens int64) {
+			rec := &auth.UsageRecord{
+				UserID:      userID,
+				UserName:    userName,
+				Source:      source,
+				Model:       "gpt-4",
+				TotalTokens: tokens,
+				CreatedAt:   time.Now(),
+			}
+			if keyID != "" {
+				rec.APIKeyID = &keyID
+				rec.APIKeyName = keyName
+			}
+			Expect(auth.RecordUsage(db, rec)).To(Succeed())
+		}
+
+		It("attributes each KeyTotal to its owner user", func() {
+			db := testDB()
+			insertNamed(db, "alice", "Alice", auth.UsageSourceAPIKey, "k1", "ci-runner", 100)
+			insertNamed(db, "bob", "Bob", auth.UsageSourceAPIKey, "k2", "lap", 50)
+
+			_, totals, _, err := auth.GetAllUsageBySource(db, "month", "", "")
+			Expect(err).ToNot(HaveOccurred())
+			Expect(totals.ByKey).To(HaveLen(2))
+
+			byID := map[string]auth.KeyTotal{}
+			for _, k := range totals.ByKey {
+				byID[k.APIKeyID] = k
+			}
+			Expect(byID["k1"].UserID).To(Equal("alice"))
+			Expect(byID["k1"].UserName).To(Equal("Alice"))
+			Expect(byID["k2"].UserID).To(Equal("bob"))
+			Expect(byID["k2"].UserName).To(Equal("Bob"))
+		})
+
+		It("breaks Web UI and legacy traffic out per user in by_user_source for admin", func() {
+			db := testDB()
+			// Alice and Bob both have Web UI traffic; a synthetic legacy user
+			// also contributes. ByUserSource should expose one row per
+			// (source, user) pair, never for source=apikey.
+			insertNamed(db, "alice", "Alice", auth.UsageSourceWeb, "", "", 30)
+			insertNamed(db, "bob", "Bob", auth.UsageSourceWeb, "", "", 70)
+			insertNamed(db, "legacy-api-key", "API Key User", auth.UsageSourceLegacy, "", "", 10)
+			insertNamed(db, "alice", "Alice", auth.UsageSourceAPIKey, "k1", "ci-runner", 5)
+
+			_, totals, _, err := auth.GetAllUsageBySource(db, "month", "", "")
+			Expect(err).ToNot(HaveOccurred())
+			Expect(totals.ByUserSource).ToNot(BeEmpty())
+
+			for _, r := range totals.ByUserSource {
+				Expect(r.Source).ToNot(Equal(auth.UsageSourceAPIKey))
+			}
+
+			webByUser := map[string]int64{}
+			legacyByUser := map[string]int64{}
+			for _, r := range totals.ByUserSource {
+				switch r.Source {
+				case auth.UsageSourceWeb:
+					webByUser[r.UserID] = r.Tokens
+				case auth.UsageSourceLegacy:
+					legacyByUser[r.UserID] = r.Tokens
+				}
+			}
+			Expect(webByUser["alice"]).To(Equal(int64(30)))
+			Expect(webByUser["bob"]).To(Equal(int64(70)))
+			Expect(legacyByUser["legacy-api-key"]).To(Equal(int64(10)))
+		})
+
+		It("does NOT populate by_user_source in the non-admin path", func() {
+			db := testDB()
+			insertNamed(db, "alice", "Alice", auth.UsageSourceWeb, "", "", 30)
+
+			_, totals, err := auth.GetUserUsageBySource(db, "alice", "month")
+			Expect(err).ToNot(HaveOccurred())
+			// Non-admin path uses includeLegacy=false, so by_user_source stays nil.
+			Expect(totals.ByUserSource).To(BeNil())
+		})
+	})
 })
--- a/core/http/endpoints/localai/nodes.go
+++ b/core/http/endpoints/localai/nodes.go
@@ -16,8 +16,11 @@ import (
 	"github.com/google/uuid"
 	"github.com/gorilla/websocket"
 	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/gallery"
 	"github.com/mudler/LocalAI/core/http/auth"
 	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/core/services/galleryop"
 	"github.com/mudler/LocalAI/core/services/nodes"
 	"github.com/mudler/xlog"
 	"gorm.io/gorm"
@@ -381,14 +384,24 @@ func ResumeNodeEndpoint(registry *nodes.NodeRegistry) echo.HandlerFunc {
 	}
 }

-// InstallBackendOnNodeEndpoint triggers backend installation on a worker node via NATS.
+// InstallBackendOnNodeEndpoint triggers backend installation on a worker node.
+// Async: enqueues a ManagementOp on the gallery service channel and returns a
+// jobID immediately. The gallery service worker goroutine drives the actual
+// install via DistributedBackendManager.InstallBackend, which honors the op's
+// TargetNodeID to scope the fan-out to one node. The UI polls /api/backends/job/:uid
+// for progress, mirroring /api/backends/install/:id.
+//
 // Backend can be either a gallery ID (resolved against BackendGalleries) or a
-// direct URI install (URI + Name + optional Alias) — same shape as the
+// direct URI install (URI + Name + optional Alias) - same shape as the
 // standalone /api/backends/install-external path, just scoped to one node.
-func InstallBackendOnNodeEndpoint(unloader nodes.NodeCommandSender) echo.HandlerFunc {
+//
+// The legacy unloader argument is retained for signature symmetry with
+// DeleteBackendOnNodeEndpoint / ListBackendsOnNodeEndpoint but is no longer
+// used here - the async path goes through galleryService.
+func InstallBackendOnNodeEndpoint(_ nodes.NodeCommandSender, galleryService *galleryop.GalleryService, opcache *galleryop.OpCache, appConfig *config.ApplicationConfig) echo.HandlerFunc {
 	return func(c echo.Context) error {
-		if unloader == nil {
-			return c.JSON(http.StatusServiceUnavailable, nodeError(http.StatusServiceUnavailable, "NATS not configured"))
+		if galleryService == nil {
+			return c.JSON(http.StatusServiceUnavailable, nodeError(http.StatusServiceUnavailable, "gallery service not configured"))
 		}
 		nodeID := c.Param("id")
 		var req struct {
@@ -401,25 +414,65 @@ func InstallBackendOnNodeEndpoint(unloader nodes.NodeCommandSender) echo.Handler
 		if err := c.Bind(&req); err != nil {
 			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "invalid request body"))
 		}
-		// Either a gallery backend name or a direct URI must be supplied.
 		if req.Backend == "" && req.URI == "" {
 			return c.JSON(http.StatusBadRequest, nodeError(http.StatusBadRequest, "backend name or uri required"))
 		}
-		// Admin-driven backend install: not tied to a specific replica slot
-		// (no model is being loaded). Pass replica 0 to match the worker's
-		// admin process-key convention (`backend#0`). The worker's fast path
-		// takes over if the backend is already running — upgrades go through
-		// the dedicated /api/backends/upgrade path on backend.upgrade.
-		reply, err := unloader.InstallBackend(nodeID, req.Backend, "", req.BackendGalleries, req.URI, req.Name, req.Alias, 0)
+
+		jobUUID, err := uuid.NewUUID()
 		if err != nil {
-			xlog.Error("Failed to install backend on node", "node", nodeID, "backend", req.Backend, "uri", req.URI, "error", err)
-			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "failed to install backend on node"))
+			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "failed to generate job id"))
 		}
-		if !reply.Success {
-			xlog.Error("Backend install failed on node", "node", nodeID, "backend", req.Backend, "uri", req.URI, "error", reply.Error)
-			return c.JSON(http.StatusInternalServerError, nodeError(http.StatusInternalServerError, "backend installation failed"))
+		jobID := jobUUID.String()
+
+		// Cache key: for gallery installs, use the backend slug; for URI
+		// installs prefer the provided Name (falling back to URI). All keys
+		// are node-scoped so concurrent installs of the same backend on
+		// different nodes do not stomp each other in opcache.
+		backendKey := req.Backend
+		if backendKey == "" {
+			backendKey = req.Name
+			if backendKey == "" {
+				backendKey = req.URI
+			}
 		}
-		return c.JSON(http.StatusOK, map[string]string{"message": "backend installed"})
+		cacheKey := galleryop.NodeScopedKey(nodeID, backendKey)
+		opcache.SetBackend(cacheKey, jobID)
+
+		// Optional caller-supplied galleries override. Mirrors the standalone
+		// install path so an admin can point at a private gallery.
+		galleries := appConfig.BackendGalleries
+		if req.BackendGalleries != "" {
+			var custom []config.Gallery
+			if err := json.Unmarshal([]byte(req.BackendGalleries), &custom); err != nil {
+				xlog.Warn("Ignoring malformed backend_galleries override; falling back to configured galleries", "error", err, "nodeID", nodeID)
+			} else if len(custom) > 0 {
+				galleries = custom
+			}
+		}
+
+		ctx, cancelFunc := context.WithCancel(context.Background())
+		op := galleryop.ManagementOp[gallery.GalleryBackend, any]{
+			ID:                 jobID,
+			GalleryElementName: req.Backend,
+			Galleries:          galleries,
+			TargetNodeID:       nodeID,
+			ExternalURI:        req.URI,
+			ExternalName:       req.Name,
+			ExternalAlias:      req.Alias,
+			Context:            ctx,
+			CancelFunc:         cancelFunc,
+		}
+		galleryService.StoreCancellation(jobID, cancelFunc)
+		go func() {
+			galleryService.BackendGalleryChannel <- op
+		}()
+
+		xlog.Info("Node-scoped backend install dispatched", "node", nodeID, "backend", req.Backend, "uri", req.URI, "jobID", jobID)
+		return c.JSON(http.StatusAccepted, map[string]string{
+			"jobID":     jobID,
+			"statusUrl": "/api/backends/job/" + jobID,
+			"message":   "backend installation started",
+		})
 	}
 }

--- a/core/http/endpoints/localai/nodes_install_async_test.go
+++ b/core/http/endpoints/localai/nodes_install_async_test.go
@@ -0,0 +1,123 @@
+package localai_test
+
+import (
+	"bytes"
+	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+
+	"github.com/labstack/echo/v4"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/gallery"
+	"github.com/mudler/LocalAI/core/http/endpoints/localai"
+	"github.com/mudler/LocalAI/core/services/galleryop"
+)
+
+// InstallBackendOnNodeEndpoint became async to stop blocking the browser on
+// the 3-minute NATS reply timeout. These specs lock in the new contract:
+// HTTP 202 with a jobID, a ManagementOp enqueued on the gallery channel, and
+// an opcache entry keyed by NodeScopedKey so concurrent installs of the same
+// backend on different nodes do not stomp each other.
+var _ = Describe("InstallBackendOnNodeEndpoint async behavior", func() {
+	var (
+		e              *echo.Echo
+		galleryService *galleryop.GalleryService
+		opcache        *galleryop.OpCache
+		appCfg         *config.ApplicationConfig
+		dispatched     chan galleryop.ManagementOp[gallery.GalleryBackend, any]
+		done           chan struct{}
+		drainExited    chan struct{}
+	)
+
+	BeforeEach(func() {
+		e = echo.New()
+		appCfg = &config.ApplicationConfig{
+			BackendGalleries: []config.Gallery{{Name: "test-gallery", URL: "http://example.com"}},
+		}
+		galleryService = galleryop.NewGalleryService(appCfg, nil)
+		opcache = galleryop.NewOpCache(galleryService)
+		// Drain the gallery channel into a buffered side channel so the
+		// handler's `go func() { ch <- op }()` send does not block waiting
+		// for the real worker (which is not running in this unit test).
+		dispatched = make(chan galleryop.ManagementOp[gallery.GalleryBackend, any], 4)
+		done = make(chan struct{})
+		drainExited = make(chan struct{})
+		go func() {
+			defer close(drainExited)
+			for {
+				select {
+				case op := <-galleryService.BackendGalleryChannel:
+					dispatched <- op
+				case <-done:
+					return
+				}
+			}
+		}()
+	})
+
+	AfterEach(func() {
+		// Signal the drain goroutine to exit. We do NOT close
+		// BackendGalleryChannel: the handler's dispatch goroutine may still
+		// be pending (specs that don't Eventually-Receive), and a send on a
+		// closed channel panics. Signalling via `done` lets the drain
+		// goroutine return without touching the gallery channel.
+		close(done)
+		Eventually(drainExited, "2s").Should(BeClosed())
+	})
+
+	It("returns 202 with a jobID and dispatches a TargetNodeID-scoped op", func() {
+		body := `{"backend": "llama-cpp"}`
+		req := httptest.NewRequest(http.MethodPost, "/api/nodes/node-xyz/backends/install", bytes.NewBufferString(body))
+		req.Header.Set("Content-Type", "application/json")
+		rec := httptest.NewRecorder()
+		c := e.NewContext(req, rec)
+		c.SetParamNames("id")
+		c.SetParamValues("node-xyz")
+
+		handler := localai.InstallBackendOnNodeEndpoint(nil, galleryService, opcache, appCfg)
+		Expect(handler(c)).To(Succeed())
+		Expect(rec.Code).To(Equal(http.StatusAccepted))
+
+		var resp map[string]any
+		Expect(json.Unmarshal(rec.Body.Bytes(), &resp)).To(Succeed())
+		Expect(resp["jobID"]).To(BeAssignableToTypeOf(""))
+		Expect(resp["jobID"].(string)).ToNot(BeEmpty())
+		Expect(resp["message"]).To(Equal("backend installation started"))
+
+		Eventually(dispatched, "2s").Should(Receive())
+		Expect(opcache.Exists(galleryop.NodeScopedKey("node-xyz", "llama-cpp"))).To(BeTrue())
+		Expect(opcache.IsBackendOp(galleryop.NodeScopedKey("node-xyz", "llama-cpp"))).To(BeTrue())
+	})
+
+	It("returns 400 when neither backend nor uri is supplied", func() {
+		req := httptest.NewRequest(http.MethodPost, "/api/nodes/node-xyz/backends/install", bytes.NewBufferString(`{}`))
+		req.Header.Set("Content-Type", "application/json")
+		rec := httptest.NewRecorder()
+		c := e.NewContext(req, rec)
+		c.SetParamNames("id")
+		c.SetParamValues("node-xyz")
+
+		handler := localai.InstallBackendOnNodeEndpoint(nil, galleryService, opcache, appCfg)
+		Expect(handler(c)).To(Succeed())
+		Expect(rec.Code).To(Equal(http.StatusBadRequest))
+	})
+
+	It("accepts a direct URI install and uses the name as the cache key", func() {
+		body := `{"uri": "oci://example.com/custom-backend:v1", "name": "custom"}`
+		req := httptest.NewRequest(http.MethodPost, "/api/nodes/node-xyz/backends/install", bytes.NewBufferString(body))
+		req.Header.Set("Content-Type", "application/json")
+		rec := httptest.NewRecorder()
+		c := e.NewContext(req, rec)
+		c.SetParamNames("id")
+		c.SetParamValues("node-xyz")
+
+		handler := localai.InstallBackendOnNodeEndpoint(nil, galleryService, opcache, appCfg)
+		Expect(handler(c)).To(Succeed())
+		Expect(rec.Code).To(Equal(http.StatusAccepted))
+
+		Expect(opcache.Exists(galleryop.NodeScopedKey("node-xyz", "custom"))).To(BeTrue())
+	})
+})
--- a/core/http/endpoints/openai/chat.go
+++ b/core/http/endpoints/openai/chat.go
@@ -73,363 +73,6 @@ func mergeToolCallDeltas(existing []schema.ToolCall, deltas []schema.ToolCall) [
 // @Success 200 {object} schema.OpenAIResponse "Response"
 // @Router /v1/chat/completions [post]
 func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator *templates.Evaluator, startupOptions *config.ApplicationConfig, natsClient mcpTools.MCPNATSClient, assistantHolder *mcpTools.LocalAIAssistantHolder) echo.HandlerFunc {
-	process := func(s string, req *schema.OpenAIRequest, config *config.ModelConfig, loader *model.ModelLoader, responses chan schema.OpenAIResponse, extraUsage bool, id string, created int) error {
-		initialMessage := schema.OpenAIResponse{
-			ID:      id,
-			Created: created,
-			Model:   req.Model, // we have to return what the user sent here, due to OpenAI spec.
-			Choices: []schema.Choice{{Delta: &schema.Message{Role: "assistant"}, Index: 0, FinishReason: nil}},
-			Object:  "chat.completion.chunk",
-		}
-		responses <- initialMessage
-
-		// Detect if thinking token is already in prompt or template
-		// When UseTokenizerTemplate is enabled, predInput is empty, so we check the template
-		var template string
-		if config.TemplateConfig.UseTokenizerTemplate {
-			template = config.GetModelTemplate()
-		} else {
-			template = s
-		}
-		thinkingStartToken := reason.DetectThinkingStartToken(template, &config.ReasoningConfig)
-		extractor := reason.NewReasoningExtractor(thinkingStartToken, config.ReasoningConfig)
-
-		_, _, _, err := ComputeChoices(req, s, config, cl, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, tokenUsage backend.TokenUsage) bool {
-			var reasoningDelta, contentDelta string
-
-			// Always keep the Go-side extractor in sync with raw tokens so it
-			// can serve as fallback for backends without an autoparser (e.g. vLLM).
-			goReasoning, goContent := extractor.ProcessToken(s)
-
-			// When C++ autoparser chat deltas are available, prefer them — they
-			// handle model-specific formats (Gemma 4, etc.) without Go-side tags.
-			// Otherwise fall back to Go-side extraction.
-			if tokenUsage.HasChatDeltaContent() {
-				rawReasoning, cd := tokenUsage.ChatDeltaReasoningAndContent()
-				contentDelta = cd
-				reasoningDelta = extractor.ProcessChatDeltaReasoning(rawReasoning)
-			} else {
-				reasoningDelta = goReasoning
-				contentDelta = goContent
-			}
-
-			usage := schema.OpenAIUsage{
-				PromptTokens:     tokenUsage.Prompt,
-				CompletionTokens: tokenUsage.Completion,
-				TotalTokens:      tokenUsage.Prompt + tokenUsage.Completion,
-			}
-			if extraUsage {
-				usage.TimingTokenGeneration = tokenUsage.TimingTokenGeneration
-				usage.TimingPromptProcessing = tokenUsage.TimingPromptProcessing
-			}
-
-			delta := &schema.Message{}
-			if contentDelta != "" {
-				delta.Content = &contentDelta
-			}
-			if reasoningDelta != "" {
-				delta.Reasoning = &reasoningDelta
-			}
-
-			// Usage rides as a struct field for the consumer to track the
-			// running cumulative — it is stripped before JSON marshal so the
-			// wire chunk stays spec-compliant (no `usage` on intermediate
-			// chunks). The dedicated trailer chunk (when include_usage=true)
-			// carries the final totals.
-			usageForChunk := usage
-			resp := schema.OpenAIResponse{
-				ID:      id,
-				Created: created,
-				Model:   req.Model, // we have to return what the user sent here, due to OpenAI spec.
-				Choices: []schema.Choice{{Delta: delta, Index: 0, FinishReason: nil}},
-				Object:  "chat.completion.chunk",
-				Usage:   &usageForChunk,
-			}
-
-			responses <- resp
-			return true
-		})
-		close(responses)
-		return err
-	}
-	processTools := func(noAction string, prompt string, req *schema.OpenAIRequest, config *config.ModelConfig, loader *model.ModelLoader, responses chan schema.OpenAIResponse, extraUsage bool, id string, created int, textContentToReturn *string) error {
-		// Detect if thinking token is already in prompt or template
-		var template string
-		if config.TemplateConfig.UseTokenizerTemplate {
-			template = config.GetModelTemplate()
-		} else {
-			template = prompt
-		}
-		thinkingStartToken := reason.DetectThinkingStartToken(template, &config.ReasoningConfig)
-		extractor := reason.NewReasoningExtractor(thinkingStartToken, config.ReasoningConfig)
-
-		result := ""
-		lastEmittedCount := 0
-		sentInitialRole := false
-		sentReasoning := false
-		hasChatDeltaToolCalls := false
-		hasChatDeltaContent := false
-
-		_, _, chatDeltas, err := ComputeChoices(req, prompt, config, cl, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, usage backend.TokenUsage) bool {
-			result += s
-
-			// Track whether ChatDeltas from the C++ autoparser contain
-			// tool calls or content, so the retry decision can account for them.
-			for _, d := range usage.ChatDeltas {
-				if len(d.ToolCalls) > 0 {
-					hasChatDeltaToolCalls = true
-				}
-				if d.Content != "" {
-					hasChatDeltaContent = true
-				}
-			}
-
-			var reasoningDelta, contentDelta string
-
-			goReasoning, goContent := extractor.ProcessToken(s)
-
-			if usage.HasChatDeltaContent() {
-				rawReasoning, cd := usage.ChatDeltaReasoningAndContent()
-				contentDelta = cd
-				reasoningDelta = extractor.ProcessChatDeltaReasoning(rawReasoning)
-			} else {
-				reasoningDelta = goReasoning
-				contentDelta = goContent
-			}
-
-			// Emit reasoning deltas in their own SSE chunks before any tool-call chunks
-			// (OpenAI spec: reasoning and tool_calls never share a delta)
-			if reasoningDelta != "" {
-				responses <- schema.OpenAIResponse{
-					ID:      id,
-					Created: created,
-					Model:   req.Model,
-					Choices: []schema.Choice{{
-						Delta: &schema.Message{Reasoning: &reasoningDelta},
-						Index: 0,
-					}},
-					Object: "chat.completion.chunk",
-				}
-				sentReasoning = true
-			}
-
-			// Stream content deltas (cleaned of reasoning tags) while no tool calls
-			// have been detected. Once the incremental parser finds tool calls,
-			// content stops — per OpenAI spec, content and tool_calls don't mix.
-			if lastEmittedCount == 0 && contentDelta != "" {
-				if !sentInitialRole {
-					responses <- schema.OpenAIResponse{
-						ID: id, Created: created, Model: req.Model,
-						Choices: []schema.Choice{{Delta: &schema.Message{Role: "assistant"}, Index: 0}},
-						Object:  "chat.completion.chunk",
-					}
-					sentInitialRole = true
-				}
-				responses <- schema.OpenAIResponse{
-					ID: id, Created: created, Model: req.Model,
-					Choices: []schema.Choice{{
-						Delta: &schema.Message{Content: &contentDelta},
-						Index: 0,
-					}},
-					Object: "chat.completion.chunk",
-				}
-			}
-
-			// Try incremental XML parsing for streaming support using iterative parser
-			// This allows emitting partial tool calls as they're being generated
-			cleanedResult := functions.CleanupLLMResult(result, config.FunctionsConfig)
-
-			// Determine XML format from config
-			var xmlFormat *functions.XMLToolCallFormat
-			if config.FunctionsConfig.XMLFormat != nil {
-				xmlFormat = config.FunctionsConfig.XMLFormat
-			} else if config.FunctionsConfig.XMLFormatPreset != "" {
-				xmlFormat = functions.GetXMLFormatPreset(config.FunctionsConfig.XMLFormatPreset)
-			}
-
-			// Use iterative parser for streaming (partial parsing enabled)
-			// Try XML parsing first
-			partialResults, parseErr := functions.ParseXMLIterative(cleanedResult, xmlFormat, true)
-			if parseErr == nil && len(partialResults) > 0 {
-				// Emit new XML tool calls that weren't emitted before
-				if len(partialResults) > lastEmittedCount {
-					for i := lastEmittedCount; i < len(partialResults); i++ {
-						toolCall := partialResults[i]
-						initialMessage := schema.OpenAIResponse{
-							ID:      id,
-							Created: created,
-							Model:   req.Model,
-							Choices: []schema.Choice{{
-								Delta: &schema.Message{
-									Role: "assistant",
-									ToolCalls: []schema.ToolCall{
-										{
-											Index: i,
-											ID:    id,
-											Type:  "function",
-											FunctionCall: schema.FunctionCall{
-												Name: toolCall.Name,
-											},
-										},
-									},
-								},
-								Index:        0,
-								FinishReason: nil,
-							}},
-							Object: "chat.completion.chunk",
-						}
-						select {
-						case responses <- initialMessage:
-						default:
-						}
-					}
-					lastEmittedCount = len(partialResults)
-				}
-			} else {
-				// Try JSON tool call parsing for streaming.
-				// Only emit NEW tool calls (same guard as XML parser above).
-				jsonResults, jsonErr := functions.ParseJSONIterative(cleanedResult, true)
-				if jsonErr == nil && len(jsonResults) > lastEmittedCount {
-					for i := lastEmittedCount; i < len(jsonResults); i++ {
-						jsonObj := jsonResults[i]
-						name, ok := jsonObj["name"].(string)
-						if !ok || name == "" {
-							continue
-						}
-						args := "{}"
-						if argsVal, ok := jsonObj["arguments"]; ok {
-							if argsStr, ok := argsVal.(string); ok {
-								args = argsStr
-							} else {
-								argsBytes, _ := json.Marshal(argsVal)
-								args = string(argsBytes)
-							}
-						}
-						initialMessage := schema.OpenAIResponse{
-							ID:      id,
-							Created: created,
-							Model:   req.Model,
-							Choices: []schema.Choice{{
-								Delta: &schema.Message{
-									Role: "assistant",
-									ToolCalls: []schema.ToolCall{
-										{
-											Index: i,
-											ID:    id,
-											Type:  "function",
-											FunctionCall: schema.FunctionCall{
-												Name:      name,
-												Arguments: args,
-											},
-										},
-									},
-								},
-								Index:        0,
-								FinishReason: nil,
-							}},
-							Object: "chat.completion.chunk",
-						}
-						responses <- initialMessage
-					}
-					lastEmittedCount = len(jsonResults)
-				}
-			}
-			return true
-		},
-			func(attempt int) bool {
-				// After streaming completes: check if we got actionable content
-				cleaned := extractor.CleanedContent()
-				// Check for tool calls from chat deltas (will be re-checked after ComputeChoices,
-				// but we need to know here whether to retry).
-				// Also check ChatDelta flags — when the C++ autoparser is active,
-				// tool calls and content are delivered via ChatDeltas while the
-				// raw message is cleared. Without this check, we'd retry
-				// unnecessarily, losing valid results and concatenating output.
-				hasToolCalls := lastEmittedCount > 0 || hasChatDeltaToolCalls
-				hasContent := cleaned != "" || hasChatDeltaContent
-				if !hasContent && !hasToolCalls {
-					xlog.Warn("Streaming: backend produced only reasoning, retrying",
-						"reasoning_len", len(extractor.Reasoning()), "attempt", attempt+1)
-					extractor.ResetAndSuppressReasoning()
-					result = ""
-					lastEmittedCount = 0
-					sentInitialRole = false
-					hasChatDeltaToolCalls = false
-					hasChatDeltaContent = false
-					return true
-				}
-				return false
-			},
-		)
-		if err != nil {
-			return err
-		}
-		// Try using pre-parsed tool calls from C++ autoparser (chat deltas)
-		var functionResults []functions.FuncCallResults
-		var reasoning string
-
-		if deltaToolCalls := functions.ToolCallsFromChatDeltas(chatDeltas); len(deltaToolCalls) > 0 {
-			xlog.Debug("[ChatDeltas] Using pre-parsed tool calls from C++ autoparser", "count", len(deltaToolCalls))
-			functionResults = deltaToolCalls
-			// Use content/reasoning from deltas too
-			*textContentToReturn = functions.ContentFromChatDeltas(chatDeltas)
-			reasoning = functions.ReasoningFromChatDeltas(chatDeltas)
-		} else {
-			// Fallback: parse tool calls from raw text (no chat deltas from backend)
-			xlog.Debug("[ChatDeltas] no pre-parsed tool calls, falling back to Go-side text parsing")
-			reasoning = extractor.Reasoning()
-			cleanedResult := extractor.CleanedContent()
-			*textContentToReturn = functions.ParseTextContent(cleanedResult, config.FunctionsConfig)
-			cleanedResult = functions.CleanupLLMResult(cleanedResult, config.FunctionsConfig)
-			functionResults = functions.ParseFunctionCall(cleanedResult, config.FunctionsConfig)
-		}
-		xlog.Debug("[ChatDeltas] final tool call decision", "tool_calls", len(functionResults), "text_content", *textContentToReturn)
-		// noAction is a sentinel "just answer" pseudo-function — not a real
-		// tool call. Scan the whole slice rather than only index 0 so we
-		// don't drop a real tool call that happens to follow a noAction
-		// entry, and so the default branch isn't entered with only noAction
-		// entries to emit as tool_calls.
-		noActionToRun := !hasRealCall(functionResults, noAction)
-
-		switch {
-		case noActionToRun:
-			// Token-cumulative usage is communicated to the streaming
-			// consumer via the per-token callback's chunk struct (stripped
-			// before wire marshal). The final usage trailer — when the
-			// caller opted in with stream_options.include_usage — is built
-			// by the outer streaming loop, not here.
-			var result string
-			if !sentInitialRole {
-				var hqErr error
-				result, hqErr = handleQuestion(config, functionResults, extractor.CleanedContent(), prompt)
-				if hqErr != nil {
-					xlog.Error("error handling question", "error", hqErr)
-					return hqErr
-				}
-			}
-			for _, chunk := range buildNoActionFinalChunks(
-				id, req.Model, created,
-				sentInitialRole, sentReasoning,
-				result, reasoning,
-			) {
-				responses <- chunk
-			}
-
-		default:
-			for _, chunk := range buildDeferredToolCallChunks(
-				id, req.Model, created,
-				functionResults, lastEmittedCount,
-				sentInitialRole, *textContentToReturn,
-				sentReasoning, reasoning,
-			) {
-				responses <- chunk
-			}
-		}
-
-		close(responses)
-		return err
-	}
-
 	return func(c echo.Context) error {
 		var textContentToReturn string
 		id := uuid.New().String()
@@ -697,17 +340,19 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 				}

 				responses := make(chan schema.OpenAIResponse)
-				ended := make(chan error, 1)
+				ended := make(chan streamWorkerResult, 1)

 				go func() {
 					if !shouldUseFn {
-						ended <- process(predInput, input, config, ml, responses, extraUsage, id, created)
+						u, err := processStream(predInput, input, config, cl, startupOptions, ml, responses, id, created)
+						ended <- streamWorkerResult{usage: u, err: err}
 					} else {
-						ended <- processTools(noActionName, predInput, input, config, ml, responses, extraUsage, id, created, &textContentToReturn)
+						u, err := processStreamWithTools(noActionName, predInput, input, config, cl, startupOptions, ml, responses, id, created, &textContentToReturn)
+						ended <- streamWorkerResult{usage: u, err: err}
 					}
 				}()

-				usage := &schema.OpenAIUsage{}
+				var finalUsage backend.TokenUsage
 				toolsCalled := false
 				var collectedToolCalls []schema.ToolCall
 				var collectedContent string
@@ -725,13 +370,6 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 							xlog.Debug("No choices in the response, skipping")
 							continue
 						}
-						// Capture the running cumulative usage from this chunk
-						// (when present) so the include_usage trailer can carry
-						// the final totals. Usage is stripped before marshal
-						// below so the wire chunk stays spec-compliant.
-						if ev.Usage != nil {
-							usage = ev.Usage
-						}
 						if len(ev.Choices[0].Delta.ToolCalls) > 0 {
 							toolsCalled = true
 							// Collect and merge tool call deltas for MCP execution
@@ -747,11 +385,6 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 								collectedContent += *sp
 							}
 						}
-						// OpenAI streaming spec: intermediate chunks must NOT
-						// carry a `usage` field. Strip the tracking copy
-						// before marshalling — usage is delivered via the
-						// dedicated trailer chunk when include_usage=true.
-						ev.Usage = nil
 						respData, err := json.Marshal(ev)
 						if err != nil {
 							xlog.Debug("Failed to marshal response", "error", err)
@@ -766,15 +399,16 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 							return err
 						}
 						c.Response().Flush()
-					case err := <-ended:
-						if err == nil {
+					case res := <-ended:
+						if res.err == nil {
+							finalUsage = res.usage
 							break LOOP
 						}
-						xlog.Error("Stream ended with error", "error", err)
+						xlog.Error("Stream ended with error", "error", res.err)

 						errorResp := schema.ErrorResponse{
 							Error: &schema.APIError{
-								Message: err.Error(),
+								Message: res.err.Error(),
 								Type:    "server_error",
 								Code:    "server_error",
 							},
@@ -797,7 +431,10 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 				// still trying to send (e.g., after client disconnect). The goroutine
 				// calls close(responses) when done, which terminates the drain.
 				if input.Context.Err() != nil {
-					go func() { for range responses {} }()
+					go func() {
+						for range responses {
+						}
+					}()
 					<-ended
 				}

@@ -921,8 +558,16 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 				// Trailing usage chunk per OpenAI spec: emit only when the
 				// caller opted in via stream_options.include_usage. Shape:
 				// {"choices":[],"usage":{...},"object":"chat.completion.chunk",...}
-				if input.StreamOptions != nil && input.StreamOptions.IncludeUsage && usage != nil {
-					trailer := streamUsageTrailerJSON(id, input.Model, created, *usage)
+				//
+				// finalUsage is the authoritative TokenUsage returned by the
+				// worker function (process / processTools) via the `ended`
+				// channel. The worker reads it from ComputeChoices' return
+				// value, which is the cumulative count produced by the backend
+				// over the whole prediction. Issue #9927 was caused by the
+				// tools-path worker not surfacing this value at all.
+				if input.StreamOptions != nil && input.StreamOptions.IncludeUsage {
+					trailerUsage := streamUsageFromTokenUsage(finalUsage, extraUsage)
+					trailer := streamUsageTrailerJSON(id, input.Model, created, trailerUsage)
 					_, _ = fmt.Fprintf(c.Response().Writer, "data: %s\n\n", trailer)
 				}

--- a/core/http/endpoints/openai/chat_emit.go
+++ b/core/http/endpoints/openai/chat_emit.go
@@ -4,10 +4,39 @@ import (
 	"encoding/json"
 	"fmt"

+	"github.com/mudler/LocalAI/core/backend"
 	"github.com/mudler/LocalAI/core/schema"
 	"github.com/mudler/LocalAI/pkg/functions"
 )

+// streamWorkerResult is what the streaming workers (process / processTools)
+// hand back to the outer ChatEndpoint loop through the `ended` channel.
+// Threading the final TokenUsage here, instead of piggy-backing it on the
+// `responses` SSE channel, keeps the SSE channel single-purpose (wire chunks)
+// and gives the trailer emitter a plain Go value to read after LOOP exits.
+// Fix for issue #9927: the previous tools-path worker never surfaced the
+// cumulative token counts at all, so the include_usage trailer reported zeros.
+type streamWorkerResult struct {
+	usage backend.TokenUsage
+	err   error
+}
+
+// streamUsageFromTokenUsage converts the backend's cumulative TokenUsage into
+// the OpenAI-spec OpenAIUsage shape used on the wire. `extraUsage` controls
+// whether the non-standard timing fields are forwarded.
+func streamUsageFromTokenUsage(usage backend.TokenUsage, extraUsage bool) schema.OpenAIUsage {
+	out := schema.OpenAIUsage{
+		PromptTokens:     usage.Prompt,
+		CompletionTokens: usage.Completion,
+		TotalTokens:      usage.Prompt + usage.Completion,
+	}
+	if extraUsage {
+		out.TimingTokenGeneration = usage.TimingTokenGeneration
+		out.TimingPromptProcessing = usage.TimingPromptProcessing
+	}
+	return out
+}
+
 // streamUsageTrailerJSON returns the bytes of the OpenAI-spec trailing usage
 // chunk emitted in streaming completions when the request opts in via
 // `stream_options.include_usage: true`. The shape is:
--- a/core/http/endpoints/openai/chat_stream_usage_test.go
+++ b/core/http/endpoints/openai/chat_stream_usage_test.go
@@ -1,10 +1,14 @@
 package openai

 import (
+	"context"
 	"encoding/json"

+	"github.com/mudler/LocalAI/core/backend"
+	"github.com/mudler/LocalAI/core/config"
 	"github.com/mudler/LocalAI/core/schema"
 	"github.com/mudler/LocalAI/pkg/functions"
+	"github.com/mudler/LocalAI/pkg/model"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 )
@@ -152,6 +156,28 @@ var _ = Describe("streaming usage spec compliance", func() {
 		})
 	})

+	Describe("streamUsageFromTokenUsage", func() {
+		It("converts backend TokenUsage to schema OpenAIUsage", func() {
+			tu := backend.TokenUsage{Prompt: 18, Completion: 213}
+			u := streamUsageFromTokenUsage(tu, false)
+			Expect(u.PromptTokens).To(Equal(18))
+			Expect(u.CompletionTokens).To(Equal(213))
+			Expect(u.TotalTokens).To(Equal(231))
+			Expect(u.TimingTokenGeneration).To(BeZero())
+			Expect(u.TimingPromptProcessing).To(BeZero())
+		})
+		It("includes timings when extraUsage is true", func() {
+			tu := backend.TokenUsage{
+				Prompt: 10, Completion: 20,
+				TimingPromptProcessing: 0.5,
+				TimingTokenGeneration:  1.5,
+			}
+			u := streamUsageFromTokenUsage(tu, true)
+			Expect(u.TimingPromptProcessing).To(Equal(0.5))
+			Expect(u.TimingTokenGeneration).To(Equal(1.5))
+		})
+	})
+
 	Describe("OpenAIRequest.StreamOptions", func() {
 		It("parses stream_options.include_usage=true", func() {
 			body := []byte(`{
@@ -177,3 +203,160 @@ var _ = Describe("streaming usage spec compliance", func() {
 		})
 	})
 })
+
+// Functional regression coverage for issue #9927: the streaming workers
+// must surface the cumulative TokenUsage returned by ComputeChoices to
+// their caller. The earlier broken implementations discarded that value
+// (`_, _, chatDeltas, err := ComputeChoices(...)`) and threw away the
+// counts on the floor, so the include_usage trailer always reported
+// zeros when tools were enabled.
+//
+// These tests stub backend.ModelInferenceFunc so the worker exercises the
+// real ComputeChoices → predFunc → LLMResponse pipeline. If a future change
+// drops the TokenUsage somewhere along that path, the assertions on the
+// returned value fail with a concrete count mismatch (e.g. 0 vs 213),
+// not with a "function undefined" compile error.
+var _ = Describe("streaming workers surface final TokenUsage (issue #9927)", func() {
+	var (
+		origInference modelInferenceFunc
+		appCfg        *config.ApplicationConfig
+	)
+
+	BeforeEach(func() {
+		origInference = backend.ModelInferenceFunc
+		appCfg = config.NewApplicationConfig()
+	})
+
+	AfterEach(func() {
+		backend.ModelInferenceFunc = origInference
+	})
+
+	// mockBackendUsage installs a stub backend that yields one LLMResponse
+	// carrying the supplied TokenUsage. ComputeChoices' single-attempt path
+	// copies these counts into the value it returns to the worker.
+	mockBackendUsage := func(usage backend.TokenUsage, response string) {
+		backend.ModelInferenceFunc = func(
+			ctx context.Context, s string, messages schema.Messages,
+			images, videos, audios []string,
+			loader *model.ModelLoader, c *config.ModelConfig, cl *config.ModelConfigLoader,
+			o *config.ApplicationConfig,
+			tokenCallback func(string, backend.TokenUsage) bool,
+			tools, toolChoice string,
+			logprobs, topLogprobs *int,
+			logitBias map[string]float64,
+			metadata map[string]string,
+		) (func() (backend.LLMResponse, error), error) {
+			return func() (backend.LLMResponse, error) {
+				return backend.LLMResponse{
+					Response: response,
+					Usage:    usage,
+				}, nil
+			}, nil
+		}
+	}
+
+	makeReq := func() *schema.OpenAIRequest {
+		ctx, cancel := context.WithCancel(context.Background())
+		req := &schema.OpenAIRequest{
+			Context: ctx,
+			Cancel:  cancel,
+		}
+		req.Model = "test-model" // promoted from BasicModelRequest
+		return req
+	}
+
+	// drainResponses consumes everything the worker pushes onto the channel
+	// so the worker is never blocked on its send. The channel is unbuffered
+	// (matching production), so the drain goroutine must be running before
+	// the worker is called.
+	drainResponses := func(ch <-chan schema.OpenAIResponse) <-chan struct{} {
+		done := make(chan struct{})
+		go func() {
+			for range ch {
+			}
+			close(done)
+		}()
+		return done
+	}
+
+	Describe("processStream (no-tools path)", func() {
+		It("returns the cumulative TokenUsage produced by the backend", func() {
+			mockBackendUsage(backend.TokenUsage{Prompt: 18, Completion: 213}, "Hello there")
+
+			req := makeReq()
+			cfg := &config.ModelConfig{}
+			responses := make(chan schema.OpenAIResponse)
+			done := drainResponses(responses)
+
+			actual, err := processStream("prompt", req, cfg, nil, appCfg, nil, responses, "req-1", 0)
+			<-done
+
+			Expect(err).ToNot(HaveOccurred())
+			Expect(actual.Prompt).To(Equal(18),
+				"prompt tokens must round-trip from backend through processStream")
+			Expect(actual.Completion).To(Equal(213),
+				"completion tokens must round-trip from backend through processStream")
+		})
+
+		It("returns zero TokenUsage when the backend reports zero (negative control)", func() {
+			mockBackendUsage(backend.TokenUsage{}, "x")
+
+			req := makeReq()
+			cfg := &config.ModelConfig{}
+			responses := make(chan schema.OpenAIResponse)
+			done := drainResponses(responses)
+
+			actual, err := processStream("prompt", req, cfg, nil, appCfg, nil, responses, "req-1", 0)
+			<-done
+
+			Expect(err).ToNot(HaveOccurred())
+			Expect(actual.Prompt).To(BeZero())
+			Expect(actual.Completion).To(BeZero())
+		})
+	})
+
+	Describe("processStreamWithTools (tools path)", func() {
+		It("returns the cumulative TokenUsage produced by the backend", func() {
+			// This is the direct regression check for issue #9927: with tools
+			// enabled, the trailer was reporting {0,0,0} because the worker
+			// discarded ComputeChoices' second return value.
+			mockBackendUsage(backend.TokenUsage{Prompt: 18, Completion: 213}, "answer")
+
+			req := makeReq()
+			cfg := &config.ModelConfig{}
+			responses := make(chan schema.OpenAIResponse)
+			done := drainResponses(responses)
+			var textContent string
+
+			actual, err := processStreamWithTools("none", "prompt", req, cfg, nil, appCfg, nil, responses, "req-1", 0, &textContent)
+			<-done
+
+			Expect(err).ToNot(HaveOccurred())
+			Expect(actual.Prompt).To(Equal(18),
+				"prompt tokens must round-trip from backend through processStreamWithTools (issue #9927)")
+			Expect(actual.Completion).To(Equal(213),
+				"completion tokens must round-trip from backend through processStreamWithTools (issue #9927)")
+		})
+
+		It("forwards timing fields when the backend supplies them", func() {
+			mockBackendUsage(backend.TokenUsage{
+				Prompt: 10, Completion: 20,
+				TimingPromptProcessing: 0.5,
+				TimingTokenGeneration:  1.5,
+			}, "answer")
+
+			req := makeReq()
+			cfg := &config.ModelConfig{}
+			responses := make(chan schema.OpenAIResponse)
+			done := drainResponses(responses)
+			var textContent string
+
+			actual, err := processStreamWithTools("none", "prompt", req, cfg, nil, appCfg, nil, responses, "req-1", 0, &textContent)
+			<-done
+
+			Expect(err).ToNot(HaveOccurred())
+			Expect(actual.TimingPromptProcessing).To(Equal(0.5))
+			Expect(actual.TimingTokenGeneration).To(Equal(1.5))
+		})
+	})
+})
--- a/core/http/endpoints/openai/chat_stream_workers.go
+++ b/core/http/endpoints/openai/chat_stream_workers.go
@@ -0,0 +1,390 @@
+package openai
+
+import (
+	"encoding/json"
+
+	"github.com/mudler/LocalAI/core/backend"
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/pkg/functions"
+	"github.com/mudler/LocalAI/pkg/model"
+	reason "github.com/mudler/LocalAI/pkg/reasoning"
+	"github.com/mudler/xlog"
+)
+
+// processStream is the streaming worker for chat completions with no
+// tool/function calling involved. It pushes SSE-shaped chunks onto
+// `responses` and returns the authoritative cumulative TokenUsage from
+// the prediction so the caller can populate the include_usage trailer
+// without having to peek inside the chunks.
+//
+// The caller owns the `responses` channel and is expected to read from
+// it while this function runs; processStream closes the channel before
+// returning.
+func processStream(
+	s string,
+	req *schema.OpenAIRequest,
+	cfg *config.ModelConfig,
+	cl *config.ModelConfigLoader,
+	startupOptions *config.ApplicationConfig,
+	loader *model.ModelLoader,
+	responses chan schema.OpenAIResponse,
+	id string,
+	created int,
+) (backend.TokenUsage, error) {
+	responses <- schema.OpenAIResponse{
+		ID:      id,
+		Created: created,
+		Model:   req.Model, // we have to return what the user sent here, due to OpenAI spec.
+		Choices: []schema.Choice{{Delta: &schema.Message{Role: "assistant"}, Index: 0, FinishReason: nil}},
+		Object:  "chat.completion.chunk",
+	}
+
+	// Detect if thinking token is already in prompt or template
+	// When UseTokenizerTemplate is enabled, predInput is empty, so we check the template
+	var template string
+	if cfg.TemplateConfig.UseTokenizerTemplate {
+		template = cfg.GetModelTemplate()
+	} else {
+		template = s
+	}
+	thinkingStartToken := reason.DetectThinkingStartToken(template, &cfg.ReasoningConfig)
+	extractor := reason.NewReasoningExtractor(thinkingStartToken, cfg.ReasoningConfig)
+
+	_, finalUsage, _, err := ComputeChoices(req, s, cfg, cl, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, tokenUsage backend.TokenUsage) bool {
+		var reasoningDelta, contentDelta string
+
+		// Always keep the Go-side extractor in sync with raw tokens so it
+		// can serve as fallback for backends without an autoparser (e.g. vLLM).
+		goReasoning, goContent := extractor.ProcessToken(s)
+
+		// When C++ autoparser chat deltas are available, prefer them: they
+		// handle model-specific formats (Gemma 4, etc.) without Go-side tags.
+		// Otherwise fall back to Go-side extraction.
+		if tokenUsage.HasChatDeltaContent() {
+			rawReasoning, cd := tokenUsage.ChatDeltaReasoningAndContent()
+			contentDelta = cd
+			reasoningDelta = extractor.ProcessChatDeltaReasoning(rawReasoning)
+		} else {
+			reasoningDelta = goReasoning
+			contentDelta = goContent
+		}
+
+		delta := &schema.Message{}
+		if contentDelta != "" {
+			delta.Content = &contentDelta
+		}
+		if reasoningDelta != "" {
+			delta.Reasoning = &reasoningDelta
+		}
+
+		responses <- schema.OpenAIResponse{
+			ID:      id,
+			Created: created,
+			Model:   req.Model, // we have to return what the user sent here, due to OpenAI spec.
+			Choices: []schema.Choice{{Delta: delta, Index: 0, FinishReason: nil}},
+			Object:  "chat.completion.chunk",
+		}
+		return true
+	})
+	close(responses)
+	return finalUsage, err
+}
+
+// processStreamWithTools is the streaming worker for chat completions
+// with tools / function calling. Same contract as processStream: pushes
+// chunks onto `responses`, closes the channel, returns the cumulative
+// TokenUsage.
+//
+// Returning the TokenUsage as a normal Go value (rather than smuggling
+// it on a sentinel chunk) is the fix for issue #9927 — the previous
+// implementation discarded the value from ComputeChoices, so the
+// include_usage trailer reported zeros whenever `tools` was in play.
+func processStreamWithTools(
+	noAction string,
+	prompt string,
+	req *schema.OpenAIRequest,
+	cfg *config.ModelConfig,
+	cl *config.ModelConfigLoader,
+	startupOptions *config.ApplicationConfig,
+	loader *model.ModelLoader,
+	responses chan schema.OpenAIResponse,
+	id string,
+	created int,
+	textContentToReturn *string,
+) (backend.TokenUsage, error) {
+	// Detect if thinking token is already in prompt or template
+	var template string
+	if cfg.TemplateConfig.UseTokenizerTemplate {
+		template = cfg.GetModelTemplate()
+	} else {
+		template = prompt
+	}
+	thinkingStartToken := reason.DetectThinkingStartToken(template, &cfg.ReasoningConfig)
+	extractor := reason.NewReasoningExtractor(thinkingStartToken, cfg.ReasoningConfig)
+
+	result := ""
+	lastEmittedCount := 0
+	sentInitialRole := false
+	sentReasoning := false
+	hasChatDeltaToolCalls := false
+	hasChatDeltaContent := false
+
+	_, finalUsage, chatDeltas, err := ComputeChoices(req, prompt, cfg, cl, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, usage backend.TokenUsage) bool {
+		result += s
+
+		// Track whether ChatDeltas from the C++ autoparser contain
+		// tool calls or content, so the retry decision can account for them.
+		for _, d := range usage.ChatDeltas {
+			if len(d.ToolCalls) > 0 {
+				hasChatDeltaToolCalls = true
+			}
+			if d.Content != "" {
+				hasChatDeltaContent = true
+			}
+		}
+
+		var reasoningDelta, contentDelta string
+
+		goReasoning, goContent := extractor.ProcessToken(s)
+
+		if usage.HasChatDeltaContent() {
+			rawReasoning, cd := usage.ChatDeltaReasoningAndContent()
+			contentDelta = cd
+			reasoningDelta = extractor.ProcessChatDeltaReasoning(rawReasoning)
+		} else {
+			reasoningDelta = goReasoning
+			contentDelta = goContent
+		}
+
+		// Emit reasoning deltas in their own SSE chunks before any tool-call chunks
+		// (OpenAI spec: reasoning and tool_calls never share a delta)
+		if reasoningDelta != "" {
+			responses <- schema.OpenAIResponse{
+				ID:      id,
+				Created: created,
+				Model:   req.Model,
+				Choices: []schema.Choice{{
+					Delta: &schema.Message{Reasoning: &reasoningDelta},
+					Index: 0,
+				}},
+				Object: "chat.completion.chunk",
+			}
+			sentReasoning = true
+		}
+
+		// Stream content deltas (cleaned of reasoning tags) while no tool calls
+		// have been detected. Once the incremental parser finds tool calls,
+		// content stops: per OpenAI spec, content and tool_calls don't mix.
+		if lastEmittedCount == 0 && contentDelta != "" {
+			if !sentInitialRole {
+				responses <- schema.OpenAIResponse{
+					ID: id, Created: created, Model: req.Model,
+					Choices: []schema.Choice{{Delta: &schema.Message{Role: "assistant"}, Index: 0}},
+					Object:  "chat.completion.chunk",
+				}
+				sentInitialRole = true
+			}
+			responses <- schema.OpenAIResponse{
+				ID: id, Created: created, Model: req.Model,
+				Choices: []schema.Choice{{
+					Delta: &schema.Message{Content: &contentDelta},
+					Index: 0,
+				}},
+				Object: "chat.completion.chunk",
+			}
+		}
+
+		// Try incremental XML parsing for streaming support using iterative parser
+		// This allows emitting partial tool calls as they're being generated
+		cleanedResult := functions.CleanupLLMResult(result, cfg.FunctionsConfig)
+
+		// Determine XML format from config
+		var xmlFormat *functions.XMLToolCallFormat
+		if cfg.FunctionsConfig.XMLFormat != nil {
+			xmlFormat = cfg.FunctionsConfig.XMLFormat
+		} else if cfg.FunctionsConfig.XMLFormatPreset != "" {
+			xmlFormat = functions.GetXMLFormatPreset(cfg.FunctionsConfig.XMLFormatPreset)
+		}
+
+		// Use iterative parser for streaming (partial parsing enabled)
+		// Try XML parsing first
+		partialResults, parseErr := functions.ParseXMLIterative(cleanedResult, xmlFormat, true)
+		if parseErr == nil && len(partialResults) > 0 {
+			// Emit new XML tool calls that weren't emitted before
+			if len(partialResults) > lastEmittedCount {
+				for i := lastEmittedCount; i < len(partialResults); i++ {
+					toolCall := partialResults[i]
+					initialMessage := schema.OpenAIResponse{
+						ID:      id,
+						Created: created,
+						Model:   req.Model,
+						Choices: []schema.Choice{{
+							Delta: &schema.Message{
+								Role: "assistant",
+								ToolCalls: []schema.ToolCall{
+									{
+										Index: i,
+										ID:    id,
+										Type:  "function",
+										FunctionCall: schema.FunctionCall{
+											Name: toolCall.Name,
+										},
+									},
+								},
+							},
+							Index:        0,
+							FinishReason: nil,
+						}},
+						Object: "chat.completion.chunk",
+					}
+					select {
+					case responses <- initialMessage:
+					default:
+					}
+				}
+				lastEmittedCount = len(partialResults)
+			}
+		} else {
+			// Try JSON tool call parsing for streaming.
+			// Only emit NEW tool calls (same guard as XML parser above).
+			jsonResults, jsonErr := functions.ParseJSONIterative(cleanedResult, true)
+			if jsonErr == nil && len(jsonResults) > lastEmittedCount {
+				for i := lastEmittedCount; i < len(jsonResults); i++ {
+					jsonObj := jsonResults[i]
+					name, ok := jsonObj["name"].(string)
+					if !ok || name == "" {
+						continue
+					}
+					args := "{}"
+					if argsVal, ok := jsonObj["arguments"]; ok {
+						if argsStr, ok := argsVal.(string); ok {
+							args = argsStr
+						} else {
+							argsBytes, _ := json.Marshal(argsVal)
+							args = string(argsBytes)
+						}
+					}
+					initialMessage := schema.OpenAIResponse{
+						ID:      id,
+						Created: created,
+						Model:   req.Model,
+						Choices: []schema.Choice{{
+							Delta: &schema.Message{
+								Role: "assistant",
+								ToolCalls: []schema.ToolCall{
+									{
+										Index: i,
+										ID:    id,
+										Type:  "function",
+										FunctionCall: schema.FunctionCall{
+											Name:      name,
+											Arguments: args,
+										},
+									},
+								},
+							},
+							Index:        0,
+							FinishReason: nil,
+						}},
+						Object: "chat.completion.chunk",
+					}
+					responses <- initialMessage
+				}
+				lastEmittedCount = len(jsonResults)
+			}
+		}
+		return true
+	},
+		func(attempt int) bool {
+			// After streaming completes: check if we got actionable content
+			cleaned := extractor.CleanedContent()
+			// Check for tool calls from chat deltas (will be re-checked after ComputeChoices,
+			// but we need to know here whether to retry).
+			// Also check ChatDelta flags: when the C++ autoparser is active,
+			// tool calls and content are delivered via ChatDeltas while the
+			// raw message is cleared. Without this check, we'd retry
+			// unnecessarily, losing valid results and concatenating output.
+			hasToolCalls := lastEmittedCount > 0 || hasChatDeltaToolCalls
+			hasContent := cleaned != "" || hasChatDeltaContent
+			if !hasContent && !hasToolCalls {
+				xlog.Warn("Streaming: backend produced only reasoning, retrying",
+					"reasoning_len", len(extractor.Reasoning()), "attempt", attempt+1)
+				extractor.ResetAndSuppressReasoning()
+				result = ""
+				lastEmittedCount = 0
+				sentInitialRole = false
+				hasChatDeltaToolCalls = false
+				hasChatDeltaContent = false
+				return true
+			}
+			return false
+		},
+	)
+	if err != nil {
+		return finalUsage, err
+	}
+	// Try using pre-parsed tool calls from C++ autoparser (chat deltas)
+	var functionResults []functions.FuncCallResults
+	var reasoning string
+
+	if deltaToolCalls := functions.ToolCallsFromChatDeltas(chatDeltas); len(deltaToolCalls) > 0 {
+		xlog.Debug("[ChatDeltas] Using pre-parsed tool calls from C++ autoparser", "count", len(deltaToolCalls))
+		functionResults = deltaToolCalls
+		// Use content/reasoning from deltas too
+		*textContentToReturn = functions.ContentFromChatDeltas(chatDeltas)
+		reasoning = functions.ReasoningFromChatDeltas(chatDeltas)
+	} else {
+		// Fallback: parse tool calls from raw text (no chat deltas from backend)
+		xlog.Debug("[ChatDeltas] no pre-parsed tool calls, falling back to Go-side text parsing")
+		reasoning = extractor.Reasoning()
+		cleanedResult := extractor.CleanedContent()
+		*textContentToReturn = functions.ParseTextContent(cleanedResult, cfg.FunctionsConfig)
+		cleanedResult = functions.CleanupLLMResult(cleanedResult, cfg.FunctionsConfig)
+		functionResults = functions.ParseFunctionCall(cleanedResult, cfg.FunctionsConfig)
+	}
+	xlog.Debug("[ChatDeltas] final tool call decision", "tool_calls", len(functionResults), "text_content", *textContentToReturn)
+	// noAction is a sentinel "just answer" pseudo-function: not a real
+	// tool call. Scan the whole slice rather than only index 0 so we
+	// don't drop a real tool call that happens to follow a noAction
+	// entry, and so the default branch isn't entered with only noAction
+	// entries to emit as tool_calls.
+	noActionToRun := !hasRealCall(functionResults, noAction)
+
+	switch {
+	case noActionToRun:
+		// The final usage trailer (when the caller opted in with
+		// stream_options.include_usage) is built by the outer streaming
+		// loop from the TokenUsage this function returns, not from any
+		// chunk on the responses channel.
+		var result string
+		if !sentInitialRole {
+			var hqErr error
+			result, hqErr = handleQuestion(cfg, functionResults, extractor.CleanedContent(), prompt)
+			if hqErr != nil {
+				xlog.Error("error handling question", "error", hqErr)
+				return finalUsage, hqErr
+			}
+		}
+		for _, chunk := range buildNoActionFinalChunks(
+			id, req.Model, created,
+			sentInitialRole, sentReasoning,
+			result, reasoning,
+		) {
+			responses <- chunk
+		}
+
+	default:
+		for _, chunk := range buildDeferredToolCallChunks(
+			id, req.Model, created,
+			functionResults, lastEmittedCount,
+			sentInitialRole, *textContentToReturn,
+			sentReasoning, reasoning,
+		) {
+			responses <- chunk
+		}
+	}
+
+	close(responses)
+	return finalUsage, err
+}
--- a/core/http/middleware/trace.go
+++ b/core/http/middleware/trace.go
@@ -17,16 +17,20 @@ import (
 )

 type APIExchangeRequest struct {
-	Method  string       `json:"method"`
-	Path    string       `json:"path"`
-	Headers *http.Header `json:"headers"`
-	Body    *[]byte      `json:"body"`
+	Method        string       `json:"method"`
+	Path          string       `json:"path"`
+	Headers       *http.Header `json:"headers"`
+	Body          *[]byte      `json:"body"`
+	BodyTruncated bool         `json:"body_truncated,omitempty"`
+	BodyBytes     int          `json:"body_bytes,omitempty"` // original size before truncation
 }

 type APIExchangeResponse struct {
-	Status  int          `json:"status"`
-	Headers *http.Header `json:"headers"`
-	Body    *[]byte      `json:"body"`
+	Status        int          `json:"status"`
+	Headers       *http.Header `json:"headers"`
+	Body          *[]byte      `json:"body"`
+	BodyTruncated bool         `json:"body_truncated,omitempty"`
+	BodyBytes     int          `json:"body_bytes,omitempty"` // original size before truncation
 }

 type APIExchange struct {
@@ -66,11 +70,29 @@ var doInitializeTracing = sync.OnceFunc(func() {

 type bodyWriter struct {
 	http.ResponseWriter
-	body *bytes.Buffer
+	body       *bytes.Buffer
+	maxBytes   int // 0 = unlimited capture
+	truncated  bool
+	totalBytes int // bytes the upstream handler wrote, even past the cap
 }

 func (w *bodyWriter) Write(b []byte) (int, error) {
-	w.body.Write(b)
+	// Capture into the trace buffer up to maxBytes, then drop the overflow
+	// so a chatty endpoint can't grow the buffer without bound. The full
+	// payload still flows through to the real client below.
+	w.totalBytes += len(b)
+	if w.maxBytes <= 0 {
+		w.body.Write(b)
+	} else if remain := w.maxBytes - w.body.Len(); remain > 0 {
+		if remain >= len(b) {
+			w.body.Write(b)
+		} else {
+			w.body.Write(b[:remain])
+			w.truncated = true
+		}
+	} else {
+		w.truncated = true
+	}
 	return w.ResponseWriter.Write(b)
 }

@@ -80,6 +102,20 @@ func (w *bodyWriter) Flush() {
 	}
 }

+// truncateForTrace returns a defensive copy of body capped at maxBytes,
+// and a flag indicating whether the cap forced truncation. maxBytes <= 0
+// disables the cap.
+func truncateForTrace(body []byte, maxBytes int) ([]byte, bool) {
+	if maxBytes <= 0 || len(body) <= maxBytes {
+		out := make([]byte, len(body))
+		copy(out, body)
+		return out, false
+	}
+	out := make([]byte, maxBytes)
+	copy(out, body[:maxBytes])
+	return out, true
+}
+
 func initializeTracing(maxItems int) {
 	tracingMaxItems = maxItems
 	doInitializeTracing()
@@ -134,11 +170,18 @@ func TraceMiddleware(app *application.Application) echo.MiddlewareFunc {

 			startTime := time.Now()

+			// Cap captured payload size. Without this, /embeddings and
+			// streaming /chat/completions blow the in-memory buffer into the
+			// tens of MB, which then locks the admin Traces UI fetching the
+			// JSON dump faster than the 5s auto-refresh.
+			maxBodyBytes := app.ApplicationConfig().TracingMaxBodyBytes
+
 			// Wrap response writer to capture body
 			resBody := new(bytes.Buffer)
 			mw := &bodyWriter{
 				ResponseWriter: c.Response().Writer,
 				body:           resBody,
+				maxBytes:       maxBodyBytes,
 			}
 			c.Response().Writer = mw

@@ -159,8 +202,7 @@ func TraceMiddleware(app *application.Application) echo.MiddlewareFunc {
 			// via any heap-dump-style introspection, and tokens shouldn't
 			// outlive the request that carried them.
 			requestHeaders := redactSensitiveHeaders(c.Request().Header)
-			requestBody := make([]byte, len(body))
-			copy(requestBody, body)
+			requestBody, requestTruncated := truncateForTrace(body, maxBodyBytes)
 			responseHeaders := redactSensitiveHeaders(c.Response().Header())
 			responseBody := make([]byte, resBody.Len())
 			copy(responseBody, resBody.Bytes())
@@ -168,15 +210,19 @@ func TraceMiddleware(app *application.Application) echo.MiddlewareFunc {
 				Timestamp: startTime,
 				Duration:  time.Since(startTime),
 				Request: APIExchangeRequest{
-					Method:  c.Request().Method,
-					Path:    c.Path(),
-					Headers: &requestHeaders,
-					Body:    &requestBody,
+					Method:        c.Request().Method,
+					Path:          c.Path(),
+					Headers:       &requestHeaders,
+					Body:          &requestBody,
+					BodyTruncated: requestTruncated,
+					BodyBytes:     len(body),
 				},
 				Response: APIExchangeResponse{
-					Status:  status,
-					Headers: &responseHeaders,
-					Body:    &responseBody,
+					Status:        status,
+					Headers:       &responseHeaders,
+					Body:          &responseBody,
+					BodyTruncated: mw.truncated,
+					BodyBytes:     mw.totalBytes,
 				},
 			}
 			if handlerErr != nil {
--- a/core/http/middleware/trace_body_cap_test.go
+++ b/core/http/middleware/trace_body_cap_test.go
@@ -0,0 +1,116 @@
+package middleware
+
+import (
+	"bytes"
+	"net/http/httptest"
+	"strings"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+// The trace middleware copies request and response bodies into an in-memory
+// buffer that backs the admin /api/traces endpoint. With no upper bound a
+// chatty workload (embeddings, large completions) trivially produces a
+// multi-MB response that locks the Traces UI in a loading state — fetching
+// and parsing the payload outruns the 5-second auto-refresh. These specs
+// pin the capping contract so future refactors keep both the cap and the
+// passthrough to the real client intact.
+
+var _ = Describe("bodyWriter capping", func() {
+	It("captures the full body when maxBytes is 0 (unlimited)", func() {
+		downstream := httptest.NewRecorder()
+		buf := &bytes.Buffer{}
+		bw := &bodyWriter{ResponseWriter: downstream, body: buf, maxBytes: 0}
+
+		payload := []byte(strings.Repeat("x", 4096))
+		n, err := bw.Write(payload)
+
+		Expect(err).ToNot(HaveOccurred())
+		Expect(n).To(Equal(len(payload)))
+		Expect(buf.Len()).To(Equal(len(payload)))
+		Expect(downstream.Body.Len()).To(Equal(len(payload)))
+		Expect(bw.truncated).To(BeFalse())
+	})
+
+	It("stops appending to the trace buffer once maxBytes is reached but still forwards to the client", func() {
+		downstream := httptest.NewRecorder()
+		buf := &bytes.Buffer{}
+		bw := &bodyWriter{ResponseWriter: downstream, body: buf, maxBytes: 100}
+
+		payload := []byte(strings.Repeat("a", 250))
+		n, err := bw.Write(payload)
+
+		Expect(err).ToNot(HaveOccurred())
+		Expect(n).To(Equal(len(payload)), "Write must return the full byte count so callers see no short write")
+		Expect(buf.Len()).To(Equal(100), "trace buffer should hold exactly maxBytes")
+		Expect(downstream.Body.Len()).To(Equal(len(payload)), "client must still receive every byte")
+		Expect(bw.truncated).To(BeTrue())
+	})
+
+	It("handles a write that straddles the cap by keeping only the leading slice", func() {
+		downstream := httptest.NewRecorder()
+		buf := &bytes.Buffer{}
+		bw := &bodyWriter{ResponseWriter: downstream, body: buf, maxBytes: 10}
+
+		_, err := bw.Write([]byte("12345"))
+		Expect(err).ToNot(HaveOccurred())
+		Expect(bw.truncated).To(BeFalse())
+
+		_, err = bw.Write([]byte("67890ABCDE"))
+		Expect(err).ToNot(HaveOccurred())
+
+		Expect(buf.String()).To(Equal("1234567890"))
+		Expect(downstream.Body.String()).To(Equal("1234567890ABCDE"))
+		Expect(bw.truncated).To(BeTrue())
+	})
+
+	It("ignores further writes after the cap was already hit", func() {
+		downstream := httptest.NewRecorder()
+		buf := &bytes.Buffer{}
+		bw := &bodyWriter{ResponseWriter: downstream, body: buf, maxBytes: 4}
+
+		_, _ = bw.Write([]byte("AAAA"))
+		_, _ = bw.Write([]byte("BBBB"))
+		_, _ = bw.Write([]byte("CCCC"))
+
+		Expect(buf.String()).To(Equal("AAAA"))
+		Expect(downstream.Body.String()).To(Equal("AAAABBBBCCCC"))
+		Expect(bw.truncated).To(BeTrue())
+	})
+})
+
+var _ = Describe("truncateForTrace", func() {
+	It("returns the input unchanged when below the cap", func() {
+		in := []byte("hello")
+		out, truncated := truncateForTrace(in, 1024)
+		Expect(truncated).To(BeFalse())
+		Expect(out).To(Equal(in))
+	})
+
+	It("truncates when the input exceeds the cap and signals truncation", func() {
+		in := []byte(strings.Repeat("z", 200))
+		out, truncated := truncateForTrace(in, 64)
+		Expect(truncated).To(BeTrue())
+		Expect(out).To(HaveLen(64))
+		Expect(string(out)).To(Equal(strings.Repeat("z", 64)))
+	})
+
+	It("treats maxBytes <= 0 as unlimited (back-compat with current default)", func() {
+		in := []byte(strings.Repeat("q", 10_000))
+		out, truncated := truncateForTrace(in, 0)
+		Expect(truncated).To(BeFalse())
+		Expect(out).To(HaveLen(len(in)))
+	})
+
+	It("does not retain the caller's backing array (defensive copy)", func() {
+		in := []byte("abcdefghij")
+		out, truncated := truncateForTrace(in, 4)
+		Expect(truncated).To(BeTrue())
+		Expect(string(out)).To(Equal("abcd"))
+
+		// Mutating the source must not corrupt the trace copy.
+		in[0] = 'Z'
+		Expect(string(out)).To(Equal("abcd"))
+	})
+})
--- a/core/http/middleware/usage.go
+++ b/core/http/middleware/usage.go
@@ -4,6 +4,7 @@ import (
 	"bytes"
 	"encoding/json"
 	"sync"
+	"sync/atomic"
 	"time"

 	"github.com/labstack/echo/v4"
@@ -14,18 +15,37 @@ import (

 const (
 	usageFlushInterval = 5 * time.Second
-	usageMaxPending    = 5000
+	// usageMaxPending bounds the in-memory queue. Sized for bursty inference
+	// traffic on a self-hosted instance with a slow or unavailable DB.
+	usageMaxPending = 50000
 )

 // usageBatcher accumulates usage records and flushes them to the DB periodically.
 type usageBatcher struct {
-	mu      sync.Mutex
-	pending []*auth.UsageRecord
-	db      *gorm.DB
+	mu       sync.Mutex
+	pending  []*auth.UsageRecord
+	db       *gorm.DB
+	stop     chan struct{}
+	done     chan struct{}
+	stopOnce sync.Once
 }

+// droppedRecords counts records discarded because the in-memory queue was full.
+// Used to rate-limit the warn log so a sustained outage doesn't flood it.
+var droppedRecords atomic.Uint64
+
 func (b *usageBatcher) add(r *auth.UsageRecord) {
 	b.mu.Lock()
+	if len(b.pending) >= usageMaxPending {
+		b.mu.Unlock()
+		// Rate-limit: one warn per 1024 drops keeps the log readable.
+		n := droppedRecords.Add(1)
+		if n&1023 == 1 {
+			xlog.Warn("usage batcher full, dropping record",
+				"cap", usageMaxPending, "total_dropped", n)
+		}
+		return
+	}
 	b.pending = append(b.pending, r)
 	b.mu.Unlock()
 }
@@ -42,31 +62,102 @@ func (b *usageBatcher) flush() {

 	if err := b.db.Create(&batch).Error; err != nil {
 		xlog.Error("Failed to flush usage batch", "count", len(batch), "error", err)
-		// Re-queue failed records with a cap to avoid unbounded growth
+		// Cap-aware re-queue: prepend as much of the failed batch as fits
+		// alongside any records added concurrently with the failed write.
 		b.mu.Lock()
-		if len(b.pending) < usageMaxPending {
-			b.pending = append(batch, b.pending...)
+		room := usageMaxPending - len(b.pending)
+		if room > 0 {
+			if room > len(batch) {
+				room = len(batch)
+			}
+			b.pending = append(batch[:room], b.pending...)
 		}
 		b.mu.Unlock()
 	}
 }

-var batcher *usageBatcher
+func (b *usageBatcher) run() {
+	defer close(b.done)
+	ticker := time.NewTicker(usageFlushInterval)
+	defer ticker.Stop()
+	for {
+		select {
+		case <-ticker.C:
+			b.flush()
+		case <-b.stop:
+			b.flush() // final drain
+			return
+		}
+	}
+}
+
+func (b *usageBatcher) shutdown() {
+	b.stopOnce.Do(func() {
+		close(b.stop)
+		<-b.done
+	})
+}
+
+// The package-level batcher is guarded by batcherMu so Init / Shutdown cycles
+// (the test pattern) don't race against UsageMiddleware reads.
+var (
+	batcherMu sync.RWMutex
+	batcher   *usageBatcher
+)
+
+func currentBatcher() *usageBatcher {
+	batcherMu.RLock()
+	defer batcherMu.RUnlock()
+	return batcher
+}

 // InitUsageRecorder starts a background goroutine that periodically flushes
-// accumulated usage records to the database.
+// accumulated usage records to the database. Calling it more than once
+// shuts down the previous batcher first so its goroutine doesn't leak.
 func InitUsageRecorder(db *gorm.DB) {
 	if db == nil {
 		return
 	}
-	batcher = &usageBatcher{db: db}
-	go func() {
-		ticker := time.NewTicker(usageFlushInterval)
-		defer ticker.Stop()
-		for range ticker.C {
-			batcher.flush()
-		}
-	}()
+
+	batcherMu.Lock()
+	old := batcher
+	batcher = nil
+	batcherMu.Unlock()
+	if old != nil {
+		old.shutdown()
+	}
+
+	b := &usageBatcher{
+		db:   db,
+		stop: make(chan struct{}),
+		done: make(chan struct{}),
+	}
+	batcherMu.Lock()
+	batcher = b
+	batcherMu.Unlock()
+
+	go b.run()
+}
+
+// ShutdownUsageRecorder stops the background flusher and synchronously drains
+// pending records once. Safe to call multiple times. Not yet wired into the
+// application lifecycle; intended for graceful process exit and tests.
+func ShutdownUsageRecorder() {
+	batcherMu.Lock()
+	b := batcher
+	batcher = nil
+	batcherMu.Unlock()
+	if b != nil {
+		b.shutdown()
+	}
+}
+
+// FlushNow synchronously flushes any pending usage records. Intended for tests
+// that need deterministic behaviour without waiting for the ticker.
+func FlushNow() {
+	if b := currentBatcher(); b != nil {
+		b.flush()
+	}
 }

 // usageResponseBody is the minimal structure we need from the response JSON.
@@ -84,7 +175,8 @@ type usageResponseBody struct {
 func UsageMiddleware(db *gorm.DB) echo.MiddlewareFunc {
 	return func(next echo.HandlerFunc) echo.HandlerFunc {
 		return func(c echo.Context) error {
-			if db == nil || batcher == nil {
+			b := currentBatcher()
+			if db == nil || b == nil {
 				return next(c)
 			}

@@ -149,9 +241,17 @@ func UsageMiddleware(db *gorm.DB) echo.MiddlewareFunc {
 				return handlerErr
 			}

+			source := auth.GetSource(c)
+			if source == "" {
+				// Auth disabled or unrecognised path: classify as web so the row is still
+				// bucketable rather than silently dropped from per-source aggregates.
+				source = auth.UsageSourceWeb
+			}
+
 			record := &auth.UsageRecord{
 				UserID:           user.ID,
 				UserName:         user.Name,
+				Source:           source,
 				Model:            resp.Model,
 				Endpoint:         c.Request().URL.Path,
 				PromptTokens:     resp.Usage.PromptTokens,
@@ -161,7 +261,13 @@ func UsageMiddleware(db *gorm.DB) echo.MiddlewareFunc {
 				CreatedAt:        startTime,
 			}

-			batcher.add(record)
+			if key := auth.GetAPIKey(c); key != nil {
+				id := key.ID
+				record.APIKeyID = &id
+				record.APIKeyName = key.Name
+			}
+
+			b.add(record)

 			return handlerErr
 		}
--- a/core/http/middleware/usage_test.go
+++ b/core/http/middleware/usage_test.go
@@ -0,0 +1,140 @@
+//go:build auth
+
+package middleware_test
+
+import (
+	"bytes"
+	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+
+	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/core/http/auth"
+	"github.com/mudler/LocalAI/core/http/middleware"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+	"gorm.io/gorm"
+)
+
+// testAuthDB returns a fresh in-memory SQLite auth DB.
+func testAuthDB() *gorm.DB {
+	db, err := auth.InitDB(":memory:")
+	if err != nil {
+		panic(err)
+	}
+	return db
+}
+
+var _ = Describe("UsageMiddleware", func() {
+	var (
+		e  *echo.Echo
+		db *gorm.DB
+	)
+
+	BeforeEach(func() {
+		db = testAuthDB()
+		e = echo.New()
+		middleware.InitUsageRecorder(db)
+	})
+
+	AfterEach(func() {
+		middleware.ShutdownUsageRecorder()
+	})
+
+	okHandler := func(c echo.Context) error {
+		body, _ := json.Marshal(map[string]any{
+			"model": "gpt-4",
+			"usage": map[string]int{
+				"prompt_tokens": 10, "completion_tokens": 5, "total_tokens": 15,
+			},
+		})
+		c.Response().Header().Set("Content-Type", "application/json")
+		c.Response().WriteHeader(http.StatusOK)
+		_, _ = c.Response().Write(body)
+		return nil
+	}
+
+	// FlushNow drains pending records synchronously, replacing the 6s sleep
+	// that was previously needed to wait for the batcher's ticker.
+	flush := middleware.FlushNow
+
+	It("records source=web when auth_source is web", func() {
+		e.POST("/v1/chat/completions", okHandler, func(next echo.HandlerFunc) echo.HandlerFunc {
+			return func(c echo.Context) error {
+				c.Set("auth_user", &auth.User{ID: "alice", Name: "Alice"})
+				c.Set("auth_source", auth.UsageSourceWeb)
+				return next(c)
+			}
+		}, middleware.UsageMiddleware(db))
+
+		req := httptest.NewRequest("POST", "/v1/chat/completions", bytes.NewReader([]byte(`{}`)))
+		e.ServeHTTP(httptest.NewRecorder(), req)
+		flush()
+
+		var rec auth.UsageRecord
+		Expect(db.Where("user_id = ?", "alice").First(&rec).Error).To(Succeed())
+		Expect(rec.Source).To(Equal(auth.UsageSourceWeb))
+		Expect(rec.APIKeyID).To(BeNil())
+		Expect(rec.APIKeyName).To(BeEmpty())
+	})
+
+	It("records source=apikey with snapshotted name when auth_apikey is set", func() {
+		e.POST("/v1/chat/completions", okHandler, func(next echo.HandlerFunc) echo.HandlerFunc {
+			return func(c echo.Context) error {
+				c.Set("auth_user", &auth.User{ID: "alice", Name: "Alice"})
+				c.Set("auth_source", auth.UsageSourceAPIKey)
+				c.Set("auth_apikey", &auth.UserAPIKey{ID: "key-1", Name: "ci-runner"})
+				return next(c)
+			}
+		}, middleware.UsageMiddleware(db))
+
+		req := httptest.NewRequest("POST", "/v1/chat/completions", bytes.NewReader([]byte(`{}`)))
+		e.ServeHTTP(httptest.NewRecorder(), req)
+		flush()
+
+		var rec auth.UsageRecord
+		Expect(db.Where("user_id = ?", "alice").First(&rec).Error).To(Succeed())
+		Expect(rec.Source).To(Equal(auth.UsageSourceAPIKey))
+		Expect(rec.APIKeyID).ToNot(BeNil())
+		Expect(*rec.APIKeyID).To(Equal("key-1"))
+		Expect(rec.APIKeyName).To(Equal("ci-runner"))
+	})
+
+	It("FlushNow drains pending records synchronously", func() {
+		e.POST("/v1/chat/completions", okHandler, func(next echo.HandlerFunc) echo.HandlerFunc {
+			return func(c echo.Context) error {
+				c.Set("auth_user", &auth.User{ID: "carol", Name: "Carol"})
+				c.Set("auth_source", auth.UsageSourceWeb)
+				return next(c)
+			}
+		}, middleware.UsageMiddleware(db))
+
+		req := httptest.NewRequest("POST", "/v1/chat/completions", bytes.NewReader([]byte(`{}`)))
+		e.ServeHTTP(httptest.NewRecorder(), req)
+
+		// No sleep: FlushNow should drain immediately.
+		middleware.FlushNow()
+
+		var rec auth.UsageRecord
+		Expect(db.Where("user_id = ?", "carol").First(&rec).Error).To(Succeed())
+		Expect(rec.Source).To(Equal(auth.UsageSourceWeb))
+	})
+
+	It("falls back to source=web when auth_source is empty", func() {
+		e.POST("/v1/chat/completions", okHandler, func(next echo.HandlerFunc) echo.HandlerFunc {
+			return func(c echo.Context) error {
+				c.Set("auth_user", &auth.User{ID: "alice", Name: "Alice"})
+				// no auth_source set
+				return next(c)
+			}
+		}, middleware.UsageMiddleware(db))
+
+		req := httptest.NewRequest("POST", "/v1/chat/completions", bytes.NewReader([]byte(`{}`)))
+		e.ServeHTTP(httptest.NewRecorder(), req)
+		flush()
+
+		var rec auth.UsageRecord
+		Expect(db.Where("user_id = ?", "alice").First(&rec).Error).To(Succeed())
+		Expect(rec.Source).To(Equal(auth.UsageSourceWeb))
+	})
+})
--- a/core/http/react-ui/e2e/chat-polling-selection.spec.js
+++ b/core/http/react-ui/e2e/chat-polling-selection.spec.js
@@ -0,0 +1,143 @@
+import { test, expect } from '@playwright/test'
+
+// Regression coverage for issue #9904:
+// - /api/operations was polled every 1s and *always* re-rendered the Chat
+//   page, even when the response was unchanged. The reconciliation would
+//   collapse any text selection inside an assistant message.
+// - The copy button next to each assistant message used navigator.clipboard
+//   without any fallback, which is undefined when the page is served over
+//   plain http (non-secure context) from a remote host.
+
+async function setupChatPage(page) {
+  await page.route('**/api/models/capabilities', (route) => {
+    route.fulfill({
+      contentType: 'application/json',
+      body: JSON.stringify({
+        data: [{ id: 'test-model', capabilities: ['FLAG_CHAT'] }],
+      }),
+    })
+  })
+
+  // Poll-tracking mock: assert the hook is hammering /api/operations every
+  // ~1s, and always return an empty list so its contents never change.
+  let operationsHits = 0
+  await page.route('**/api/operations', (route) => {
+    operationsHits++
+    route.fulfill({
+      contentType: 'application/json',
+      body: JSON.stringify({ operations: [] }),
+    })
+  })
+
+  await page.route('**/v1/chat/completions', (route) => {
+    // One short SSE stream so the chat finishes streaming quickly and we
+    // can interact with a stable assistant message.
+    const body = [
+      'data: {"choices":[{"delta":{"content":"Hello world this is a long assistant reply that we can try to select."},"index":0}]}\n\n',
+      'data: {"choices":[{"delta":{},"index":0,"finish_reason":"stop"}],"usage":{"prompt_tokens":1,"completion_tokens":1,"total_tokens":2}}\n\n',
+      'data: [DONE]\n\n',
+    ].join('')
+    route.fulfill({
+      status: 200,
+      headers: { 'Content-Type': 'text/event-stream' },
+      body,
+    })
+  })
+
+  return { getOperationsHits: () => operationsHits }
+}
+
+test.describe('Chat - /api/operations polling (#9904)', () => {
+  test('text selection inside an assistant message survives polling', async ({ page }) => {
+    const { getOperationsHits } = await setupChatPage(page)
+
+    await page.goto('/app/chat')
+    await expect(page.getByRole('button', { name: 'test-model' })).toBeVisible({ timeout: 10_000 })
+
+    await page.locator('.chat-input').fill('Hi')
+    await page.locator('.chat-send-btn').click()
+
+    const assistantContent = page.locator('.chat-message-assistant .chat-message-content').first()
+    await expect(assistantContent).toContainText('Hello world', { timeout: 10_000 })
+
+    // Sanity check: the polling we're regressing against is actually firing.
+    await page.waitForTimeout(2_500)
+    expect(getOperationsHits()).toBeGreaterThan(1)
+
+    // Sanity check that the bug we're guarding against is structurally
+    // possible: count how many times the assistant content node gets
+    // *touched* by React (childList / characterData mutations) over a
+    // 3-second window. Before the fix, every poll re-rendered Chat and
+    // re-set dangerouslySetInnerHTML, triggering a mutation cascade that
+    // collapsed the user's text selection. After the fix, polling with
+    // identical contents must not mutate the DOM at all.
+    const mutationCount = await assistantContent.evaluate((el) => new Promise((resolve) => {
+      let count = 0
+      const obs = new MutationObserver((records) => { count += records.length })
+      obs.observe(el, { childList: true, subtree: true, characterData: true })
+      setTimeout(() => { obs.disconnect(); resolve(count) }, 3_000)
+    }))
+    expect(mutationCount).toBe(0)
+
+    // Same sanity check translated to a user-observable property: a
+    // programmatically created selection survives the polling window.
+    await assistantContent.evaluate((el) => {
+      const range = document.createRange()
+      range.selectNodeContents(el)
+      const sel = window.getSelection()
+      sel.removeAllRanges()
+      sel.addRange(range)
+    })
+
+    const initialSelection = await page.evaluate(() => window.getSelection().toString())
+    expect(initialSelection).toContain('Hello world')
+
+    await page.waitForTimeout(2_500)
+
+    const selectionAfterPolling = await page.evaluate(() => window.getSelection().toString())
+    expect(selectionAfterPolling).toBe(initialSelection)
+  })
+})
+
+test.describe('Chat - copy button (#9904)', () => {
+  test('copy button works when navigator.clipboard is unavailable (plain http)', async ({ page }) => {
+    await setupChatPage(page)
+
+    // Simulate a non-secure context: hide navigator.clipboard before any of
+    // our app code touches it. This mirrors what browsers do over plain
+    // http from a remote host.
+    await page.addInitScript(() => {
+      Object.defineProperty(window, 'isSecureContext', { value: false, configurable: true })
+      try {
+        Object.defineProperty(navigator, 'clipboard', { value: undefined, configurable: true })
+      } catch { /* some browsers refuse — the secure-context flag is enough */ }
+    })
+
+    await page.goto('/app/chat')
+    await expect(page.getByRole('button', { name: 'test-model' })).toBeVisible({ timeout: 10_000 })
+
+    await page.locator('.chat-input').fill('Hi')
+    await page.locator('.chat-send-btn').click()
+
+    const assistantBubble = page.locator('.chat-message-assistant .chat-message-bubble').first()
+    await expect(assistantBubble).toContainText('Hello world', { timeout: 10_000 })
+
+    // Spy on document.execCommand so we can confirm the fallback path ran.
+    await page.evaluate(() => {
+      window.__execCommandCalls = []
+      const original = document.execCommand?.bind(document)
+      document.execCommand = (cmd, ...rest) => {
+        window.__execCommandCalls.push(cmd)
+        // execCommand('copy') in a headless browser may return false because
+        // there is no real clipboard, but the fact that we tried is what we
+        // care about for this regression.
+        return original ? original(cmd, ...rest) : false
+      }
+    })
+
+    await assistantBubble.locator('.chat-message-actions button').first().click()
+
+    const execCommandCalls = await page.evaluate(() => window.__execCommandCalls)
+    expect(execCommandCalls).toContain('copy')
+  })
+})
--- a/core/http/react-ui/public/locales/de/chat.json
+++ b/core/http/react-ui/public/locales/de/chat.json
@@ -97,7 +97,8 @@
  },
  "toasts": {
    "selectModel": "Bitte wählen Sie ein Modell",
-    "copied": "In die Zwischenablage kopiert"
+    "copied": "In die Zwischenablage kopiert",
+    "copyFailed": "Kopieren in die Zwischenablage fehlgeschlagen"
  },
  "menu": {
    "trigger": "Chats",
--- a/core/http/react-ui/public/locales/en/admin.json
+++ b/core/http/react-ui/public/locales/en/admin.json
@@ -53,7 +53,30 @@
  },
  "usage": {
    "title": "Usage",
-    "subtitle": "API token usage statistics"
+    "subtitle": "API token usage statistics",
+    "sources": {
+      "tab": "Sources",
+      "mixTitle": "Source mix",
+      "ribbonAria": "{{apikey}}% API keys, {{web}}% Web UI, {{legacy}}% Legacy",
+      "topSources": "Top sources over time",
+      "searchPlaceholder": "Search by name or prefix",
+      "sortBy": "Sort",
+      "sortTokens": "Tokens",
+      "sortRequests": "Requests",
+      "sortLastUsed": "Last used",
+      "sortName": "Name",
+      "sortUser": "User",
+      "webUI": "Web UI",
+      "legacy": "Legacy",
+      "revoked": "revoked",
+      "filteredTo": "Filtered to: {{name}}",
+      "clearFilter": "Clear filter",
+      "other": "Other ({{count}})",
+      "noTrafficShort": "No requests in this period.",
+      "noKeysYet": "Once requests come in, you'll see them broken down here.",
+      "createKey": "Create your first API key",
+      "truncatedWarning": "Showing top 200 keys. Apply a filter to narrow further."
+    }
  },
  "explorer": {
    "title": "Explorer",
--- a/core/http/react-ui/public/locales/en/chat.json
+++ b/core/http/react-ui/public/locales/en/chat.json
@@ -97,7 +97,8 @@
  },
  "toasts": {
    "selectModel": "Please select a model",
-    "copied": "Copied to clipboard"
+    "copied": "Copied to clipboard",
+    "copyFailed": "Could not copy to clipboard"
  },
  "menu": {
    "trigger": "Chats",
--- a/core/http/react-ui/public/locales/es/chat.json
+++ b/core/http/react-ui/public/locales/es/chat.json
@@ -97,7 +97,8 @@
  },
  "toasts": {
    "selectModel": "Por favor selecciona un modelo",
-    "copied": "Copiado al portapapeles"
+    "copied": "Copiado al portapapeles",
+    "copyFailed": "No se pudo copiar al portapapeles"
  },
  "menu": {
    "trigger": "Chats",
--- a/core/http/react-ui/public/locales/it/chat.json
+++ b/core/http/react-ui/public/locales/it/chat.json
@@ -97,7 +97,8 @@
  },
  "toasts": {
    "selectModel": "Seleziona un modello",
-    "copied": "Copiato negli appunti"
+    "copied": "Copiato negli appunti",
+    "copyFailed": "Impossibile copiare negli appunti"
  },
  "menu": {
    "trigger": "Chat",
--- a/core/http/react-ui/public/locales/zh-CN/chat.json
+++ b/core/http/react-ui/public/locales/zh-CN/chat.json
@@ -97,7 +97,8 @@
  },
  "toasts": {
    "selectModel": "请选择一个模型",
-    "copied": "已复制到剪贴板"
+    "copied": "已复制到剪贴板",
+    "copyFailed": "无法复制到剪贴板"
  },
  "menu": {
    "trigger": "聊天",
--- a/core/http/react-ui/src/components/CanvasPanel.jsx
+++ b/core/http/react-ui/src/components/CanvasPanel.jsx
@@ -2,6 +2,7 @@ import { useState, useEffect, useRef } from 'react'
 import { renderMarkdown } from '../utils/markdown'
 import { getArtifactIcon } from '../utils/artifacts'
 import { safeHref } from '../utils/url'
+import { copyToClipboard } from '../utils/clipboard'
 import DOMPurify from 'dompurify'
 import hljs from 'highlight.js'

@@ -23,11 +24,13 @@ export default function CanvasPanel({ artifacts, selectedId, onSelect, onClose }
    }
  }, [current, showPreview])

-  const handleCopy = () => {
+  const handleCopy = async () => {
    const text = current.code || current.url || ''
-    navigator.clipboard.writeText(text)
-    setCopySuccess(true)
-    setTimeout(() => setCopySuccess(false), 2000)
+    const ok = await copyToClipboard(text)
+    if (ok) {
+      setCopySuccess(true)
+      setTimeout(() => setCopySuccess(false), 2000)
+    }
  }

  const handleDownload = () => {
--- a/core/http/react-ui/src/components/NodeInstallPicker.jsx
+++ b/core/http/react-ui/src/components/NodeInstallPicker.jsx
@@ -1,7 +1,7 @@
 import { useState, useMemo, useEffect, useRef } from 'react'
 import Modal from './Modal'
 import SearchableSelect from './SearchableSelect'
-import { nodesApi } from '../utils/api'
+import { nodesApi, backendsApi } from '../utils/api'

 // NodeInstallPicker is the single multi-node install surface used both from
 // the Backends gallery split-button and from the "Install on more nodes" `+`
@@ -240,6 +240,37 @@ export default function NodeInstallPicker({
  }
  const clearSelection = () => setSelected(new Set())

+  // pollJob resolves with { done: true, error?: string } once a single job
+  // completes, fails, or is cancelled. Bounded by a hard wall-clock cap so a
+  // stuck worker eventually surfaces in the UI as "Failed" instead of
+  // spinning forever.
+  const pollJob = (jobID) => new Promise((resolve) => {
+    const POLL_INTERVAL_MS = 1500
+    const HARD_CAP_MS = 6 * 60 * 1000 // 6 min - generous for a fresh worker download
+    const startedAt = Date.now()
+
+    const tick = async () => {
+      try {
+        const status = await backendsApi.getJob(jobID)
+        if (status?.completed) { resolve({ done: true }); return }
+        if (status?.error) { resolve({ done: true, error: status.error }); return }
+        if (status?.processed && !status?.completed) {
+          resolve({ done: true, error: status.error || 'install did not complete' })
+          return
+        }
+      } catch (err) {
+        resolve({ done: true, error: err?.message || 'polling failed' })
+        return
+      }
+      if (Date.now() - startedAt > HARD_CAP_MS) {
+        resolve({ done: true, error: 'timed out waiting for install to finish' })
+        return
+      }
+      setTimeout(tick, POLL_INTERVAL_MS)
+    }
+    tick()
+  })
+
  const submit = async () => {
    if (selected.size === 0 || submitting) return
    if (counts.overrides > 0 && !showMismatchConfirm) {
@@ -255,38 +286,68 @@ export default function NodeInstallPicker({
      return next
    })

-    const results = await Promise.allSettled(ids.map(id =>
+    // Phase 1: dispatch all installs in parallel. Each POST returns immediately
+    // with { jobID } now that the handler is async.
+    const dispatchResults = await Promise.allSettled(ids.map(id =>
      nodesApi.installBackend(id, effectiveBackendName)
-        .then(r => ({ id, ok: true, message: r?.message }))
-        .catch(err => ({ id, ok: false, error: err?.message || 'install failed' }))
+        .then(r => ({ id, ok: true, jobID: r?.jobID }))
+        .catch(err => ({ id, ok: false, error: err?.message || 'dispatch failed' }))
    ))

-    let successCount = 0, failCount = 0
-    setPerNode(prev => {
-      const next = { ...prev }
-      for (const r of results) {
-        if (r.status !== 'fulfilled') continue
-        const v = r.value
-        if (v.ok) {
-          next[v.id] = { status: 'done' }
-          successCount++
-        } else {
-          next[v.id] = { status: 'error', error: v.error }
-          failCount++
-        }
+    // Classify dispatch results synchronously OUTSIDE the setter. React may
+    // invoke a functional state updater more than once (StrictMode dev double
+    // invoke, concurrent rendering replay): building the jobs array inside
+    // the closure would duplicate entries and re-poll the same job.
+    const jobs = []
+    const dispatchPatch = {}
+    for (const r of dispatchResults) {
+      if (r.status !== 'fulfilled') continue
+      const v = r.value
+      if (v.ok && v.jobID) {
+        dispatchPatch[v.id] = { status: 'installing', jobID: v.jobID }
+        jobs.push({ nodeID: v.id, jobID: v.jobID })
+      } else {
+        dispatchPatch[v.id] = { status: 'error', error: v.error || 'dispatch failed' }
      }
-      return next
+    }
+    setPerNode(prev => ({ ...prev, ...dispatchPatch }))
+
+    // Phase 2: poll each job. Promise.all resolves when the last job settles;
+    // intermediate updates flip per-row state via the setPerNode inside pollJob.
+    await Promise.all(jobs.map(async ({ nodeID, jobID }) => {
+      const result = await pollJob(jobID)
+      setPerNode(prev => {
+        const next = { ...prev }
+        if (result.error) {
+          next[nodeID] = { status: 'error', error: result.error, jobID }
+        } else {
+          next[nodeID] = { status: 'done', jobID }
+        }
+        return next
+      })
+    }))
+
+    // Phase 3: summary toast + onComplete. Read latest state via functional setter.
+    let successCount = 0
+    let failCount = 0
+    setPerNode(prev => {
+      for (const v of Object.values(prev)) {
+        if (v.status === 'done') successCount++
+        else if (v.status === 'error') failCount++
+      }
+      return prev
    })
+
    setSubmitting(false)

    if (successCount > 0 && onComplete) onComplete()

-    if (failCount === 0) {
+    if (failCount === 0 && successCount > 0) {
      addToast?.(`Installed on ${successCount} node${successCount === 1 ? '' : 's'}`, 'success')
      setTimeout(() => onClose?.(), 800)
-    } else if (successCount === 0) {
+    } else if (successCount === 0 && failCount > 0) {
      addToast?.(`Install failed on all ${failCount} node${failCount === 1 ? '' : 's'}`, 'error')
-    } else {
+    } else if (successCount > 0 && failCount > 0) {
      addToast?.(`Installed on ${successCount}, failed on ${failCount}`, 'warning')
    }
  }
@@ -297,32 +358,58 @@ export default function NodeInstallPicker({
      .map(([id]) => id)
    if (failedIds.length === 0) return
    setSelected(new Set(failedIds))
-    // Replace state for failed rows so they show "installing" again, not stale errors.
    setPerNode(prev => {
      const next = { ...prev }
      failedIds.forEach(id => { next[id] = { status: 'installing' } })
      return next
    })
    setSubmitting(true)
-    const results = await Promise.allSettled(failedIds.map(id =>
+
+    const dispatchResults = await Promise.allSettled(failedIds.map(id =>
      nodesApi.installBackend(id, effectiveBackendName)
-        .then(r => ({ id, ok: true, message: r?.message }))
-        .catch(err => ({ id, ok: false, error: err?.message || 'install failed' }))
+        .then(r => ({ id, ok: true, jobID: r?.jobID }))
+        .catch(err => ({ id, ok: false, error: err?.message || 'dispatch failed' }))
    ))
+
+    // Same precaution as in submit(): classify outside the functional setter
+    // so a replayed updater can't push duplicate jobs into the polling list.
+    const jobs = []
+    const dispatchPatch = {}
+    for (const r of dispatchResults) {
+      if (r.status !== 'fulfilled') continue
+      const v = r.value
+      if (v.ok && v.jobID) {
+        dispatchPatch[v.id] = { status: 'installing', jobID: v.jobID }
+        jobs.push({ nodeID: v.id, jobID: v.jobID })
+      } else {
+        dispatchPatch[v.id] = { status: 'error', error: v.error || 'dispatch failed' }
+      }
+    }
+    setPerNode(prev => ({ ...prev, ...dispatchPatch }))
+
+    await Promise.all(jobs.map(async ({ nodeID, jobID }) => {
+      const result = await pollJob(jobID)
+      setPerNode(prev => {
+        const next = { ...prev }
+        if (result.error) next[nodeID] = { status: 'error', error: result.error, jobID }
+        else next[nodeID] = { status: 'done', jobID }
+        return next
+      })
+    }))
+
+    setSubmitting(false)
+
    let successCount = 0, failCount = 0
    setPerNode(prev => {
-      const next = { ...prev }
-      for (const r of results) {
-        if (r.status !== 'fulfilled') continue
-        const v = r.value
-        if (v.ok) { next[v.id] = { status: 'done' }; successCount++ }
-        else { next[v.id] = { status: 'error', error: v.error }; failCount++ }
+      for (const id of failedIds) {
+        const v = prev[id]
+        if (v?.status === 'done') successCount++
+        else if (v?.status === 'error') failCount++
      }
-      return next
+      return prev
    })
-    setSubmitting(false)
    if (successCount > 0 && onComplete) onComplete()
-    if (failCount === 0) {
+    if (failCount === 0 && successCount > 0) {
      addToast?.(`Installed on ${successCount} node${successCount === 1 ? '' : 's'}`, 'success')
      setTimeout(() => onClose?.(), 800)
    }
--- a/core/http/react-ui/src/hooks/useChat.js
+++ b/core/http/react-ui/src/hooks/useChat.js
@@ -218,9 +218,15 @@ export function useChat(initialModel = '') {
          })
          userFiles.push({ name: file.name, type: 'audio' })
        } else {
-          // Text/PDF files - append to content
-          userFiles.push({ name: file.name, type: 'file', content: file.textContent || '' })
-        }
+			// Text/PDF files - append to content
+			if (file.textContent) {
+				messageContent.push({
+					type: 'text',
+					text: `\n\n--- File: ${file.name} ---\n${file.textContent}\n--- End of ${file.name} ---`,
+				})
+			}
+			userFiles.push({ name: file.name, type: 'file', content: file.textContent || '' })
+		}
      }
    } else {
      messageContent = content
--- a/core/http/react-ui/src/hooks/useOperations.js
+++ b/core/http/react-ui/src/hooks/useOperations.js
@@ -2,6 +2,14 @@ import { useState, useEffect, useCallback, useRef } from 'react'
 import { operationsApi } from '../utils/api'
 import { useAuth } from '../context/AuthContext'

+// Serialize ops into a stable comparison key. Each op is a flat map of
+// primitives, so JSON.stringify is good enough and stable as long as the
+// server emits keys in the same order (Go's map iteration into JSON happens
+// to be stable here because we build an explicit map[string]any).
+function serializeOps(ops) {
+  return JSON.stringify(ops)
+}
+
 export function useOperations(pollInterval = 1000) {
  const [operations, setOperations] = useState([])
  const [loading, setLoading] = useState(true)
@@ -11,16 +19,26 @@ export function useOperations(pollInterval = 1000) {

  const previousCountRef = useRef(0)
  const onAllCompleteRef = useRef(null)
+  // Track the last payload we wrote into state. Each poll otherwise produces
+  // a fresh array reference even when nothing changed, and that re-render
+  // ripples into the Chat page — wiping the user's text selection mid-read
+  // (#9904).
+  const lastSerializedRef = useRef('[]')

  const fetchOperations = useCallback(async () => {
    if (!isAdmin) {
-      setLoading(false)
+      setLoading((prev) => (prev ? false : prev))
      return
    }
    try {
      const data = await operationsApi.list()
      const ops = data?.operations || (Array.isArray(data) ? data : [])
-      setOperations(ops)
+
+      const serialized = serializeOps(ops)
+      if (serialized !== lastSerializedRef.current) {
+        lastSerializedRef.current = serialized
+        setOperations(ops)
+      }

      // Separate active (non-failed) operations from failed ones
      const activeOps = ops.filter(op => !op.error)
@@ -32,11 +50,11 @@ export function useOperations(pollInterval = 1000) {
      }
      previousCountRef.current = activeOps.length

-      setError(null)
+      setError((prev) => (prev === null ? prev : null))
    } catch (err) {
-      setError(err.message)
+      setError((prev) => (prev === err.message ? prev : err.message))
    } finally {
-      setLoading(false)
+      setLoading((prev) => (prev ? false : prev))
    }
  }, [isAdmin])

--- a/core/http/react-ui/src/pages/AgentChat.jsx
+++ b/core/http/react-ui/src/pages/AgentChat.jsx
@@ -9,6 +9,7 @@ import ResourceCards from '../components/ResourceCards'
 import ConfirmDialog from '../components/ConfirmDialog'
 import { useAgentChat } from '../hooks/useAgentChat'
 import { relativeTime } from '../utils/format'
+import { copyToClipboard } from '../utils/clipboard'

 function getLastMessagePreview(conv) {
  if (!conv.messages || conv.messages.length === 0) return ''
@@ -390,9 +391,13 @@ export default function AgentChat() {
    }
  }

-  const copyMessage = (content) => {
-    navigator.clipboard.writeText(content)
-    addToast('Copied to clipboard', 'success', 2000)
+  const copyMessage = async (content) => {
+    const ok = await copyToClipboard(content)
+    addToast(
+      ok ? 'Copied to clipboard' : 'Could not copy to clipboard',
+      ok ? 'success' : 'error',
+      ok ? 2000 : 3000,
+    )
  }

  const senderToRole = (sender) => {
--- a/core/http/react-ui/src/pages/Backends.jsx
+++ b/core/http/react-ui/src/pages/Backends.jsx
@@ -179,16 +179,19 @@ export default function Backends() {

  // Install a single gallery backend on a specific node, used in target-node
  // mode (the URL has ?target=<node-id> set from the Nodes page entry point).
+  // The handler is async - we dispatch and let the global Operations panel
+  // surface progress; no need to await completion here.
  const handleInstallOnTarget = async (id) => {
    if (!targetNode) return
    try {
      await nodesApi.installBackend(targetNode.id, id)
-      addToast(`Installing ${id} on ${targetNode.name}…`, 'info')
-      // Per-node install is request-reply, not part of the global jobs feed —
-      // refetch to reflect the new Nodes column state.
-      setTimeout(() => { fetchBackends(); refetchNodes() }, 600)
+      addToast(`Installing ${id} on ${targetNode.name}...`, 'info')
+      // The install runs async via the gallery job queue. Refetch shortly so
+      // the Nodes column reflects "installing" state; the Operations panel
+      // tracks the actual progress until completion.
+      setTimeout(() => { fetchBackends(); refetchNodes() }, 1200)
    } catch (err) {
-      addToast(`Install failed on ${targetNode.name}: ${err.message}`, 'error')
+      addToast(`Install dispatch failed on ${targetNode.name}: ${err.message}`, 'error')
    }
  }

--- a/core/http/react-ui/src/pages/Chat.jsx
+++ b/core/http/react-ui/src/pages/Chat.jsx
@@ -17,6 +17,7 @@ import ChatsMenu from '../components/ChatsMenu'
 import { useAuth } from '../context/AuthContext'
 import { useOperations } from '../hooks/useOperations'
 import { relativeTime } from '../utils/format'
+import { copyToClipboard } from '../utils/clipboard'

 function getLastMessagePreview(chat) {
  if (!chat.history || chat.history.length === 0) return ''
@@ -798,10 +799,14 @@ export default function Chat() {
    }
  }

-  const copyMessage = (content) => {
+  const copyMessage = async (content) => {
    const text = typeof content === 'string' ? content : content?.[0]?.text || ''
-    navigator.clipboard.writeText(text)
-    addToast(t('toasts.copied'), 'success', 2000)
+    const ok = await copyToClipboard(text)
+    if (ok) {
+      addToast(t('toasts.copied'), 'success', 2000)
+    } else {
+      addToast(t('toasts.copyFailed'), 'error', 3000)
+    }
  }

  const contextPercent = getContextUsagePercent()
--- a/core/http/react-ui/src/pages/Home.jsx
+++ b/core/http/react-ui/src/pages/Home.jsx
@@ -161,7 +161,11 @@ export default function Home() {
    const newFiles = []
    for (const file of fileList) {
      const base64 = await fileToBase64(file)
-      newFiles.push({ name: file.name, type: file.type, base64 })
+      const entry = { name: file.name, type: file.type, base64 }
+      if (!file.type.startsWith('image/') && !file.type.startsWith('audio/')) {
+        entry.textContent = await file.text().catch(() => '')
+      }
+      newFiles.push(entry)
    }
    setter(prev => [...prev, ...newFiles])
  }, [])
--- a/core/http/react-ui/src/pages/Traces.jsx
+++ b/core/http/react-ui/src/pages/Traces.jsx
@@ -406,7 +406,15 @@ export default function Traces() {
        <button className="btn btn-secondary btn-sm" onClick={fetchTraces}><i className="fas fa-rotate" /> Refresh</button>
        <button className="btn btn-secondary btn-sm" onClick={handleExport} disabled={traces.length === 0}><i className="fas fa-download" /> Export</button>
        <div style={{ flex: 1 }} />
-        <button className="btn btn-danger btn-sm" onClick={handleClear} disabled={traces.length === 0}><i className="fas fa-trash" /> Clear</button>
+        <button
+          className="btn btn-danger btn-sm"
+          onClick={handleClear}
+          /* Stay enabled while loading: a massive in-memory trace buffer is
+             precisely the case where the user can't see the table yet and
+             needs Clear to recover. Clearing an already-empty server-side
+             buffer is a harmless no-op. */
+          disabled={!loading && traces.length === 0}
+        ><i className="fas fa-trash" /> Clear</button>
      </div>

      {settings && (() => {
--- a/core/http/react-ui/src/pages/Usage.jsx
+++ b/core/http/react-ui/src/pages/Usage.jsx
@@ -4,6 +4,7 @@ import { useTranslation } from 'react-i18next'
 import { useAuth } from '../context/AuthContext'
 import { apiUrl } from '../utils/basePath'
 import LoadingSpinner from '../components/LoadingSpinner'
+import SourcesTab from './Usage/SourcesTab'

 const PERIODS = [
  { key: 'day', label: 'Day' },
@@ -724,23 +725,27 @@ export default function Usage() {
            {p.label}
          </button>
        ))}
+        <div style={{ width: 1, height: 20, background: 'var(--color-border-subtle)', margin: '0 var(--spacing-xs)' }} />
+        <button
+          className={`btn btn-sm ${activeTab === 'models' ? 'btn-primary' : 'btn-secondary'}`}
+          onClick={() => setActiveTab('models')}
+        >
+          <i className="fas fa-cube" style={{ fontSize: '0.7rem' }} /> Models
+        </button>
        {isAdmin && (
-          <>
-            <div style={{ width: 1, height: 20, background: 'var(--color-border-subtle)', margin: '0 var(--spacing-xs)' }} />
-            <button
-              className={`btn btn-sm ${activeTab === 'models' ? 'btn-primary' : 'btn-secondary'}`}
-              onClick={() => setActiveTab('models')}
-            >
-              <i className="fas fa-cube" style={{ fontSize: '0.7rem' }} /> Models
-            </button>
-            <button
-              className={`btn btn-sm ${activeTab === 'users' ? 'btn-primary' : 'btn-secondary'}`}
-              onClick={() => setActiveTab('users')}
-            >
-              <i className="fas fa-users" style={{ fontSize: '0.7rem' }} /> Users
-            </button>
-          </>
+          <button
+            className={`btn btn-sm ${activeTab === 'users' ? 'btn-primary' : 'btn-secondary'}`}
+            onClick={() => setActiveTab('users')}
+          >
+            <i className="fas fa-users" style={{ fontSize: '0.7rem' }} /> Users
+          </button>
        )}
+        <button
+          className={`btn btn-sm ${activeTab === 'sources' ? 'btn-primary' : 'btn-secondary'}`}
+          onClick={() => setActiveTab('sources')}
+        >
+          <i className="fas fa-key" style={{ fontSize: '0.7rem' }} /> {t('usage.sources.tab')}
+        </button>
        <div style={{ flex: 1 }} />
        <button className="btn btn-secondary btn-sm" onClick={fetchUsage} disabled={loading} style={{ gap: 4 }}>
          <i className={`fas fa-rotate${loading ? ' fa-spin' : ''}`} /> Refresh
@@ -884,6 +889,10 @@ export default function Usage() {
              </div>
            )
          )}
+
+          {activeTab === 'sources' && (
+            <SourcesTab period={period} adminUserId={selectedUserId} />
+          )}
        </>
      )}
    </div>
--- a/core/http/react-ui/src/pages/Usage/SourceMixRibbon.jsx
+++ b/core/http/react-ui/src/pages/Usage/SourceMixRibbon.jsx
@@ -0,0 +1,83 @@
+import { useTranslation } from 'react-i18next'
+
+const SEGMENT_COLORS = {
+  apikey: 'var(--color-primary)',
+  web: 'var(--color-info, #3b82f6)',
+  legacy: 'var(--color-warning, #f59e0b)',
+}
+
+// SourceMixRibbon renders one segmented horizontal bar showing the share of
+// tokens by source class (apikey / web / legacy). Clicking a segment invokes
+// onSelectSourceClass with the segment key so the parent can filter the view.
+//
+// Props:
+//   bySource: { apikey?: {tokens, requests}, web?: {...}, legacy?: {...} }
+//   keyCount: number of distinct API keys in the dataset (for the legend)
+//   onSelectSourceClass: (cls: 'apikey'|'web'|'legacy') => void (optional)
+export default function SourceMixRibbon({ bySource = {}, keyCount = 0, onSelectSourceClass }) {
+  const { t } = useTranslation('admin')
+
+  const apikey = (bySource.apikey?.tokens) || 0
+  const web = (bySource.web?.tokens) || 0
+  const legacy = (bySource.legacy?.tokens) || 0
+  const total = apikey + web + legacy || 1
+
+  const pct = (n) => Math.round((n / total) * 100)
+  const apiPct = pct(apikey)
+  const webPct = pct(web)
+  const legacyPct = pct(legacy)
+
+  const segments = [
+    { key: 'apikey', label: `${apiPct}% API keys (${keyCount})`, pct: apiPct, color: SEGMENT_COLORS.apikey },
+    { key: 'web', label: `${webPct}% ${t('usage.sources.webUI')}`, pct: webPct, color: SEGMENT_COLORS.web },
+    { key: 'legacy', label: `${legacyPct}% ${t('usage.sources.legacy')}`, pct: legacyPct, color: SEGMENT_COLORS.legacy },
+  ].filter((s) => s.pct > 0)
+
+  return (
+    <div
+      role="group"
+      aria-label={t('usage.sources.ribbonAria', { apikey: apiPct, web: webPct, legacy: legacyPct })}
+      style={{ display: 'flex', flexDirection: 'column', gap: 'var(--spacing-xs)' }}
+    >
+      <div style={{ fontSize: '0.875rem', fontWeight: 600, color: 'var(--color-text-primary)' }}>
+        {t('usage.sources.mixTitle')}
+      </div>
+      <div
+        style={{
+          display: 'flex',
+          height: 12,
+          borderRadius: 'var(--radius-sm)',
+          overflow: 'hidden',
+          border: '1px solid var(--color-border-subtle)',
+        }}
+      >
+        {segments.map((s) => (
+          <button
+            key={s.key}
+            type="button"
+            onClick={() => onSelectSourceClass?.(s.key)}
+            aria-label={s.label}
+            style={{
+              width: `${s.pct}%`,
+              background: s.color,
+              border: 'none',
+              padding: 0,
+              cursor: onSelectSourceClass ? 'pointer' : 'default',
+            }}
+          />
+        ))}
+      </div>
+      <div style={{ display: 'flex', flexWrap: 'wrap', gap: 'var(--spacing-sm)', fontSize: '0.75rem' }}>
+        {segments.map((s) => (
+          <span key={s.key} style={{ display: 'inline-flex', alignItems: 'center', gap: 6 }}>
+            <span
+              style={{ width: 10, height: 10, borderRadius: 2, background: s.color, display: 'inline-block' }}
+              aria-hidden
+            />
+            {s.label}
+          </span>
+        ))}
+      </div>
+    </div>
+  )
+}
--- a/core/http/react-ui/src/pages/Usage/SourceTimeChart.jsx
+++ b/core/http/react-ui/src/pages/Usage/SourceTimeChart.jsx
@@ -0,0 +1,147 @@
+import { useMemo } from 'react'
+import { useTranslation } from 'react-i18next'
+
+const TOP_N = 7
+// Distinct, accessible-ish series colors that read on both light and dark themes.
+const SERIES_COLORS = [
+  'var(--color-primary)',
+  'var(--color-success, #10b981)',
+  'var(--color-warning, #f59e0b)',
+  'var(--color-info, #3b82f6)',
+  'var(--color-danger, #ef4444)',
+  '#a855f7',
+  '#ec4899',
+]
+const OTHER_COLOR = 'var(--color-text-muted, #94a3b8)'
+
+function identityFor(bucket) {
+  return bucket.api_key_id || bucket.source || 'unknown'
+}
+
+// buckets: UsageBucket[] from /api/auth/usage/sources (server-sorted ASC by bucket)
+// selectedKey: 'web' | 'legacy' | api_key_id | null
+// totals: SourceTotals (for the "Other (count)" legend label)
+export default function SourceTimeChart({ buckets = [], selectedKey, totals }) {
+  const { t } = useTranslation('admin')
+
+  // Find the top-N identities by total tokens across the period.
+  const topIds = useMemo(() => {
+    const sums = new Map()
+    for (const b of buckets) {
+      const id = identityFor(b)
+      sums.set(id, (sums.get(id) || 0) + (b.total_tokens || 0))
+    }
+    return [...sums.entries()]
+      .sort((a, b) => b[1] - a[1])
+      .slice(0, TOP_N)
+      .map(([id]) => id)
+  }, [buckets])
+
+  const topSet = useMemo(() => new Set(topIds), [topIds])
+
+  // Resolve a display label for an identity (api_key_id -> snapshotted name, or source name).
+  const labelByIdentity = useMemo(() => {
+    const m = new Map()
+    for (const b of buckets) {
+      const id = identityFor(b)
+      if (m.has(id)) continue
+      if (b.source === 'web')    { m.set(id, t('usage.sources.webUI')); continue }
+      if (b.source === 'legacy') { m.set(id, t('usage.sources.legacy')); continue }
+      m.set(id, b.api_key_name || b.api_key_id || id)
+    }
+    return m
+  }, [buckets, t])
+
+  // Build a dense per-bucket row, splitting top-N vs Other.
+  const series = useMemo(() => {
+    const byBucket = new Map()
+    for (const b of buckets) {
+      const id = identityFor(b)
+      const seriesId = topSet.has(id) ? id : '__other__'
+      const row = byBucket.get(b.bucket) || { bucket: b.bucket, total: 0 }
+      row[seriesId] = (row[seriesId] || 0) + (b.total_tokens || 0)
+      row.total += b.total_tokens || 0
+      byBucket.set(b.bucket, row)
+    }
+    return [...byBucket.values()]
+  }, [buckets, topSet])
+
+  const max = useMemo(
+    () => series.reduce((m, r) => Math.max(m, r.total), 0) || 1,
+    [series]
+  )
+
+  const seriesIds = [...topIds, '__other__']
+  const colorOf = (id) =>
+    id === '__other__'
+      ? OTHER_COLOR
+      : SERIES_COLORS[topIds.indexOf(id) % SERIES_COLORS.length]
+
+  const labelOfId = (id) => {
+    if (id === '__other__') return null // computed inline (need count)
+    return labelByIdentity.get(id) || id
+  }
+
+  const otherCount = Math.max(0, (totals?.by_key?.length || 0) - TOP_N)
+
+  // SVG geometry: 24px wide per bar (2px gap), 100px tall, viewBox stretches with bar count.
+  const barWidth = 20
+  const barGap = 4
+  const slotWidth = barWidth + barGap
+  const height = 100
+  const width = Math.max(series.length * slotWidth, 200)
+
+  return (
+    <div style={{ display: 'flex', flexDirection: 'column', gap: 'var(--spacing-xs)' }}>
+      <div style={{ fontSize: '0.875rem', fontWeight: 600, color: 'var(--color-text-primary)' }}>
+        {t('usage.sources.topSources')}
+      </div>
+
+      <svg
+        viewBox={`0 0 ${width} ${height}`}
+        preserveAspectRatio="none"
+        style={{ width: '100%', height: 160, display: 'block' }}
+        aria-hidden
+      >
+        {series.map((row, i) => {
+          let y = height
+          return (
+            <g key={row.bucket} transform={`translate(${i * slotWidth}, 0)`}>
+              {seriesIds.map(id => {
+                const v = row[id] || 0
+                if (!v) return null
+                const h = (v / max) * height
+                y -= h
+                const dim = selectedKey && selectedKey !== id ? 0.25 : 1
+                const title = id === '__other__'
+                  ? t('usage.sources.other', { count: otherCount })
+                  : labelOfId(id)
+                return (
+                  <rect
+                    key={id}
+                    x={barGap / 2} y={y}
+                    width={barWidth} height={h}
+                    fill={colorOf(id)} opacity={dim}
+                  >
+                    <title>{`${row.bucket} - ${title}: ${v.toLocaleString()}`}</title>
+                  </rect>
+                )
+              })}
+            </g>
+          )
+        })}
+      </svg>
+
+      <div style={{ display: 'flex', flexWrap: 'wrap', gap: 'var(--spacing-sm)', fontSize: '0.75rem' }}>
+        {seriesIds.map(id => (
+          <span key={id} style={{ display: 'inline-flex', alignItems: 'center', gap: 6 }}>
+            <span style={{ width: 10, height: 10, borderRadius: 2, background: colorOf(id), display: 'inline-block' }} aria-hidden />
+            {id === '__other__'
+              ? t('usage.sources.other', { count: otherCount })
+              : labelOfId(id)}
+          </span>
+        ))}
+      </div>
+    </div>
+  )
+}
--- a/core/http/react-ui/src/pages/Usage/SourcesTab.jsx
+++ b/core/http/react-ui/src/pages/Usage/SourcesTab.jsx
@@ -0,0 +1,176 @@
+import { useEffect, useState } from 'react'
+import { useTranslation } from 'react-i18next'
+import { usageApi, apiKeysApi } from '../../utils/api'
+import { useAuth } from '../../context/AuthContext'
+import LoadingSpinner from '../../components/LoadingSpinner'
+import SourceMixRibbon from './SourceMixRibbon'
+import SourcesTable from './SourcesTable'
+import SourceTimeChart from './SourceTimeChart'
+
+const EMPTY_DATA = {
+  buckets: [],
+  totals: { by_source: {}, by_key: [], grand_total: { tokens: 0, requests: 0 } },
+  truncated: false,
+}
+
+// Resolve a human label for the currently selected key (web/legacy class or api_key_id).
+function labelForSelected(totals, selectedKey, t) {
+  if (!selectedKey) return ''
+  if (selectedKey === 'web')    return t('usage.sources.webUI')
+  if (selectedKey === 'legacy') return t('usage.sources.legacy')
+  const row = (totals?.by_key || []).find(k => k.api_key_id === selectedKey)
+  return row ? (row.api_key_name || selectedKey) : selectedKey
+}
+
+// SourcesTab fetches and renders per-source / per-API-key usage breakdown.
+// Task 10 replaces the raw JSON / list placeholders with SourceMixRibbon and
+// SourcesTable. Task 11 will add the time chart and drill-in chip.
+export default function SourcesTab({ period, adminUserId }) {
+  const { t } = useTranslation('admin')
+  const { isAdmin } = useAuth()
+
+  const [data, setData] = useState(EMPTY_DATA)
+  const [loading, setLoading] = useState(false)
+  const [error, setError] = useState(null)
+
+  const [selectedKey, setSelectedKey] = useState(null)
+  const [search, setSearch] = useState('')
+  const [sortKey, setSortKey] = useState('tokens')
+
+  // Pull the current set of API key ids so the table can mark unknown keys as
+  // revoked. null = "don't know yet" so the table won't dim live keys during
+  // the fetch or after a failure.
+  const [existingKeyIds, setExistingKeyIds] = useState(null)
+  useEffect(() => {
+    apiKeysApi
+      .list()
+      .then((resp) => {
+        const list = Array.isArray(resp) ? resp : (resp?.keys || [])
+        setExistingKeyIds(new Set(list.map((k) => k.id)))
+      })
+      .catch(() => { /* leave existingKeyIds null so revoked detection is skipped */ })
+  }, [])
+
+  useEffect(() => {
+    let cancelled = false
+    setLoading(true)
+    setError(null)
+    const p = isAdmin
+      ? usageApi.getAdminSources(period, adminUserId)
+      : usageApi.getMySources(period)
+    p
+      .then((d) => { if (!cancelled) setData(d || EMPTY_DATA) })
+      .catch((e) => { if (!cancelled) setError(e) })
+      .finally(() => { if (!cancelled) setLoading(false) })
+    return () => { cancelled = true }
+  }, [isAdmin, period, adminUserId])
+
+  const totals = data.totals || EMPTY_DATA.totals
+  const buckets = data.buckets || EMPTY_DATA.buckets
+  const grandT = totals.grand_total || { tokens: 0, requests: 0 }
+  const truncated = data.truncated || false
+
+  const isEmpty = !loading && (grandT.tokens || 0) === 0 && (grandT.requests || 0) === 0
+
+  if (loading) {
+    return (
+      <div style={{ display: 'flex', justifyContent: 'center', padding: 'var(--spacing-xl)' }}>
+        <LoadingSpinner size="lg" />
+      </div>
+    )
+  }
+
+  if (error) {
+    return (
+      <div className="empty-state">
+        <div className="empty-state-icon"><i className="fas fa-triangle-exclamation" /></div>
+        <h2 className="empty-state-title">Failed to load</h2>
+        <p className="empty-state-text">{String(error.message || error)}</p>
+      </div>
+    )
+  }
+
+  if (isEmpty) {
+    return (
+      <div className="empty-state">
+        <div className="empty-state-icon"><i className="fas fa-key" /></div>
+        <h2 className="empty-state-title">{t('usage.sources.noTrafficShort')}</h2>
+        <p className="empty-state-text">{t('usage.sources.noKeysYet')}</p>
+      </div>
+    )
+  }
+
+  return (
+    <div style={{ display: 'flex', flexDirection: 'column', gap: 'var(--spacing-md)' }}>
+      <div className="card" style={{ padding: 'var(--spacing-md)' }}>
+        <SourceMixRibbon
+          bySource={totals.by_source}
+          keyCount={(totals.by_key || []).length}
+          onSelectSourceClass={(cls) => setSelectedKey(cls)}
+        />
+      </div>
+
+      {selectedKey && (
+        <div style={{ display: 'flex', alignItems: 'center', gap: 'var(--spacing-xs)' }}>
+          <span
+            style={{
+              display: 'inline-flex',
+              alignItems: 'center',
+              gap: 'var(--spacing-xs)',
+              padding: 'calc(var(--spacing-xs) / 2) var(--spacing-sm)',
+              background: 'var(--color-bg-secondary)',
+              color: 'var(--color-text-primary)',
+              fontSize: '0.75rem',
+              borderRadius: 'var(--radius-sm)',
+              border: '1px solid var(--color-border-subtle)',
+            }}
+          >
+            <i className="fas fa-filter" style={{ fontSize: '0.6875rem', color: 'var(--color-text-muted)' }} aria-hidden />
+            {t('usage.sources.filteredTo', { name: labelForSelected(totals, selectedKey, t) })}
+            <button
+              type="button"
+              onClick={() => setSelectedKey(null)}
+              aria-label={t('usage.sources.clearFilter')}
+              style={{
+                appearance: 'none',
+                background: 'transparent',
+                border: 'none',
+                color: 'var(--color-text-muted)',
+                cursor: 'pointer',
+                padding: 0,
+                fontSize: '0.875rem',
+                lineHeight: 1,
+              }}
+            >
+              <i className="fas fa-xmark" />
+            </button>
+          </span>
+        </div>
+      )}
+
+      <div className="card" style={{ padding: 'var(--spacing-md)' }}>
+        <SourceTimeChart buckets={buckets} selectedKey={selectedKey} totals={totals} />
+      </div>
+
+      <div className="card" style={{ padding: 'var(--spacing-md)' }}>
+        <SourcesTable
+          totals={totals}
+          selectedKey={selectedKey}
+          onSelectKey={setSelectedKey}
+          search={search}
+          setSearch={setSearch}
+          sortKey={sortKey}
+          setSortKey={setSortKey}
+          existingKeyIds={existingKeyIds}
+          showUserColumn={isAdmin}
+        />
+      </div>
+
+      {truncated && (
+        <div style={{ fontSize: '0.75rem', color: 'var(--color-warning)' }}>
+          {t('usage.sources.truncatedWarning')}
+        </div>
+      )}
+    </div>
+  )
+}
--- a/core/http/react-ui/src/pages/Usage/SourcesTable.jsx
+++ b/core/http/react-ui/src/pages/Usage/SourcesTable.jsx
@@ -0,0 +1,245 @@
+import { useMemo } from 'react'
+import { useTranslation } from 'react-i18next'
+
+const SORT_FNS = {
+  tokens: (a, b) => (b.tokens || 0) - (a.tokens || 0),
+  requests: (a, b) => (b.requests || 0) - (a.requests || 0),
+  last_used: (a, b) => new Date(b.last_used || 0).getTime() - new Date(a.last_used || 0).getTime(),
+  name: (a, b) => (a.name || '').localeCompare(b.name || ''),
+  user: (a, b) => (a.userName || '').localeCompare(b.userName || ''),
+}
+
+function formatTokens(n) {
+  if (!n) return '0'
+  if (n >= 1_000_000) return (n / 1_000_000).toFixed(1) + 'M'
+  if (n >= 1_000) return (n / 1_000).toFixed(1) + 'k'
+  return String(n)
+}
+
+function formatRelative(iso) {
+  if (!iso) return '-'
+  const t = new Date(iso).getTime()
+  if (Number.isNaN(t) || t <= 0) return '-'
+  const diff = Date.now() - t
+  if (diff < 60_000) return 'just now'
+  if (diff < 3_600_000) return Math.round(diff / 60_000) + 'm ago'
+  if (diff < 86_400_000) return Math.round(diff / 3_600_000) + 'h ago'
+  return Math.round(diff / 86_400_000) + 'd ago'
+}
+
+// SourcesTable is the searchable, sortable list of key totals plus pseudo-rows
+// for the web UI and legacy (unkeyed) source classes. Clicking a row selects
+// it; the parent decides what to do with the selection (the drill-in panel
+// will be wired in Task 11).
+//
+// Props:
+//   totals: SourceTotals payload (from /api/auth/usage/sources)
+//   selectedKey: currently-selected row id (api_key_id | 'web' | 'legacy' | null)
+//   onSelectKey: (id|null) => void
+//   search / setSearch: free-text filter state lifted to the parent
+//   sortKey / setSortKey: sort column state lifted to the parent
+//   existingKeyIds: Set<string> of current (non-revoked) api key ids, or null
+//     when the parent hasn't yet learned which keys exist. Null suppresses the
+//     revoked badge entirely so live keys aren't dimmed during the fetch or
+//     after a failure.
+//   showUserColumn: render the User column. Admin views set this true so the
+//     reader can attribute each key (and each Web UI row) to its owner.
+export default function SourcesTable({
+  totals,
+  selectedKey,
+  onSelectKey,
+  search,
+  setSearch,
+  sortKey,
+  setSortKey,
+  existingKeyIds = null,
+  showUserColumn = false,
+}) {
+  const { t } = useTranslation('admin')
+
+  const rows = useMemo(() => {
+    const named = (totals?.by_key || []).map((k) => ({
+      kind: 'apikey',
+      id: k.api_key_id,
+      name: k.api_key_name || k.api_key_id,
+      userID: k.user_id || '',
+      userName: k.user_name || '',
+      prefix: '',
+      tokens: k.tokens,
+      requests: k.requests,
+      last_used: k.last_used,
+      revoked: existingKeyIds != null && !existingKeyIds.has(k.api_key_id),
+    }))
+
+    // Pseudo-rows for sources that don't have a named key identity.
+    // In admin view (showUserColumn=true), prefer the per-user breakdown
+    // from totals.by_user_source so each user's Web UI / legacy traffic
+    // gets its own row. Otherwise fall back to the global by_source aggregate.
+    let unkeyed = []
+    if (showUserColumn && Array.isArray(totals?.by_user_source) && totals.by_user_source.length > 0) {
+      unkeyed = totals.by_user_source.map((r) => ({
+        kind: r.source,
+        id: r.source + ':' + (r.user_id || ''),
+        name: r.source === 'legacy' ? t('usage.sources.legacy') : t('usage.sources.webUI'),
+        userID: r.user_id || '',
+        userName: r.user_name || '',
+        prefix: '-',
+        tokens: r.tokens,
+        requests: r.requests,
+      }))
+    } else {
+      if (totals?.by_source?.web) {
+        unkeyed.push({
+          kind: 'web',
+          id: 'web',
+          name: t('usage.sources.webUI'),
+          userID: '',
+          userName: '',
+          prefix: '-',
+          tokens: totals.by_source.web.tokens,
+          requests: totals.by_source.web.requests,
+        })
+      }
+      if (totals?.by_source?.legacy) {
+        unkeyed.push({
+          kind: 'legacy',
+          id: 'legacy',
+          name: t('usage.sources.legacy'),
+          userID: '',
+          userName: '',
+          prefix: '-',
+          tokens: totals.by_source.legacy.tokens,
+          requests: totals.by_source.legacy.requests,
+        })
+      }
+    }
+
+    return [...named, ...unkeyed]
+  }, [totals, existingKeyIds, showUserColumn, t])
+
+  const filtered = useMemo(() => {
+    const q = (search || '').trim().toLowerCase()
+    const list = q
+      ? rows.filter((r) =>
+          (r.name || '').toLowerCase().includes(q) ||
+          (r.prefix || '').toLowerCase().includes(q) ||
+          (r.userName || '').toLowerCase().includes(q) ||
+          (r.userID || '').toLowerCase().includes(q)
+        )
+      : rows
+    return [...list].sort(SORT_FNS[sortKey] || SORT_FNS.tokens)
+  }, [rows, search, sortKey])
+
+  const iconFor = (kind) =>
+    kind === 'apikey' ? 'fas fa-key' : kind === 'web' ? 'fas fa-globe' : 'fas fa-gear'
+
+  return (
+    <div style={{ display: 'flex', flexDirection: 'column', gap: 'var(--spacing-sm)' }}>
+      <div style={{ display: 'flex', alignItems: 'center', gap: 'var(--spacing-sm)', flexWrap: 'wrap' }}>
+        <input
+          type="search"
+          value={search}
+          onChange={(e) => setSearch(e.target.value)}
+          placeholder={t('usage.sources.searchPlaceholder')}
+          aria-label={t('usage.sources.searchPlaceholder')}
+          style={{
+            flex: '1 1 12rem',
+            minWidth: 160,
+            padding: 'var(--spacing-xs) var(--spacing-sm)',
+            border: '1px solid var(--color-border-subtle)',
+            borderRadius: 'var(--radius-sm)',
+            background: 'var(--color-bg-primary)',
+            color: 'var(--color-text-primary)',
+          }}
+        />
+        <label style={{ display: 'inline-flex', alignItems: 'center', gap: 6, fontSize: '0.75rem' }}>
+          {t('usage.sources.sortBy')}:
+          <select
+            value={sortKey}
+            onChange={(e) => setSortKey(e.target.value)}
+            style={{
+              padding: 'calc(var(--spacing-xs) / 2) var(--spacing-xs)',
+              border: '1px solid var(--color-border-subtle)',
+              borderRadius: 'var(--radius-sm)',
+              background: 'var(--color-bg-primary)',
+              color: 'var(--color-text-primary)',
+            }}
+          >
+            <option value="tokens">{t('usage.sources.sortTokens')}</option>
+            <option value="requests">{t('usage.sources.sortRequests')}</option>
+            <option value="last_used">{t('usage.sources.sortLastUsed')}</option>
+            <option value="name">{t('usage.sources.sortName')}</option>
+            {showUserColumn && <option value="user">{t('usage.sources.sortUser')}</option>}
+          </select>
+        </label>
+      </div>
+
+      <div className="table-container">
+        <table className="table">
+          <thead>
+            <tr>
+              <th>{t('usage.sources.sortName')}</th>
+              {showUserColumn && <th style={{ width: 180 }}>{t('usage.sources.sortUser')}</th>}
+              <th style={{ width: 110 }}>Prefix</th>
+              <th style={{ width: 100, textAlign: 'right' }}>{t('usage.sources.sortRequests')}</th>
+              <th style={{ width: 100, textAlign: 'right' }}>{t('usage.sources.sortTokens')}</th>
+              <th style={{ width: 120, textAlign: 'right' }}>{t('usage.sources.sortLastUsed')}</th>
+            </tr>
+          </thead>
+          <tbody>
+            {filtered.map((r) => {
+              const isSel = selectedKey === r.id
+              return (
+                <tr
+                  key={r.id}
+                  onClick={() => onSelectKey?.(isSel ? null : r.id)}
+                  style={{
+                    cursor: 'pointer',
+                    background: isSel ? 'var(--color-bg-secondary)' : undefined,
+                    opacity: r.revoked ? 0.5 : 1,
+                  }}
+                >
+                  <td>
+                    <span style={{ display: 'inline-flex', alignItems: 'center', gap: 8 }}>
+                      <i
+                        className={iconFor(r.kind)}
+                        style={{ color: 'var(--color-text-muted)', fontSize: '0.8125rem' }}
+                      />
+                      <span>{r.name}</span>
+                      {r.revoked && (
+                        <span
+                          style={{
+                            fontSize: '0.6875rem',
+                            textTransform: 'uppercase',
+                            color: 'var(--color-text-muted)',
+                          }}
+                        >
+                          ({t('usage.sources.revoked')})
+                        </span>
+                      )}
+                    </span>
+                  </td>
+                  {showUserColumn && (
+                    <td style={{ color: 'var(--color-text-secondary)', fontSize: '0.8125rem' }}>
+                      {r.userName || r.userID || '-'}
+                    </td>
+                  )}
+                  <td style={{ color: 'var(--color-text-muted)', fontSize: '0.75rem' }}>{r.prefix || '-'}</td>
+                  <td style={{ textAlign: 'right', fontFamily: 'var(--font-mono)' }}>
+                    {Number(r.requests || 0).toLocaleString()}
+                  </td>
+                  <td style={{ textAlign: 'right', fontFamily: 'var(--font-mono)' }}>
+                    {formatTokens(r.tokens || 0)}
+                  </td>
+                  <td style={{ textAlign: 'right', fontSize: '0.75rem', color: 'var(--color-text-muted)' }}>
+                    {formatRelative(r.last_used)}
+                  </td>
+                </tr>
+              )
+            })}
+          </tbody>
+        </table>
+      </div>
+    </div>
+  )
+}
--- a/core/http/react-ui/src/utils/api.js
+++ b/core/http/react-ui/src/utils/api.js
@@ -422,6 +422,14 @@ export const usageApi = {
    if (userId) url += `&user_id=${encodeURIComponent(userId)}`
    return fetchJSON(url)
  },
+  getMySources: (period) =>
+    fetchJSON(`/api/auth/usage/sources?period=${period || 'month'}`),
+  getAdminSources: (period, userId, apiKeyId) => {
+    let url = `/api/auth/admin/usage/sources?period=${period || 'month'}`
+    if (userId) url += `&user_id=${encodeURIComponent(userId)}`
+    if (apiKeyId) url += `&api_key_id=${encodeURIComponent(apiKeyId)}`
+    return fetchJSON(url)
+  },
  getMyQuotas: () => fetchJSON('/api/auth/quota'),
 }

--- a/core/http/react-ui/src/utils/clipboard.js
+++ b/core/http/react-ui/src/utils/clipboard.js
@@ -0,0 +1,81 @@
+// Clipboard helper that works in non-secure contexts.
+//
+// navigator.clipboard is only defined on https:// origins and on
+// http://localhost. When LocalAI is served over plain http from a remote
+// host (LXC + Docker is a common deployment), every page that called
+// `navigator.clipboard.writeText` silently failed (#9904). This helper
+// transparently falls back to a hidden-textarea + execCommand('copy')
+// trick that browsers still honour when the page is not a secure context.
+//
+// Returns true on success, false on failure. Callers should use the return
+// value to drive the success/failure toast — the old code always claimed
+// success regardless of what actually happened.
+export async function copyToClipboard(text) {
+  if (text == null) return false
+  const value = typeof text === 'string' ? text : String(text)
+
+  if (typeof navigator !== 'undefined' && navigator.clipboard?.writeText && window.isSecureContext) {
+    try {
+      await navigator.clipboard.writeText(value)
+      return true
+    } catch {
+      // Permissions denied, browser refused, etc. — try the fallback.
+    }
+  }
+
+  return legacyCopy(value)
+}
+
+function legacyCopy(value) {
+  if (typeof document === 'undefined') return false
+  const ta = document.createElement('textarea')
+  ta.value = value
+  // Keep the textarea out of the viewport and out of layout reads. Using
+  // `position: fixed` + a negative offset avoids scrolling the page when
+  // we call .select() below.
+  ta.setAttribute('readonly', '')
+  ta.style.position = 'fixed'
+  ta.style.top = '0'
+  ta.style.left = '-9999px'
+  ta.style.opacity = '0'
+  document.body.appendChild(ta)
+  // Preserve the current selection so triggering execCommand doesn't blow
+  // away whatever the user had highlighted on the page.
+  const previousSelection = saveSelection()
+  let ok = false
+  try {
+    ta.select()
+    ta.setSelectionRange(0, value.length)
+    ok = document.execCommand('copy')
+  } catch {
+    ok = false
+  } finally {
+    document.body.removeChild(ta)
+    restoreSelection(previousSelection)
+  }
+  return ok
+}
+
+function saveSelection() {
+  try {
+    const sel = window.getSelection()
+    if (!sel || sel.rangeCount === 0) return null
+    const ranges = []
+    for (let i = 0; i < sel.rangeCount; i++) ranges.push(sel.getRangeAt(i).cloneRange())
+    return ranges
+  } catch {
+    return null
+  }
+}
+
+function restoreSelection(ranges) {
+  if (!ranges) return
+  try {
+    const sel = window.getSelection()
+    if (!sel) return
+    sel.removeAllRanges()
+    for (const r of ranges) sel.addRange(r)
+  } catch {
+    // best-effort
+  }
+}
--- a/core/http/routes/auth.go
+++ b/core/http/routes/auth.go
@@ -789,6 +789,30 @@ func RegisterAuthRoutes(e *echo.Echo, app *application.Application) {
 		})
 	})

+	// GET /api/auth/usage/sources - caller's per-source breakdown (no legacy)
+	e.GET("/api/auth/usage/sources", func(c echo.Context) error {
+		user := auth.GetUser(c)
+		if user == nil {
+			return c.JSON(http.StatusUnauthorized, map[string]string{"error": "not authenticated"})
+		}
+
+		period := c.QueryParam("period")
+		if period == "" {
+			period = "month"
+		}
+
+		buckets, totals, err := auth.GetUserUsageBySource(db, user.ID, period)
+		if err != nil {
+			return c.JSON(http.StatusInternalServerError, map[string]string{"error": "failed to get usage"})
+		}
+
+		return c.JSON(http.StatusOK, map[string]any{
+			"buckets":   buckets,
+			"totals":    totals,
+			"truncated": false,
+		})
+	})
+
 	// Admin endpoints
 	adminMw := auth.RequireAdmin()

@@ -1104,6 +1128,27 @@ func RegisterAuthRoutes(e *echo.Echo, app *application.Application) {
 		})
 	}, adminMw)

+	// GET /api/auth/admin/usage/sources - all users' per-source breakdown (admin only)
+	e.GET("/api/auth/admin/usage/sources", func(c echo.Context) error {
+		period := c.QueryParam("period")
+		if period == "" {
+			period = "month"
+		}
+		userID := c.QueryParam("user_id")
+		apiKeyID := c.QueryParam("api_key_id")
+
+		buckets, totals, truncated, err := auth.GetAllUsageBySource(db, period, userID, apiKeyID)
+		if err != nil {
+			return c.JSON(http.StatusInternalServerError, map[string]string{"error": "failed to get usage"})
+		}
+
+		return c.JSON(http.StatusOK, map[string]any{
+			"buckets":   buckets,
+			"totals":    totals,
+			"truncated": truncated,
+		})
+	}, adminMw)
+
 	// --- Invite management endpoints ---

 	// POST /api/auth/admin/invites - create invite (admin only)
--- a/core/http/routes/auth_test.go
+++ b/core/http/routes/auth_test.go
@@ -286,6 +286,45 @@ func newTestAuthApp(db *gorm.DB, appConfig *config.ApplicationConfig) *echo.Echo
 		return c.JSON(http.StatusOK, map[string]string{"message": "user deleted"})
 	}, adminMw)

+	// Mirror of production handler in routes/auth.go GET /api/auth/usage/sources.
+	// Keep this body in sync with the real handler; this test app cannot call
+	// RegisterAuthRoutes because it needs a *application.Application.
+	e.GET("/api/auth/usage/sources", func(c echo.Context) error {
+		user := auth.GetUser(c)
+		if user == nil {
+			return c.JSON(http.StatusUnauthorized, map[string]string{"error": "not authenticated"})
+		}
+		period := c.QueryParam("period")
+		if period == "" {
+			period = "month"
+		}
+		buckets, totals, err := auth.GetUserUsageBySource(db, user.ID, period)
+		if err != nil {
+			return c.JSON(http.StatusInternalServerError, map[string]string{"error": "failed to get usage"})
+		}
+		return c.JSON(http.StatusOK, map[string]any{
+			"buckets": buckets, "totals": totals, "truncated": false,
+		})
+	})
+
+	// Mirror of production handler in routes/auth.go GET /api/auth/admin/usage/sources.
+	// Keep this body in sync with the real handler.
+	e.GET("/api/auth/admin/usage/sources", func(c echo.Context) error {
+		period := c.QueryParam("period")
+		if period == "" {
+			period = "month"
+		}
+		userID := c.QueryParam("user_id")
+		apiKeyID := c.QueryParam("api_key_id")
+		buckets, totals, truncated, err := auth.GetAllUsageBySource(db, period, userID, apiKeyID)
+		if err != nil {
+			return c.JSON(http.StatusInternalServerError, map[string]string{"error": "failed to get usage"})
+		}
+		return c.JSON(http.StatusOK, map[string]any{
+			"buckets": buckets, "totals": totals, "truncated": truncated,
+		})
+	}, adminMw)
+
 	// Regular API endpoint for testing
 	e.POST("/v1/chat/completions", func(c echo.Context) error {
 		return c.String(http.StatusOK, "ok")
@@ -931,4 +970,110 @@ var _ = Describe("Auth Routes", Label("auth"), func() {
 			Expect(providers).To(ContainElement(auth.ProviderGitHub))
 		})
 	})
+
+	Describe("GET /api/auth/usage/sources", func() {
+		It("returns only the caller's data, never legacy", func() {
+			app := newTestAuthApp(db, appConfig)
+
+			alice := createRouteTestUser(db, "alice@example.com", auth.RoleUser)
+			aliceToken, err := auth.CreateSession(db, alice.ID, "")
+			Expect(err).ToNot(HaveOccurred())
+
+			keyID := "k-alice"
+			now := time.Now()
+			Expect(auth.RecordUsage(db, &auth.UsageRecord{
+				UserID: alice.ID, Source: auth.UsageSourceAPIKey,
+				APIKeyID: &keyID, APIKeyName: "alice-key",
+				Model: "gpt-4", TotalTokens: 100, CreatedAt: now,
+			})).To(Succeed())
+			Expect(auth.RecordUsage(db, &auth.UsageRecord{
+				UserID: alice.ID, Source: auth.UsageSourceWeb,
+				Model: "gpt-4", TotalTokens: 50, CreatedAt: now,
+			})).To(Succeed())
+			Expect(auth.RecordUsage(db, &auth.UsageRecord{
+				UserID: "legacy-api-key", Source: auth.UsageSourceLegacy,
+				Model: "gpt-4", TotalTokens: 30, CreatedAt: now,
+			})).To(Succeed())
+
+			rec := doAuthRequest(app, http.MethodGet, "/api/auth/usage/sources?period=month", nil, withSession(aliceToken))
+			Expect(rec.Code).To(Equal(http.StatusOK))
+
+			var resp struct {
+				Buckets   []auth.UsageBucket `json:"buckets"`
+				Totals    auth.SourceTotals  `json:"totals"`
+				Truncated bool               `json:"truncated"`
+			}
+			Expect(json.Unmarshal(rec.Body.Bytes(), &resp)).To(Succeed())
+			_, hasLegacy := resp.Totals.BySource[auth.UsageSourceLegacy]
+			Expect(hasLegacy).To(BeFalse())
+			Expect(resp.Totals.GrandTotal.Tokens).To(Equal(int64(150)))
+			Expect(resp.Truncated).To(BeFalse())
+		})
+
+		It("returns 401 when unauthenticated", func() {
+			app := newTestAuthApp(db, appConfig)
+
+			// Without a session cookie or bearer token, the global auth middleware
+			// should refuse the request before our handler runs.
+			rec := doAuthRequest(app, http.MethodGet, "/api/auth/usage/sources?period=month", nil)
+			Expect(rec.Code).To(Equal(http.StatusUnauthorized))
+		})
+	})
+
+	Describe("GET /api/auth/admin/usage/sources", func() {
+		It("returns 403 for non-admin", func() {
+			app := newTestAuthApp(db, appConfig)
+
+			alice := createRouteTestUser(db, "alice@example.com", auth.RoleUser)
+			aliceToken, _ := auth.CreateSession(db, alice.ID, "")
+
+			rec := doAuthRequest(app, http.MethodGet, "/api/auth/admin/usage/sources?period=month", nil, withSession(aliceToken))
+			Expect(rec.Code).To(Equal(http.StatusForbidden))
+		})
+
+		It("returns legacy bucket for admin and applies api_key_id filter", func() {
+			app := newTestAuthApp(db, appConfig)
+
+			admin := createRouteTestUser(db, "admin@example.com", auth.RoleAdmin)
+			adminToken, _ := auth.CreateSession(db, admin.ID, "")
+
+			k1 := "k1"
+			k2 := "k2"
+			now := time.Now()
+			Expect(auth.RecordUsage(db, &auth.UsageRecord{UserID: "alice", Source: auth.UsageSourceAPIKey, APIKeyID: &k1, APIKeyName: "ci", Model: "gpt-4", TotalTokens: 10, CreatedAt: now})).To(Succeed())
+			Expect(auth.RecordUsage(db, &auth.UsageRecord{UserID: "alice", Source: auth.UsageSourceAPIKey, APIKeyID: &k2, APIKeyName: "lap", Model: "gpt-4", TotalTokens: 20, CreatedAt: now})).To(Succeed())
+			Expect(auth.RecordUsage(db, &auth.UsageRecord{UserID: "legacy-api-key", Source: auth.UsageSourceLegacy, Model: "gpt-4", TotalTokens: 5, CreatedAt: now})).To(Succeed())
+
+			rec := doAuthRequest(app, http.MethodGet,
+				"/api/auth/admin/usage/sources?period=month&api_key_id=k2", nil, withSession(adminToken))
+			Expect(rec.Code).To(Equal(http.StatusOK))
+
+			var resp struct {
+				Totals    auth.SourceTotals `json:"totals"`
+				Truncated bool              `json:"truncated"`
+			}
+			Expect(json.Unmarshal(rec.Body.Bytes(), &resp)).To(Succeed())
+			Expect(resp.Totals.GrandTotal.Tokens).To(Equal(int64(20)))
+		})
+
+		It("includes legacy in by_source for admin with no filter", func() {
+			app := newTestAuthApp(db, appConfig)
+
+			admin := createRouteTestUser(db, "admin@example.com", auth.RoleAdmin)
+			adminToken, _ := auth.CreateSession(db, admin.ID, "")
+
+			now := time.Now()
+			Expect(auth.RecordUsage(db, &auth.UsageRecord{UserID: "legacy-api-key", Source: auth.UsageSourceLegacy, Model: "gpt-4", TotalTokens: 7, CreatedAt: now})).To(Succeed())
+
+			rec := doAuthRequest(app, http.MethodGet, "/api/auth/admin/usage/sources?period=month", nil, withSession(adminToken))
+			Expect(rec.Code).To(Equal(http.StatusOK))
+
+			var resp struct {
+				Totals auth.SourceTotals `json:"totals"`
+			}
+			Expect(json.Unmarshal(rec.Body.Bytes(), &resp)).To(Succeed())
+			Expect(resp.Totals.BySource).To(HaveKey(auth.UsageSourceLegacy))
+			Expect(resp.Totals.BySource[auth.UsageSourceLegacy].Tokens).To(Equal(int64(7)))
+		})
+	})
 })
--- a/core/http/routes/nodes.go
+++ b/core/http/routes/nodes.go
@@ -6,7 +6,9 @@ import (
 	"strings"

 	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/core/config"
 	"github.com/mudler/LocalAI/core/http/endpoints/localai"
+	"github.com/mudler/LocalAI/core/services/galleryop"
 	"github.com/mudler/LocalAI/core/services/nodes"
 	"gorm.io/gorm"
 )
@@ -53,7 +55,12 @@ func RegisterNodeSelfServiceRoutes(e *echo.Echo, registry *nodes.NodeRegistry, r

 // RegisterNodeAdminRoutes registers /api/nodes/ endpoints used by admins
 // (list, get, get models, drain, delete, approve, backend management). Protected by admin middleware.
-func RegisterNodeAdminRoutes(e *echo.Echo, registry *nodes.NodeRegistry, unloader nodes.NodeCommandSender, adminMw echo.MiddlewareFunc, authDB *gorm.DB, hmacSecret string, registrationToken string) {
+//
+// galleryService/opcache/appConfig are threaded in for the async node-scoped
+// backend install path (POST /:id/backends/install). That handler enqueues a
+// ManagementOp on the gallery channel rather than blocking on a NATS reply, so
+// the browser gets HTTP 202 + jobID immediately instead of waiting up to 3 minutes.
+func RegisterNodeAdminRoutes(e *echo.Echo, registry *nodes.NodeRegistry, unloader nodes.NodeCommandSender, galleryService *galleryop.GalleryService, opcache *galleryop.OpCache, appConfig *config.ApplicationConfig, adminMw echo.MiddlewareFunc, authDB *gorm.DB, hmacSecret string, registrationToken string) {
 	if registry == nil {
 		return
 	}
@@ -78,7 +85,7 @@ func RegisterNodeAdminRoutes(e *echo.Echo, registry *nodes.NodeRegistry, unloade

 	// Backend management on workers
 	admin.GET("/:id/backends", localai.ListBackendsOnNodeEndpoint(unloader))
-	admin.POST("/:id/backends/install", localai.InstallBackendOnNodeEndpoint(unloader))
+	admin.POST("/:id/backends/install", localai.InstallBackendOnNodeEndpoint(unloader, galleryService, opcache, appConfig))
 	admin.POST("/:id/backends/delete", localai.DeleteBackendOnNodeEndpoint(unloader))

 	// Model management on workers
--- a/core/http/routes/ui_api.go
+++ b/core/http/routes/ui_api.go
@@ -214,6 +214,17 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
 				}
 			}

+			// Node-scoped backend ops (from /api/nodes/:id/backends/install)
+			// carry the nodeID inside the opcache key as "node:<nodeID>:<backend>".
+			// Pull it back out so the operations panel can label which node the
+			// install is targeting, and so the display name is just the backend
+			// slug instead of the full prefixed key.
+			scopedNodeID := ""
+			if nodeID, backend, ok := galleryop.ParseNodeScopedKey(galleryID); ok {
+				scopedNodeID = nodeID
+				galleryID = backend
+			}
+
 			// Extract display name (remove repo prefix if exists)
 			displayName := galleryID
 			if strings.Contains(galleryID, "@") {
@@ -237,6 +248,12 @@ func RegisterUIAPIRoutes(app *echo.Echo, cl *config.ModelConfigLoader, ml *model
 				"cancellable": isCancellable,
 				"message":     message,
 			}
+			// Only attach nodeID when this op was node-scoped: an empty string
+			// would mislead the UI into rendering a node attribution that never
+			// existed in the first place.
+			if scopedNodeID != "" {
+				opData["nodeID"] = scopedNodeID
+			}
 			if status != nil && status.Error != nil {
 				opData["error"] = status.Error.Error()
 			}
--- a/core/http/routes/ui_api_operations_test.go
+++ b/core/http/routes/ui_api_operations_test.go
@@ -0,0 +1,98 @@
+package routes_test
+
+import (
+	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+
+	"github.com/labstack/echo/v4"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+
+	"github.com/mudler/LocalAI/core/application"
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/http/routes"
+	"github.com/mudler/LocalAI/core/services/galleryop"
+)
+
+// These specs guard the contract between the opcache (which stores
+// node-scoped backend installs under a "node:<nodeID>:<backend>" key) and the
+// /api/operations response surface the React UI polls. Without nodeID
+// extraction the panel would show the raw prefixed key and have no way to
+// label which worker an install is targeting.
+var _ = Describe("/api/operations with node-scoped backend ops", func() {
+	// We pass a zero-value *application.Application because the handler's
+	// distributed-services branch guards on a nil check on the returned
+	// *DistributedServices, which is nil for a fresh Application{}.
+	noopMw := func(next echo.HandlerFunc) echo.HandlerFunc { return next }
+
+	It("emits nodeID and the un-prefixed backend name for keys built by NodeScopedKey", func() {
+		appCfg := &config.ApplicationConfig{}
+		galleryService := galleryop.NewGalleryService(appCfg, nil)
+		opcache := galleryop.NewOpCache(galleryService)
+
+		key := galleryop.NodeScopedKey("worker-7", "llama-cpp")
+		opcache.SetBackend(key, "job-uuid-123")
+
+		e := echo.New()
+		routes.RegisterUIAPIRoutes(e, nil, nil, appCfg, galleryService, opcache, &application.Application{}, noopMw)
+
+		req := httptest.NewRequest(http.MethodGet, "/api/operations", nil)
+		rec := httptest.NewRecorder()
+		e.ServeHTTP(rec, req)
+
+		Expect(rec.Code).To(Equal(http.StatusOK))
+
+		// The handler wraps operations in {"operations": [...]}.
+		var envelope struct {
+			Operations []map[string]any `json:"operations"`
+		}
+		Expect(json.Unmarshal(rec.Body.Bytes(), &envelope)).To(Succeed())
+
+		var found map[string]any
+		for _, op := range envelope.Operations {
+			if op["jobID"] == "job-uuid-123" {
+				found = op
+				break
+			}
+		}
+		Expect(found).ToNot(BeNil(), "node-scoped op should appear in /api/operations")
+		Expect(found["nodeID"]).To(Equal("worker-7"))
+		Expect(found["name"]).To(Equal("llama-cpp"))
+		Expect(found["isBackend"]).To(Equal(true))
+	})
+
+	It("does not emit nodeID for non-node-scoped backend ops", func() {
+		appCfg := &config.ApplicationConfig{}
+		galleryService := galleryop.NewGalleryService(appCfg, nil)
+		opcache := galleryop.NewOpCache(galleryService)
+
+		// Legacy/global install path: bare backend name as the opcache key.
+		opcache.SetBackend("llama-cpp", "job-uuid-456")
+
+		e := echo.New()
+		routes.RegisterUIAPIRoutes(e, nil, nil, appCfg, galleryService, opcache, &application.Application{}, noopMw)
+
+		req := httptest.NewRequest(http.MethodGet, "/api/operations", nil)
+		rec := httptest.NewRecorder()
+		e.ServeHTTP(rec, req)
+
+		Expect(rec.Code).To(Equal(http.StatusOK))
+		var envelope struct {
+			Operations []map[string]any `json:"operations"`
+		}
+		Expect(json.Unmarshal(rec.Body.Bytes(), &envelope)).To(Succeed())
+
+		var found map[string]any
+		for _, op := range envelope.Operations {
+			if op["jobID"] == "job-uuid-456" {
+				found = op
+				break
+			}
+		}
+		Expect(found).ToNot(BeNil())
+		// Critical: bare ops must NOT gain a misleading empty nodeID field.
+		Expect(found).ToNot(HaveKey("nodeID"), "non-node-scoped ops must NOT carry a nodeID field")
+		Expect(found["name"]).To(Equal("llama-cpp"))
+	})
+})
--- a/core/services/galleryop/backends.go
+++ b/core/services/galleryop/backends.go
@@ -113,7 +113,7 @@ func (g *GalleryService) backendHandler(op *ManagementOp[gallery.GalleryBackend,
 // InstallExternalBackend installs a backend from an external source (OCI image, URL, or path).
 // This method contains the logic to detect the input type and call the appropriate installation function.
 // It can be used by both CLI and Web UI for installing backends from external sources.
-func InstallExternalBackend(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, downloadStatus func(string, string, string, float64), backend, name, alias string) error {
+func InstallExternalBackend(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, downloadStatus func(string, string, string, float64), backend, name, alias string, requireIntegrity bool) error {
 	uri := downloader.URI(backend)
 	switch {
 	case uri.LooksLikeDir():
@@ -127,7 +127,7 @@ func InstallExternalBackend(ctx context.Context, galleries []config.Gallery, sys
 			},
 			Alias: alias,
 			URI:   backend,
-		}, downloadStatus); err != nil {
+		}, downloadStatus, requireIntegrity); err != nil {
 			return fmt.Errorf("error installing backend %s: %w", backend, err)
 		}
 	case uri.LooksLikeOCI() && !uri.LooksLikeOCIFile():
@@ -141,7 +141,7 @@ func InstallExternalBackend(ctx context.Context, galleries []config.Gallery, sys
 			},
 			Alias: alias,
 			URI:   backend,
-		}, downloadStatus); err != nil {
+		}, downloadStatus, requireIntegrity); err != nil {
 			return fmt.Errorf("error installing backend %s: %w", backend, err)
 		}
 	case uri.LooksLikeOCIFile():
@@ -163,7 +163,7 @@ func InstallExternalBackend(ctx context.Context, galleries []config.Gallery, sys
 			},
 			Alias: alias,
 			URI:   backend,
-		}, downloadStatus); err != nil {
+		}, downloadStatus, requireIntegrity); err != nil {
 			return fmt.Errorf("error installing backend %s: %w", backend, err)
 		}
 	default:
@@ -171,7 +171,7 @@ func InstallExternalBackend(ctx context.Context, galleries []config.Gallery, sys
 		if name != "" || alias != "" {
 			return fmt.Errorf("specifying a name or alias is not supported for gallery backends")
 		}
-		err := gallery.InstallBackendFromGallery(ctx, galleries, systemState, modelLoader, backend, downloadStatus, true)
+		err := gallery.InstallBackendFromGallery(ctx, galleries, systemState, modelLoader, backend, downloadStatus, true, requireIntegrity)
 		if err != nil {
 			return fmt.Errorf("error installing backend %s: %w", backend, err)
 		}
--- a/core/services/galleryop/backends_test.go
+++ b/core/services/galleryop/backends_test.go
@@ -70,6 +70,7 @@ var _ = Describe("InstallExternalBackend", func() {
 				"test-backend", // gallery name
 				"custom-name",  // name should not be allowed
 				"",
+				false,
 			)
 			Expect(err).To(HaveOccurred())
 			Expect(err.Error()).To(ContainSubstring("specifying a name or alias is not supported for gallery backends"))
@@ -85,6 +86,7 @@ var _ = Describe("InstallExternalBackend", func() {
 				"non-existent-backend",
 				"",
 				"",
+				false,
 			)
 			Expect(err).To(HaveOccurred())
 		})
@@ -101,6 +103,7 @@ var _ = Describe("InstallExternalBackend", func() {
 				"oci://quay.io/mudler/tests:localai-backend-test",
 				"", // name is required for OCI images
 				"",
+				false,
 			)
 			Expect(err).To(HaveOccurred())
 			Expect(err.Error()).To(ContainSubstring("specifying a name is required for OCI images"))
@@ -133,6 +136,7 @@ var _ = Describe("InstallExternalBackend", func() {
 				testBackendPath,
 				"", // name should be inferred as "source-backend"
 				"",
+				false,
 			)
 			// The function should at least attempt to install with the inferred name
 			// Even if it fails for other reasons, it shouldn't fail due to missing name
@@ -151,6 +155,7 @@ var _ = Describe("InstallExternalBackend", func() {
 				testBackendPath,
 				"custom-backend-name",
 				"",
+				false,
 			)
 			// The function should use the provided name
 			if err != nil {
@@ -168,6 +173,7 @@ var _ = Describe("InstallExternalBackend", func() {
 				testBackendPath,
 				"custom-backend-name",
 				"custom-alias",
+				false,
 			)
 			// The function should accept alias for directory paths
 			if err != nil {
@@ -190,4 +196,60 @@ var _ = Describe("ManagementOp with External Backend", func() {
 		Expect(op.ExternalName).To(Equal("test-backend"))
 		Expect(op.ExternalAlias).To(Equal("test-alias"))
 	})
+
+	Context("TargetNodeID field", func() {
+		It("defaults to empty string", func() {
+			op := galleryop.ManagementOp[string, string]{
+				ExternalURI: "oci://example.com/backend:latest",
+			}
+			Expect(op.TargetNodeID).To(BeEmpty())
+		})
+
+		It("preserves TargetNodeID across a channel send", func() {
+			ch := make(chan galleryop.ManagementOp[string, string], 1)
+			ch <- galleryop.ManagementOp[string, string]{
+				GalleryElementName: "llama-cpp",
+				TargetNodeID:       "node-abc-123",
+			}
+			received := <-ch
+			Expect(received.TargetNodeID).To(Equal("node-abc-123"))
+			Expect(received.GalleryElementName).To(Equal("llama-cpp"))
+		})
+	})
+
+	Describe("NodeScopedKey", func() {
+		It("builds a unique key per (nodeID, backend) pair", func() {
+			Expect(galleryop.NodeScopedKey("node-a", "llama-cpp")).To(Equal("node:node-a:llama-cpp"))
+			Expect(galleryop.NodeScopedKey("node-b", "llama-cpp")).To(Equal("node:node-b:llama-cpp"))
+			Expect(galleryop.NodeScopedKey("node-a", "vllm")).To(Equal("node:node-a:vllm"))
+		})
+
+		It("handles backend names containing colons", func() {
+			// Gallery IDs sometimes look like "official@llama-cpp"; nodeIDs are UUIDs
+			// without colons, but the backend slug may contain anything. Splitting on
+			// the first colon after the prefix MUST yield the full backend back.
+			key := galleryop.NodeScopedKey("node-1", "official@llama-cpp:v2")
+			node, backend, ok := galleryop.ParseNodeScopedKey(key)
+			Expect(ok).To(BeTrue())
+			Expect(node).To(Equal("node-1"))
+			Expect(backend).To(Equal("official@llama-cpp:v2"))
+		})
+
+		It("rejects keys without the node prefix", func() {
+			_, _, ok := galleryop.ParseNodeScopedKey("llama-cpp")
+			Expect(ok).To(BeFalse())
+			_, _, ok = galleryop.ParseNodeScopedKey("official@llama-cpp")
+			Expect(ok).To(BeFalse())
+		})
+
+		It("rejects malformed node-prefixed keys", func() {
+			_, _, ok := galleryop.ParseNodeScopedKey("node:only-one-segment")
+			Expect(ok).To(BeFalse())
+		})
+
+		It("rejects keys with an empty nodeID segment", func() {
+			_, _, ok := galleryop.ParseNodeScopedKey("node::llama-cpp")
+			Expect(ok).To(BeFalse())
+		})
+	})
 })
--- a/core/services/galleryop/managers_local.go
+++ b/core/services/galleryop/managers_local.go
@@ -16,6 +16,7 @@ type LocalModelManager struct {
 	modelLoader                 *model.ModelLoader
 	enforcePredownloadScans     bool
 	automaticallyInstallBackend bool
+	requireBackendIntegrity     bool
 }

 // NewLocalModelManager creates a LocalModelManager from the application config.
@@ -25,6 +26,7 @@ func NewLocalModelManager(appConfig *config.ApplicationConfig, ml *model.ModelLo
 		modelLoader:                 ml,
 		enforcePredownloadScans:     appConfig.EnforcePredownloadScans,
 		automaticallyInstallBackend: appConfig.AutoloadBackendGalleries,
+		requireBackendIntegrity:     appConfig.RequireBackendIntegrity,
 	}
 }

@@ -53,32 +55,34 @@ func (m *LocalModelManager) InstallModel(ctx context.Context, op *ManagementOp[g
 		if m.automaticallyInstallBackend && installedModel.Backend != "" {
 			xlog.Debug("Installing backend", "backend", installedModel.Backend)
 			return gallery.InstallBackendFromGallery(ctx, op.BackendGalleries, m.systemState,
-				m.modelLoader, installedModel.Backend, progressCb, false)
+				m.modelLoader, installedModel.Backend, progressCb, false, m.requireBackendIntegrity)
 		}
 		return nil
 	case op.GalleryElementName != "":
 		return gallery.InstallModelFromGallery(ctx, op.Galleries, op.BackendGalleries,
 			m.systemState, m.modelLoader, op.GalleryElementName, op.Req, progressCb,
-			m.enforcePredownloadScans, m.automaticallyInstallBackend)
+			m.enforcePredownloadScans, m.automaticallyInstallBackend, m.requireBackendIntegrity)
 	default:
 		return installModelFromRemoteConfig(ctx, m.systemState, m.modelLoader, op.Req,
-			progressCb, m.enforcePredownloadScans, m.automaticallyInstallBackend, op.BackendGalleries)
+			progressCb, m.enforcePredownloadScans, m.automaticallyInstallBackend, op.BackendGalleries, m.requireBackendIntegrity)
 	}
 }

 // LocalBackendManager handles backend install/delete on the local instance.
 type LocalBackendManager struct {
-	systemState      *system.SystemState
-	modelLoader      *model.ModelLoader
-	backendGalleries []config.Gallery
+	systemState             *system.SystemState
+	modelLoader             *model.ModelLoader
+	backendGalleries        []config.Gallery
+	requireBackendIntegrity bool
 }

 // NewLocalBackendManager creates a LocalBackendManager from the application config.
 func NewLocalBackendManager(appConfig *config.ApplicationConfig, ml *model.ModelLoader) *LocalBackendManager {
 	return &LocalBackendManager{
-		systemState:      appConfig.SystemState,
-		modelLoader:      ml,
-		backendGalleries: appConfig.BackendGalleries,
+		systemState:             appConfig.SystemState,
+		modelLoader:             ml,
+		backendGalleries:        appConfig.BackendGalleries,
+		requireBackendIntegrity: appConfig.RequireBackendIntegrity,
 	}
 }

@@ -93,7 +97,7 @@ func (b *LocalBackendManager) ListBackends() (gallery.SystemBackends, error) {
 }

 func (b *LocalBackendManager) UpgradeBackend(ctx context.Context, name string, progressCb ProgressCallback) error {
-	return gallery.UpgradeBackend(ctx, b.systemState, b.modelLoader, b.backendGalleries, name, progressCb)
+	return gallery.UpgradeBackend(ctx, b.systemState, b.modelLoader, b.backendGalleries, name, progressCb, b.requireBackendIntegrity)
 }

 func (b *LocalBackendManager) CheckUpgrades(ctx context.Context) (map[string]gallery.UpgradeInfo, error) {
@@ -103,10 +107,10 @@ func (b *LocalBackendManager) CheckUpgrades(ctx context.Context) (map[string]gal
 func (b *LocalBackendManager) InstallBackend(ctx context.Context, op *ManagementOp[gallery.GalleryBackend, any], progressCb ProgressCallback) error {
 	if op.ExternalURI != "" {
 		return InstallExternalBackend(ctx, b.backendGalleries, b.systemState, b.modelLoader,
-			progressCb, op.ExternalURI, op.ExternalName, op.ExternalAlias)
+			progressCb, op.ExternalURI, op.ExternalName, op.ExternalAlias, b.requireBackendIntegrity)
 	}
 	return gallery.InstallBackendFromGallery(ctx, b.backendGalleries, b.systemState,
-		b.modelLoader, op.GalleryElementName, progressCb, true)
+		b.modelLoader, op.GalleryElementName, progressCb, true, b.requireBackendIntegrity)
 }

 func (b *LocalBackendManager) IsDistributed() bool { return false }
--- a/core/services/galleryop/models.go
+++ b/core/services/galleryop/models.go
@@ -123,7 +123,7 @@ func (g *GalleryService) modelHandler(op *ManagementOp[gallery.GalleryModel, gal
 	return nil
 }

-func installModelFromRemoteConfig(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, req gallery.GalleryModel, downloadStatus func(string, string, string, float64), enforceScan, automaticallyInstallBackend bool, backendGalleries []config.Gallery) error {
+func installModelFromRemoteConfig(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, req gallery.GalleryModel, downloadStatus func(string, string, string, float64), enforceScan, automaticallyInstallBackend bool, backendGalleries []config.Gallery, requireBackendIntegrity bool) error {
 	config, err := gallery.GetGalleryConfigFromURLWithContext[gallery.ModelConfig](ctx, req.URL, systemState.Model.ModelsPath)
 	if err != nil {
 		return err
@@ -137,7 +137,7 @@ func installModelFromRemoteConfig(ctx context.Context, systemState *system.Syste
 	}

 	if automaticallyInstallBackend && installedModel.Backend != "" {
-		if err := gallery.InstallBackendFromGallery(ctx, backendGalleries, systemState, modelLoader, installedModel.Backend, downloadStatus, false); err != nil {
+		if err := gallery.InstallBackendFromGallery(ctx, backendGalleries, systemState, modelLoader, installedModel.Backend, downloadStatus, false, requireBackendIntegrity); err != nil {
 			return err
 		}
 	}
@@ -150,23 +150,23 @@ type galleryModel struct {
 	ID                   string           `json:"id"`
 }

-func processRequests(systemState *system.SystemState, modelLoader *model.ModelLoader, enforceScan, automaticallyInstallBackend bool, galleries []config.Gallery, backendGalleries []config.Gallery, requests []galleryModel) error {
+func processRequests(systemState *system.SystemState, modelLoader *model.ModelLoader, enforceScan, automaticallyInstallBackend bool, galleries []config.Gallery, backendGalleries []config.Gallery, requests []galleryModel, requireBackendIntegrity bool) error {
 	ctx := context.Background()
 	var err error
 	for _, r := range requests {
 		utils.ResetDownloadTimers()
 		if r.ID == "" {
-			err = installModelFromRemoteConfig(ctx, systemState, modelLoader, r.GalleryModel, utils.DisplayDownloadFunction, enforceScan, automaticallyInstallBackend, backendGalleries)
+			err = installModelFromRemoteConfig(ctx, systemState, modelLoader, r.GalleryModel, utils.DisplayDownloadFunction, enforceScan, automaticallyInstallBackend, backendGalleries, requireBackendIntegrity)

 		} else {
 			err = gallery.InstallModelFromGallery(
-				ctx, galleries, backendGalleries, systemState, modelLoader, r.ID, r.GalleryModel, utils.DisplayDownloadFunction, enforceScan, automaticallyInstallBackend)
+				ctx, galleries, backendGalleries, systemState, modelLoader, r.ID, r.GalleryModel, utils.DisplayDownloadFunction, enforceScan, automaticallyInstallBackend, requireBackendIntegrity)
 		}
 	}
 	return err
 }

-func ApplyGalleryFromFile(systemState *system.SystemState, modelLoader *model.ModelLoader, enforceScan, automaticallyInstallBackend bool, galleries []config.Gallery, backendGalleries []config.Gallery, s string) error {
+func ApplyGalleryFromFile(systemState *system.SystemState, modelLoader *model.ModelLoader, enforceScan, automaticallyInstallBackend bool, galleries []config.Gallery, backendGalleries []config.Gallery, s string, requireBackendIntegrity bool) error {
 	dat, err := os.ReadFile(s)
 	if err != nil {
 		return err
@@ -177,15 +177,15 @@ func ApplyGalleryFromFile(systemState *system.SystemState, modelLoader *model.Mo
 		return err
 	}

-	return processRequests(systemState, modelLoader, enforceScan, automaticallyInstallBackend, galleries, backendGalleries, requests)
+	return processRequests(systemState, modelLoader, enforceScan, automaticallyInstallBackend, galleries, backendGalleries, requests, requireBackendIntegrity)
 }

-func ApplyGalleryFromString(systemState *system.SystemState, modelLoader *model.ModelLoader, enforceScan, automaticallyInstallBackend bool, galleries []config.Gallery, backendGalleries []config.Gallery, s string) error {
+func ApplyGalleryFromString(systemState *system.SystemState, modelLoader *model.ModelLoader, enforceScan, automaticallyInstallBackend bool, galleries []config.Gallery, backendGalleries []config.Gallery, s string, requireBackendIntegrity bool) error {
 	var requests []galleryModel
 	err := json.Unmarshal([]byte(s), &requests)
 	if err != nil {
 		return err
 	}

-	return processRequests(systemState, modelLoader, enforceScan, automaticallyInstallBackend, galleries, backendGalleries, requests)
+	return processRequests(systemState, modelLoader, enforceScan, automaticallyInstallBackend, galleries, backendGalleries, requests, requireBackendIntegrity)
 }
--- a/Show More
+++ b/Show More