chore: ⬆️ Update ikawrakow/ik_llama.cpp to 77413bc900f9a2bfd8a5407f184427bcc0825f6c (#9899 )

⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
chore: ⬆️ Update ggml-org/whisper.cpp to afa2ea544fb4b0448916b4a31ecd33c8685bd482 (#9898 )
2026-05-20 06:35:41 -04:00 · 2026-05-20 01:02:53 +02:00 · 2026-05-20 01:02:25 +02:00 · 2026-05-20 01:01:45 +02:00 · 2026-05-20 01:01:20 +02:00 · 2026-05-20 01:00:32 +02:00
154 changed files with 7346 additions and 564 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -112,6 +112,8 @@ Add a YAML anchor definition in the `## metas` section (around line 2-300). Look

 Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags.

+**Note on integrity:** OCI backends installed from a gallery whose `verification:` block is set are verified against a keyless-cosign policy before extraction; tarball/HTTP backends use the optional `sha256:` field. New backends do not need any extra YAML — the gallery-level `verification:` block covers every entry. See [.agents/backend-signing.md](backend-signing.md) for the producer-side CI step.
+
 ## 4. Update the Makefile

 The Makefile needs to be updated in several places to support building and testing the new backend:
--- a/.agents/api-endpoints-and-auth.md
+++ b/.agents/api-endpoints-and-auth.md
@@ -284,7 +284,17 @@ Also bump the expected-length count in `api_instructions_test.go` and add the na

 ### 3. `capabilities.js` symbol (for new model-config FLAG_* flags)

-If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), also declare the matching symbol in `core/http/react-ui/src/utils/capabilities.js`:
+If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), you need to update **all** of:
+
+- `Usecase<Name>` string constant in `core/config/backend_capabilities.go`
+- `UsecaseInfoMap` entry mapping the string to its flag + gRPC method
+- `FLAG_<NAME>` bitmask in `core/config/model_config.go`
+- `GetAllModelConfigUsecases()` map entry (otherwise the YAML loader silently ignores the string)
+- `ModalityGroups` membership if the flag should affect `IsMultimodal()` (e.g. realtime_audio is in both speech-input and audio-output groups so a lone flag still reads as multimodal)
+- `GuessUsecases()` branch listing the backends that own this capability
+- `usecaseFilters` in `core/http/routes/ui_api.go` (drives the gallery filter dropdown)
+- `Models.jsx` `FILTERS` array + matching `filters.<camelCase>` i18n key in `core/http/react-ui/public/locales/en/models.json`
+- `core/http/react-ui/src/utils/capabilities.js`:

 ```js
 export const CAP_MY_CAPABILITY = 'FLAG_MY_CAPABILITY'
--- a/.agents/backend-signing.md
+++ b/.agents/backend-signing.md
@@ -0,0 +1,120 @@
+# Backend image signing & verification
+
+LocalAI verifies backend OCI images against a per-gallery keyless-cosign
+policy. This page documents the trust model, the producer side
+(`.github/workflows/backend_merge.yml` in this repo), and the consumer
+side (`pkg/oci/cosignverify` plus the gallery YAML).
+
+## Trust model
+
+- **Producer:** `.github/workflows/backend_merge.yml` signs each pushed
+  manifest list with `cosign sign --recursive` in keyless mode after
+  `docker buildx imagetools create`. The signing cert is issued by
+  Fulcio bound to the workflow's OIDC identity. There is no long-lived
+  signing key. `--recursive` signs both the manifest list and every
+  per-arch entry — needed because our consumer resolves a tag to a
+  per-arch manifest before checking signatures.
+- **Storage:** Signatures are written as OCI 1.1 referrers
+  (`--registry-referrers-mode=oci-1-1`) in the new Sigstore bundle format
+  (`--new-bundle-format`). No `:sha256-<hex>.sig` tag clutter.
+- **Consumer:** `pkg/oci/cosignverify` discovers the bundle via the
+  referrers API, hands it to `sigstore-go`, and verifies it against the
+  policy declared in the gallery YAML (`Gallery.Verification`).
+- **Revocation:** Keyless cosign certs are ephemeral (10-minute Fulcio
+  validity), so revocation is policy-side, not CA-side. The gallery's
+  `verification.not_before` (RFC3339) is the kill-switch — advance it to
+  invalidate every signature produced before a known compromise window.
+
+## Producer setup
+
+`backend_merge.yml` is the workflow that joins per-arch digests into the
+multi-arch manifest list users actually pull, so it's also the right place
+to sign. The job needs:
+
+- `permissions: { id-token: write, contents: read }` at the job level so
+  the runner can exchange its GitHub OIDC token for a Fulcio cert.
+- `sigstore/cosign-installer@v3` step (cosign ≥ 2.2 for
+  `--new-bundle-format`).
+- After each `docker buildx imagetools create`, resolve the resulting
+  list digest with `docker buildx imagetools inspect <tag> --format
+  '{{.Manifest.Digest}}'` and sign:
+
+```sh
+cosign sign --yes --recursive \
+  --new-bundle-format \
+  --registry-referrers-mode=oci-1-1 \
+  "${REGISTRY_REPO}@${DIGEST}"
+```
+
+Sign by digest, never by tag — signing by tag binds the signature to
+whatever the tag points at *now*, and a subsequent tag push orphans it.
+
+`backend_build_darwin.yml` builds and pushes single-arch darwin images
+that bypass the manifest-list merge. If/when those entries get a gallery
+`verification:` policy, the equivalent cosign step has to land there
+too.
+
+## Consumer setup (in `mudler/LocalAI` gallery YAML)
+
+Once CI is signing, add a `verification:` block to the backend gallery
+entry (`backend/index.yaml`):
+
+```yaml
+- name: localai
+  url: github:mudler/LocalAI/backend/index.yaml@master
+  verification:
+    issuer: "https://token.actions.githubusercontent.com"
+    identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@refs/heads/master$"
+    # Optional revocation cutoff; advance during incident response.
+    # not_before: "2026-06-01T00:00:00Z"
+```
+
+Identity matching pins the OIDC subject Fulcio issued the signing cert
+to. Without this, any image signed by *anyone* with a Fulcio cert would
+pass — the regex is what makes a signature mean "produced by our CI".
+
+## Strict mode
+
+Default behaviour: OCI backends without a `verification:` block install
+with a warning (logs include `installing OCI backend without signature
+verification`). Tarball/HTTP backends without a `sha256` field log a
+similar warning.
+
+For production, set `LOCALAI_REQUIRE_BACKEND_INTEGRITY=1` (or pass
+`--require-backend-integrity` to `local-ai run` / `local-ai backends
+install` / `local-ai models install`). The warning becomes a hard error
+and unverifiable backends refuse to install.
+
+## Revocation playbook
+
+If `backend_merge.yml` (or any workflow with `id-token: write`) is
+compromised and we've shipped malicious signed images:
+
+1. **Identify the compromise window.** Find the earliest IntegratedTime
+   from the bad signatures (Rekor search by `subject` filter).
+2. **Set `verification.not_before`** in `backend/index.yaml` to a
+   timestamp just *after* that window's start.
+3. **Push the YAML.** Deployed LocalAI instances pick it up on next
+   gallery refresh (1-hour cache in `core/gallery/gallery.go`).
+4. **Fix the underlying compromise** in the workflow and re-sign images
+   with the new build, which will have IntegratedTime > `not_before`.
+5. **Optional:** for absolute decisiveness, also rotate to a new
+   workflow path (`backend_merge_v2.yml`) and update `identity_regex`.
+
+## Where the code lives
+
+- `pkg/oci/cosignverify/` — verifier, policy, OCI referrer fetch, NotBefore enforcement.
+- `pkg/downloader/uri.go` — `WithImageVerifier` option threaded through `DownloadFileWithContext`.
+- `core/gallery/backends.go` — `backendDownloadOptions` builds the verifier from the gallery's policy.
+- `core/config/gallery.go` — `Gallery.Verification` YAML schema.
+- `core/cli/run.go`, `core/cli/backends.go`, `core/cli/models.go` — `--require-backend-integrity` flag propagation.
+- `.github/workflows/backend_merge.yml` — producer-side `cosign sign --recursive` after each multi-arch manifest list push.
+
+## Out of scope (follow-ups)
+
+- **Signing the gallery YAML itself.** The index is fetched over HTTPS
+  from GitHub; we trust the host. A cosign blob signature on the YAML
+  would close that gap but adds key-management overhead. Revisit this
+  page if/when added.
+- **Tarball/HTTP backend signing.** Cosign can sign arbitrary blobs, but
+  for now non-OCI backends keep using the `sha256:` field in YAML.
--- a/.agents/llama-cpp-backend.md
+++ b/.agents/llama-cpp-backend.md
@@ -61,6 +61,12 @@ Always check `llama.cpp` for new model configuration options that should be supp
   - `reasoning_format` - Reasoning format options
   - Any new flags or parameters

+### Speculative Decoding Types
+
+The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
+
+`draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
+
 ### Implementation Guidelines

 1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -278,6 +278,19 @@ include:
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "12"
+    cuda-minor-version: "8"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-12-liquid-audio'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "liquid-audio"
+    dockerfile: "./backend/Dockerfile.python"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "8"
@@ -808,6 +821,19 @@ include:
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-13-liquid-audio'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "liquid-audio"
+    dockerfile: "./backend/Dockerfile.python"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1088,6 +1114,19 @@ include:
    backend: "vibevoice"
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
+  - build-type: 'l4t'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/arm64'
+    tag-latest: 'auto'
+    tag-suffix: '-nvidia-l4t-cuda-13-arm64-liquid-audio'
+    runs-on: 'ubuntu-24.04-arm'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    ubuntu-version: '2404'
+    backend: "liquid-audio"
+    dockerfile: "./backend/Dockerfile.python"
+    context: "./"
  - build-type: 'l4t'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1729,6 +1768,19 @@ include:
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'hipblas'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-rocm-hipblas-liquid-audio'
+    runs-on: 'ubuntu-latest'
+    base-image: "rocm/dev-ubuntu-24.04:7.2.1"
+    skip-drivers: 'false'
+    backend: "liquid-audio"
+    dockerfile: "./backend/Dockerfile.python"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'hipblas'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2177,6 +2229,19 @@ include:
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'intel'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-intel-liquid-audio'
+    runs-on: 'ubuntu-latest'
+    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+    skip-drivers: 'false'
+    backend: "liquid-audio"
+    dockerfile: "./backend/Dockerfile.python"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'intel'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -3503,6 +3568,20 @@ include:
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    platform-tag: 'amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-liquid-audio'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "liquid-audio"
+    dockerfile: "./backend/Dockerfile.python"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -31,6 +31,13 @@ on:
 jobs:
  merge:
    runs-on: ubuntu-latest
+    # id-token: write is required for keyless cosign — the workflow
+    # exchanges the GitHub OIDC token for a short-lived Fulcio cert that
+    # signs each pushed manifest. Without this permission the runner
+    # cannot mint the token, and `cosign sign` fails with "no token".
+    permissions:
+      contents: read
+      id-token: write
    env:
      quay_username: ${{ secrets.quayUsername }}
    steps:
@@ -57,6 +64,15 @@ jobs:
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@master

+      # cosign signs each pushed manifest list with --recursive so the
+      # index and every per-arch entry get an attached Sigstore bundle.
+      # 2.2+ is required for --new-bundle-format.
+      - name: Install cosign
+        if: github.event_name != 'pull_request'
+        uses: sigstore/cosign-installer@v3
+        with:
+          cosign-release: 'v2.4.1'
+
      - name: Login to DockerHub
        if: github.event_name != 'pull_request'
        uses: docker/login-action@v4
@@ -88,6 +104,25 @@ jobs:
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true

+      # Source from ci-cache, not local-ai-backends.
+      #
+      # The build job pushes per-arch manifests to local-ai-backends with
+      # push-by-digest=true (no tag), then anchors a tagged copy into
+      # ci-cache so the manifest can be retrieved hours later when this
+      # merge runs. Quay's manifest GC, however, is per-repository: the
+      # anchor tag in ci-cache protects the manifest there, but the same
+      # digest in local-ai-backends has no tag in *that* repo and gets
+      # reaped independently. Sourcing local-ai-backends@<digest> here
+      # then fails with "manifest not found" — exactly the regression
+      # we hit on v4.2.2 (19/37 multiarch merges failed).
+      #
+      # ci-cache@<digest> resolves because we anchored it there. buildx
+      # imagetools create copies the manifest into local-ai-backends
+      # (cross-repo within the same registry, blobs already cross-mounted
+      # from the original push so no transfer needed) and publishes the
+      # manifest list with the user-facing tags. The resulting manifest
+      # list is fully self-contained in local-ai-backends — child digests
+      # only, no embedded references to ci-cache.
      - name: Create manifest list and push (quay)
        if: github.event_name != 'pull_request'
        working-directory: /tmp/digests
@@ -101,11 +136,26 @@ jobs:
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          if [ -z "$tags" ]; then
            echo "No quay.io tags from docker/metadata-action; skipping quay merge"
-          else
-            # shellcheck disable=SC2086
-            docker buildx imagetools create $tags \
-              $(printf 'quay.io/go-skynet/local-ai-backends@sha256:%s ' *)
+            exit 0
          fi
+          # shellcheck disable=SC2086
+          docker buildx imagetools create $tags \
+            $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
+          # Resolve the manifest-list digest (any tag points at it) so
+          # cosign can sign by digest. Signing by tag would leave the
+          # signature orphaned the next time the tag moves.
+          first_tag=$(jq -cr '
+            .tags | map(select(startswith("quay.io/"))) | .[0]
+          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
+          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
+          # --recursive walks the list and signs every per-arch entry
+          # too — clients that resolve a tag to a platform-specific
+          # manifest before checking signatures need the per-arch
+          # signatures, not just the list-level one.
+          cosign sign --yes --recursive \
+            --new-bundle-format \
+            --registry-referrers-mode=oci-1-1 \
+            "quay.io/go-skynet/local-ai-backends@${digest}"

      - name: Create manifest list and push (dockerhub)
        if: github.event_name != 'pull_request'
@@ -120,11 +170,19 @@ jobs:
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          if [ -z "$tags" ]; then
            echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
-          else
-            # shellcheck disable=SC2086
-            docker buildx imagetools create $tags \
-              $(printf 'localai/localai-backends@sha256:%s ' *)
+            exit 0
          fi
+          # shellcheck disable=SC2086
+          docker buildx imagetools create $tags \
+            $(printf 'localai/localai-backends@sha256:%s ' *)
+          first_tag=$(jq -cr '
+            .tags | map(select(startswith("localai/"))) | .[0]
+          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
+          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
+          cosign sign --yes --recursive \
+            --new-bundle-format \
+            --registry-referrers-mode=oci-1-1 \
+            "localai/localai-backends@${digest}"

      - name: Inspect manifest
        if: github.event_name != 'pull_request'
--- a/.github/workflows/image.yml
+++ b/.github/workflows/image.yml
@@ -151,7 +151,11 @@
              ubuntu-codename: 'noble'

    core-image-merge:
-      if: github.repository == 'mudler/LocalAI'
+      # !cancelled(): without it, GHA's default `needs:` cascade skips the
+      # merge whenever any matrix cell of the parent build fails or is
+      # cancelled. Same fix as backend.yml's merge jobs — we still want to
+      # publish the manifest list for tag-suffixes whose legs all succeeded.
+      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: core-image-build
      uses: ./.github/workflows/image_merge.yml
      with:
@@ -164,7 +168,7 @@
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}

    gpu-vulkan-image-merge:
-      if: github.repository == 'mudler/LocalAI'
+      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: core-image-build
      uses: ./.github/workflows/image_merge.yml
      with:
@@ -175,7 +179,91 @@
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-  
+
+    # Single-arch server-image merges. Same conceptual fix as the backend
+    # singletons in PR #9781: image_build.yml pushes by canonical digest
+    # only, so without a downstream merge step there's no tag for consumers
+    # (no :latest-gpu-nvidia-cuda-12, no :v<X>-gpu-nvidia-cuda-12, etc.).
+    # Each merge job needs only its parent build matrix and is filtered by
+    # tag-suffix in image_merge.yml's artifact-download pattern.
+    gpu-nvidia-cuda-12-image-merge:
+      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
+      needs: core-image-build
+      uses: ./.github/workflows/image_merge.yml
+      with:
+        tag-latest: 'auto'
+        tag-suffix: '-gpu-nvidia-cuda-12'
+      secrets:
+        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+
+    gpu-nvidia-cuda-13-image-merge:
+      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
+      needs: core-image-build
+      uses: ./.github/workflows/image_merge.yml
+      with:
+        tag-latest: 'auto'
+        tag-suffix: '-gpu-nvidia-cuda-13'
+      secrets:
+        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+
+    gpu-intel-image-merge:
+      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
+      needs: core-image-build
+      uses: ./.github/workflows/image_merge.yml
+      with:
+        tag-latest: 'auto'
+        tag-suffix: '-gpu-intel'
+      secrets:
+        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+
+    gpu-hipblas-image-merge:
+      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
+      needs: hipblas-jobs
+      uses: ./.github/workflows/image_merge.yml
+      with:
+        tag-latest: 'auto'
+        tag-suffix: '-gpu-hipblas'
+      secrets:
+        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+
+    nvidia-l4t-arm64-image-merge:
+      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
+      needs: gh-runner
+      uses: ./.github/workflows/image_merge.yml
+      with:
+        tag-latest: 'auto'
+        tag-suffix: '-nvidia-l4t-arm64'
+      secrets:
+        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+
+    nvidia-l4t-arm64-cuda-13-image-merge:
+      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
+      needs: gh-runner
+      uses: ./.github/workflows/image_merge.yml
+      with:
+        tag-latest: 'auto'
+        tag-suffix: '-nvidia-l4t-arm64-cuda-13'
+      secrets:
+        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+
    gh-runner:
      if: github.repository == 'mudler/LocalAI'
      uses: ./.github/workflows/image_build.yml
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -185,11 +185,28 @@ jobs:
          digest="${{ steps.build.outputs.digest }}"
          touch "/tmp/digests/${digest#sha256:}"

+      # See .github/scripts/anchor-digest-in-cache.sh for why this is needed
+      # and how it interacts with image_merge.yml's cleanup step. Mirrors the
+      # same anchor in backend_build.yml — quay's per-repo manifest GC reaps
+      # untagged manifests in local-ai before the merge runs.
+      - name: Anchor digest in ci-cache so quay GC won't reap before merge
+        if: github.event_name != 'pull_request'
+        env:
+          TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
+          PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
+          DIGEST: ${{ steps.build.outputs.digest }}
+          SOURCE_IMAGE: quay.io/go-skynet/local-ai
+        run: .github/scripts/anchor-digest-in-cache.sh
+
      - name: Upload digest artifact
        if: github.event_name != 'pull_request'
        uses: actions/upload-artifact@v7
        with:
-          name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-${{ inputs.platform-tag }}
+          # `--` separator + 'single' placeholder for empty platform-tag —
+          # same pattern as backend_build.yml. Prevents prefix collisions
+          # in the merge-side glob (e.g. -nvidia-l4t-arm64 is a prefix of
+          # -nvidia-l4t-arm64-cuda-13).
+          name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
          path: /tmp/digests/*
          if-no-files-found: error
          retention-days: 1
--- a/.github/workflows/image_merge.yml
+++ b/.github/workflows/image_merge.yml
@@ -33,10 +33,22 @@ jobs:
    env:
      quay_username: ${{ secrets.quayUsername }}
    steps:
+      # Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
+      # script). Skips the rest of the source tree.
+      - name: Checkout (.github/scripts only)
+        uses: actions/checkout@v6
+        with:
+          sparse-checkout: |
+            .github/scripts
+          sparse-checkout-cone-mode: false
+
      - name: Download digests
        uses: actions/download-artifact@v8
        with:
-          pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-*
+          # `--` separator anchors the glob so we don't over-match sibling
+          # tag-suffixes (e.g. -nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13).
+          # Must stay in sync with image_build.yml's upload-artifact name.
+          pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--*
          merge-multiple: true
          path: /tmp/digests

@@ -72,6 +84,13 @@ jobs:
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true

+      # Source from ci-cache, not local-ai. See backend_merge.yml for the
+      # detailed rationale — quay's manifest GC is per-repository, so the
+      # untagged digest in local-ai gets reaped while the same content lives
+      # tagged under ci-cache (anchored by image_build.yml). buildx imagetools
+      # create copies the manifest into local-ai (blobs already cross-mounted)
+      # and publishes the manifest list with user-facing tags. End state in
+      # local-ai is self-contained; no embedded reference to ci-cache.
      - name: Create manifest list and push (quay)
        working-directory: /tmp/digests
        run: |
@@ -82,7 +101,7 @@ jobs:
          else
            # shellcheck disable=SC2086
            docker buildx imagetools create $tags \
-              $(printf 'quay.io/go-skynet/local-ai@sha256:%s ' *)
+              $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
          fi

      - name: Create manifest list and push (dockerhub)
@@ -107,6 +126,15 @@ jobs:
            docker buildx imagetools inspect "$first_tag"
          fi

+      # See .github/scripts/cleanup-keepalive-tags.sh for the best-effort
+      # semantics — fails soft when the registry credential isn't OAuth-scoped.
+      - name: Cleanup keepalive tags in ci-cache
+        if: github.event_name != 'pull_request' && success()
+        env:
+          TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
+          QUAY_TOKEN: ${{ secrets.quayPassword }}
+        run: .github/scripts/cleanup-keepalive-tags.sh
+
      - name: Job summary
        run: |
          set -euo pipefail
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -28,6 +28,7 @@ jobs:
      qwen-asr: ${{ steps.detect.outputs.qwen-asr }}
      nemo: ${{ steps.detect.outputs.nemo }}
      voxcpm: ${{ steps.detect.outputs.voxcpm }}
+      liquid-audio: ${{ steps.detect.outputs.liquid-audio }}
      llama-cpp-quantization: ${{ steps.detect.outputs.llama-cpp-quantization }}
      llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
      ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
@@ -447,6 +448,32 @@ jobs:
        run: |
          make --jobs=5 --output-sync=target -C backend/python/voxcpm
          make --jobs=5 --output-sync=target -C backend/python/voxcpm test
+  # liquid-audio: LFM2.5-Audio any-to-any backend. The CI smoke test
+  # exercises Health() and LoadModel(mode:finetune) — fine-tune mode
+  # short-circuits before pulling weights (backend.py:192), so no
+  # HuggingFace download or GPU is needed. The full-inference path is
+  # gated on LIQUID_AUDIO_MODEL_ID, which we don't set here.
+  tests-liquid-audio:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.liquid-audio == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y build-essential ffmpeg
+          sudo apt-get install -y ca-certificates cmake curl patch python3-pip
+          # Install UV
+          curl -LsSf https://astral.sh/uv/install.sh | sh
+          pip install --user --no-cache-dir grpcio-tools==1.64.1
+      - name: Test liquid-audio
+        run: |
+          make --jobs=5 --output-sync=target -C backend/python/liquid-audio
+          make --jobs=5 --output-sync=target -C backend/python/liquid-audio test
  tests-llama-cpp-quantization:
    needs: detect-changes
    if: needs.detect-changes.outputs.llama-cpp-quantization == 'true' || needs.detect-changes.outputs.run-all == 'true'
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -46,8 +46,52 @@ linters:
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.'
        - pattern: '^t\.FailNow$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.'
+        # In-process config should flow through ApplicationConfig / kong-bound
+        # CLI flags, not via os.Getenv. The CLI layer is the legitimate
+        # env→struct boundary (kong's `env:"..."` tag); anything deeper that
+        # reads env directly leaks process state into business logic and
+        # makes flags impossible to test or override per-request. Backend
+        # subprocesses, the system/capabilities probe, and a few places that
+        # read non-LocalAI env vars (HOME, PATH, AUTH_TOKEN passed by parent)
+        # are exempt — see linters.exclusions.rules below.
+        - pattern: '^os\.(Getenv|LookupEnv|Environ)$'
+          msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
  exclusions:
    paths:
      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
      - 'backend/go/whisper/sources'
      - 'docs/'
+    rules:
+      # CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
+      # boundary, and a handful of subcommands legitimately propagate values
+      # to spawned subprocesses (LLAMACPP_GRPC_SERVERS, MLX hostfile, ...).
+      - path: ^core/cli/
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
+      # Backend subprocesses are independent binaries with their own env
+      # surface; they're not "in-process config" of the LocalAI server.
+      - path: ^backend/
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
+      # System capability probe reads HOME, PATH-style vars to discover
+      # GPUs, default paths, etc. — not LocalAI config.
+      - path: ^pkg/system/
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
+      # gRPC server reads AUTH_TOKEN passed in by the parent process at spawn
+      # time; model.Loader sets/inherits env to communicate with subprocesses.
+      - path: ^pkg/grpc/
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
+      - path: ^pkg/model/
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
+      # Top-level main binaries (local-ai, launcher) are entry points.
+      - path: ^cmd/
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
+      # Tests legitimately read $HOME, $TMPDIR, and gating env vars
+      # (LOCALAI_COSIGN_LIVE, etc.) to skip live-network specs.
+      - path: _test\.go$
+        text: 'os\.(Getenv|LookupEnv|Environ)'
+        linters: [forbidigo]
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -31,6 +31,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
 | [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery |
 | [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) | LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync |
+| [.agents/backend-signing.md](.agents/backend-signing.md) | Backend OCI image signing (keyless cosign + sigstore-go) — producer-side CI setup, consumer-side gallery `verification:` block, strict mode (`LOCALAI_REQUIRE_BACKEND_INTEGRITY`), revocation via `not_before` |

 ## Quick Reference

--- a/8
+++ b/8
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -463,6 +463,7 @@ prepare-test-extra: protogen-python
 	$(MAKE) -C backend/python/vllm-omni
 	$(MAKE) -C backend/python/sglang
 	$(MAKE) -C backend/python/vibevoice
+	$(MAKE) -C backend/python/liquid-audio
 	$(MAKE) -C backend/python/moonshine
 	$(MAKE) -C backend/python/pocket-tts
 	$(MAKE) -C backend/python/qwen-tts
@@ -488,6 +489,7 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/vllm test
 	$(MAKE) -C backend/python/vllm-omni test
 	$(MAKE) -C backend/python/vibevoice test
+	$(MAKE) -C backend/python/liquid-audio test
 	$(MAKE) -C backend/python/moonshine test
 	$(MAKE) -C backend/python/pocket-tts test
 	$(MAKE) -C backend/python/qwen-tts test
@@ -1092,6 +1094,7 @@ BACKEND_SGLANG = sglang|python|.|false|true
 BACKEND_DIFFUSERS = diffusers|python|.|--progress=plain|true
 BACKEND_CHATTERBOX = chatterbox|python|.|false|true
 BACKEND_VIBEVOICE = vibevoice|python|.|--progress=plain|true
+BACKEND_LIQUID_AUDIO = liquid-audio|python|.|--progress=plain|true
 BACKEND_MOONSHINE = moonshine|python|.|false|true
 BACKEND_POCKET_TTS = pocket-tts|python|.|false|true
 BACKEND_QWEN_TTS = qwen-tts|python|.|false|true
@@ -1169,6 +1172,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SGLANG)))
 $(eval $(call generate-docker-build-target,$(BACKEND_DIFFUSERS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_CHATTERBOX)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE)))
+$(eval $(call generate-docker-build-target,$(BACKEND_LIQUID_AUDIO)))
 $(eval $(call generate-docker-build-target,$(BACKEND_MOONSHINE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_POCKET_TTS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_QWEN_TTS)))
@@ -1197,7 +1201,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx

 ########################################################
 ### Mock Backend for E2E Tests
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -48,6 +48,11 @@ service Backend {

  rpc AudioTransform(AudioTransformRequest) returns (AudioTransformResult) {}
  rpc AudioTransformStream(stream AudioTransformFrameRequest) returns (stream AudioTransformFrameResponse) {}
+  // AudioToAudioStream is the bidirectional any-to-any S2S RPC. Backends
+  // that load a speech-to-speech model consume input audio frames and emit
+  // interleaved audio + transcript + tool-call deltas as typed events.
+  // Backends without S2S support return UNIMPLEMENTED.
+  rpc AudioToAudioStream(stream AudioToAudioRequest) returns (stream AudioToAudioResponse) {}

  rpc ModelMetadata(ModelOptions) returns (ModelMetadataResponse) {}

@@ -768,6 +773,93 @@ message AudioTransformFrameResponse {
  int64 frame_index = 2;
 }

+// === AudioToAudioStream messages =========================================
+//
+// Bidirectional stream between the LocalAI core and an any-to-any audio
+// model. The client opens the stream with a Config payload, then alternates
+// Frame (input audio) and Control (turn boundaries, function-call results,
+// session updates) payloads. The server streams back typed events: audio
+// frames carry PCM in `pcm`; transcript / tool-call deltas carry JSON in
+// `meta`; the stream ends with a `response.done` (success) or `error` event.
+
+message AudioToAudioRequest {
+  oneof payload {
+    AudioToAudioConfig  config  = 1;
+    AudioToAudioFrame   frame   = 2;
+    AudioToAudioControl control = 3;
+  }
+}
+
+message AudioToAudioConfig {
+  // PCM format for client→server audio. 0 => backend default
+  // (16 kHz for the LFM2-Audio Conformer encoder).
+  int32 input_sample_rate = 1;
+  // Preferred server→client audio rate. 0 => backend default
+  // (24 kHz for the LFM2-Audio vocoder).
+  int32 output_sample_rate = 2;
+  // Optional system prompt override. Empty => backend chooses based on
+  // mode (e.g. "Respond with interleaved text and audio.").
+  string system_prompt = 3;
+  // Optional baked-voice id. Models that only ship a fixed set of
+  // voices (e.g. LFM2-Audio: us_male/us_female/uk_male/uk_female) match
+  // this against their voice table; an empty string keeps the default.
+  string voice = 4;
+  // JSON-encoded array of tool definitions in OpenAI Chat Completions
+  // format. Empty => no tools.
+  string tools = 5;
+  // Free-form sampling / decoding parameters (temperature, top_k,
+  // max_new_tokens, audio_top_k, etc).
+  map<string, string> params = 6;
+  // True => reset any session-scoped state before processing further
+  // frames on this stream. The first Config implicitly resets.
+  bool reset = 7;
+}
+
+message AudioToAudioFrame {
+  // Raw PCM s16le mono at config.input_sample_rate. Empty pcm + end_of_input
+  // is a valid "user finished speaking" marker without trailing audio.
+  bytes pcm = 1;
+  // Marks the last frame of a user turn. The backend may begin emitting
+  // a response immediately after seeing this.
+  bool end_of_input = 2;
+}
+
+message AudioToAudioControl {
+  // Free-form control event names. Initial set:
+  //   "input_audio_buffer.commit"     — user finished speaking
+  //   "response.cancel"               — abort in-flight generation
+  //   "conversation.item.create"      — inject a non-audio item (e.g.
+  //                                     function_call_output as JSON in
+  //                                     `payload`)
+  //   "session.update"                — re-configure mid-stream
+  string event = 1;
+  // Event-specific JSON payload.
+  bytes payload = 2;
+}
+
+message AudioToAudioResponse {
+  // Event identifies what this frame carries. Mirrors the OpenAI Realtime
+  // API server-event names where applicable. Initial set:
+  //   "response.audio.delta"
+  //   "response.audio_transcript.delta"
+  //   "response.function_call_arguments.delta"
+  //   "response.function_call_arguments.done"
+  //   "response.done"
+  //   "error"
+  string event = 1;
+  // Populated when event = response.audio.delta.
+  bytes pcm = 2;
+  // Populated alongside pcm to identify its rate. 0 => same as the
+  // session's negotiated output_sample_rate.
+  int32 sample_rate = 3;
+  // JSON payload for non-PCM events (transcript chunk, tool args, error
+  // body).
+  bytes meta = 4;
+  // Monotonic per-stream counter, useful for client reordering and
+  // debugging.
+  int64 sequence = 5;
+}
+
 message ModelMetadataResponse {
  bool supports_thinking = 1;
  string rendered_template = 2;  // The rendered chat template with enable_thinking=true (empty if not applicable)
--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -1,10 +1,10 @@
 # ds4 backend Makefile.
 #
-# Upstream pin lives below as DS4_VERSION?= so the bump-deps bot
+# Upstream pin lives below as DS4_VERSION?=599e49d253971451f710cb8323344e789906ed6c
 # (.github/bump_deps.sh) can find and update it - matches the
 # llama-cpp / ik-llama-cpp / turboquant convention.

-DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
+DS4_VERSION?=599e49d253971451f710cb8323344e789906ed6c
 DS4_REPO?=https://github.com/antirez/ds4

 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=eb570eb96689c235933b813693ca28ab9d3d26de
+IK_LLAMA_VERSION?=77413bc900f9a2bfd8a5407f184427bcc0825f6c
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=1ec7ba0c14f33f17e980daeeda5f35b225d41994
+LLAMA_VERSION?=5cbaa5e69e09bde3334cd8c355570553a0dca027
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -32,6 +32,7 @@
 #include <grpcpp/health_check_service_interface.h>
 #include <grpcpp/security/server_credentials.h>
 #include <regex>
+#include <algorithm>
 #include <atomic>
 #include <cstdlib>
 #include <fstream>
@@ -450,6 +451,8 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
        // vector; the turboquant fork still uses the legacy scalar. The
        // LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
        // backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
+        // Upstream renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE
+        // in ggml-org/llama.cpp#22964; the fork still uses the old name.
 #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
        if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
            params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
@@ -458,7 +461,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
        const bool no_spec_type = params.speculative.types.empty() ||
            (params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
        if (no_spec_type) {
-            params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT };
+            params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE };
        }
 #endif
    }
@@ -685,6 +688,136 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                    // If conversion fails, keep default value (8)
                }
            }
+
+        // --- physical batch size (upstream -ub / --ubatch-size) ---
+        // Note: line ~482 already aliases n_ubatch to n_batch as a default; this
+        // option lets users decouple the two (useful for embeddings/rerank).
+        } else if (!strcmp(optname, "n_ubatch") || !strcmp(optname, "ubatch")) {
+            if (optval != NULL) {
+                try { params.n_ubatch = std::stoi(optval_str); } catch (...) {}
+            }
+
+        // --- main-model batch threads (upstream -tb / --threads-batch) ---
+        } else if (!strcmp(optname, "threads_batch") || !strcmp(optname, "n_threads_batch")) {
+            if (optval != NULL) {
+                try {
+                    int n = std::stoi(optval_str);
+                    if (n <= 0) n = (int)std::thread::hardware_concurrency();
+                    params.cpuparams_batch.n_threads = n;
+                } catch (...) {}
+            }
+
+        // --- pooling type for embeddings (upstream --pooling) ---
+        } else if (!strcmp(optname, "pooling_type") || !strcmp(optname, "pooling")) {
+            if (optval != NULL) {
+                if      (optval_str == "none") params.pooling_type = LLAMA_POOLING_TYPE_NONE;
+                else if (optval_str == "mean") params.pooling_type = LLAMA_POOLING_TYPE_MEAN;
+                else if (optval_str == "cls")  params.pooling_type = LLAMA_POOLING_TYPE_CLS;
+                else if (optval_str == "last") params.pooling_type = LLAMA_POOLING_TYPE_LAST;
+                else if (optval_str == "rank") params.pooling_type = LLAMA_POOLING_TYPE_RANK;
+                // unknown values silently leave UNSPECIFIED (auto-detect)
+            }
+
+        // --- llama log verbosity threshold (upstream -lv / --verbosity) ---
+        } else if (!strcmp(optname, "verbosity")) {
+            if (optval != NULL) {
+                try { params.verbosity = std::stoi(optval_str); } catch (...) {}
+            }
+
+        // --- O_DIRECT model loading (upstream --direct-io) ---
+        } else if (!strcmp(optname, "direct_io") || !strcmp(optname, "use_direct_io")) {
+            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
+                params.use_direct_io = true;
+            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
+                params.use_direct_io = false;
+            }
+
+        // --- embedding normalization (upstream --embd-normalize) ---
+        // -1 none, 0 max-abs, 1 taxicab, 2 L2 (default), >2 p-norm
+        } else if (!strcmp(optname, "embd_normalize") || !strcmp(optname, "embedding_normalize")) {
+            if (optval != NULL) {
+                try { params.embd_normalize = std::stoi(optval_str); } catch (...) {}
+            }
+
+        // --- reasoning parser (upstream --reasoning-format) ---
+        // Picks the parser for <think> blocks emitted by reasoning models.
+        // none / auto / deepseek / deepseek-legacy
+        } else if (!strcmp(optname, "reasoning_format")) {
+            if (optval != NULL) {
+                if      (optval_str == "none")             params.reasoning_format = COMMON_REASONING_FORMAT_NONE;
+                else if (optval_str == "auto")             params.reasoning_format = COMMON_REASONING_FORMAT_AUTO;
+                else if (optval_str == "deepseek")         params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
+                else if (optval_str == "deepseek-legacy" || optval_str == "deepseek_legacy")
+                                                            params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY;
+                // unknown values silently keep the upstream default (DEEPSEEK)
+            }
+
+        // --- reasoning budget (upstream --reasoning-budget) ---
+        // -1 unlimited, 0 disabled, >0 token budget for thinking blocks.
+        // Distinct from per-request `enable_thinking` (chat_template_kwargs).
+        } else if (!strcmp(optname, "enable_reasoning") || !strcmp(optname, "reasoning_budget")) {
+            if (optval != NULL) {
+                try { params.enable_reasoning = std::stoi(optval_str); } catch (...) {}
+            }
+
+        // --- prefill assistant turn (upstream --no-prefill-assistant) ---
+        } else if (!strcmp(optname, "prefill_assistant")) {
+            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
+                params.prefill_assistant = true;
+            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
+                params.prefill_assistant = false;
+            }
+
+        // --- mmproj GPU offload (upstream --no-mmproj-offload, inverted) ---
+        } else if (!strcmp(optname, "mmproj_use_gpu") || !strcmp(optname, "mmproj_offload")) {
+            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
+                params.mmproj_use_gpu = true;
+            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
+                params.mmproj_use_gpu = false;
+            }
+
+        // --- per-image vision token budget (upstream --image-min/max-tokens) ---
+        } else if (!strcmp(optname, "image_min_tokens")) {
+            if (optval != NULL) {
+                try { params.image_min_tokens = std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "image_max_tokens")) {
+            if (optval != NULL) {
+                try { params.image_max_tokens = std::stoi(optval_str); } catch (...) {}
+            }
+
+        // --- main-model tensor buffer overrides (upstream --override-tensor) ---
+        // Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
+        // Mirrors the existing `draft_override_tensor` parser below.
+        } else if (!strcmp(optname, "override_tensor") || !strcmp(optname, "tensor_buft_overrides")) {
+            ggml_backend_load_all();
+            std::map<std::string, ggml_backend_buffer_type_t> buft_list;
+            for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
+                auto * dev = ggml_backend_dev_get(i);
+                auto * buft = ggml_backend_dev_buffer_type(dev);
+                if (buft) {
+                    buft_list[ggml_backend_buft_name(buft)] = buft;
+                }
+            }
+            static std::list<std::string> override_names;
+            std::string cur;
+            auto flush = [&](const std::string & spec) {
+                auto pos = spec.find('=');
+                if (pos == std::string::npos) return;
+                const std::string name = spec.substr(0, pos);
+                const std::string type = spec.substr(pos + 1);
+                auto it = buft_list.find(type);
+                if (it == buft_list.end()) return; // unknown buffer type: ignore
+                override_names.push_back(name);
+                params.tensor_buft_overrides.push_back(
+                    {override_names.back().c_str(), it->second});
+            };
+            for (char c : optval_str) {
+                if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
+                else { cur.push_back(c); }
+            }
+            if (!cur.empty()) flush(cur);
+
        // Speculative decoding options
        } else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
 #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
@@ -701,16 +834,27 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            // Upstream switched to a vector of types (comma-separated for multi-type
            // chaining via common_speculative_types_from_names). We keep accepting a
            // single value here, but also tolerate comma-separated lists.
+            //
+            // ggml-org/llama.cpp#22964 also renamed the registered names from
+            // underscore- to dash-separated form, and replaced the bare
+            // `draft`/`eagle3` aliases with `draft-simple`/`draft-eagle3`. We
+            // normalize each token here so existing model configs keep working.
+            auto normalize_spec_name = [](std::string s) -> std::string {
+                std::replace(s.begin(), s.end(), '_', '-');
+                if (s == "draft")  return "draft-simple";
+                if (s == "eagle3") return "draft-eagle3";
+                return s;
+            };
            std::vector<std::string> names;
            std::string item;
            for (char c : optval_str) {
                if (c == ',') {
-                    if (!item.empty()) { names.push_back(item); item.clear(); }
+                    if (!item.empty()) { names.push_back(normalize_spec_name(item)); item.clear(); }
                } else {
                    item.push_back(c);
                }
            }
-            if (!item.empty()) names.push_back(item);
+            if (!item.empty()) names.push_back(normalize_spec_name(item));
            auto parsed = common_speculative_types_from_names(names);
            if (!parsed.empty()) {
                params.speculative.types = parsed;
@@ -2794,7 +2938,9 @@ public:
            }
        }

-        int embd_normalize = 2; // default to Euclidean/L2 norm
+        // Honor the load-time embd_normalize set via options:embd_normalize.
+        // -1 none, 0 max-abs, 1 taxicab, 2 L2 (default), >2 p-norm.
+        int embd_normalize = params_base.embd_normalize;
        // create and queue the task
        auto rd = ctx_server.get_response_reader();
        {
--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -1,7 +1,7 @@

 # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
 # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
-TURBOQUANT_VERSION?=69d8e4be47243e83b3d0d71e932bc7aa61c644dc
+TURBOQUANT_VERSION?=5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403
 LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant

 CMAKE_ARGS?=
--- a/backend/go/stablediffusion-ggml/Makefile
+++ b/backend/go/stablediffusion-ggml/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # stablediffusion.cpp (ggml)
 STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
-STABLEDIFFUSION_GGML_VERSION?=90e87bc846f17059771efb8aaa31e9ef0cab6f78
+STABLEDIFFUSION_GGML_VERSION?=bd17f53b7386fb5f60e8587b75e73c4b2fed3426

 CMAKE_ARGS+=-DGGML_MAX_NAME=128

--- a/backend/go/whisper/Makefile
+++ b/backend/go/whisper/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # whisper.cpp version
 WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
-WHISPER_CPP_VERSION?=c33c5618b72bb345df029b730b36bc0e369845a3
+WHISPER_CPP_VERSION?=afa2ea544fb4b0448916b4a31ecd33c8685bd482
 SO_TARGET?=libgowhisper.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -847,6 +847,35 @@
    nvidia-l4t-cuda-12: "nvidia-l4t-vibevoice"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vibevoice"
  icon: https://avatars.githubusercontent.com/u/6154722?s=200&v=4
+- &liquid-audio
+  urls:
+    - https://github.com/Liquid4All/liquid-audio
+    - https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B
+  description: |
+    LiquidAI LFM2 / LFM2.5 Audio Python backend. End-to-end speech-to-speech, ASR,
+    TTS (4 baked voices), and text chat from a single 1.5B model. Wraps the
+    upstream `liquid-audio` package; supports fine-tuning via LocalAI's
+    /v1/fine-tuning/jobs endpoint.
+  tags:
+    - speech-to-speech
+    - any-to-any
+    - text-to-speech
+    - speech-to-text
+    - TTS
+    - ASR
+    - realtime
+  license: LFM-Open-License-v1.0
+  name: "liquid-audio"
+  alias: "liquid-audio"
+  capabilities:
+    nvidia: "cuda12-liquid-audio"
+    intel: "intel-liquid-audio"
+    amd: "rocm-liquid-audio"
+    default: "cpu-liquid-audio"
+    nvidia-cuda-13: "cuda13-liquid-audio"
+    nvidia-cuda-12: "cuda12-liquid-audio"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio"
+  icon: https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/7_6D7rWrLxp2hb6OHSV1p.png
 - &qwen-tts
  urls:
    - https://github.com/QwenLM/Qwen3-TTS
@@ -3437,6 +3466,77 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vibevoice"
  mirrors:
    - localai/localai-backends:master-metal-darwin-arm64-vibevoice
+## liquid-audio
+- !!merge <<: *liquid-audio
+  name: "liquid-audio-development"
+  capabilities:
+    nvidia: "cuda12-liquid-audio-development"
+    intel: "intel-liquid-audio-development"
+    amd: "rocm-liquid-audio-development"
+    default: "cpu-liquid-audio-development"
+    nvidia-cuda-13: "cuda13-liquid-audio-development"
+    nvidia-cuda-12: "cuda12-liquid-audio-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio-development"
+- !!merge <<: *liquid-audio
+  name: "cpu-liquid-audio"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-liquid-audio"
+  mirrors:
+    - localai/localai-backends:latest-cpu-liquid-audio
+- !!merge <<: *liquid-audio
+  name: "cpu-liquid-audio-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-liquid-audio"
+  mirrors:
+    - localai/localai-backends:master-cpu-liquid-audio
+- !!merge <<: *liquid-audio
+  name: "cuda12-liquid-audio"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-liquid-audio"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-12-liquid-audio
+- !!merge <<: *liquid-audio
+  name: "cuda12-liquid-audio-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-liquid-audio"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-12-liquid-audio
+- !!merge <<: *liquid-audio
+  name: "cuda13-liquid-audio"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-liquid-audio"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-liquid-audio
+- !!merge <<: *liquid-audio
+  name: "cuda13-liquid-audio-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-liquid-audio"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-liquid-audio
+- !!merge <<: *liquid-audio
+  name: "intel-liquid-audio"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-liquid-audio"
+  mirrors:
+    - localai/localai-backends:latest-gpu-intel-liquid-audio
+- !!merge <<: *liquid-audio
+  name: "intel-liquid-audio-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-liquid-audio"
+  mirrors:
+    - localai/localai-backends:master-gpu-intel-liquid-audio
+- !!merge <<: *liquid-audio
+  name: "rocm-liquid-audio"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-liquid-audio"
+  mirrors:
+    - localai/localai-backends:latest-gpu-rocm-hipblas-liquid-audio
+- !!merge <<: *liquid-audio
+  name: "rocm-liquid-audio-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-liquid-audio"
+  mirrors:
+    - localai/localai-backends:master-gpu-rocm-hipblas-liquid-audio
+- !!merge <<: *liquid-audio
+  name: "cuda13-nvidia-l4t-arm64-liquid-audio"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-liquid-audio"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-liquid-audio
+- !!merge <<: *liquid-audio
+  name: "cuda13-nvidia-l4t-arm64-liquid-audio-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-liquid-audio"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-liquid-audio
 ## qwen-tts
 - !!merge <<: *qwen-tts
  name: "qwen-tts-development"
--- a/backend/python/liquid-audio/Makefile
+++ b/backend/python/liquid-audio/Makefile
@@ -0,0 +1,23 @@
+.PHONY: liquid-audio
+liquid-audio:
+	bash install.sh
+
+.PHONY: run
+run: liquid-audio
+	@echo "Running liquid-audio..."
+	bash run.sh
+	@echo "liquid-audio run."
+
+.PHONY: test
+test: liquid-audio
+	@echo "Testing liquid-audio..."
+	bash test.sh
+	@echo "liquid-audio tested."
+
+.PHONY: protogen-clean
+protogen-clean:
+	$(RM) backend_pb2_grpc.py backend_pb2.py
+
+.PHONY: clean
+clean: protogen-clean
+	rm -rf venv __pycache__
--- a/backend/python/liquid-audio/backend.py
+++ b/backend/python/liquid-audio/backend.py
@@ -0,0 +1,871 @@
+#!/usr/bin/env python3
+"""
+Liquid Audio backend for LocalAI.
+
+Wraps LiquidAI's `liquid-audio` Python package (https://github.com/Liquid4All/liquid-audio).
+The same model serves four roles, selected by the `mode` option at load time:
+chat, asr, tts, s2s. Fine-tuning is exposed via StartFineTune.
+"""
+from concurrent import futures
+import argparse
+import json
+import os
+import queue
+import signal
+import sys
+import threading
+import time
+import traceback
+import uuid
+
+import grpc
+
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
+from grpc_auth import get_auth_interceptors  # noqa: E402
+from python_utils import parse_options  # noqa: E402
+
+import backend_pb2  # noqa: E402
+import backend_pb2_grpc  # noqa: E402
+
+_ONE_DAY_IN_SECONDS = 60 * 60 * 24
+MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
+
+# Voice id → system-prompt suffix. The model only ships these four voices.
+VOICE_PROMPTS = {
+    "us_male":   "Perform TTS. Use the US male voice.",
+    "us_female": "Perform TTS. Use the US female voice.",
+    "uk_male":   "Perform TTS. Use the UK male voice.",
+    "uk_female": "Perform TTS. Use the UK female voice.",
+}
+DEFAULT_VOICE = "us_female"
+
+# Special-token IDs that LFM2-Audio emits to delimit modality boundaries.
+# Sourced from liquid_audio/model/lfm2_audio.py (see generate_sequential/_sample_*).
+TEXT_END_TOKEN = 130        # <|text_end|>
+AUDIO_START_TOKEN = 128     # <|audio_start|>
+IM_END_TOKEN = 7            # <|im_end|>
+AUDIO_EOS_CODE = 2048       # signals end-of-audio in any codebook position
+
+_PATCHED_LOCAL_PATHS = False
+
+
+def _patch_liquid_audio_local_paths():
+    """Make liquid_audio.utils.get_model_dir() tolerate local directories.
+
+    Upstream always passes its argument to huggingface_hub.snapshot_download,
+    which only accepts `owner/repo` ids. LocalAI's gallery hands us absolute
+    paths under <ModelPath>/<owner>/<repo>, so we intercept snapshot_download
+    in the liquid_audio.utils namespace and return the directory as-is when
+    it already exists on disk. Idempotent.
+    """
+    global _PATCHED_LOCAL_PATHS
+    if _PATCHED_LOCAL_PATHS:
+        return
+    import liquid_audio.utils as _la_utils
+    _orig_snapshot_download = _la_utils.snapshot_download
+
+    def _local_first_snapshot_download(repo_id, revision=None, **kwargs):
+        if isinstance(repo_id, (str, os.PathLike)) and os.path.isdir(str(repo_id)):
+            return str(repo_id)
+        return _orig_snapshot_download(repo_id, revision=revision, **kwargs)
+
+    _la_utils.snapshot_download = _local_first_snapshot_download
+    _PATCHED_LOCAL_PATHS = True
+
+
+def _select_device():
+    import torch
+    if torch.cuda.is_available():
+        return "cuda"
+    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
+        return "mps"
+    return "cpu"
+
+
+class ActiveJob:
+    """Tracks an in-flight fine-tune so FineTuneProgress can stream from its queue."""
+
+    def __init__(self, job_id):
+        self.job_id = job_id
+        self.progress_queue = queue.Queue()
+        self.thread = None
+        self.stopped = False
+        self.completed = False
+        self.error = None
+
+
+class BackendServicer(backend_pb2_grpc.BackendServicer):
+    def __init__(self):
+        self.processor = None
+        self.model = None
+        self.device = "cpu"
+        self.dtype = None
+        self.options = {}
+        self.model_id = None
+        self.active_job = None
+
+    @property
+    def mode(self):
+        return str(self.options.get("mode", "chat")).lower()
+
+    @property
+    def voice(self):
+        v = str(self.options.get("voice", DEFAULT_VOICE)).lower()
+        return v if v in VOICE_PROMPTS else DEFAULT_VOICE
+
+
+    def Free(self, request, context):
+        # Called by LocalAI when unloading the model. Drop GPU tensors so the
+        # next load starts from a clean state instead of bumping into OOM.
+        try:
+            for attr in ("model", "processor", "tokenizer"):
+                if hasattr(self, attr):
+                    try:
+                        delattr(self, attr)
+                    except Exception:
+                        pass
+            import gc
+            gc.collect()
+            try:
+                import torch
+                if torch.cuda.is_available():
+                    torch.cuda.empty_cache()
+            except Exception:
+                pass
+            return backend_pb2.Result(success=True, message="OK")
+        except Exception as exc:
+            print(f"Free failed: {exc}", file=sys.stderr)
+            return backend_pb2.Result(success=False, message=str(exc))
+
+
+    def Health(self, request, context):
+        return backend_pb2.Reply(message=bytes("OK", 'utf-8'))
+
+
+    def LoadModel(self, request, context):
+        try:
+            import torch
+
+            self.options = parse_options(request.Options)
+            if self.options.get("voice") and self.options["voice"] not in VOICE_PROMPTS:
+                print(f"Warning: unknown voice '{self.options['voice']}'; defaulting to '{DEFAULT_VOICE}'",
+                      file=sys.stderr)
+
+            requested_device = self.options.get("device")
+            self.device = requested_device or _select_device()
+            if self.device == "cuda" and not torch.cuda.is_available():
+                return backend_pb2.Result(success=False, message="CUDA requested but not available")
+            if self.device == "mps" and not (hasattr(torch.backends, "mps") and
+                                             torch.backends.mps.is_available()):
+                print("MPS not available; falling back to CPU", file=sys.stderr)
+                self.device = "cpu"
+
+            dtype_name = str(self.options.get("dtype", "bfloat16")).lower()
+            self.dtype = {
+                "bfloat16": torch.bfloat16,
+                "bf16":     torch.bfloat16,
+                "float16":  torch.float16,
+                "fp16":     torch.float16,
+                "half":     torch.float16,
+                "float32":  torch.float32,
+                "fp32":     torch.float32,
+            }.get(dtype_name, torch.bfloat16)
+
+            # request.Model holds the raw `parameters.model` value (an HF
+            # repo id like "LiquidAI/LFM2.5-Audio-1.5B"); request.ModelFile
+            # is LocalAI's ModelPath-prefixed local copy that exists only
+            # when the gallery supplied a `files:` list. Mirror the
+            # transformers/vibevoice convention: prefer the repo id and
+            # only switch to the local path if it's been staged on disk.
+            model_id = request.Model
+            if not model_id:
+                model_id = request.ModelFile
+            if not model_id:
+                return backend_pb2.Result(success=False, message="No model identifier provided")
+            if request.ModelFile and os.path.isdir(request.ModelFile):
+                model_id = request.ModelFile
+            self.model_id = model_id
+
+            # Pure fine-tune jobs don't need an in-memory inference model — the
+            # Trainer instantiates its own copy at StartFineTune time.
+            if self.mode == "finetune":
+                print(f"Loaded liquid-audio backend in fine-tune mode (model id: {model_id})",
+                      file=sys.stderr)
+                return backend_pb2.Result(success=True, message="OK")
+
+            from liquid_audio import LFM2AudioModel, LFM2AudioProcessor
+
+            # liquid_audio's from_pretrained unconditionally routes through
+            # huggingface_hub.snapshot_download, which rejects local paths
+            # (HFValidationError on `/models/LiquidAI/LFM2.5-Audio-1.5B`).
+            # When LocalAI's gallery has already staged the weights on disk,
+            # short-circuit the download to return the local directory.
+            _patch_liquid_audio_local_paths()
+
+            print(f"Loading liquid-audio model '{model_id}' on {self.device} ({self.dtype})",
+                  file=sys.stderr)
+            self.processor = LFM2AudioProcessor.from_pretrained(model_id, device=self.device).eval()
+            self.model = LFM2AudioModel.from_pretrained(
+                model_id, device=self.device, dtype=self.dtype
+            ).eval()
+
+            print(f"Liquid-audio mode={self.mode}, voice={self.voice}", file=sys.stderr)
+            return backend_pb2.Result(success=True, message="OK")
+
+        except Exception as exc:
+            print(f"LoadModel failed: {exc}", file=sys.stderr)
+            print(traceback.format_exc(), file=sys.stderr)
+            return backend_pb2.Result(success=False, message=str(exc))
+
+
+    def Predict(self, request, context):
+        try:
+            text = "".join(self._generate_text_stream(request))
+            return backend_pb2.Reply(message=text.encode("utf-8"))
+        except Exception as exc:
+            print(f"Predict failed: {exc}", file=sys.stderr)
+            print(traceback.format_exc(), file=sys.stderr)
+            context.set_code(grpc.StatusCode.INTERNAL)
+            context.set_details(str(exc))
+            return backend_pb2.Reply()
+
+    def PredictStream(self, request, context):
+        try:
+            for delta in self._generate_text_stream(request):
+                yield backend_pb2.Reply(message=delta.encode("utf-8"))
+        except Exception as exc:
+            print(f"PredictStream failed: {exc}", file=sys.stderr)
+            print(traceback.format_exc(), file=sys.stderr)
+            context.set_code(grpc.StatusCode.INTERNAL)
+            context.set_details(str(exc))
+
+
+    def VAD(self, request, context):
+        # Stub voice-activity detector: RMS-energy threshold over 30ms frames at
+        # 16 kHz. Good enough for the realtime endpoint's handleVAD loop, which
+        # only inspects segment presence + last segment end. The proper signal
+        # would come from the model's audio encoder, but that ride-along is a
+        # PR-D scope item — until then this keeps the legacy pipeline path
+        # working without forcing the operator to install a separate VAD model.
+        import numpy as np
+        try:
+            audio = np.asarray(request.audio, dtype=np.float32)
+            if audio.size == 0:
+                return backend_pb2.VADResponse(segments=[])
+
+            sample_rate = 16000
+            frame_size = sample_rate * 30 // 1000  # 30ms → 480 samples
+            threshold = float(self.options.get("vad_rms_threshold", 0.01))
+            min_speech_frames = int(self.options.get("vad_min_speech_frames", 2))  # ≥60ms
+            # handleVAD ticks every 300 ms and only inspects segment presence
+            # + last segment end relative to silence_threshold (~500 ms). Cap
+            # the analysed window to the tail of the buffer so we don't redo
+            # the entire growing utterance every tick.
+            window_s = float(self.options.get("vad_window_s", 5.0))
+            window_samples = int(window_s * sample_rate)
+            time_offset_s = 0.0
+            if audio.size > window_samples:
+                time_offset_s = (audio.size - window_samples) / sample_rate
+                audio = audio[-window_samples:]
+
+            n_frames = audio.size // frame_size
+            if n_frames == 0:
+                return backend_pb2.VADResponse(segments=[])
+            frames = audio[: n_frames * frame_size].reshape(n_frames, frame_size)
+            rms = np.sqrt(np.mean(frames ** 2, axis=1))
+            speech = rms > threshold
+
+            def _emit(start_idx, end_idx, out):
+                if end_idx - start_idx >= min_speech_frames:
+                    out.append(backend_pb2.VADSegment(
+                        start=time_offset_s + start_idx * frame_size / sample_rate,
+                        end=time_offset_s + end_idx * frame_size / sample_rate,
+                    ))
+
+            segments = []
+            start_idx = None
+            for i, is_speech in enumerate(speech):
+                if is_speech and start_idx is None:
+                    start_idx = i
+                elif not is_speech and start_idx is not None:
+                    _emit(start_idx, i, segments)
+                    start_idx = None
+            if start_idx is not None:
+                _emit(start_idx, n_frames, segments)
+            return backend_pb2.VADResponse(segments=segments)
+        except Exception as exc:
+            print(f"VAD failed: {exc}", file=sys.stderr)
+            print(traceback.format_exc(), file=sys.stderr)
+            context.set_code(grpc.StatusCode.INTERNAL)
+            context.set_details(str(exc))
+            return backend_pb2.VADResponse(segments=[])
+
+
+    def TTS(self, request, context):
+        try:
+            if self.model is None or self.processor is None:
+                return backend_pb2.Result(success=False, message="Model not loaded")
+
+            import torch
+            import torchaudio
+            from liquid_audio import ChatState
+
+            voice = request.voice.lower() if request.voice else self.voice
+            voice = voice.removeprefix("lfm2:").removeprefix("lfm:")
+            if voice not in VOICE_PROMPTS:
+                voice = self.voice
+            system_prompt = VOICE_PROMPTS[voice]
+
+            chat = ChatState(self.processor)
+            chat.new_turn("system")
+            chat.add_text(system_prompt)
+            chat.end_turn()
+            chat.new_turn("user")
+            chat.add_text(request.text or "")
+            chat.end_turn()
+            chat.new_turn("assistant")
+
+            audio_top_k = int(self.options.get("audio_top_k", 64))
+            audio_temp = float(self.options.get("audio_temperature", 0.8))
+            max_new = int(self.options.get("max_new_tokens", 2048))
+
+            audio_out = []
+            for tok in self.model.generate_sequential(
+                **chat,
+                max_new_tokens=max_new,
+                audio_temperature=audio_temp,
+                audio_top_k=audio_top_k,
+            ):
+                if tok.numel() > 1:
+                    audio_out.append(tok)
+
+            if len(audio_out) <= 1:
+                return backend_pb2.Result(success=False, message="No audio frames generated")
+
+            # Drop the trailing end-of-audio frame, matching the package's examples.
+            audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
+            waveform = self.processor.decode(audio_codes)
+
+            out_path = request.dst
+            if not out_path:
+                return backend_pb2.Result(success=False, message="dst path is required")
+            os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True)
+            # soundfile in preference to torchaudio.save — the latter routes
+            # through torchcodec, whose native libs need NVIDIA NPP that we
+            # don't bundle in the cuda13 image.
+            import soundfile as _sf
+            _sf.write(out_path, waveform.cpu().numpy().squeeze(0).T, 24_000)
+
+            return backend_pb2.Result(success=True)
+        except Exception as exc:
+            print(f"TTS failed: {exc}", file=sys.stderr)
+            print(traceback.format_exc(), file=sys.stderr)
+            return backend_pb2.Result(success=False, message=str(exc))
+
+
+    def AudioToAudioStream(self, request_iterator, context):
+        """Bidirectional any-to-any speech-to-speech stream.
+
+        See `backend.proto` AudioToAudioStream for the wire protocol. Audio
+        is decoded once per turn here; chunked detokenization for sub-second
+        TTFB is left to a future iteration once the LFM2AudioDetokenizer
+        gains a streaming entry point.
+        """
+        try:
+            yield from self._audio_to_audio_stream(request_iterator, context)
+        except Exception as exc:
+            print(f"AudioToAudioStream failed: {exc}", file=sys.stderr)
+            print(traceback.format_exc(), file=sys.stderr)
+            yield backend_pb2.AudioToAudioResponse(
+                event="error",
+                meta=json.dumps({"message": str(exc)}).encode("utf-8"),
+            )
+
+    def _audio_to_audio_stream(self, request_iterator, context):
+        if self.model is None or self.processor is None:
+            raise RuntimeError("Model not loaded")
+
+        import torch
+        import torchaudio
+        from liquid_audio import ChatState
+
+        cfg = None
+        chat = None
+        input_sample_rate = 16000
+        output_sample_rate = 24000
+        sequence = 0
+
+        def _new_event(event, **kwargs):
+            nonlocal sequence
+            sequence += 1
+            kwargs.setdefault("sequence", sequence)
+            return backend_pb2.AudioToAudioResponse(event=event, **kwargs)
+
+        def _ensure_chat():
+            """Build a fresh ChatState seeded with the system prompt."""
+            nonlocal chat
+            chat = ChatState(self.processor)
+            system_prompt = (cfg.system_prompt if cfg and cfg.system_prompt
+                             else "Respond with interleaved text and audio.")
+            chat.new_turn("system")
+            chat.add_text(system_prompt)
+            chat.end_turn()
+
+        # Buffers for the in-flight user turn
+        pcm_buffer = bytearray()
+
+        def _consume_user_turn():
+            nonlocal pcm_buffer
+            if not pcm_buffer:
+                return
+            # Avoid the bytes(pcm_buffer) copy and let the float widen happen
+            # in-place: numpy view → torch view → in-place divide.
+            import numpy as np
+            arr = np.frombuffer(memoryview(pcm_buffer), dtype=np.int16)
+            wav = torch.from_numpy(arr).to(torch.float32).div_(32768.0).unsqueeze(0)
+            chat.new_turn("user")
+            chat.add_audio(wav, input_sample_rate)
+            chat.end_turn()
+            pcm_buffer = bytearray()
+
+        def _run_generation():
+            """Run generate_interleaved; yield response events as we go."""
+            chat.new_turn("assistant")
+            audio_top_k = int(self.options.get("audio_top_k", 4))
+            audio_temp = float(self.options.get("audio_temperature", 1.0))
+            text_top_k = int(self.options.get("text_top_k", 0)) or None
+            text_temp = float(self.options.get("text_temperature", 0)) or None
+            max_new = int(self.options.get("max_new_tokens", 512))
+
+            audio_tokens = []
+            for tok in self.model.generate_interleaved(
+                **chat,
+                max_new_tokens=max_new,
+                text_temperature=text_temp,
+                text_top_k=text_top_k,
+                audio_temperature=audio_temp,
+                audio_top_k=audio_top_k,
+            ):
+                if tok.numel() == 1:
+                    if tok.item() == IM_END_TOKEN:
+                        break
+                    text = self.processor.text.decode(tok)
+                    if not text:
+                        continue
+                    yield _new_event(
+                        "response.audio_transcript.delta",
+                        meta=json.dumps({"delta": text}).encode("utf-8"),
+                    )
+                else:
+                    audio_tokens.append(tok)
+
+            # Detokenize the accumulated audio at end-of-turn — the
+            # LFM2AudioDetokenizer is non-streaming today.
+            if len(audio_tokens) > 1:
+                audio_codes = torch.stack(audio_tokens[:-1], 1).unsqueeze(0)
+                waveform = self.processor.decode(audio_codes)
+                # Convert to s16le PCM bytes at output_sample_rate
+                if output_sample_rate != 24000:
+                    waveform = torchaudio.functional.resample(
+                        waveform.cpu(), 24000, output_sample_rate
+                    )
+                pcm = (waveform.cpu().squeeze(0).clamp(-1, 1) * 32767.0).to(
+                    torch.int16
+                ).numpy().tobytes()
+                yield _new_event(
+                    "response.audio.delta",
+                    pcm=pcm,
+                    sample_rate=output_sample_rate,
+                )
+
+            yield _new_event("response.done", meta=b"{}")
+
+        for req in request_iterator:
+            if not context.is_active():
+                return
+            payload = req.WhichOneof("payload")
+            if payload == "config":
+                cfg = req.config
+                if cfg.input_sample_rate > 0:
+                    input_sample_rate = cfg.input_sample_rate
+                if cfg.output_sample_rate > 0:
+                    output_sample_rate = cfg.output_sample_rate
+                # The first config implicitly resets state.
+                _ensure_chat()
+                pcm_buffer = bytearray()
+            elif payload == "frame":
+                if chat is None:
+                    _ensure_chat()
+                if req.frame.pcm:
+                    pcm_buffer.extend(req.frame.pcm)
+                if req.frame.end_of_input:
+                    _consume_user_turn()
+                    yield from _run_generation()
+            elif payload == "control":
+                event = req.control.event
+                if event == "input_audio_buffer.commit":
+                    _consume_user_turn()
+                    yield from _run_generation()
+                elif event == "response.cancel":
+                    # Synchronous generation here means cancel can only
+                    # take effect between turns; we ack so the client unblocks.
+                    yield _new_event("response.done", meta=b'{"cancelled":true}')
+                elif event == "session.update":
+                    # Free-form session re-config; treat as a soft reset.
+                    _ensure_chat()
+                    pcm_buffer = bytearray()
+                # Unknown events are ignored — forward-compatible.
+
+
+    def AudioTranscription(self, request, context):
+        try:
+            if self.model is None or self.processor is None:
+                return backend_pb2.TranscriptResult(segments=[], text="")
+
+            import torchaudio
+            from liquid_audio import ChatState
+
+            audio_path = request.dst
+            if not audio_path:
+                return backend_pb2.TranscriptResult(segments=[], text="")
+
+            chat = ChatState(self.processor)
+            chat.new_turn("system")
+            chat.add_text("Perform ASR.")
+            chat.end_turn()
+            chat.new_turn("user")
+            # soundfile in preference to torchaudio.load — the latter routes
+            # through torchcodec which needs NVIDIA NPP libs we don't bundle.
+            import soundfile as _sf
+            import torch
+            audio_np, sr = _sf.read(audio_path, dtype="float32", always_2d=True)
+            wav = torch.from_numpy(audio_np.T)  # (channels, samples)
+            if wav.shape[0] > 1:
+                # Down-mix to mono — the processor expects a single channel
+                wav = wav.mean(dim=0, keepdim=True)
+            chat.add_audio(wav, sr)
+            chat.end_turn()
+            chat.new_turn("assistant")
+
+            max_new = int(self.options.get("max_new_tokens", 1024))
+
+            pieces = []
+            for tok in self.model.generate_sequential(**chat, max_new_tokens=max_new):
+                if tok.numel() == 1:
+                    if tok.item() == IM_END_TOKEN:
+                        break
+                    pieces.append(self.processor.text.decode(tok))
+
+            text = "".join(pieces).strip()
+            duration_ms = int((wav.shape[1] / sr) * 1000)
+            segment = backend_pb2.TranscriptSegment(
+                id=0, start=0, end=duration_ms, text=text, tokens=[],
+            )
+            return backend_pb2.TranscriptResult(segments=[segment], text=text)
+        except Exception as exc:
+            print(f"AudioTranscription failed: {exc}", file=sys.stderr)
+            print(traceback.format_exc(), file=sys.stderr)
+            return backend_pb2.TranscriptResult(segments=[], text="")
+
+
+    def StartFineTune(self, request, context):
+        if self.active_job is not None and not self.active_job.completed:
+            return backend_pb2.FineTuneJobResult(
+                job_id="", success=False,
+                message="A fine-tuning job is already running",
+            )
+
+        job_id = request.job_id or str(uuid.uuid4())
+        job = ActiveJob(job_id)
+        self.active_job = job
+
+        thread = threading.Thread(target=self._run_training, args=(request, job), daemon=True)
+        job.thread = thread
+        thread.start()
+
+        return backend_pb2.FineTuneJobResult(
+            job_id=job_id, success=True, message="Training started",
+        )
+
+    def FineTuneProgress(self, request, context):
+        if self.active_job is None or self.active_job.job_id != request.job_id:
+            context.set_code(grpc.StatusCode.NOT_FOUND)
+            context.set_details(f"Job {request.job_id} not found")
+            return
+
+        job = self.active_job
+        while True:
+            try:
+                update = job.progress_queue.get(timeout=1.0)
+            except queue.Empty:
+                if job.completed or job.stopped:
+                    break
+                if not context.is_active():
+                    break
+                continue
+            if update is None:
+                break
+            yield update
+            if update.status in ("completed", "failed", "stopped"):
+                break
+
+    def StopFineTune(self, request, context):
+        # We can't kill the Accelerate training loop mid-step cleanly from here;
+        # LocalAI's job manager kills the backend process on stop. The flag below
+        # at least lets the progress stream terminate quickly.
+        if self.active_job is not None and self.active_job.job_id == request.job_id:
+            self.active_job.stopped = True
+            self.active_job.progress_queue.put(None)
+        return backend_pb2.Result(success=True, message="OK")
+
+    def _run_training(self, request, job):
+        try:
+            self._do_train(request, job)
+            job.completed = True
+            job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
+                job_id=job.job_id, status="completed", message="Training completed",
+                progress_percent=100.0,
+            ))
+        except Exception as exc:
+            job.error = str(exc)
+            job.completed = True
+            print(f"Training failed: {exc}", file=sys.stderr)
+            print(traceback.format_exc(), file=sys.stderr)
+            job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
+                job_id=job.job_id, status="failed", message=str(exc),
+            ))
+        finally:
+            job.progress_queue.put(None)
+
+    def _do_train(self, request, job):
+        from liquid_audio import LFM2AudioModel  # noqa: F401  (sanity import)
+        from liquid_audio.data.dataloader import LFM2DataLoader
+        from liquid_audio.trainer import Trainer
+
+        model_id = request.model or self.model_id or "LiquidAI/LFM2.5-Audio-1.5B"
+
+        dataset_path = request.dataset_source
+        if not dataset_path:
+            raise ValueError("dataset_source is required (path to a preprocessed dataset)")
+
+        extras = dict(request.extra_options) if request.extra_options else {}
+        val_path = extras.get("val_dataset")
+
+        # Map FineTuneRequest hyperparameters to liquid_audio.Trainer constructor args
+        lr = request.learning_rate or 3e-5
+        max_steps = request.max_steps or 1000
+        warmup_steps = request.warmup_steps or min(100, max_steps // 10)
+        batch_size = request.batch_size or 16
+        save_interval = request.save_steps or max(1, max_steps // 4)
+
+        output_dir = request.output_dir or os.path.join(
+            os.environ.get("LIQUID_AUDIO_OUTPUT_DIR", "/tmp"),
+            f"liquid-audio-{job.job_id}",
+        )
+        os.makedirs(output_dir, exist_ok=True)
+
+        job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
+            job_id=job.job_id, status="loading_dataset",
+            message=f"Loading preprocessed dataset from {dataset_path}",
+        ))
+        train_data = LFM2DataLoader(dataset_path)
+        val_data = LFM2DataLoader(val_path) if val_path else None
+
+        job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
+            job_id=job.job_id, status="loading_model",
+            message=f"Loading base model {model_id}",
+        ))
+
+        # The Liquid Trainer logs via self.accelerator.print; we subclass it to
+        # also push progress events onto the queue every logging_interval steps.
+        progress_q = job.progress_queue
+
+        class QueuedTrainer(Trainer):
+            def log(self_, model_output):
+                if self_.step > 0 and self_.step % self_.logging_interval == 0:
+                    try:
+                        loss = self_.accelerator.reduce(
+                            model_output.loss.detach(), reduction="mean"
+                        ).item()
+                    except Exception:
+                        loss = float("nan")
+                    lr_now = self_.optimizer.param_groups[0]["lr"]
+                    pct = (self_.step / self_.max_steps * 100.0) if self_.max_steps else 0.0
+                    progress_q.put(backend_pb2.FineTuneProgressUpdate(
+                        job_id=job.job_id,
+                        current_step=int(self_.step),
+                        total_steps=int(self_.max_steps),
+                        current_epoch=float(self_.epoch),
+                        loss=float(loss),
+                        learning_rate=float(lr_now),
+                        progress_percent=float(pct),
+                        status="training",
+                    ))
+                # Honour stop requests: raising here terminates the loop cleanly
+                if job.stopped:
+                    raise KeyboardInterrupt("stop requested")
+                return super().log(model_output)
+
+            def validate(self_):
+                progress_q.put(backend_pb2.FineTuneProgressUpdate(
+                    job_id=job.job_id, current_step=int(self_.step),
+                    total_steps=int(self_.max_steps), status="training",
+                    message=f"Running validation at step {self_.step}",
+                ))
+                return super().validate()
+
+        trainer = QueuedTrainer(
+            model_id=model_id,
+            train_data=train_data,
+            val_data=val_data,
+            lr=lr,
+            max_steps=max_steps,
+            warmup_steps=warmup_steps,
+            batch_size=batch_size,
+            save_interval=save_interval,
+            output_dir=output_dir,
+            weight_decay=request.weight_decay or 0.1,
+        )
+
+        job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
+            job_id=job.job_id, status="training", message="Training started",
+            total_steps=int(max_steps),
+        ))
+        trainer.train()
+
+        job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
+            job_id=job.job_id, status="saving",
+            message=f"Saved final model to {output_dir}",
+            checkpoint_path=os.path.join(output_dir, "final"),
+        ))
+
+
+    def _build_chat_state(self, messages, user_prompt, tools_prelude=None):
+        """Build a ChatState from a list of (role, content) tuples plus an optional final user turn.
+
+        tools_prelude, when non-empty, is prepended as an extra system turn carrying
+        the LFM2 tool-list block — mirrors gallery/lfm.yaml's `function:` template
+        so the model sees the same prompt shape whether served via llama-cpp or here.
+        """
+        from liquid_audio import ChatState
+        chat = ChatState(self.processor)
+        if tools_prelude:
+            chat.new_turn("system")
+            chat.add_text(tools_prelude)
+            chat.end_turn()
+        for role, content in messages:
+            chat.new_turn(role)
+            chat.add_text(content)
+            chat.end_turn()
+        if user_prompt:
+            chat.new_turn("user")
+            chat.add_text(user_prompt)
+            chat.end_turn()
+        chat.new_turn("assistant")
+        return chat
+
+    def _collect_messages(self, request):
+        """Translate PredictOptions.Messages into (role, content) tuples."""
+        out = []
+        for m in request.Messages:
+            role = (m.role or "user").lower()
+            if role not in ("system", "user", "assistant"):
+                role = "user"
+            out.append((role, m.content or ""))
+        return out
+
+    def _render_tools_prelude(self, request):
+        """Build the LFM2 `<|tool_list_start|>…<|tool_list_end|>` system prelude
+        from request.Tools (OpenAI Chat-Completions tool JSON). Returns "" when
+        no tools are attached. Output mirrors gallery/lfm.yaml's `function:`
+        template so the model sees the same prompt whether routed via llama-cpp
+        or this backend."""
+        tools_raw = getattr(request, "Tools", "") or ""
+        if not tools_raw:
+            return ""
+        try:
+            tools = json.loads(tools_raw)
+        except json.JSONDecodeError:
+            print(f"liquid-audio: ignoring malformed Tools JSON: {tools_raw[:200]!r}",
+                  file=sys.stderr)
+            return ""
+        if not isinstance(tools, list) or not tools:
+            return ""
+        # The LFM2 chat template uses single-quoted Python-dict-ish syntax in
+        # examples, but the tokenizer treats this whole block as opaque text;
+        # JSON works fine and is what other backends emit.
+        return (
+            "You are a function calling AI model. You are provided with functions to "
+            "execute. You may call one or more functions to assist with the user query. "
+            "Don't make assumptions about what values to plug into functions.\n"
+            "List of tools: <|tool_list_start|>"
+            + json.dumps(tools, separators=(",", ":"))
+            + "<|tool_list_end|>"
+        )
+
+    def _generate_text_stream(self, request):
+        """Yield text-only deltas from generate_sequential. Caller joins for unary Predict."""
+        if self.model is None or self.processor is None:
+            raise RuntimeError("Model not loaded")
+        messages = self._collect_messages(request)
+        user_prompt = request.Prompt or None
+        tools_prelude = self._render_tools_prelude(request)
+        # If the request already carries Messages, Prompt is the templated form
+        # of the same content — don't append a duplicate user turn.
+        chat = self._build_chat_state(
+            messages,
+            user_prompt if not messages else None,
+            tools_prelude=tools_prelude,
+        )
+
+        max_new = request.Tokens if request.Tokens > 0 else int(self.options.get("max_new_tokens", 512))
+        temperature = request.Temperature if request.Temperature > 0 else None
+        top_k = request.TopK if request.TopK > 0 else None
+
+        for tok in self.model.generate_sequential(
+            **chat,
+            max_new_tokens=max_new,
+            text_temperature=temperature,
+            text_top_k=top_k,
+        ):
+            if tok.numel() == 1:
+                if tok.item() == IM_END_TOKEN:
+                    break
+                yield self.processor.text.decode(tok)
+
+
+def serve(address):
+    server = grpc.server(
+        futures.ThreadPoolExecutor(max_workers=MAX_WORKERS),
+        options=[
+            ('grpc.max_message_length', 50 * 1024 * 1024),
+            ('grpc.max_send_message_length', 50 * 1024 * 1024),
+            ('grpc.max_receive_message_length', 50 * 1024 * 1024),
+        ],
+        interceptors=get_auth_interceptors(),
+    )
+    backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
+    server.add_insecure_port(address)
+    server.start()
+    print(f"Liquid-audio backend listening on {address}", file=sys.stderr, flush=True)
+
+    def stop(_signum, _frame):
+        server.stop(0)
+        sys.exit(0)
+
+    signal.signal(signal.SIGTERM, stop)
+    signal.signal(signal.SIGINT, stop)
+
+    try:
+        while True:
+            time.sleep(_ONE_DAY_IN_SECONDS)
+    except KeyboardInterrupt:
+        server.stop(0)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Liquid Audio gRPC backend")
+    parser.add_argument("--addr", default="localhost:50051", help="gRPC server address")
+    args = parser.parse_args()
+    serve(args.addr)
--- a/backend/python/liquid-audio/install.sh
+++ b/backend/python/liquid-audio/install.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+set -e
+
+# liquid-audio requires Python ≥ 3.12 (per its pyproject.toml); the default
+# portable Python in libbackend.sh is 3.10. Override before sourcing.
+export PYTHON_VERSION="${PYTHON_VERSION:-3.12}"
+export PYTHON_PATCH="${PYTHON_PATCH:-11}"
+
+backend_dir=$(dirname $0)
+if [ -d $backend_dir/common ]; then
+    source $backend_dir/common/libbackend.sh
+else
+    source $backend_dir/../common/libbackend.sh
+fi
+
+# liquid-audio's torch wheels are large; allow upgrades to satisfy transitive pins
+EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
+installRequirements
--- a/backend/python/liquid-audio/protogen.sh
+++ b/backend/python/liquid-audio/protogen.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+set -e
+
+backend_dir=$(dirname $0)
+if [ -d $backend_dir/common ]; then
+    source $backend_dir/common/libbackend.sh
+else
+    source $backend_dir/../common/libbackend.sh
+fi
+
+runProtogen
--- a/backend/python/liquid-audio/requirements-cpu.txt
+++ b/backend/python/liquid-audio/requirements-cpu.txt
@@ -0,0 +1,13 @@
+--extra-index-url https://download.pytorch.org/whl/cpu
+torch>=2.8.0
+torchaudio>=2.8.0
+torchcodec>=0.9.1
+transformers>=4.55.4
+accelerate>=1.10.1
+datasets>=4.8.4
+einops>=0.8.1
+librosa>=0.11.0
+soundfile>=0.12.1
+sentencepiece>=0.2.1
+huggingface-hub>=1.3.0
+liquid-audio>=1.2.0
--- a/backend/python/liquid-audio/requirements-cublas12.txt
+++ b/backend/python/liquid-audio/requirements-cublas12.txt
@@ -0,0 +1,13 @@
+--extra-index-url https://download.pytorch.org/whl/cu121
+torch>=2.8.0
+torchaudio>=2.8.0
+torchcodec>=0.9.1
+transformers>=4.55.4
+accelerate>=1.10.1
+datasets>=4.8.4
+einops>=0.8.1
+librosa>=0.11.0
+soundfile>=0.12.1
+sentencepiece>=0.2.1
+huggingface-hub>=1.3.0
+liquid-audio>=1.2.0
--- a/backend/python/liquid-audio/requirements-cublas13.txt
+++ b/backend/python/liquid-audio/requirements-cublas13.txt
@@ -0,0 +1,13 @@
+--extra-index-url https://download.pytorch.org/whl/cu130
+torch>=2.8.0
+torchaudio>=2.8.0
+torchcodec>=0.9.1
+transformers>=4.55.4
+accelerate>=1.10.1
+datasets>=4.8.4
+einops>=0.8.1
+librosa>=0.11.0
+soundfile>=0.12.1
+sentencepiece>=0.2.1
+huggingface-hub>=1.3.0
+liquid-audio>=1.2.0
--- a/backend/python/liquid-audio/requirements-hipblas.txt
+++ b/backend/python/liquid-audio/requirements-hipblas.txt
@@ -0,0 +1,13 @@
+--extra-index-url https://download.pytorch.org/whl/rocm7.0
+torch>=2.8.0
+torchaudio>=2.8.0
+torchcodec>=0.9.1
+transformers>=4.55.4
+accelerate>=1.10.1
+datasets>=4.8.4
+einops>=0.8.1
+librosa>=0.11.0
+soundfile>=0.12.1
+sentencepiece>=0.2.1
+huggingface-hub>=1.3.0
+liquid-audio>=1.2.0
--- a/backend/python/liquid-audio/requirements-l4t13.txt
+++ b/backend/python/liquid-audio/requirements-l4t13.txt
@@ -0,0 +1,13 @@
+--extra-index-url https://pypi.jetson-ai-lab.io/jp7/cu130
+torch>=2.8.0
+torchaudio>=2.8.0
+torchcodec>=0.9.1
+transformers>=4.55.4
+accelerate>=1.10.1
+datasets>=4.8.4
+einops>=0.8.1
+librosa>=0.11.0
+soundfile>=0.12.1
+sentencepiece>=0.2.1
+huggingface-hub>=1.3.0
+liquid-audio>=1.2.0
--- a/backend/python/liquid-audio/requirements-mps.txt
+++ b/backend/python/liquid-audio/requirements-mps.txt
@@ -0,0 +1,12 @@
+torch>=2.8.0
+torchaudio>=2.8.0
+torchcodec>=0.9.1
+transformers>=4.55.4
+accelerate>=1.10.1
+datasets>=4.8.4
+einops>=0.8.1
+librosa>=0.11.0
+soundfile>=0.12.1
+sentencepiece>=0.2.1
+huggingface-hub>=1.3.0
+liquid-audio>=1.2.0
--- a/backend/python/liquid-audio/requirements.txt
+++ b/backend/python/liquid-audio/requirements.txt
@@ -0,0 +1,3 @@
+grpcio==1.78.1
+protobuf
+certifi
--- a/backend/python/liquid-audio/run.sh
+++ b/backend/python/liquid-audio/run.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+backend_dir=$(dirname $0)
+if [ -d $backend_dir/common ]; then
+    source $backend_dir/common/libbackend.sh
+else
+    source $backend_dir/../common/libbackend.sh
+fi
+
+startBackend $@
--- a/backend/python/liquid-audio/test.py
+++ b/backend/python/liquid-audio/test.py
@@ -0,0 +1,89 @@
+"""Smoke tests for the liquid-audio backend.
+
+These run without contacting HuggingFace or loading model weights:
+they only verify that the gRPC service starts and Health() responds.
+
+To run an end-to-end inference test, set LIQUID_AUDIO_MODEL_ID
+(e.g. "LiquidAI/LFM2.5-Audio-1.5B") in the environment — see test_inference().
+"""
+import os
+import subprocess
+import sys
+import time
+import unittest
+
+import grpc
+
+# Ensure generated protobuf stubs are importable
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+
+import backend_pb2
+import backend_pb2_grpc
+
+
+class TestBackend(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        addr = os.environ.get("LIQUID_AUDIO_TEST_ADDR", "localhost:50053")
+        cls.addr = addr
+        cls.server = subprocess.Popen(
+            [sys.executable, os.path.join(os.path.dirname(__file__), "backend.py"), "--addr", addr],
+        )
+        time.sleep(2)  # Give the server a moment to bind
+
+    @classmethod
+    def tearDownClass(cls):
+        cls.server.terminate()
+        try:
+            cls.server.wait(timeout=5)
+        except subprocess.TimeoutExpired:
+            cls.server.kill()
+
+    def _stub(self):
+        channel = grpc.insecure_channel(self.addr)
+        return backend_pb2_grpc.BackendStub(channel)
+
+    def test_health(self):
+        stub = self._stub()
+        reply = stub.Health(backend_pb2.HealthMessage(), timeout=5)
+        self.assertEqual(reply.message, b"OK")
+
+    def test_load_finetune_mode_without_weights(self):
+        """Loading in fine-tune mode should succeed without pulling model weights."""
+        stub = self._stub()
+        result = stub.LoadModel(
+            backend_pb2.ModelOptions(
+                Model="LiquidAI/LFM2.5-Audio-1.5B",
+                Options=["mode:finetune"],
+            ),
+            timeout=10,
+        )
+        self.assertTrue(result.success, msg=result.message)
+
+    @unittest.skipUnless(os.environ.get("LIQUID_AUDIO_MODEL_ID"),
+                         "Set LIQUID_AUDIO_MODEL_ID to run an end-to-end inference smoke test")
+    def test_inference(self):
+        """End-to-end: load a real LFM2-Audio model and run one short prediction."""
+        stub = self._stub()
+        model_id = os.environ["LIQUID_AUDIO_MODEL_ID"]
+        result = stub.LoadModel(
+            backend_pb2.ModelOptions(
+                Model=model_id,
+                Options=["mode:chat"],
+            ),
+            timeout=600,
+        )
+        self.assertTrue(result.success, msg=result.message)
+        reply = stub.Predict(
+            backend_pb2.PredictOptions(
+                Prompt="Hello!",
+                Tokens=8,
+                Temperature=0.0,
+            ),
+            timeout=120,
+        )
+        self.assertGreater(len(reply.message), 0)
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/backend/python/liquid-audio/test.sh
+++ b/backend/python/liquid-audio/test.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+set -e
+
+backend_dir=$(dirname $0)
+if [ -d $backend_dir/common ]; then
+    source $backend_dir/common/libbackend.sh
+else
+    source $backend_dir/../common/libbackend.sh
+fi
+
+runUnittests
--- a/backend/python/vllm/requirements-cublas13-after.txt
+++ b/backend/python/vllm/requirements-cublas13-after.txt
@@ -3,5 +3,5 @@
 # on a cu130 host. Pull the cu130-flavoured wheel from vLLM's per-tag index
 # instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
 # so uv consults this index alongside PyPI.
--extra-index-url https://wheels.vllm.ai/0.20.2/cu130
-vllm==0.20.2
+--extra-index-url https://wheels.vllm.ai/0.21.0/cu130
+vllm==0.21.0
--- a/core/application/distributed.go
+++ b/core/application/distributed.go
@@ -169,7 +169,7 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
 		cfg.Distributed.HealthCheckIntervalOrDefault(),
 		cfg.Distributed.StaleNodeThresholdOrDefault(),
 		routerAuthToken,
-		cfg.Distributed.PerModelHealthCheck,
+		!cfg.Distributed.DisablePerModelHealthCheck,
 	)

 	// Initialize job store
--- a/core/application/startup.go
+++ b/core/application/startup.go
@@ -212,12 +212,12 @@ func New(opts ...config.AppOption) (*Application, error) {
 		}
 	}

-	if err := coreStartup.InstallModels(options.Context, application.GalleryService(), options.Galleries, options.BackendGalleries, options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, nil, options.ModelsURL...); err != nil {
+	if err := coreStartup.InstallModels(options.Context, application.GalleryService(), options.Galleries, options.BackendGalleries, options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.RequireBackendIntegrity, nil, options.ModelsURL...); err != nil {
 		xlog.Error("error installing models", "error", err)
 	}

 	for _, backend := range options.ExternalBackends {
-		if err := galleryop.InstallExternalBackend(options.Context, options.BackendGalleries, options.SystemState, application.ModelLoader(), nil, backend, "", ""); err != nil {
+		if err := galleryop.InstallExternalBackend(options.Context, options.BackendGalleries, options.SystemState, application.ModelLoader(), nil, backend, "", "", options.RequireBackendIntegrity); err != nil {
 			xlog.Error("error installing external backend", "error", err)
 		}
 	}
@@ -267,13 +267,13 @@ func New(opts ...config.AppOption) (*Application, error) {
 	}

 	if options.PreloadJSONModels != "" {
-		if err := galleryop.ApplyGalleryFromString(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadJSONModels); err != nil {
+		if err := galleryop.ApplyGalleryFromString(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadJSONModels, options.RequireBackendIntegrity); err != nil {
 			return nil, err
 		}
 	}

 	if options.PreloadModelsFromPath != "" {
-		if err := galleryop.ApplyGalleryFromFile(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadModelsFromPath); err != nil {
+		if err := galleryop.ApplyGalleryFromFile(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadModelsFromPath, options.RequireBackendIntegrity); err != nil {
 			return nil, err
 		}
 	}
--- a/core/application/upgrade_checker.go
+++ b/core/application/upgrade_checker.go
@@ -217,7 +217,7 @@ func (uc *UpgradeChecker) runCheck(ctx context.Context) {
 				err = bm.UpgradeBackend(ctx, name, nil)
 			} else {
 				err = gallery.UpgradeBackend(ctx, uc.systemState, uc.modelLoader,
-					uc.galleries, name, nil)
+					uc.galleries, name, nil, uc.appConfig.RequireBackendIntegrity)
 			}
 			if err != nil {
 				xlog.Error("Failed to auto-upgrade backend",
--- a/core/backend/llm.go
+++ b/core/backend/llm.go
@@ -86,7 +86,7 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
 		if !slices.Contains(modelNames, modelName) {
 			utils.ResetDownloadTimers()
 			// if we failed to load the model, we try to download it
-			err := gallery.InstallModelFromGallery(ctx, o.Galleries, o.BackendGalleries, o.SystemState, loader, modelName, gallery.GalleryModel{}, utils.DisplayDownloadFunction, o.EnforcePredownloadScans, o.AutoloadBackendGalleries)
+			err := gallery.InstallModelFromGallery(ctx, o.Galleries, o.BackendGalleries, o.SystemState, loader, modelName, gallery.GalleryModel{}, utils.DisplayDownloadFunction, o.EnforcePredownloadScans, o.AutoloadBackendGalleries, o.RequireBackendIntegrity)
 			if err != nil {
 				xlog.Error("failed to install model from gallery", "error", err, "model", modelFile)
 				//return nil, err
--- a/core/cli/backends.go
+++ b/core/cli/backends.go
@@ -17,9 +17,10 @@ import (
 )

 type BackendsCMDFlags struct {
-	BackendGalleries   string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
-	BackendsPath       string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"storage"`
-	BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
+	BackendGalleries        string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
+	BackendsPath            string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"storage"`
+	BackendsSystemPath      string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
+	RequireBackendIntegrity bool   `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
 }

 type BackendsList struct {
@@ -126,7 +127,7 @@ func (bi *BackendsInstall) Run(ctx *cliContext.Context) error {
 	}

 	modelLoader := model.NewModelLoader(systemState)
-	err = galleryop.InstallExternalBackend(context.Background(), galleries, systemState, modelLoader, progressCallback, bi.BackendArgs, bi.Name, bi.Alias)
+	err = galleryop.InstallExternalBackend(context.Background(), galleries, systemState, modelLoader, progressCallback, bi.BackendArgs, bi.Name, bi.Alias, bi.RequireBackendIntegrity)
 	if err != nil {
 		return err
 	}
@@ -197,7 +198,7 @@ func (bu *BackendsUpgrade) Run(ctx *cliContext.Context) error {
 			}
 		}

-		if err := gallery.UpgradeBackend(context.Background(), systemState, modelLoader, galleries, name, progressCallback); err != nil {
+		if err := gallery.UpgradeBackend(context.Background(), systemState, modelLoader, galleries, name, progressCallback, bu.RequireBackendIntegrity); err != nil {
 			fmt.Printf("Failed to upgrade %s: %v\n", name, err)
 		} else {
 			fmt.Printf("Backend %s upgraded successfully\n", name)
--- a/core/cli/models.go
+++ b/core/cli/models.go
@@ -32,6 +32,7 @@ type ModelsList struct {

 type ModelsInstall struct {
 	DisablePredownloadScan   bool     `env:"LOCALAI_DISABLE_PREDOWNLOAD_SCAN" help:"If true, disables the best-effort security scanner before downloading any files." group:"hardening" default:"false"`
+	RequireBackendIntegrity  bool     `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
 	AutoloadBackendGalleries bool     `env:"LOCALAI_AUTOLOAD_BACKEND_GALLERIES" help:"If true, automatically loads backend galleries" group:"backends" default:"true"`
 	ModelArgs                []string `arg:"" optional:"" name:"models" help:"Model configuration URLs to load"`

@@ -71,7 +72,6 @@ func (ml *ModelsList) Run(ctx *cliContext.Context) error {
 }

 func (mi *ModelsInstall) Run(ctx *cliContext.Context) error {
-
 	systemState, err := system.GetSystemState(
 		system.WithModelPath(mi.ModelsPath),
 		system.WithBackendPath(mi.BackendsPath),
@@ -135,7 +135,7 @@ func (mi *ModelsInstall) Run(ctx *cliContext.Context) error {
 		}

 		modelLoader := model.NewModelLoader(systemState)
-		err = startup.InstallModels(context.Background(), galleryService, galleries, backendGalleries, systemState, modelLoader, !mi.DisablePredownloadScan, mi.AutoloadBackendGalleries, progressCallback, modelName)
+		err = startup.InstallModels(context.Background(), galleryService, galleries, backendGalleries, systemState, modelLoader, !mi.DisablePredownloadScan, mi.AutoloadBackendGalleries, mi.RequireBackendIntegrity, progressCallback, modelName)
 		if err != nil {
 			return err
 		}
--- a/core/cli/run.go
+++ b/core/cli/run.go
@@ -67,6 +67,7 @@ type RunCMD struct {
 	OllamaAPIRootEndpoint              bool     `env:"LOCALAI_OLLAMA_API_ROOT_ENDPOINT" default:"false" help:"Register Ollama-compatible health check on / (replaces web UI on root path). The /api/* Ollama endpoints are always available regardless of this flag" group:"api"`
 	DisableRuntimeSettings             bool     `env:"LOCALAI_DISABLE_RUNTIME_SETTINGS,DISABLE_RUNTIME_SETTINGS" default:"false" help:"Disables the runtime settings. When set to true, the server will not load the runtime settings from the runtime_settings.json file" group:"api"`
 	DisablePredownloadScan             bool     `env:"LOCALAI_DISABLE_PREDOWNLOAD_SCAN" help:"If true, disables the best-effort security scanner before downloading any files." group:"hardening" default:"false"`
+	RequireBackendIntegrity            bool     `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, backend installs without a configured signature verification policy (for OCI URIs) or SHA256 (for tarball/HTTP URIs) are rejected. Default is to warn and install. Set this in production once your gallery's verification: block is populated." group:"hardening" default:"false"`
 	OpaqueErrors                       bool     `env:"LOCALAI_OPAQUE_ERRORS" default:"false" help:"If true, all error responses are replaced with blank 500 errors. This is intended only for hardening against information leaks and is normally not recommended." group:"hardening"`
 	UseSubtleKeyComparison             bool     `env:"LOCALAI_SUBTLE_KEY_COMPARISON" default:"false" help:"If true, API Key validation comparisons will be performed using constant-time comparisons rather than simple equality. This trades off performance on each request for resiliancy against timing attacks." group:"hardening"`
 	DisableApiKeyRequirementForHttpGet bool     `env:"LOCALAI_DISABLE_API_KEY_REQUIREMENT_FOR_HTTP_GET" default:"false" help:"If true, a valid API key is not required to issue GET requests to portions of the web ui. This should only be enabled in secure testing environments" group:"hardening"`
@@ -503,6 +504,10 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
 		opts = append(opts, config.WithAutoUpgradeBackends(r.AutoUpgradeBackends))
 	}

+	if r.RequireBackendIntegrity {
+		opts = append(opts, config.WithRequireBackendIntegrity(r.RequireBackendIntegrity))
+	}
+
 	if r.PreferDevelopmentBackends {
 		opts = append(opts, config.WithPreferDevelopmentBackends(r.PreferDevelopmentBackends))
 	}
--- a/core/cli/worker/worker.go
+++ b/core/cli/worker/worker.go
@@ -1,10 +1,11 @@
 package worker

 type WorkerFlags struct {
-	BackendsPath       string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"backends"`
-	BackendGalleries   string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
-	BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
-	ExtraLLamaCPPArgs  string `name:"llama-cpp-args" env:"LOCALAI_EXTRA_LLAMA_CPP_ARGS,EXTRA_LLAMA_CPP_ARGS" help:"Extra arguments to pass to llama-cpp-rpc-server"`
+	BackendsPath            string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"backends"`
+	BackendGalleries        string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
+	BackendsSystemPath      string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
+	RequireBackendIntegrity bool   `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
+	ExtraLLamaCPPArgs       string `name:"llama-cpp-args" env:"LOCALAI_EXTRA_LLAMA_CPP_ARGS,EXTRA_LLAMA_CPP_ARGS" help:"Extra arguments to pass to llama-cpp-rpc-server"`
 }

 type Worker struct {
--- a/core/cli/worker/worker_backend_common.go
+++ b/core/cli/worker/worker_backend_common.go
@@ -18,7 +18,7 @@ import (
 // installing the backend from the gallery if it isn't present.
 // `name` is the gallery entry name (for vLLM the meta entry "vllm"
 // resolves to a platform-specific package via capability lookup).
-func findBackendPath(name, galleries string, systemState *system.SystemState) (string, error) {
+func findBackendPath(name, galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
 	backends, err := gallery.ListSystemBackends(systemState)
 	if err != nil {
 		return "", err
@@ -33,7 +33,7 @@ func findBackendPath(name, galleries string, systemState *system.SystemState) (s
 		xlog.Error("failed loading galleries", "error", err)
 		return "", err
 	}
-	if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, name, nil, true); err != nil {
+	if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, name, nil, true, requireIntegrity); err != nil {
 		xlog.Error("backend not found, failed to install it", "name", name, "error", err)
 		return "", err
 	}
--- a/core/cli/worker/worker_llamacpp.go
+++ b/core/cli/worker/worker_llamacpp.go
@@ -27,7 +27,7 @@ const (
 	llamaCPPGalleryName   = "llama-cpp"
 )

-func findLLamaCPPBackend(galleries string, systemState *system.SystemState) (string, error) {
+func findLLamaCPPBackend(galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
 	backends, err := gallery.ListSystemBackends(systemState)
 	if err != nil {
 		xlog.Warn("Failed listing system backends", "error", err)
@@ -43,7 +43,7 @@ func findLLamaCPPBackend(galleries string, systemState *system.SystemState) (str
 			xlog.Error("failed loading galleries", "error", err)
 			return "", err
 		}
-		err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, llamaCPPGalleryName, nil, true)
+		err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, llamaCPPGalleryName, nil, true, requireIntegrity)
 		if err != nil {
 			xlog.Error("llama-cpp backend not found, failed to install it", "error", err)
 			return "", err
@@ -76,7 +76,7 @@ func (r *LLamaCPP) Run(ctx *cliContext.Context) error {
 	if err != nil {
 		return err
 	}
-	grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState)
+	grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 	if err != nil {
 		return err
 	}
--- a/core/cli/worker/worker_mlx_common.go
+++ b/core/cli/worker/worker_mlx_common.go
@@ -9,8 +9,8 @@ import (

 const mlxDistributedGalleryName = "mlx-distributed"

-func findMLXDistributedBackendPath(galleries string, systemState *system.SystemState) (string, error) {
-	return findBackendPath(mlxDistributedGalleryName, galleries, systemState)
+func findMLXDistributedBackendPath(galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
+	return findBackendPath(mlxDistributedGalleryName, galleries, systemState, requireIntegrity)
 }

 // buildMLXCommand builds the exec.Cmd to launch the mlx-distributed backend.
--- a/core/cli/worker/worker_mlx_distributed.go
+++ b/core/cli/worker/worker_mlx_distributed.go
@@ -28,7 +28,7 @@ func (r *MLXDistributed) Run(ctx *cliContext.Context) error {
 		return err
 	}

-	backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState)
+	backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 	if err != nil {
 		return fmt.Errorf("cannot find mlx-distributed backend: %w", err)
 	}
--- a/core/cli/worker/worker_p2p.go
+++ b/core/cli/worker/worker_p2p.go
@@ -73,7 +73,7 @@ func (r *P2P) Run(ctx *cliContext.Context) error {
 			for {
 				xlog.Info("Starting llama-cpp-rpc-server", "address", address, "port", port)

-				grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState)
+				grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 				if err != nil {
 					xlog.Error("Failed to find llama-cpp-rpc-server", "error", err)
 					return
--- a/core/cli/worker/worker_p2p_mlx.go
+++ b/core/cli/worker/worker_p2p_mlx.go
@@ -48,7 +48,7 @@ func (r *P2PMLX) Run(ctx *cliContext.Context) error {
 	c, cancel := context.WithCancel(context.Background())
 	defer cancel()

-	backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState)
+	backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 	if err != nil {
 		xlog.Warn("Could not find mlx-distributed backend from gallery, will try backend.py directly", "error", err)
 	}
--- a/core/cli/worker/worker_vllm.go
+++ b/core/cli/worker/worker_vllm.go
@@ -77,7 +77,7 @@ func (r *VLLMDistributed) Run(ctx *cliContext.Context) error {
 		return fmt.Errorf("getting system state: %w", err)
 	}

-	backendPath, err := findBackendPath("vllm", r.BackendGalleries, systemState)
+	backendPath, err := findBackendPath("vllm", r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 	if err != nil {
 		return fmt.Errorf("cannot find vllm backend: %w", err)
 	}
--- a/core/config/application_config.go
+++ b/core/config/application_config.go
@@ -60,6 +60,13 @@ type ApplicationConfig struct {
 	AutoUpgradeBackends                         bool
 	PreferDevelopmentBackends                   bool

+	// RequireBackendIntegrity promotes a missing SHA256 (tarball/HTTP URIs)
+	// or missing verification policy (OCI URIs) from a warning to a hard
+	// failure during backend install/upgrade. Off by default to keep
+	// upgrades non-breaking; operators opt in explicitly via
+	// --require-backend-integrity / LOCALAI_REQUIRE_BACKEND_INTEGRITY.
+	RequireBackendIntegrity bool
+
 	SingleBackend           bool // Deprecated: use MaxActiveBackends = 1 instead
 	MaxActiveBackends       int  // Maximum number of active backends (0 = unlimited, 1 = single backend mode)
 	WatchDogIdle bool
@@ -436,6 +443,10 @@ func WithAutoUpgradeBackends(v bool) AppOption {
 	return func(o *ApplicationConfig) { o.AutoUpgradeBackends = v }
 }

+func WithRequireBackendIntegrity(v bool) AppOption {
+	return func(o *ApplicationConfig) { o.RequireBackendIntegrity = v }
+}
+
 func WithPreferDevelopmentBackends(v bool) AppOption {
 	return func(o *ApplicationConfig) { o.PreferDevelopmentBackends = v }
 }
--- a/core/config/backend_capabilities.go
+++ b/core/config/backend_capabilities.go
@@ -24,6 +24,7 @@ const (
 	UsecaseVAD             = "vad"
 	UsecaseAudioTransform  = "audio_transform"
 	UsecaseDiarization     = "diarization"
+	UsecaseRealtimeAudio   = "realtime_audio"
 )

 // GRPCMethod identifies a Backend service RPC from backend.proto.
@@ -45,6 +46,7 @@ const (
 	MethodVAD                GRPCMethod = "VAD"
 	MethodAudioTransform     GRPCMethod = "AudioTransform"
 	MethodDiarize            GRPCMethod = "Diarize"
+	MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
 )

 // UsecaseInfo describes a single known_usecase value and how it maps
@@ -147,6 +149,11 @@ var UsecaseInfoMap = map[string]UsecaseInfo{
 		GRPCMethod:  MethodDiarize,
 		Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.",
 	},
+	UsecaseRealtimeAudio: {
+		Flag:        FLAG_REALTIME_AUDIO,
+		GRPCMethod:  MethodAudioToAudioStream,
+		Description: "Self-contained any-to-any audio model for the Realtime API — accepts microphone audio and emits speech + transcript (+ optional function calls) from a single backend via the AudioToAudioStream RPC.",
+	},
 }

 // BackendCapability describes which gRPC methods and usecases a backend supports.
@@ -397,6 +404,15 @@ var BackendCapabilities = map[string]BackendCapability{
 		Description:      "Meta MusicGen via transformers — music generation from text",
 	},

+	// --- Any-to-any audio backends ---
+	"liquid-audio": {
+		GRPCMethods:      []GRPCMethod{MethodPredict, MethodPredictStream, MethodAudioTranscription, MethodTTS, MethodAudioToAudioStream, MethodVAD},
+		PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTranscript, UsecaseTTS, UsecaseRealtimeAudio, UsecaseVAD},
+		DefaultUsecases:  []string{UsecaseRealtimeAudio, UsecaseChat, UsecaseTranscript, UsecaseTTS, UsecaseVAD},
+		AcceptsAudios:    true,
+		Description:      "LFM2 / LFM2.5-Audio — self-contained any-to-any audio model for the Realtime API; also exposes chat, transcription, TTS and a stub energy-based VAD endpoint",
+	},
+
 	// --- Audio transform backends ---
 	"localvqe": {
 		GRPCMethods:      []GRPCMethod{MethodAudioTransform},
--- a/core/config/distributed_config.go
+++ b/core/config/distributed_config.go
@@ -31,7 +31,15 @@ type DistributedConfig struct {
 	DrainTimeout        time.Duration // Time to wait for in-flight requests during drain (default 30s)
 	HealthCheckInterval time.Duration // Health monitor check interval (default 15s)
 	StaleNodeThreshold  time.Duration // Time before a node is considered stale (default 60s)
-	PerModelHealthCheck bool          // Enable per-model backend health checking (default false)
+	// DisablePerModelHealthCheck turns off the health monitor's per-model
+	// gRPC probe. When enabled (the default), the monitor pings each model's
+	// gRPC address and removes stale node_models rows whose backend has
+	// crashed even though the worker's node-level heartbeat is still arriving.
+	// Without per-model probing, /embeddings and /completions can be dispatched
+	// to a backend that silently returns garbage (see also the cascading
+	// model-row cleanup on MarkUnhealthy / MarkDraining).
+	DisablePerModelHealthCheck bool
+
 	MCPCIJobTimeout     time.Duration // MCP CI job execution timeout (default 10m)

 	MaxUploadSize int64 // Maximum upload body size in bytes (default 50 GB)
--- a/core/config/gallery.go
+++ b/core/config/gallery.go
@@ -1,6 +1,37 @@
 package config

-type Gallery struct {
-	URL  string `json:"url" yaml:"url"`
-	Name string `json:"name" yaml:"name"`
+// GalleryVerification declares the keyless-cosign signature policy that
+// every OCI backend image fetched from this gallery must satisfy.
+//
+// Verification is opt-in: galleries without a Verification block install
+// backends with no signature check (the downloader logs a warning when
+// LOCALAI_REQUIRE_BACKEND_INTEGRITY is unset; that flag turns the warning
+// into a hard error).
+//
+// Identity matching: set Issuer (exact) or IssuerRegex, AND Identity
+// (exact) or IdentityRegex. For GitHub Actions keyless signing the
+// typical shape is:
+//
+//	verification:
+//	  issuer: "https://token.actions.githubusercontent.com"
+//	  identity_regex: "^https://github\\.com/mudler/local-ai-backends/\\.github/workflows/build\\.yaml@refs/heads/master$"
+//	  not_before: "2026-05-01T00:00:00Z"
+//
+// NotBefore is the revocation lever: advance it to invalidate every
+// signature produced before a known compromise window. Keyless cosign
+// certs are ephemeral so there is no CA-side revocation.
+type GalleryVerification struct {
+	Issuer        string `json:"issuer,omitempty" yaml:"issuer,omitempty"`
+	IssuerRegex   string `json:"issuer_regex,omitempty" yaml:"issuer_regex,omitempty"`
+	Identity      string `json:"identity,omitempty" yaml:"identity,omitempty"`
+	IdentityRegex string `json:"identity_regex,omitempty" yaml:"identity_regex,omitempty"`
+
+	// NotBefore is an RFC3339 timestamp. Empty disables the time check.
+	NotBefore string `json:"not_before,omitempty" yaml:"not_before,omitempty"`
+}
+
+type Gallery struct {
+	URL          string               `json:"url" yaml:"url"`
+	Name         string               `json:"name" yaml:"name"`
+	Verification *GalleryVerification `json:"verification,omitempty" yaml:"verification,omitempty"`
 }
--- a/core/config/gguf.go
+++ b/core/config/gguf.go
@@ -54,6 +54,13 @@ func guessGGUFFromFile(cfg *ModelConfig, f *gguf.GGUFFile, defaultCtx int) {
 		cfg.modelTemplate = chatTemplate.ValueString()
 	}

+	// Auto-enable Multi-Token Prediction (ggml-org/llama.cpp#22673) when the
+	// GGUF carries an embedded MTP head. Skipped silently for non-MTP models
+	// and when the user already configured a spec_type.
+	if n, ok := HasEmbeddedMTPHead(f); ok {
+		ApplyMTPDefaults(cfg, n)
+	}
+
 	// Thinking support detection is done after model load via DetectThinkingSupportFromBackend

 	// template estimations
--- a/core/config/model_config.go
+++ b/core/config/model_config.go
@@ -636,6 +636,7 @@ const (
 	FLAG_SPEAKER_RECOGNITION ModelConfigUsecase = 0b1000000000000000
 	FLAG_AUDIO_TRANSFORM     ModelConfigUsecase = 0b10000000000000000
 	FLAG_DIARIZATION         ModelConfigUsecase = 0b100000000000000000
+	FLAG_REALTIME_AUDIO      ModelConfigUsecase = 0b1000000000000000000

 	// Common Subsets
 	FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT
@@ -645,12 +646,12 @@ const (
 // Flags within the same group are NOT orthogonal (e.g., chat and completion are
 // both text/language). A model is multimodal when its usecases span 2+ groups.
 var ModalityGroups = []ModelConfigUsecase{
-	FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT, // text/language
-	FLAG_VISION | FLAG_DETECTION,            // visual understanding
-	FLAG_TRANSCRIPT,                         // speech input
-	FLAG_TTS | FLAG_SOUND_GENERATION,        // audio output
-	FLAG_AUDIO_TRANSFORM,                    // audio in/out transforms
-	FLAG_IMAGE | FLAG_VIDEO,                 // visual generation
+	FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT,    // text/language
+	FLAG_VISION | FLAG_DETECTION,               // visual understanding
+	FLAG_TRANSCRIPT | FLAG_REALTIME_AUDIO,      // speech input — realtime_audio is any-to-any, so it counts here too
+	FLAG_TTS | FLAG_SOUND_GENERATION | FLAG_REALTIME_AUDIO, // audio output — and here, so a lone realtime_audio flag still reads as multimodal
+	FLAG_AUDIO_TRANSFORM,                       // audio in/out transforms
+	FLAG_IMAGE | FLAG_VIDEO,                    // visual generation
 }

 // IsMultimodal returns true if the given usecases span two or more orthogonal
@@ -692,6 +693,7 @@ func GetAllModelConfigUsecases() map[string]ModelConfigUsecase {
 		"FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION,
 		"FLAG_AUDIO_TRANSFORM":     FLAG_AUDIO_TRANSFORM,
 		"FLAG_DIARIZATION":         FLAG_DIARIZATION,
+		"FLAG_REALTIME_AUDIO":      FLAG_REALTIME_AUDIO,
 	}
 }

@@ -866,6 +868,16 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
 		}
 	}

+	if (u & FLAG_REALTIME_AUDIO) == FLAG_REALTIME_AUDIO {
+		// Backends that own a single any-to-any loop and implement
+		// AudioToAudioStream — listed here so models without an explicit
+		// known_usecases still surface on the Talk page.
+		realtimeAudioBackends := []string{"liquid-audio"}
+		if !slices.Contains(realtimeAudioBackends, c.Backend) {
+			return false
+		}
+	}
+
 	return true
 }

--- a/core/config/mtp.go
+++ b/core/config/mtp.go
@@ -0,0 +1,84 @@
+package config
+
+import (
+	"strings"
+
+	gguf "github.com/gpustack/gguf-parser-go"
+	"github.com/mudler/xlog"
+)
+
+// mtpSpecOptions lists the speculative-decoding option keys auto-applied when
+// an MTP head is detected on a llama-cpp GGUF. Defaults track the upstream
+// MTP PR (ggml-org/llama.cpp#22673):
+//
+//   - spec_type:draft-mtp      activates Multi-Token Prediction
+//   - spec_n_max:6             draft window
+//   - spec_p_min:0.75          pinned because upstream marked the 0.75 default
+//     with a "change to 0.0f" TODO; locking it here keeps acceptance
+//     thresholds stable across future bumps
+var mtpSpecOptions = []string{
+	"spec_type:draft-mtp",
+	"spec_n_max:6",
+	"spec_p_min:0.75",
+}
+
+// MTPSpecOptions returns a copy of the option keys auto-applied when an MTP
+// head is detected. Exported for testing and for the importer.
+func MTPSpecOptions() []string {
+	out := make([]string, len(mtpSpecOptions))
+	copy(out, mtpSpecOptions)
+	return out
+}
+
+// HasEmbeddedMTPHead reports whether the parsed GGUF declares a Multi-Token
+// Prediction head. Detection reads `<arch>.nextn_predict_layers`, which is
+// what `gguf_writer.add_nextn_predict_layers(n)` emits in upstream's
+// `conversion/qwen.py` MTP mixin. A positive layer count means the head is
+// present in the same GGUF as the trunk.
+func HasEmbeddedMTPHead(f *gguf.GGUFFile) (uint32, bool) {
+	if f == nil {
+		return 0, false
+	}
+	arch := f.Architecture().Architecture
+	if arch == "" {
+		return 0, false
+	}
+	v, ok := f.Header.MetadataKV.Get(arch + ".nextn_predict_layers")
+	if !ok {
+		return 0, false
+	}
+	n := gguf.ValueNumeric[uint32](v)
+	return n, n > 0
+}
+
+// hasSpecTypeOption returns true when the slice already contains a
+// user-configured `spec_type:` / `speculative_type:` entry. Used to avoid
+// clobbering an explicit choice with the MTP auto-defaults.
+func hasSpecTypeOption(opts []string) bool {
+	for _, o := range opts {
+		if strings.HasPrefix(o, "spec_type:") || strings.HasPrefix(o, "speculative_type:") {
+			return true
+		}
+	}
+	return false
+}
+
+// ApplyMTPDefaults appends the auto-MTP option keys to cfg.Options when none
+// is already configured. It is a no-op when the user already picked a
+// `spec_type` (either via YAML or via the importer's preferences flow).
+//
+// `layers` is the value read from `<arch>.nextn_predict_layers` and is only
+// used for the diagnostic log line.
+func ApplyMTPDefaults(cfg *ModelConfig, layers uint32) {
+	if cfg == nil {
+		return
+	}
+	if hasSpecTypeOption(cfg.Options) {
+		xlog.Debug("[mtp] embedded MTP head detected but spec_type already configured; leaving user choice intact",
+			"name", cfg.Name, "nextn_layers", layers)
+		return
+	}
+	cfg.Options = append(cfg.Options, mtpSpecOptions...)
+	xlog.Info("[mtp] embedded MTP head detected; enabling draft-mtp speculative decoding",
+		"name", cfg.Name, "nextn_layers", layers, "spec_n_max", 6, "spec_p_min", 0.75)
+}
--- a/core/config/mtp_test.go
+++ b/core/config/mtp_test.go
@@ -0,0 +1,86 @@
+package config_test
+
+import (
+	. "github.com/mudler/LocalAI/core/config"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("MTP auto-defaults", func() {
+	Context("MTPSpecOptions", func() {
+		It("returns the upstream-recommended speculative tuple", func() {
+			Expect(MTPSpecOptions()).To(Equal([]string{
+				"spec_type:draft-mtp",
+				"spec_n_max:6",
+				"spec_p_min:0.75",
+			}))
+		})
+
+		It("returns a defensive copy so callers cannot mutate the package default", func() {
+			opts := MTPSpecOptions()
+			opts[0] = "spec_type:none"
+			Expect(MTPSpecOptions()[0]).To(Equal("spec_type:draft-mtp"))
+		})
+	})
+
+	Context("ApplyMTPDefaults", func() {
+		It("appends MTP options when nothing is configured", func() {
+			cfg := &ModelConfig{Name: "qwen-mtp"}
+			ApplyMTPDefaults(cfg, 1)
+			Expect(cfg.Options).To(Equal([]string{
+				"spec_type:draft-mtp",
+				"spec_n_max:6",
+				"spec_p_min:0.75",
+			}))
+		})
+
+		It("preserves unrelated options already on the config", func() {
+			cfg := &ModelConfig{
+				Name:    "qwen-mtp",
+				Options: []string{"use_jinja:true", "cache_reuse:256"},
+			}
+			ApplyMTPDefaults(cfg, 1)
+			Expect(cfg.Options).To(Equal([]string{
+				"use_jinja:true",
+				"cache_reuse:256",
+				"spec_type:draft-mtp",
+				"spec_n_max:6",
+				"spec_p_min:0.75",
+			}))
+		})
+
+		It("is a no-op when the user already configured spec_type", func() {
+			cfg := &ModelConfig{
+				Name:    "qwen-mtp",
+				Options: []string{"spec_type:ngram-simple", "use_jinja:true"},
+			}
+			ApplyMTPDefaults(cfg, 1)
+			Expect(cfg.Options).To(Equal([]string{
+				"spec_type:ngram-simple",
+				"use_jinja:true",
+			}))
+		})
+
+		It("also respects the legacy speculative_type alias", func() {
+			cfg := &ModelConfig{
+				Name:    "qwen-mtp",
+				Options: []string{"speculative_type:ngram-mod"},
+			}
+			ApplyMTPDefaults(cfg, 1)
+			Expect(cfg.Options).To(Equal([]string{"speculative_type:ngram-mod"}))
+		})
+
+		It("tolerates a nil config", func() {
+			Expect(func() { ApplyMTPDefaults(nil, 1) }).ToNot(Panic())
+		})
+	})
+
+	Context("HasEmbeddedMTPHead", func() {
+		It("returns false on a nil GGUF file", func() {
+			n, ok := HasEmbeddedMTPHead(nil)
+			Expect(ok).To(BeFalse())
+			Expect(n).To(BeZero())
+		})
+	})
+})
--- a/core/gallery/backends.go
+++ b/core/gallery/backends.go
@@ -16,6 +16,7 @@ import (
 	"github.com/mudler/LocalAI/pkg/downloader"
 	"github.com/mudler/LocalAI/pkg/model"
 	"github.com/mudler/LocalAI/pkg/oci"
+	"github.com/mudler/LocalAI/pkg/oci/cosignverify"
 	"github.com/mudler/LocalAI/pkg/system"
 	"github.com/mudler/xlog"
 	cp "github.com/otiai10/copy"
@@ -102,8 +103,81 @@ func writeBackendMetadata(backendPath string, metadata *BackendMetadata) error {
 	return nil
 }

+// backendDownloadOptions translates the gallery's verification policy into
+// downloader options, and gates the call on strict-integrity mode. Both
+// InstallBackend and UpgradeBackend MUST route their download through these
+// options — without them, the corresponding code path silently downloads
+// and activates unverified backend bytes even when the gallery has a
+// verification: policy configured.
+//
+// For OCI URIs with a verification policy, returns a slice containing
+// downloader.WithImageVerifier(v) — the downloader will then run cosign
+// signature verification between fetching the manifest and extracting
+// layers (see pkg/downloader/uri.go OCI branch).
+//
+// For OCI URIs without a verification policy, or non-OCI URIs without a
+// SHA256, the function either returns a non-fatal warning (requireIntegrity
+// false) or fails the install (requireIntegrity true).
+func backendDownloadOptions(config *GalleryBackend, requireIntegrity bool) ([]downloader.DownloadOption, error) {
+	uri := downloader.URI(config.URI)
+	hasVerification := config.Gallery.Verification != nil
+	hasSHA := config.SHA256 != ""
+
+	switch {
+	case uri.LooksLikeOCI():
+		if !hasVerification {
+			if requireIntegrity {
+				return nil, fmt.Errorf("strict integrity: gallery %q has no verification policy for OCI backend %q (set verification: in the gallery YAML or disable --require-backend-integrity)",
+					config.Gallery.Name, config.Name)
+			}
+			xlog.Warn("installing OCI backend without signature verification",
+				"backend", config.Name, "gallery", config.Gallery.Name, "uri", config.URI)
+			return nil, nil
+		}
+		v, err := newGalleryVerifier(config.Gallery.Verification)
+		if err != nil {
+			return nil, fmt.Errorf("gallery %q verification policy: %w", config.Gallery.Name, err)
+		}
+		return []downloader.DownloadOption{downloader.WithImageVerifier(v)}, nil
+
+	case uri.LooksLikeDir():
+		// Local directory — out of scope for integrity checks.
+		return nil, nil
+
+	default:
+		if !hasSHA && requireIntegrity {
+			return nil, fmt.Errorf("strict integrity: backend %q has no SHA256 (gallery %q)",
+				config.Name, config.Gallery.Name)
+		}
+		// Non-strict: pkg/downloader already emits a warning when sha is empty.
+		return nil, nil
+	}
+}
+
+// newGalleryVerifier constructs a cosignverify.Verifier from the gallery
+// policy. Parses NotBefore (RFC3339) here so YAML errors surface at install
+// time rather than during signature verification.
+func newGalleryVerifier(p *config.GalleryVerification) (*cosignverify.Verifier, error) {
+	pol := cosignverify.Policy{
+		Issuer:        p.Issuer,
+		IssuerRegex:   p.IssuerRegex,
+		Identity:      p.Identity,
+		IdentityRegex: p.IdentityRegex,
+	}
+	if p.NotBefore != "" {
+		t, err := time.Parse(time.RFC3339, p.NotBefore)
+		if err != nil {
+			return nil, fmt.Errorf("not_before %q: %w", p.NotBefore, err)
+		}
+		pol.NotBefore = t
+	}
+	return cosignverify.NewVerifier(pol, nil, nil)
+}
+
 // InstallBackendFromGallery installs a backend from the gallery.
-func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, name string, downloadStatus func(string, string, string, float64), force bool) error {
+// requireIntegrity escalates a missing SHA256 / verification policy from a
+// warning to a hard failure (see backendDownloadOptions).
+func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, name string, downloadStatus func(string, string, string, float64), force, requireIntegrity bool) error {
 	if !force {
 		// check if we already have the backend installed
 		backends, err := ListSystemBackends(systemState)
@@ -149,7 +223,7 @@ func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery,
 		xlog.Debug("Installing backend from meta backend", "name", name, "bestBackend", bestBackend.Name)

 		// Then, let's install the best backend
-		if err := InstallBackend(ctx, systemState, modelLoader, bestBackend, downloadStatus); err != nil {
+		if err := InstallBackend(ctx, systemState, modelLoader, bestBackend, downloadStatus, requireIntegrity); err != nil {
 			return err
 		}

@@ -175,10 +249,10 @@ func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery,
 		return nil
 	}

-	return InstallBackend(ctx, systemState, modelLoader, backend, downloadStatus)
+	return InstallBackend(ctx, systemState, modelLoader, backend, downloadStatus, requireIntegrity)
 }

-func InstallBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, config *GalleryBackend, downloadStatus func(string, string, string, float64)) error {
+func InstallBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, config *GalleryBackend, downloadStatus func(string, string, string, float64), requireIntegrity bool) error {
 	// Get configurable fallback tag values from SystemState
 	latestTag, masterTag, devSuffix := getFallbackTagValues(systemState)

@@ -213,6 +287,14 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 		return fmt.Errorf("failed to create base path: %v", err)
 	}

+	// Build the download options once and reuse for every retry path —
+	// mirrors and tag fallbacks must verify against the same gallery
+	// policy or we open a hole where a non-default URI bypasses the check.
+	downloadOpts, optsErr := backendDownloadOptions(config, requireIntegrity)
+	if optsErr != nil {
+		return fmt.Errorf("backend %q: %w", config.Name, optsErr)
+	}
+
 	uri := downloader.URI(config.URI)
 	// Check if it is a directory
 	if uri.LooksLikeDir() {
@@ -222,7 +304,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 		}
 	} else {
 		xlog.Debug("Downloading backend", "uri", config.URI, "backendPath", backendPath)
-		if err := uri.DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err != nil {
+		if err := uri.DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err != nil {
 			xlog.Debug("Backend download failed, trying fallback", "backendPath", backendPath, "error", err)

 			// resetBackendPath cleans up partial state from a failed OCI extraction
@@ -243,7 +325,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 				default:
 				}
 				resetBackendPath()
-				if err := downloader.URI(mirror).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil {
+				if err := downloader.URI(mirror).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
 					success = true
 					xlog.Debug("Downloaded backend from mirror", "uri", config.URI, "backendPath", backendPath)
 					break
@@ -256,7 +338,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 				if fallbackURI != string(config.URI) {
 					resetBackendPath()
 					xlog.Info("Trying fallback URI", "original", config.URI, "fallback", fallbackURI)
-					if err := downloader.URI(fallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil {
+					if err := downloader.URI(fallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
 						xlog.Info("Downloaded backend using fallback URI", "uri", fallbackURI, "backendPath", backendPath)
 						success = true
 					} else {
@@ -265,7 +347,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 							resetBackendPath()
 							devFallbackURI := fallbackURI + "-" + devSuffix
 							xlog.Info("Trying development fallback URI", "fallback", devFallbackURI)
-							if err := downloader.URI(devFallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil {
+							if err := downloader.URI(devFallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
 								xlog.Info("Downloaded backend using development fallback URI", "uri", devFallbackURI, "backendPath", backendPath)
 								success = true
 							} else {
--- a/core/gallery/backends_test.go
+++ b/core/gallery/backends_test.go
@@ -117,13 +117,13 @@ var _ = Describe("Gallery Backends", func() {

 	Describe("InstallBackendFromGallery", func() {
 		It("should return error when backend is not found", func() {
-			err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "non-existent", nil, true)
+			err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "non-existent", nil, true, false)
 			Expect(err).To(HaveOccurred())
 			Expect(err.Error()).To(ContainSubstring("no backend found with name \"non-existent\""))
 		})

 		It("should install backend from gallery", func() {
-			err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "test-backend", nil, true)
+			err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "test-backend", nil, true, false)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(filepath.Join(tempDir, "test-backend", "run.sh")).To(BeARegularFile())
 		})
@@ -545,7 +545,7 @@ var _ = Describe("Gallery Backends", func() {
 				VRAM:      1000000000000,
 				Backend:   system.Backend{BackendsPath: tempDir},
 			}
-			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true)
+			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
 			Expect(err).NotTo(HaveOccurred())

 			metaBackendPath := filepath.Join(tempDir, "meta-backend")
@@ -625,7 +625,7 @@ var _ = Describe("Gallery Backends", func() {
 				VRAM:      1000000000000,
 				Backend:   system.Backend{BackendsPath: tempDir},
 			}
-			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true)
+			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
 			Expect(err).NotTo(HaveOccurred())

 			metaBackendPath := filepath.Join(tempDir, "meta-backend")
@@ -709,7 +709,7 @@ var _ = Describe("Gallery Backends", func() {
 				VRAM:      1000000000000,
 				Backend:   system.Backend{BackendsPath: tempDir},
 			}
-			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true)
+			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
 			Expect(err).NotTo(HaveOccurred())

 			metaBackendPath := filepath.Join(tempDir, "meta-backend")
@@ -808,7 +808,7 @@ var _ = Describe("Gallery Backends", func() {
 				system.WithBackendPath(newPath),
 			)
 			Expect(err).NotTo(HaveOccurred())
-			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
+			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
 			Expect(newPath).To(BeADirectory())
 			Expect(err).To(HaveOccurred()) // Will fail due to invalid URI, but path should be created
 		})
@@ -840,7 +840,7 @@ var _ = Describe("Gallery Backends", func() {
 				system.WithBackendPath(tempDir),
 			)
 			Expect(err).NotTo(HaveOccurred())
-			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
+			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())
 			dat, err := os.ReadFile(filepath.Join(tempDir, "test-backend", "metadata.json"))
@@ -873,7 +873,7 @@ var _ = Describe("Gallery Backends", func() {

 			Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).ToNot(BeARegularFile())

-			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
+			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())
 		})
@@ -894,7 +894,7 @@ var _ = Describe("Gallery Backends", func() {
 				system.WithBackendPath(tempDir),
 			)
 			Expect(err).NotTo(HaveOccurred())
-			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
+			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())

--- a/core/gallery/backends_version_test.go
+++ b/core/gallery/backends_version_test.go
@@ -47,7 +47,7 @@ var _ = Describe("Backend versioning", func() {
 		backend.URI = srcDir
 		backend.Version = "1.2.3"

-		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil)
+		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
 		Expect(err).NotTo(HaveOccurred())

 		// Read the metadata file and check version
@@ -74,7 +74,7 @@ var _ = Describe("Backend versioning", func() {
 		backend.URI = srcDir
 		backend.Version = "2.0.0"

-		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil)
+		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
 		Expect(err).NotTo(HaveOccurred())

 		metadataPath := filepath.Join(tempDir, "test-backend-uri", "metadata.json")
@@ -100,7 +100,7 @@ var _ = Describe("Backend versioning", func() {
 		backend.URI = srcDir
 		// Version intentionally left empty

-		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil)
+		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
 		Expect(err).NotTo(HaveOccurred())

 		metadataPath := filepath.Join(tempDir, "test-backend-noversion", "metadata.json")
--- a/core/gallery/importers/importers.go
+++ b/core/gallery/importers/importers.go
@@ -130,6 +130,8 @@ var defaultImporters = []Importer{
 	// and would otherwise swallow the C++ port's GGUF bundles.
 	&VibeVoiceCppImporter{},
 	&VibeVoiceImporter{},
+	// LiquidAudio (Python) — keep before LlamaCPP so non-GGUF LFM2-Audio repos route here.
+	&LiquidAudioImporter{},
 	&CoquiImporter{},
 	// Image/Video (Batch 3)
 	&StableDiffusionGGMLImporter{},
--- a/core/gallery/importers/liquid-audio.go
+++ b/core/gallery/importers/liquid-audio.go
@@ -0,0 +1,145 @@
+package importers
+
+import (
+	"encoding/json"
+	"path/filepath"
+	"strings"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/gallery"
+	"github.com/mudler/LocalAI/core/schema"
+	"go.yaml.in/yaml/v2"
+)
+
+var _ Importer = &LiquidAudioImporter{}
+
+// LiquidAudioImporter recognises LiquidAI's LFM2-Audio family (LFM2-Audio-1.5B,
+// LFM2.5-Audio-1.5B, community finetunes) and routes them to the Python
+// `liquid-audio` backend. Detection is by repo-name substring so third-party
+// mirrors still match. preferences.backend="liquid-audio" overrides detection.
+//
+// Once upstream llama.cpp PR #18641 lands and the GGUF gallery entries are
+// added, GGUF mirrors of these models should route to llama-cpp; that's
+// handled by ordering LlamaCPPImporter after this one and by the explicit
+// "-gguf" exclusion below.
+type LiquidAudioImporter struct{}
+
+func (i *LiquidAudioImporter) Name() string      { return "liquid-audio" }
+func (i *LiquidAudioImporter) Modality() string  { return "tts" }
+func (i *LiquidAudioImporter) AutoDetects() bool { return true }
+
+func (i *LiquidAudioImporter) Match(details Details) bool {
+	preferences, err := details.Preferences.MarshalJSON()
+	if err != nil {
+		return false
+	}
+	preferencesMap := make(map[string]any)
+	if len(preferences) > 0 {
+		if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
+			return false
+		}
+	}
+
+	if b, ok := preferencesMap["backend"].(string); ok && b == "liquid-audio" {
+		return true
+	}
+
+	matchRepo := func(repo string) bool {
+		r := strings.ToLower(repo)
+		// Cede GGUF mirrors to the (later-ordered) llama-cpp importer.
+		if strings.HasSuffix(r, "-gguf") {
+			return false
+		}
+		return strings.Contains(r, "lfm2-audio") || strings.Contains(r, "lfm2.5-audio")
+	}
+
+	if details.HuggingFace != nil {
+		repoName := details.HuggingFace.ModelID
+		if idx := strings.Index(repoName, "/"); idx >= 0 {
+			repoName = repoName[idx+1:]
+		}
+		if matchRepo(repoName) {
+			return true
+		}
+	}
+
+	if _, repo, ok := HFOwnerRepoFromURI(details.URI); ok {
+		return matchRepo(repo)
+	}
+	return false
+}
+
+func (i *LiquidAudioImporter) Import(details Details) (gallery.ModelConfig, error) {
+	preferences, err := details.Preferences.MarshalJSON()
+	if err != nil {
+		return gallery.ModelConfig{}, err
+	}
+	preferencesMap := make(map[string]any)
+	if len(preferences) > 0 {
+		if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
+			return gallery.ModelConfig{}, err
+		}
+	}
+
+	name, ok := preferencesMap["name"].(string)
+	if !ok {
+		name = filepath.Base(details.URI)
+	}
+
+	description, ok := preferencesMap["description"].(string)
+	if !ok {
+		description = "Imported from " + details.URI
+	}
+
+	model := details.URI
+	if details.HuggingFace != nil && details.HuggingFace.ModelID != "" {
+		model = details.HuggingFace.ModelID
+	}
+
+	// Preferences may pin the mode (chat / asr / tts / s2s / finetune).
+	// Default to s2s — the headline any-to-any use case.
+	mode, _ := preferencesMap["mode"].(string)
+	if mode == "" {
+		mode = "s2s"
+	}
+
+	options := []string{"mode:" + mode}
+	if voice, ok := preferencesMap["voice"].(string); ok && voice != "" {
+		options = append(options, "voice:"+voice)
+	}
+
+	usecases := []string{"chat"}
+	switch mode {
+	case "asr":
+		usecases = []string{"transcript"}
+	case "tts":
+		usecases = []string{"tts"}
+	case "s2s":
+		// realtime_audio surfaces the model on the Talk page; chat/tts/
+		// transcript/vad keep the standalone OpenAI-compatible endpoints
+		// working since liquid-audio implements all of them.
+		usecases = []string{"realtime_audio", "chat", "tts", "transcript", "vad"}
+	}
+
+	modelConfig := config.ModelConfig{
+		Name:                name,
+		Description:         description,
+		Backend:             "liquid-audio",
+		KnownUsecaseStrings: usecases,
+		Options:             options,
+		PredictionOptions: schema.PredictionOptions{
+			BasicModelRequest: schema.BasicModelRequest{Model: model},
+		},
+	}
+
+	data, err := yaml.Marshal(modelConfig)
+	if err != nil {
+		return gallery.ModelConfig{}, err
+	}
+
+	return gallery.ModelConfig{
+		Name:        name,
+		Description: description,
+		ConfigFile:  string(data),
+	}, nil
+}
--- a/core/gallery/importers/liquid-audio_test.go
+++ b/core/gallery/importers/liquid-audio_test.go
@@ -0,0 +1,91 @@
+package importers_test
+
+import (
+	"encoding/json"
+	"fmt"
+
+	"github.com/mudler/LocalAI/core/gallery/importers"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("LiquidAudioImporter", func() {
+	Context("detection from HuggingFace", func() {
+		It("matches LiquidAI/LFM2.5-Audio-1.5B", func() {
+			uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B"
+			preferences := json.RawMessage(`{}`)
+
+			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
+
+			Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
+			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: liquid-audio"))
+			Expect(modelConfig.ConfigFile).To(ContainSubstring("LiquidAI/LFM2.5-Audio-1.5B"))
+		})
+
+		It("matches LiquidAI/LFM2-Audio-1.5B (older variant)", func() {
+			uri := "https://huggingface.co/LiquidAI/LFM2-Audio-1.5B"
+			preferences := json.RawMessage(`{}`)
+
+			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
+
+			Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
+			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: liquid-audio"))
+		})
+
+		It("cedes -GGUF mirrors to the llama-cpp importer", func() {
+			// LiquidAI/LFM2.5-Audio-1.5B-GGUF should NOT route to liquid-audio.
+			// Once upstream PR #18641 lands and the GGUF gallery entry exists,
+			// this is the path that lets users opt into the C++ runtime.
+			uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B-GGUF"
+			preferences := json.RawMessage(`{}`)
+
+			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
+
+			Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
+			Expect(modelConfig.ConfigFile).ToNot(ContainSubstring("backend: liquid-audio"),
+				fmt.Sprintf("GGUF repo should not match Python importer; got: %s", modelConfig.ConfigFile))
+		})
+	})
+
+	Context("preference override", func() {
+		It("honours preferences.backend=liquid-audio for arbitrary URIs", func() {
+			uri := "https://example.com/some-unrelated-model"
+			preferences := json.RawMessage(`{"backend": "liquid-audio"}`)
+
+			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
+
+			Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
+			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: liquid-audio"))
+		})
+
+		It("picks up the mode preference", func() {
+			uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B"
+			preferences := json.RawMessage(`{"mode": "asr"}`)
+
+			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
+
+			Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
+			Expect(modelConfig.ConfigFile).To(ContainSubstring("mode:asr"))
+			Expect(modelConfig.ConfigFile).To(ContainSubstring("transcript"))
+		})
+
+		It("picks up the voice preference", func() {
+			uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B"
+			preferences := json.RawMessage(`{"mode": "tts", "voice": "uk_male"}`)
+
+			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
+
+			Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
+			Expect(modelConfig.ConfigFile).To(ContainSubstring("voice:uk_male"))
+		})
+	})
+
+	Context("Importer interface metadata", func() {
+		It("exposes name/modality/autodetect", func() {
+			imp := &importers.LiquidAudioImporter{}
+			Expect(imp.Name()).To(Equal("liquid-audio"))
+			Expect(imp.Modality()).To(Equal("tts"))
+			Expect(imp.AutoDetects()).To(BeTrue())
+		})
+	})
+})
--- a/core/gallery/importers/llama-cpp.go
+++ b/core/gallery/importers/llama-cpp.go
@@ -1,10 +1,13 @@
 package importers

 import (
+	"context"
 	"encoding/json"
 	"path/filepath"
 	"strings"
+	"time"

+	gguf "github.com/gpustack/gguf-parser-go"
 	"github.com/mudler/LocalAI/core/config"
 	"github.com/mudler/LocalAI/core/gallery"
 	"github.com/mudler/LocalAI/core/schema"
@@ -261,6 +264,13 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error)
 	// Apply per-model-family inference parameter defaults
 	config.ApplyInferenceDefaults(&modelConfig, details.URI)

+	// Auto-detect Multi-Token Prediction heads (ggml-org/llama.cpp#22673) and
+	// enable speculative decoding. Mirrors the load-time hook so freshly
+	// imported configs already carry spec_type:draft-mtp before the model is
+	// ever loaded - users see it in the YAML preview rather than discovering
+	// it after the first start.
+	maybeApplyMTPDefaults(&modelConfig, details, &cfg)
+
 	data, err := yaml.Marshal(modelConfig)
 	if err != nil {
 		return gallery.ModelConfig{}, err
@@ -291,6 +301,85 @@ func pickPreferredGroup(groups []hfapi.ShardGroup, prefs []string) *hfapi.ShardG
 	return &groups[len(groups)-1]
 }

+// maybeApplyMTPDefaults parses the picked GGUF header (range-fetched over
+// HTTP for HF/URL imports) and, if the file declares a Multi-Token Prediction
+// head, appends the auto-MTP option keys to modelConfig.Options. Failures
+// during the probe are non-fatal: the importer keeps the config without MTP
+// so an unrelated network blip or weird header doesn't break the import.
+//
+// OCI/Ollama URIs are skipped because the artifact isn't directly fetchable
+// as a GGUF byte stream - the load-time hook (core/config/gguf.go) covers
+// those once the model is materialised on disk.
+func maybeApplyMTPDefaults(modelConfig *config.ModelConfig, details Details, cfg *gallery.ModelConfig) {
+	probeURL := pickMTPProbeURL(details, cfg)
+	if probeURL == "" {
+		return
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
+	defer func() {
+		if r := recover(); r != nil {
+			xlog.Debug("[mtp-importer] panic while probing GGUF header", "uri", probeURL, "recover", r)
+		}
+	}()
+
+	f, err := gguf.ParseGGUFFileRemote(ctx, probeURL)
+	if err != nil {
+		xlog.Debug("[mtp-importer] failed to read remote GGUF header for MTP detection", "uri", probeURL, "error", err)
+		return
+	}
+
+	n, ok := config.HasEmbeddedMTPHead(f)
+	if !ok {
+		return
+	}
+	config.ApplyMTPDefaults(modelConfig, n)
+}
+
+// pickMTPProbeURL returns an HTTP(S) URL pointing at the main (non-mmproj)
+// GGUF shard that should be inspected for an MTP head, or "" when no
+// suitable URL is available. Custom URI schemes (`huggingface://`,
+// `ollama://`, etc.) are run through `downloader.URI.ResolveURL` so the
+// resulting URL is something `gguf.ParseGGUFFileRemote` can actually open.
+// OCI/Ollama URIs are skipped because the artifact is not directly
+// streamable as a GGUF byte range.
+func pickMTPProbeURL(details Details, cfg *gallery.ModelConfig) string {
+	uri := downloader.URI(details.URI)
+
+	if uri.LooksLikeOCI() {
+		return ""
+	}
+
+	if strings.HasSuffix(strings.ToLower(details.URI), ".gguf") {
+		return resolveHTTPProbe(details.URI)
+	}
+
+	for _, f := range cfg.Files {
+		lower := strings.ToLower(f.Filename)
+		if strings.Contains(lower, "mmproj") {
+			continue
+		}
+		if !strings.HasSuffix(lower, ".gguf") {
+			continue
+		}
+		return resolveHTTPProbe(f.URI)
+	}
+	return ""
+}
+
+// resolveHTTPProbe resolves an importer-side URI to the HTTP(S) URL that
+// `gguf.ParseGGUFFileRemote` can range-fetch. Returns "" if the URI can't
+// be reduced to an HTTP(S) endpoint (e.g. local path, unsupported scheme).
+func resolveHTTPProbe(uri string) string {
+	resolved := downloader.URI(uri).ResolveURL()
+	if downloader.URI(resolved).LooksLikeHTTPURL() {
+		return resolved
+	}
+	return ""
+}
+
 // appendShardGroup copies every shard of group into cfg.Files under dest,
 // skipping any entry whose target filename is already present so repeated
 // calls (e.g. the rare case of mmproj + model picking the same group)
--- a/core/gallery/models.go
+++ b/core/gallery/models.go
@@ -77,7 +77,7 @@ func InstallModelFromGallery(
 	modelGalleries, backendGalleries []lconfig.Gallery,
 	systemState *system.SystemState,
 	modelLoader *model.ModelLoader,
-	name string, req GalleryModel, downloadStatus func(string, string, string, float64), enforceScan, automaticallyInstallBackend bool) error {
+	name string, req GalleryModel, downloadStatus func(string, string, string, float64), enforceScan, automaticallyInstallBackend, requireBackendIntegrity bool) error {

 	applyModel := func(model *GalleryModel) error {
 		name = strings.ReplaceAll(name, string(os.PathSeparator), "__")
@@ -137,7 +137,7 @@ func InstallModelFromGallery(
 		if automaticallyInstallBackend && installedModel.Backend != "" {
 			xlog.Debug("Installing backend", "backend", installedModel.Backend)

-			if err := InstallBackendFromGallery(ctx, backendGalleries, systemState, modelLoader, installedModel.Backend, downloadStatus, false); err != nil {
+			if err := InstallBackendFromGallery(ctx, backendGalleries, systemState, modelLoader, installedModel.Backend, downloadStatus, false, requireBackendIntegrity); err != nil {
 				return err
 			}
 		}
--- a/core/gallery/models_test.go
+++ b/core/gallery/models_test.go
@@ -89,7 +89,7 @@ var _ = Describe("Model test", func() {
 			Expect(models[0].URL).To(Equal(bertEmbeddingsURL))
 			Expect(models[0].Installed).To(BeFalse())

-			err = InstallModelFromGallery(context.TODO(), galleries, []config.Gallery{}, systemState, nil, "test@bert", GalleryModel{}, func(s1, s2, s3 string, f float64) {}, true, true)
+			err = InstallModelFromGallery(context.TODO(), galleries, []config.Gallery{}, systemState, nil, "test@bert", GalleryModel{}, func(s1, s2, s3 string, f float64) {}, true, true, false)
 			Expect(err).ToNot(HaveOccurred())

 			dat, err := os.ReadFile(filepath.Join(tempdir, "bert.yaml"))
--- a/core/gallery/upgrade.go
+++ b/core/gallery/upgrade.go
@@ -232,7 +232,7 @@ func summarizeNodeDrift(nodes []NodeBackendRef) (majority struct{ version, diges

 // UpgradeBackend upgrades a single backend to the latest gallery version using
 // an atomic swap with backup-based rollback on failure.
-func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, galleries []config.Gallery, backendName string, downloadStatus func(string, string, string, float64)) error {
+func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, galleries []config.Gallery, backendName string, downloadStatus func(string, string, string, float64), requireIntegrity bool) error {
 	// Look up the installed backend
 	installedBackends, err := ListSystemBackends(systemState)
 	if err != nil {
@@ -251,7 +251,7 @@ func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelL
 	// If this is a meta backend, recursively upgrade the concrete backend it points to
 	if installed.Metadata != nil && installed.Metadata.MetaBackendFor != "" {
 		xlog.Info("Meta backend detected, upgrading concrete backend", "meta", backendName, "concrete", installed.Metadata.MetaBackendFor)
-		return UpgradeBackend(ctx, systemState, modelLoader, galleries, installed.Metadata.MetaBackendFor, downloadStatus)
+		return UpgradeBackend(ctx, systemState, modelLoader, galleries, installed.Metadata.MetaBackendFor, downloadStatus, requireIntegrity)
 	}

 	// Find the gallery entry
@@ -265,6 +265,16 @@ func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelL
 		return fmt.Errorf("no gallery entry found for backend %q", backendName)
 	}

+	// Resolve integrity options (cosign verifier for OCI URIs, strict-mode
+	// gate for missing SHA256/policy) BEFORE writing anything to disk.
+	// Without this, the upgrade path would atomically swap in an
+	// unverified backend even when the gallery has a verification policy
+	// — see backendDownloadOptions in backends.go.
+	downloadOpts, err := backendDownloadOptions(galleryEntry, requireIntegrity)
+	if err != nil {
+		return fmt.Errorf("upgrade %q: %w", backendName, err)
+	}
+
 	backendPath := filepath.Join(systemState.Backend.BackendsPath, backendName)
 	tmpPath := backendPath + ".upgrade-tmp"
 	backupPath := backendPath + ".backup"
@@ -285,7 +295,7 @@ func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelL
 			return fmt.Errorf("failed to copy backend from directory: %w", err)
 		}
 	} else {
-		if err := uri.DownloadFileWithContext(ctx, tmpPath, "", 1, 1, downloadStatus); err != nil {
+		if err := uri.DownloadFileWithContext(ctx, tmpPath, galleryEntry.SHA256, 1, 1, downloadStatus, downloadOpts...); err != nil {
 			os.RemoveAll(tmpPath)
 			return fmt.Errorf("failed to download backend: %w", err)
 		}
--- a/core/gallery/upgrade_test.go
+++ b/core/gallery/upgrade_test.go
@@ -383,7 +383,7 @@ var _ = Describe("Upgrade Detection and Execution", func() {
 			})

 			ml := model.NewModelLoader(systemState)
-			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil)
+			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil, false)
 			Expect(err).NotTo(HaveOccurred())

 			// Verify run.sh was updated
@@ -417,7 +417,7 @@ var _ = Describe("Upgrade Detection and Execution", func() {
 			})

 			ml := model.NewModelLoader(systemState)
-			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil)
+			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil, false)
 			Expect(err).To(HaveOccurred())

 			// Verify v1 is still intact
@@ -432,5 +432,41 @@ var _ = Describe("Upgrade Detection and Execution", func() {
 			Expect(json.Unmarshal(metaData, &meta)).To(Succeed())
 			Expect(meta.Version).To(Equal("1.0.0"))
 		})
+
+		// Regression: an earlier version of UpgradeBackend wrote the
+		// downloaded bytes to disk without going through
+		// backendDownloadOptions, so the gallery's verification policy
+		// (and strict-integrity gate) didn't apply on upgrade. This test
+		// pins the upgrade path to the same integrity gate as installs:
+		// strict mode + an OCI URI without a verification: block must
+		// hard-fail *before* anything is downloaded or swapped in.
+		It("should refuse to upgrade an OCI backend that bypasses integrity in strict mode", func() {
+			installBackendWithVersion("my-backend", "1.0.0", "#!/bin/sh\necho v1")
+
+			// OCI URI, no Gallery.Verification → backendDownloadOptions
+			// returns a strict-integrity error before any network call.
+			writeGalleryYAML([]GalleryBackend{
+				{
+					Metadata: Metadata{
+						Name: "my-backend",
+					},
+					URI:     "oci://example.invalid/missing:never-fetched",
+					Version: "2.0.0",
+				},
+			})
+
+			ml := model.NewModelLoader(systemState)
+			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil, true)
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring("strict integrity"))
+
+			// The installed v1 must be untouched — the upgrade should
+			// have aborted before writing anything.
+			content, err := os.ReadFile(filepath.Join(backendsPath, "my-backend", "run.sh"))
+			Expect(err).NotTo(HaveOccurred())
+			Expect(string(content)).To(Equal("#!/bin/sh\necho v1"))
+			Expect(filepath.Join(backendsPath, "my-backend.upgrade-tmp")).NotTo(BeAnExistingFile())
+			Expect(filepath.Join(backendsPath, "my-backend.backup")).NotTo(BeAnExistingFile())
+		})
 	})
 })
--- a/core/http/app.go
+++ b/core/http/app.go
@@ -443,6 +443,25 @@ func API(application *application.Application) (*echo.Echo, error) {
 					baseTag := `<base href="` + httpMiddleware.SecureBaseHref(baseURL) + `" />`
 					indexHTML = []byte(strings.Replace(string(indexHTML), "<head>", "<head>\n  "+baseTag, 1))
 				}
+				// <base href> only changes how relative URLs resolve; path-absolute
+				// URLs (those starting with `/`) still resolve against the origin
+				// and would bypass the reverse-proxy prefix. Rewrite the internal
+				// path-absolute references emitted by the build so the browser
+				// requests them through the proxy under the prefix.
+				//
+				// HTML-escape the prefix before interpolating it into attributes:
+				// BasePathPrefix already gates X-Forwarded-Prefix via
+				// SafeForwardedPrefix, but the validator only blocks open-redirect
+				// shapes (// prefix, backslashes, control chars), not attribute
+				// breakout characters like `"`. Escaping makes this resilient
+				// even if the validator ever loosens.
+				if prefix := httpMiddleware.BasePathPrefix(c); prefix != "/" {
+					safePrefix := httpMiddleware.SecureBaseHref(prefix)
+					html := string(indexHTML)
+					html = strings.ReplaceAll(html, `="/assets/`, `="`+safePrefix+`assets/`)
+					html = strings.ReplaceAll(html, `="/favicon.svg"`, `="`+safePrefix+`favicon.svg"`)
+					indexHTML = []byte(html)
+				}
 				return c.HTMLBlob(http.StatusOK, indexHTML)
 			}

--- a/core/http/app_test.go
+++ b/core/http/app_test.go
@@ -446,6 +446,42 @@ var _ = Describe("API test", func() {
 				Expect(sc).To(Equal(200), "status code")
 				Expect(string(body)).To(ContainSubstring(`<base href="https://example.org/myprefix/" />`), "body")
 			})
+
+			// Caddy's `handle_path` (and similar directives) strip the matched
+			// prefix before forwarding upstream, so LocalAI receives the
+			// already-stripped path together with X-Forwarded-Prefix. The base
+			// href and asset URLs must still include the prefix so the browser
+			// requests them through the proxy.
+			It("Should support reverse-proxy when prefix is stripped by the proxy", func() {
+
+				err, sc, body := getRequest("http://127.0.0.1:9090/app", http.Header{
+					"X-Forwarded-Proto":  {"https"},
+					"X-Forwarded-Host":   {"example.org"},
+					"X-Forwarded-Prefix": {"/myprefix"},
+				})
+				Expect(err).To(BeNil(), "error")
+				Expect(sc).To(Equal(200), "status code")
+				Expect(string(body)).To(ContainSubstring(`<base href="https://example.org/myprefix/" />`), "body")
+				Expect(string(body)).ToNot(ContainSubstring(`="/assets/`), "asset URLs must include the prefix")
+				Expect(string(body)).ToNot(ContainSubstring(`="/favicon.svg"`), "favicon URL must include the prefix")
+			})
+
+			// X-Forwarded-Prefix is attacker controllable on misconfigured
+			// proxy chains. A value like "//evil.com" would otherwise turn the
+			// asset URL rewrite into a protocol-relative URL that loads JS
+			// from a foreign origin. BasePathPrefix must reject these via
+			// SafeForwardedPrefix and fall back to "/".
+			It("Should ignore an unsafe X-Forwarded-Prefix and not poison asset URLs", func() {
+				err, sc, body := getRequest("http://127.0.0.1:9090/app", http.Header{
+					"X-Forwarded-Proto":  {"https"},
+					"X-Forwarded-Host":   {"example.org"},
+					"X-Forwarded-Prefix": {"//evil.com"},
+				})
+				Expect(err).To(BeNil(), "error")
+				Expect(sc).To(Equal(200), "status code")
+				Expect(string(body)).ToNot(ContainSubstring("evil.com"), "unsafe prefix must not leak into the response")
+				Expect(string(body)).ToNot(ContainSubstring(`="//`), "asset URLs must not become protocol-relative")
+			})
 		})

 		Context("Applying models", func() {
--- a/core/http/endpoints/localai/video.go
+++ b/core/http/endpoints/localai/video.go
@@ -22,12 +22,19 @@ import (
 	"github.com/mudler/LocalAI/core/backend"

 	model "github.com/mudler/LocalAI/pkg/model"
+	"github.com/mudler/LocalAI/pkg/utils"
 	"github.com/mudler/xlog"
 )

+var videoDownloadClient = http.Client{Timeout: 30 * time.Second}
+
 func downloadFile(url string) (string, error) {
+	if err := utils.ValidateExternalURL(url); err != nil {
+		return "", fmt.Errorf("URL validation failed: %w", err)
+	}
+
 	// Get the data
-	resp, err := http.Get(url)
+	resp, err := videoDownloadClient.Get(url)
 	if err != nil {
 		return "", err
 	}
--- a/core/http/endpoints/openai/chat.go
+++ b/core/http/endpoints/openai/chat.go
@@ -131,13 +131,19 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 				delta.Reasoning = &reasoningDelta
 			}

+			// Usage rides as a struct field for the consumer to track the
+			// running cumulative — it is stripped before JSON marshal so the
+			// wire chunk stays spec-compliant (no `usage` on intermediate
+			// chunks). The dedicated trailer chunk (when include_usage=true)
+			// carries the final totals.
+			usageForChunk := usage
 			resp := schema.OpenAIResponse{
 				ID:      id,
 				Created: created,
 				Model:   req.Model, // we have to return what the user sent here, due to OpenAI spec.
 				Choices: []schema.Choice{{Delta: delta, Index: 0, FinishReason: nil}},
 				Object:  "chat.completion.chunk",
-				Usage:   usage,
+				Usage:   &usageForChunk,
 			}

 			responses <- resp
@@ -164,7 +170,7 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 		hasChatDeltaToolCalls := false
 		hasChatDeltaContent := false

-		_, tokenUsage, chatDeltas, err := ComputeChoices(req, prompt, config, cl, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, usage backend.TokenUsage) bool {
+		_, _, chatDeltas, err := ComputeChoices(req, prompt, config, cl, startupOptions, loader, func(s string, c *[]schema.Choice) {}, func(s string, usage backend.TokenUsage) bool {
 			result += s

 			// Track whether ChatDeltas from the C++ autoparser contain
@@ -387,16 +393,11 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator

 		switch {
 		case noActionToRun:
-			usage := schema.OpenAIUsage{
-				PromptTokens:     tokenUsage.Prompt,
-				CompletionTokens: tokenUsage.Completion,
-				TotalTokens:      tokenUsage.Prompt + tokenUsage.Completion,
-			}
-			if extraUsage {
-				usage.TimingTokenGeneration = tokenUsage.TimingTokenGeneration
-				usage.TimingPromptProcessing = tokenUsage.TimingPromptProcessing
-			}
-
+			// Token-cumulative usage is communicated to the streaming
+			// consumer via the per-token callback's chunk struct (stripped
+			// before wire marshal). The final usage trailer — when the
+			// caller opted in with stream_options.include_usage — is built
+			// by the outer streaming loop, not here.
 			var result string
 			if !sentInitialRole {
 				var hqErr error
@@ -409,7 +410,7 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 			for _, chunk := range buildNoActionFinalChunks(
 				id, req.Model, created,
 				sentInitialRole, sentReasoning,
-				result, reasoning, usage,
+				result, reasoning,
 			) {
 				responses <- chunk
 			}
@@ -724,7 +725,13 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 							xlog.Debug("No choices in the response, skipping")
 							continue
 						}
-						usage = &ev.Usage // Copy a pointer to the latest usage chunk so that the stop message can reference it
+						// Capture the running cumulative usage from this chunk
+						// (when present) so the include_usage trailer can carry
+						// the final totals. Usage is stripped before marshal
+						// below so the wire chunk stays spec-compliant.
+						if ev.Usage != nil {
+							usage = ev.Usage
+						}
 						if len(ev.Choices[0].Delta.ToolCalls) > 0 {
 							toolsCalled = true
 							// Collect and merge tool call deltas for MCP execution
@@ -740,6 +747,11 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 								collectedContent += *sp
 							}
 						}
+						// OpenAI streaming spec: intermediate chunks must NOT
+						// carry a `usage` field. Strip the tracking copy
+						// before marshalling — usage is delivered via the
+						// dedicated trailer chunk when include_usage=true.
+						ev.Usage = nil
 						respData, err := json.Marshal(ev)
 						if err != nil {
 							xlog.Debug("Failed to marshal response", "error", err)
@@ -888,6 +900,9 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 					finishReason = FinishReasonFunctionCall
 				}

+				// Final delta chunk: empty delta with finish_reason set. Per
+				// OpenAI streaming spec this chunk does NOT carry usage —
+				// the optional trailer (below) does, gated on include_usage.
 				resp := &schema.OpenAIResponse{
 					ID:      id,
 					Created: created,
@@ -899,11 +914,18 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 							Delta:        &schema.Message{},
 						}},
 					Object: "chat.completion.chunk",
-					Usage:  *usage,
 				}
 				respData, _ := json.Marshal(resp)
-
 				fmt.Fprintf(c.Response().Writer, "data: %s\n\n", respData)
+
+				// Trailing usage chunk per OpenAI spec: emit only when the
+				// caller opted in via stream_options.include_usage. Shape:
+				// {"choices":[],"usage":{...},"object":"chat.completion.chunk",...}
+				if input.StreamOptions != nil && input.StreamOptions.IncludeUsage && usage != nil {
+					trailer := streamUsageTrailerJSON(id, input.Model, created, *usage)
+					_, _ = fmt.Fprintf(c.Response().Writer, "data: %s\n\n", trailer)
+				}
+
 				fmt.Fprintf(c.Response().Writer, "data: [DONE]\n\n")
 				c.Response().Flush()
 				xlog.Debug("Stream ended")
@@ -1263,7 +1285,7 @@ func ChatEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 					Model:   input.Model, // we have to return what the user sent here, due to OpenAI spec.
 					Choices: result,
 					Object:  "chat.completion",
-					Usage:   usage,
+					Usage:   &usage,
 				}
 				respData, _ := json.Marshal(resp)
 				xlog.Debug("Response", "response", string(respData))
--- a/core/http/endpoints/openai/chat_emit.go
+++ b/core/http/endpoints/openai/chat_emit.go
@@ -1,12 +1,45 @@
 package openai

 import (
+	"encoding/json"
 	"fmt"

 	"github.com/mudler/LocalAI/core/schema"
 	"github.com/mudler/LocalAI/pkg/functions"
 )

+// streamUsageTrailerJSON returns the bytes of the OpenAI-spec trailing usage
+// chunk emitted in streaming completions when the request opts in via
+// `stream_options.include_usage: true`. The shape is:
+//
+//	{"id":"...","object":"chat.completion.chunk","created":N,
+//	 "model":"...","choices":[],"usage":{...}}
+//
+// `choices` is intentionally an empty array (not absent, not null) — that is
+// what the OpenAI spec mandates, and what consumers like the official OpenAI
+// SDK and Continue's openai-adapter look for to recognise this as the usage
+// chunk rather than a content chunk. schema.OpenAIResponse has `omitempty`
+// on Choices, so we cannot reuse it for the trailer.
+func streamUsageTrailerJSON(id, model string, created int, usage schema.OpenAIUsage) []byte {
+	trailer := struct {
+		ID      string             `json:"id"`
+		Created int                `json:"created"`
+		Model   string             `json:"model"`
+		Object  string             `json:"object"`
+		Choices []schema.Choice    `json:"choices"`
+		Usage   schema.OpenAIUsage `json:"usage"`
+	}{
+		ID:      id,
+		Created: created,
+		Model:   model,
+		Object:  "chat.completion.chunk",
+		Choices: []schema.Choice{},
+		Usage:   usage,
+	}
+	b, _ := json.Marshal(trailer)
+	return b
+}
+
 // hasRealCall reports whether functionResults contains at least one
 // entry whose Name is something other than the noAction sentinel.
 // Used by processTools to decide between the "answer the question"
@@ -25,10 +58,10 @@ func hasRealCall(functionResults []functions.FuncCallResults, noAction string) b
 // pseudo-function or emitted no tool calls at all).
 //
 // When content was already streamed (contentAlreadyStreamed=true) the
-// helper emits a single trailing usage chunk, optionally carrying
-// reasoning that was produced but not streamed incrementally. When
-// content was not streamed it emits a role chunk followed by a
-// content+reasoning+usage chunk — the "send everything at once" fallback.
+// helper emits a trailing reasoning chunk if any non-streamed reasoning
+// remains, else nothing. When content was not streamed it emits a role
+// chunk followed by a content (+reasoning) chunk — the "send everything
+// at once" fallback.
 //
 // Reasoning re-emission is guarded by reasoningAlreadyStreamed, not by
 // probing the extractor's Go-side state: the C++ autoparser delivers
@@ -36,6 +69,10 @@ func hasRealCall(functionResults []functions.FuncCallResults, noAction string) b
 // separate accumulator that extractor.Reasoning() does not expose.
 // Without this guard the callback would stream reasoning incrementally
 // and the final chunk would duplicate it.
+//
+// The returned chunks intentionally do NOT carry a `usage` field. The
+// usage trailer is emitted separately by the streaming handler when
+// `stream_options.include_usage` is true, per OpenAI spec.
 func buildNoActionFinalChunks(
 	id, model string,
 	created int,
@@ -43,26 +80,26 @@ func buildNoActionFinalChunks(
 	reasoningAlreadyStreamed bool,
 	content string,
 	reasoning string,
-	usage schema.OpenAIUsage,
 ) []schema.OpenAIResponse {
 	var out []schema.OpenAIResponse

 	if contentAlreadyStreamed {
-		delta := &schema.Message{}
-		if reasoning != "" && !reasoningAlreadyStreamed {
-			r := reasoning
-			delta.Reasoning = &r
+		if reasoning == "" || reasoningAlreadyStreamed {
+			return nil
 		}
+		r := reasoning
 		out = append(out, schema.OpenAIResponse{
 			ID: id, Created: created, Model: model,
-			Choices: []schema.Choice{{Delta: delta, Index: 0}},
-			Object:  "chat.completion.chunk",
-			Usage:   usage,
+			Choices: []schema.Choice{{
+				Delta: &schema.Message{Reasoning: &r},
+				Index: 0,
+			}},
+			Object: "chat.completion.chunk",
 		})
 		return out
 	}

-	// Content was not streamed — send role, then content (+reasoning) + usage.
+	// Content was not streamed — send role, then content (+reasoning).
 	out = append(out, schema.OpenAIResponse{
 		ID: id, Created: created, Model: model,
 		Choices: []schema.Choice{{
@@ -82,7 +119,6 @@ func buildNoActionFinalChunks(
 		ID: id, Created: created, Model: model,
 		Choices: []schema.Choice{{Delta: delta, Index: 0}},
 		Object:  "chat.completion.chunk",
-		Usage:   usage,
 	})
 	return out
 }
--- a/core/http/endpoints/openai/chat_emit_test.go
+++ b/core/http/endpoints/openai/chat_emit_test.go
@@ -609,54 +609,52 @@ var _ = Describe("buildNoActionFinalChunks", func() {
 		testModel   = "test-model"
 		testCreated = 1700000000
 	)
-	usage := schema.OpenAIUsage{PromptTokens: 5, CompletionTokens: 7, TotalTokens: 12}

-	Describe("Content streamed — trailing usage chunk", func() {
-		It("emits just one chunk with usage, no content, no reasoning when reasoning was streamed", func() {
+	Describe("Content streamed — trailing reasoning only", func() {
+		It("emits nothing when content and reasoning were already streamed", func() {
+			// Before the streaming-usage-spec fix this branch emitted a
+			// content-less chunk solely to carry `usage`. Per the OpenAI
+			// spec usage no longer rides on delta chunks; the dedicated
+			// trailer (when include_usage=true) carries it instead — so
+			// with nothing to deliver the helper returns no chunks.
 			chunks := buildNoActionFinalChunks(
 				testID, testModel, testCreated,
 				true, true,
-				"", "already-streamed-reasoning", usage,
+				"", "already-streamed-reasoning",
 			)
-
-			Expect(chunks).To(HaveLen(1))
-			Expect(chunks[0].Usage.TotalTokens).To(Equal(12))
-			Expect(contentOf(chunks[0])).To(BeEmpty())
-			Expect(reasoningOf(chunks[0])).To(BeEmpty(),
-				"reasoning must not be re-emitted once it was streamed via the callback")
+			Expect(chunks).To(BeEmpty())
 		})

 		It("emits a trailing reasoning delivery when reasoning came only at end", func() {
 			chunks := buildNoActionFinalChunks(
 				testID, testModel, testCreated,
 				true, false,
-				"", "autoparser final reasoning", usage,
+				"", "autoparser final reasoning",
 			)

 			Expect(chunks).To(HaveLen(1))
 			Expect(reasoningOf(chunks[0])).To(Equal("autoparser final reasoning"))
 			Expect(contentOf(chunks[0])).To(BeEmpty())
-			Expect(chunks[0].Usage.TotalTokens).To(Equal(12))
+			Expect(chunks[0].Usage).To(BeNil(),
+				"intermediate chunks must not carry usage per OpenAI spec")
 		})

-		It("omits reasoning when it's empty regardless of streamed flag", func() {
+		It("returns no chunks when reasoning is empty and content was streamed", func() {
 			chunks := buildNoActionFinalChunks(
 				testID, testModel, testCreated,
 				true, false,
-				"", "", usage,
+				"", "",
 			)
-
-			Expect(chunks).To(HaveLen(1))
-			Expect(reasoningOf(chunks[0])).To(BeEmpty())
+			Expect(chunks).To(BeEmpty())
 		})
 	})

-	Describe("Content not streamed — role, then content+usage", func() {
+	Describe("Content not streamed — role, then content", func() {
 		It("emits role chunk then content chunk without reasoning when reasoning was streamed", func() {
 			chunks := buildNoActionFinalChunks(
 				testID, testModel, testCreated,
 				false, true,
-				"the answer", "already-streamed-reasoning", usage,
+				"the answer", "already-streamed-reasoning",
 			)

 			Expect(chunks).To(HaveLen(2))
@@ -666,14 +664,14 @@ var _ = Describe("buildNoActionFinalChunks", func() {
 			Expect(contentOf(chunks[1])).To(Equal("the answer"))
 			Expect(reasoningOf(chunks[1])).To(BeEmpty(),
 				"reasoning must not be re-emitted if it was streamed earlier")
-			Expect(chunks[1].Usage.TotalTokens).To(Equal(12))
+			Expect(chunks[1].Usage).To(BeNil())
 		})

 		It("emits role, then content+reasoning when reasoning was not streamed", func() {
 			chunks := buildNoActionFinalChunks(
 				testID, testModel, testCreated,
 				false, false,
-				"the answer", "autoparser final reasoning", usage,
+				"the answer", "autoparser final reasoning",
 			)

 			Expect(chunks).To(HaveLen(2))
@@ -681,14 +679,14 @@ var _ = Describe("buildNoActionFinalChunks", func() {

 			Expect(contentOf(chunks[1])).To(Equal("the answer"))
 			Expect(reasoningOf(chunks[1])).To(Equal("autoparser final reasoning"))
-			Expect(chunks[1].Usage.TotalTokens).To(Equal(12))
+			Expect(chunks[1].Usage).To(BeNil())
 		})

 		It("still emits content even when reasoning is empty", func() {
 			chunks := buildNoActionFinalChunks(
 				testID, testModel, testCreated,
 				false, false,
-				"just an answer", "", usage,
+				"just an answer", "",
 			)

 			Expect(chunks).To(HaveLen(2))
@@ -702,7 +700,7 @@ var _ = Describe("buildNoActionFinalChunks", func() {
 			chunks := buildNoActionFinalChunks(
 				testID, testModel, testCreated,
 				false, false,
-				"hi", "reasoning", usage,
+				"hi", "reasoning",
 			)
 			for i, ch := range chunks {
 				Expect(ch.ID).To(Equal(testID), "chunk[%d] ID", i)
--- a/core/http/endpoints/openai/chat_stream_usage_test.go
+++ b/core/http/endpoints/openai/chat_stream_usage_test.go
@@ -0,0 +1,179 @@
+package openai
+
+import (
+	"encoding/json"
+
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/pkg/functions"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+// These tests pin LocalAI's streaming chunks to the OpenAI spec for the
+// `usage` field. The regression that motivated them (issue #8546) was that
+// LocalAI emitted `"usage":{...zeros...}` on every chunk, which made the
+// official OpenAI Node SDK consumers (Continue, Kilo Code, Roo Code, Zed,
+// IntelliJ Continue) drop every content chunk via the filter at
+// continuedev/continue packages/openai-adapters/src/apis/OpenAI.ts:275-288.
+//
+// Per OpenAI's chat-completion streaming contract:
+//   - intermediate chunks MUST NOT carry a `usage` field
+//   - usage is only delivered when the request opts in via
+//     `stream_options.include_usage: true`, on a final extra chunk whose
+//     `choices` is an empty array.
+
+var _ = Describe("streaming usage spec compliance", func() {
+	Describe("OpenAIResponse JSON shape", func() {
+		It("does not emit a 'usage' key when Usage is unset", func() {
+			// A typical intermediate token chunk: no Usage populated.
+			content := "hello"
+			resp := schema.OpenAIResponse{
+				ID:      "req-1",
+				Created: 1,
+				Model:   "m",
+				Object:  "chat.completion.chunk",
+				Choices: []schema.Choice{{
+					Index: 0,
+					Delta: &schema.Message{Content: &content},
+				}},
+			}
+			data, err := json.Marshal(resp)
+			Expect(err).ToNot(HaveOccurred())
+
+			var raw map[string]any
+			Expect(json.Unmarshal(data, &raw)).To(Succeed())
+			_, present := raw["usage"]
+			Expect(present).To(BeFalse(),
+				"intermediate chunk must not include a 'usage' key; got: %s", string(data))
+		})
+
+		It("emits the usage object when Usage is explicitly set", func() {
+			usage := &schema.OpenAIUsage{PromptTokens: 11, CompletionTokens: 22, TotalTokens: 33}
+			resp := schema.OpenAIResponse{
+				ID:      "req-1",
+				Created: 1,
+				Model:   "m",
+				Object:  "chat.completion.chunk",
+				Usage:   usage,
+			}
+			data, err := json.Marshal(resp)
+			Expect(err).ToNot(HaveOccurred())
+
+			var raw map[string]any
+			Expect(json.Unmarshal(data, &raw)).To(Succeed())
+			u, ok := raw["usage"].(map[string]any)
+			Expect(ok).To(BeTrue(), "expected 'usage' object, got: %s", string(data))
+			Expect(u["prompt_tokens"]).To(BeNumerically("==", 11))
+			Expect(u["completion_tokens"]).To(BeNumerically("==", 22))
+			Expect(u["total_tokens"]).To(BeNumerically("==", 33))
+		})
+	})
+
+	Describe("buildNoActionFinalChunks", func() {
+		It("returns chunks with no Usage embedded", func() {
+			// Whatever the caller is doing, helpers must not bake usage
+			// into intermediate or final delta chunks. The usage trailer
+			// (when requested via include_usage) is emitted separately.
+			chunks := buildNoActionFinalChunks(
+				"req-1", "m", 1,
+				false, false,
+				"hi", "",
+			)
+			Expect(chunks).ToNot(BeEmpty())
+			for i, ch := range chunks {
+				Expect(ch.Usage).To(BeNil(),
+					"chunk[%d] must not carry Usage; got %+v", i, ch.Usage)
+			}
+		})
+
+		It("returns chunks with no Usage when only trailing reasoning needs delivery", func() {
+			chunks := buildNoActionFinalChunks(
+				"req-1", "m", 1,
+				true, false,
+				"", "autoparser late reasoning",
+			)
+			Expect(chunks).ToNot(BeEmpty())
+			for i, ch := range chunks {
+				Expect(ch.Usage).To(BeNil(),
+					"chunk[%d] must not carry Usage; got %+v", i, ch.Usage)
+			}
+		})
+	})
+
+	Describe("buildDeferredToolCallChunks", func() {
+		It("returns chunks with no Usage embedded", func() {
+			calls := []functions.FuncCallResults{{
+				Name: "do_thing", Arguments: `{"x":1}`,
+			}}
+			chunks := buildDeferredToolCallChunks(
+				"req-1", "m", 1, calls, 0,
+				false, "", false, "",
+			)
+			Expect(chunks).ToNot(BeEmpty())
+			for i, ch := range chunks {
+				Expect(ch.Usage).To(BeNil(),
+					"chunk[%d] must not carry Usage; got %+v", i, ch.Usage)
+			}
+		})
+	})
+
+	Describe("streamUsageTrailerJSON", func() {
+		It("produces JSON matching the OpenAI spec for the trailer chunk", func() {
+			// Trailing usage chunk shape (OpenAI streaming spec):
+			//   {"id":"...","object":"chat.completion.chunk","created":...,
+			//    "model":"...","choices":[],"usage":{...}}
+			usage := schema.OpenAIUsage{
+				PromptTokens: 18, CompletionTokens: 14, TotalTokens: 32,
+			}
+			data := streamUsageTrailerJSON("req-1", "m", 1, usage)
+
+			var raw map[string]any
+			Expect(json.Unmarshal(data, &raw)).To(Succeed(),
+				"trailer must be valid JSON, got: %s", string(data))
+
+			Expect(raw["id"]).To(Equal("req-1"))
+			Expect(raw["model"]).To(Equal("m"))
+			Expect(raw["object"]).To(Equal("chat.completion.chunk"))
+			Expect(raw["created"]).To(BeNumerically("==", 1))
+
+			// `choices` MUST be present as an empty array (not absent, not null).
+			rawChoices, present := raw["choices"]
+			Expect(present).To(BeTrue(), "choices key must be present, got: %s", string(data))
+			choicesArr, ok := rawChoices.([]any)
+			Expect(ok).To(BeTrue(), "choices must serialize as an array, got: %s", string(data))
+			Expect(choicesArr).To(BeEmpty(), "choices must be empty in usage trailer, got: %s", string(data))
+
+			// `usage` MUST be present and non-null with the populated counts.
+			u, ok := raw["usage"].(map[string]any)
+			Expect(ok).To(BeTrue(), "usage object must be present, got: %s", string(data))
+			Expect(u["prompt_tokens"]).To(BeNumerically("==", 18))
+			Expect(u["completion_tokens"]).To(BeNumerically("==", 14))
+			Expect(u["total_tokens"]).To(BeNumerically("==", 32))
+		})
+	})
+
+	Describe("OpenAIRequest.StreamOptions", func() {
+		It("parses stream_options.include_usage=true", func() {
+			body := []byte(`{
+                "model": "m",
+                "stream": true,
+                "stream_options": {"include_usage": true},
+                "messages": []
+            }`)
+			var req schema.OpenAIRequest
+			Expect(json.Unmarshal(body, &req)).To(Succeed())
+			Expect(req.StreamOptions).ToNot(BeNil())
+			Expect(req.StreamOptions.IncludeUsage).To(BeTrue())
+		})
+
+		It("defaults IncludeUsage to false when stream_options is absent", func() {
+			body := []byte(`{"model":"m","stream":true,"messages":[]}`)
+			var req schema.OpenAIRequest
+			Expect(json.Unmarshal(body, &req)).To(Succeed())
+			// Either a nil StreamOptions or one with IncludeUsage=false is acceptable.
+			if req.StreamOptions != nil {
+				Expect(req.StreamOptions.IncludeUsage).To(BeFalse())
+			}
+		})
+	})
+})
--- a/core/http/endpoints/openai/completion.go
+++ b/core/http/endpoints/openai/completion.go
@@ -39,6 +39,10 @@ func CompletionEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, eva
 				usage.TimingTokenGeneration = tokenUsage.TimingTokenGeneration
 				usage.TimingPromptProcessing = tokenUsage.TimingPromptProcessing
 			}
+			// Usage rides on the struct for the consumer to track the
+			// running cumulative; the consumer strips it before marshalling
+			// so intermediate chunks stay OpenAI-spec compliant.
+			usageForChunk := usage
 			resp := schema.OpenAIResponse{
 				ID:      id,
 				Created: created,
@@ -51,7 +55,7 @@ func CompletionEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, eva
 					},
 				},
 				Object: "text_completion",
-				Usage:  usage,
+				Usage:  &usageForChunk,
 			}
 			xlog.Debug("Sending goroutine", "text", s)

@@ -127,6 +131,8 @@ func CompletionEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, eva
 				ended <- process(id, predInput, input, config, ml, responses, extraUsage)
 			}()

+			var latestUsage *schema.OpenAIUsage
+
 		LOOP:
 			for {
 				select {
@@ -135,6 +141,14 @@ func CompletionEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, eva
 						xlog.Debug("No choices in the response, skipping")
 						continue
 					}
+					// Capture running cumulative usage for the optional trailer
+					// emitted after the final stop chunk when include_usage=true.
+					if ev.Usage != nil {
+						latestUsage = ev.Usage
+					}
+					// OpenAI streaming spec: intermediate chunks must NOT
+					// carry a `usage` field. Strip the tracking copy now.
+					ev.Usage = nil
 					respData, err := json.Marshal(ev)
 					if err != nil {
 						xlog.Debug("Failed to marshal response", "error", err)
@@ -194,8 +208,15 @@ func CompletionEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, eva
 				Object: "text_completion",
 			}
 			respData, _ := json.Marshal(resp)
-
 			fmt.Fprintf(c.Response().Writer, "data: %s\n\n", respData)
+
+			// Trailing usage chunk per OpenAI spec: emit only when the caller
+			// opted in via stream_options.include_usage.
+			if input.StreamOptions != nil && input.StreamOptions.IncludeUsage && latestUsage != nil {
+				trailer := streamUsageTrailerJSON(id, input.Model, created, *latestUsage)
+				_, _ = fmt.Fprintf(c.Response().Writer, "data: %s\n\n", trailer)
+			}
+
 			fmt.Fprintf(c.Response().Writer, "data: [DONE]\n\n")
 			c.Response().Flush()
 			return nil
@@ -247,7 +268,7 @@ func CompletionEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, eva
 			Model:   input.Model, // we have to return what the user sent here, due to OpenAI spec.
 			Choices: result,
 			Object:  "text_completion",
-			Usage:   usage,
+			Usage:   &usage,
 		}

 		jsonResult, _ := json.Marshal(resp)
--- a/core/http/endpoints/openai/edit.go
+++ b/core/http/endpoints/openai/edit.go
@@ -92,7 +92,7 @@ func EditEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator
 			Model:   input.Model, // we have to return what the user sent here, due to OpenAI spec.
 			Choices: result,
 			Object:  "edit",
-			Usage:   usage,
+			Usage:   &usage,
 		}

 		jsonResult, _ := json.Marshal(resp)
--- a/core/http/endpoints/openai/image.go
+++ b/core/http/endpoints/openai/image.go
@@ -233,7 +233,7 @@ func ImageEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfi
 			ID:      id,
 			Created: created,
 			Data:    result,
-			Usage: schema.OpenAIUsage{
+			Usage: &schema.OpenAIUsage{
 				PromptTokens:     0,
 				CompletionTokens: 0,
 				TotalTokens:      0,
--- a/core/http/endpoints/openai/inpainting.go
+++ b/core/http/endpoints/openai/inpainting.go
@@ -258,7 +258,7 @@ func InpaintingEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, app
 			Data: []schema.Item{{
 				URL: imgPath,
 			}},
-			Usage: schema.OpenAIUsage{
+			Usage: &schema.OpenAIUsage{
 				PromptTokens:     0,
 				CompletionTokens: 0,
 				TotalTokens:      0,
--- a/core/http/endpoints/openai/realtime.go
+++ b/core/http/endpoints/openai/realtime.go
@@ -8,6 +8,7 @@ import (
 	"fmt"
 	"math"
 	"os"
+	"strconv"
 	"sync"
 	"time"

@@ -20,6 +21,8 @@ import (
 	"github.com/mudler/LocalAI/core/backend"

 	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/http/auth"
+	mcpTools "github.com/mudler/LocalAI/core/http/endpoints/mcp"
 	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
 	"github.com/mudler/LocalAI/core/schema"
 	"github.com/mudler/LocalAI/core/templates"
@@ -51,6 +54,30 @@ const (
 		"Avoid parenthetical asides, URLs, and anything that cannot be clearly vocalized."
 )

+// resolveOutputModalities returns the effective output modalities for a
+// response: response-level overrides session-level, and the OpenAI Realtime
+// spec default is ["audio"] when neither is set.
+func resolveOutputModalities(session, response []types.Modality) []types.Modality {
+	if len(response) > 0 {
+		return response
+	}
+	if len(session) > 0 {
+		return session
+	}
+	return []types.Modality{types.ModalityAudio}
+}
+
+// modalitiesContainAudio reports whether the resolved modalities include audio
+// output.
+func modalitiesContainAudio(m []types.Modality) bool {
+	for _, x := range m {
+		if x == types.ModalityAudio {
+			return true
+		}
+	}
+	return false
+}
+
 // A model can be "emulated" that is: transcribe audio to text -> feed text to the LLM -> generate audio as result
 // If the model support instead audio-to-audio, we will use the specific gRPC calls instead

@@ -79,6 +106,30 @@ type Session struct {
 	InputSampleRate  int
 	OutputSampleRate int
 	MaxOutputTokens  types.IntOrInf
+	// OutputModalities mirrors the OpenAI Realtime spec field of the same
+	// name. Empty means "use the spec default" (audio). ["text"] suppresses
+	// TTS so the client receives only response.output_text.* events.
+	OutputModalities []types.Modality
+	// MaxHistoryItems caps the number of MessageItems passed to the LLM each
+	// turn (0 = unlimited). Small models — especially the LFM2.5-Audio 1.5B
+	// served via the liquid-audio backend — degrade quickly past a handful
+	// of turns. Counted from the tail; FunctionCall + FunctionCallOutput
+	// pairs are kept together so we never feed an orphaned tool result.
+	MaxHistoryItems int
+
+	// AssistantExecutor is non-nil when the session opted into the in-process
+	// LocalAI Assistant tool surface. Tool calls whose name matches this
+	// executor's catalog are run inproc and their output is fed back to the
+	// model server-side; the client never sees a function_call_arguments
+	// event for those. Mirrors the chat handler's metadata.localai_assistant
+	// path.
+	AssistantExecutor mcpTools.ToolExecutor
+
+	// AssistantTools is the cached ToolUnion slice we injected at session
+	// creation. Re-applied after every client session.update so a
+	// client-driven tool refresh (e.g. toggling a client MCP server) doesn't
+	// silently strip Manage Mode's tools.
+	AssistantTools []types.ToolUnion

 	// Response cancellation: protects activeResponseCancel/activeResponseDone
 	responseMu           sync.Mutex
@@ -139,13 +190,14 @@ func (s *Session) ToServer() types.SessionUnion {
 	} else {
 		return types.SessionUnion{
 			Realtime: &types.RealtimeSession{
-				ID:              s.ID,
-				Object:          "realtime.session",
-				Model:           s.Model,
-				Instructions:    s.Instructions,
-				Tools:           s.Tools,
-				ToolChoice:      s.ToolChoice,
-				MaxOutputTokens: s.MaxOutputTokens,
+				ID:               s.ID,
+				Object:           "realtime.session",
+				Model:            s.Model,
+				Instructions:     s.Instructions,
+				Tools:            s.Tools,
+				ToolChoice:       s.ToolChoice,
+				MaxOutputTokens:  s.MaxOutputTokens,
+				OutputModalities: s.OutputModalities,
 				Audio: &types.RealtimeSessionAudio{
 					Input: &types.SessionAudioInput{
 						TurnDetection: s.TurnDetection,
@@ -205,6 +257,19 @@ func RealtimeTranscriptionSession(application *application.Application) echo.Han
 	}
 }

+// RealtimeSessionOptions bundles per-session knobs decoded from the WS query
+// string (or the WebRTC handshake body). Mirrors what chat.go pulls off
+// `metadata.localai_assistant` — admin-only opt-in to the in-process
+// management tool surface.
+type RealtimeSessionOptions struct {
+	LocalAIAssistant bool
+	// AuthEnabled mirrors chat.go's requireAssistantAccess gate. We resolve
+	// admin role at handshake time (where the echo.Context has the auth
+	// cookie/Bearer) and drop the result here so runRealtimeSession can
+	// decide without holding onto the request.
+	IsAdmin bool
+}
+
 func Realtime(application *application.Application) echo.HandlerFunc {
 	return func(c echo.Context) error {
 		ws, err := upgrader.Upgrade(c.Response(), c.Request(), nil)
@@ -218,25 +283,105 @@ func Realtime(application *application.Application) echo.HandlerFunc {

 		// Extract query parameters from Echo context before passing to websocket handler
 		model := c.QueryParam("model")
+		assistantFlag, _ := strconv.ParseBool(c.QueryParam("localai_assistant"))
+		opts := RealtimeSessionOptions{
+			LocalAIAssistant: assistantFlag,
+			IsAdmin:          isCurrentUserAdmin(c, application),
+		}

-		registerRealtime(application, model)(ws)
+		registerRealtime(application, model, opts)(ws)
 		return nil
 	}
 }

-func registerRealtime(application *application.Application, model string) func(c *websocket.Conn) {
+// isCurrentUserAdmin replicates the chat-side admin check at the realtime
+// handshake. When auth is disabled, every caller is treated as admin (same
+// as chat's requireAssistantAccess).
+func isCurrentUserAdmin(c echo.Context, application *application.Application) bool {
+	if application == nil || application.ApplicationConfig() == nil || !application.ApplicationConfig().Auth.Enabled {
+		return true
+	}
+	user := auth.GetUser(c)
+	return user != nil && user.Role == auth.RoleAdmin
+}
+
+func registerRealtime(application *application.Application, model string, opts RealtimeSessionOptions) func(c *websocket.Conn) {
 	return func(conn *websocket.Conn) {
 		t := NewWebSocketTransport(conn)
 		evaluator := application.TemplatesEvaluator()
 		xlog.Debug("Realtime WebSocket connection established", "address", conn.RemoteAddr().String(), "model", model)
-		runRealtimeSession(application, t, model, evaluator)
+		runRealtimeSession(application, t, model, evaluator, opts)
 	}
 }

+// defaultMaxHistoryItems picks a sensible default cap for the session.
+// Small any-to-any audio models degrade quickly past a handful of turns;
+// legacy pipelines composing larger LLMs keep the historical "unlimited"
+// default and rely on the LLM's own context window.
+func defaultMaxHistoryItems(cfg *config.ModelConfig) int {
+	if cfg != nil && cfg.HasUsecases(config.FLAG_REALTIME_AUDIO) {
+		return 6
+	}
+	return 0
+}
+
+// trimRealtimeItems returns the tail of items capped at maxItems (0 = no cap).
+// Walks backwards keeping function_call + function_call_output pairs together
+// so we never feed the LLM an orphaned tool result that references a call it
+// can't see.
+func trimRealtimeItems(items []*types.MessageItemUnion, maxItems int) []*types.MessageItemUnion {
+	if maxItems <= 0 || len(items) <= maxItems {
+		return items
+	}
+	// Find the cut point starting from len-maxItems and pull it left until
+	// we're not in the middle of a tool-call pair.
+	cut := len(items) - maxItems
+	for cut > 0 && items[cut] != nil && items[cut].FunctionCallOutput != nil {
+		cut--
+	}
+	return items[cut:]
+}
+
+// prepareRealtimeConfig validates a model config for use in a realtime session
+// and fills in pipeline slots for self-contained any-to-any models. It returns
+// an error code + message pair suitable for sendError; the bool indicates
+// whether the caller should proceed. Extracted from runRealtimeSession so the
+// gate logic can be exercised in unit tests without a full Application.
+func prepareRealtimeConfig(cfg *config.ModelConfig) (errCode, errMsg string, ok bool) {
+	if cfg == nil {
+		return "invalid_model", "Model is not a pipeline model", false
+	}
+
+	// Self-contained any-to-any models (e.g. liquid-audio) own the whole
+	// loop in one engine — surface them by populating empty pipeline slots
+	// with the model's own name so newModel can resolve a config for each
+	// role. The user can still pin individual slots (e.g. Pipeline.VAD =
+	// silero-vad) and those wins.
+	if cfg.HasUsecases(config.FLAG_REALTIME_AUDIO) {
+		if cfg.Pipeline.VAD == "" {
+			cfg.Pipeline.VAD = cfg.Name
+		}
+		if cfg.Pipeline.Transcription == "" {
+			cfg.Pipeline.Transcription = cfg.Name
+		}
+		if cfg.Pipeline.LLM == "" {
+			cfg.Pipeline.LLM = cfg.Name
+		}
+		if cfg.Pipeline.TTS == "" {
+			cfg.Pipeline.TTS = cfg.Name
+		}
+		return "", "", true
+	}
+
+	if cfg.Pipeline.VAD == "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.TTS == "" && cfg.Pipeline.LLM == "" {
+		return "invalid_model", "Model is not a pipeline model", false
+	}
+	return "", "", true
+}
+
 // runRealtimeSession runs the main event loop for a realtime session.
 // It is transport-agnostic and works with both WebSocket and WebRTC.
-func runRealtimeSession(application *application.Application, t Transport, model string, evaluator *templates.Evaluator) {
-	// TODO: Allow any-to-any model to be specified
+func runRealtimeSession(application *application.Application, t Transport, model string, evaluator *templates.Evaluator, opts RealtimeSessionOptions) {
 	cl := application.ModelConfigLoader()
 	cfg, err := cl.LoadModelConfigFileByNameDefaultOptions(model, application.ApplicationConfig())
 	if err != nil {
@@ -245,22 +390,79 @@ func runRealtimeSession(application *application.Application, t Transport, model
 		return
 	}

-	if cfg == nil || (cfg.Pipeline.VAD == "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.TTS == "" && cfg.Pipeline.LLM == "") {
+	if code, msg, ok := prepareRealtimeConfig(cfg); !ok {
 		xlog.Error("model is not a pipeline", "model", model)
-		sendError(t, "invalid_model", "Model is not a pipeline model", "", "")
+		sendError(t, code, msg, "", "")
 		return
 	}

+	// LocalAI Assistant opt-in: gate on admin (same rule as chat.go's
+	// requireAssistantAccess) and grab the process-wide holder's executor.
+	// We collect tools + system prompt here and merge them into the session
+	// below so they're live from the first response.create.
+	var assistantTools []types.ToolUnion
+	var assistantSystemPrompt string
+	var assistantExecutor mcpTools.ToolExecutor
+	if opts.LocalAIAssistant {
+		if !opts.IsAdmin {
+			sendError(t, "forbidden", "localai_assistant requires admin", "", "")
+			return
+		}
+		appCfg := application.ApplicationConfig()
+		if appCfg != nil && appCfg.DisableLocalAIAssistant {
+			sendError(t, "unavailable", "LocalAI Assistant is disabled on this server", "", "")
+			return
+		}
+		holder := application.LocalAIAssistant()
+		if holder == nil || !holder.HasTools() {
+			sendError(t, "unavailable", "LocalAI Assistant is not available on this server", "", "")
+			return
+		}
+		exec := holder.Executor()
+		fns, discErr := exec.DiscoverTools(context.Background())
+		if discErr != nil {
+			xlog.Error("realtime: failed to discover LocalAI Assistant tools", "error", discErr)
+			sendError(t, "tool_discovery_failed", "failed to discover assistant tools: "+discErr.Error(), "", "")
+			return
+		}
+		assistantExecutor = exec
+		assistantSystemPrompt = holder.SystemPrompt()
+		assistantTools = make([]types.ToolUnion, 0, len(fns))
+		for _, fn := range fns {
+			fnCopy := fn
+			assistantTools = append(assistantTools, types.ToolUnion{
+				Function: &types.ToolFunction{
+					Name:        fnCopy.Name,
+					Description: fnCopy.Description,
+					Parameters:  fnCopy.Parameters,
+				},
+			})
+		}
+		xlog.Debug("realtime: LocalAI Assistant tools injected", "count", len(fns))
+	}
+
 	sttModel := cfg.Pipeline.Transcription

+	// Compose the system prompt: prepend the assistant prompt when we have
+	// one (it teaches the model the safety rules and tool recipes), then the
+	// session's default voice instructions. Order matches chat.go's
+	// hasSystemMessage check — assistant prompt comes first.
+	instructions := defaultInstructions
+	if assistantSystemPrompt != "" {
+		instructions = assistantSystemPrompt + "\n\n" + defaultInstructions
+	}
+
 	sessionID := generateSessionID()
 	session := &Session{
 		ID:                sessionID,
 		TranscriptionOnly: false,
 		Model:             model,
 		Voice:             cfg.TTSConfig.Voice,
-		Instructions:      defaultInstructions,
+		Instructions:      instructions,
 		ModelConfig:       cfg,
+		Tools:             assistantTools,
+		AssistantTools:    assistantTools,
+		AssistantExecutor: assistantExecutor,
 		TurnDetection: &types.TurnDetectionUnion{
 			ServerVad: &types.ServerVad{
 				Threshold:         0.5,
@@ -275,6 +477,7 @@ func runRealtimeSession(application *application.Application, t Transport, model
 		Conversations:    make(map[string]*Conversation),
 		InputSampleRate:  defaultRemoteSampleRate,
 		OutputSampleRate: defaultRemoteSampleRate,
+		MaxHistoryItems:  defaultMaxHistoryItems(cfg),
 	}

 	// Create a default conversation
@@ -810,7 +1013,28 @@ func updateSession(session *Session, update *types.SessionUnion, cl *config.Mode
 	}

 	if rt.Tools != nil {
-		session.Tools = rt.Tools
+		// Manage Mode tools survive a client-driven session.update — the
+		// alternative is silently dropping them whenever the user toggles
+		// a client MCP server, which would break the modality mid-session.
+		// Names from rt.Tools win on collision (the client is explicit;
+		// we preserve, we don't override).
+		merged := append([]types.ToolUnion(nil), rt.Tools...)
+		seen := make(map[string]struct{}, len(merged))
+		for _, t := range merged {
+			if t.Function != nil {
+				seen[t.Function.Name] = struct{}{}
+			}
+		}
+		for _, t := range session.AssistantTools {
+			if t.Function == nil {
+				continue
+			}
+			if _, ok := seen[t.Function.Name]; ok {
+				continue
+			}
+			merged = append(merged, t)
+		}
+		session.Tools = merged
 	}
 	if rt.ToolChoice != nil {
 		session.ToolChoice = rt.ToolChoice
@@ -820,6 +1044,10 @@ func updateSession(session *Session, update *types.SessionUnion, cl *config.Mode
 		session.MaxOutputTokens = rt.MaxOutputTokens
 	}

+	if len(rt.OutputModalities) > 0 {
+		session.OutputModalities = rt.OutputModalities
+	}
+
 	return nil
 }

@@ -1104,7 +1332,17 @@ func generateResponse(ctx context.Context, session *Session, utt []byte, transcr
 	triggerResponse(ctx, session, conv, t, nil)
 }

+// maxAssistantToolTurns caps the server-side agentic loop. Mirrors the
+// chat-page maxToolTurns:10 from useChat.js — the model gets up to this
+// many consecutive tool round-trips before we return control to the user
+// without another response cycle.
+const maxAssistantToolTurns = 10
+
 func triggerResponse(ctx context.Context, session *Session, conv *Conversation, t Transport, overrides *types.ResponseCreateParams) {
+	triggerResponseAtTurn(ctx, session, conv, t, overrides, 0)
+}
+
+func triggerResponseAtTurn(ctx context.Context, session *Session, conv *Conversation, t Transport, overrides *types.ResponseCreateParams, toolTurn int) {
 	config := session.ModelInterface.PredictConfig()

 	// Default values
@@ -1155,7 +1393,8 @@ func triggerResponse(ctx context.Context, session *Session, conv *Conversation,

 	imgIndex := 0
 	conv.Lock.Lock()
-	for _, item := range conv.Items {
+	items := trimRealtimeItems(conv.Items, session.MaxHistoryItems)
+	for _, item := range items {
 		if item.User != nil {
 			msg := schema.Message{
 				Role: string(types.MessageRoleUser),
@@ -1448,106 +1687,130 @@ func triggerResponse(ctx context.Context, session *Session, conv *Conversation,
 			})
 		}

-		// Check for cancellation before TTS
-		if ctx.Err() != nil {
-			xlog.Debug("Response cancelled before TTS (barge-in)")
-			sendCancelledResponse()
-			return
-		}
-
-		audioFilePath, res, err := session.ModelInterface.TTS(ctx, finalSpeech, session.Voice, session.InputAudioTranscription.Language)
-		if err != nil {
-			if ctx.Err() != nil {
-				xlog.Debug("TTS cancelled (barge-in)")
-				sendCancelledResponse()
-				return
-			}
-			xlog.Error("TTS failed", "error", err)
-			sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %v", err), "", item.Assistant.ID)
-			return
-		}
-		if !res.Success {
-			xlog.Error("TTS failed", "message", res.Message)
-			sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %s", res.Message), "", item.Assistant.ID)
-			return
-		}
-		defer os.Remove(audioFilePath)
-
-		audioBytes, err := os.ReadFile(audioFilePath)
-		if err != nil {
-			xlog.Error("failed to read TTS file", "error", err)
-			sendError(t, "tts_error", fmt.Sprintf("Failed to read TTS audio: %v", err), "", item.Assistant.ID)
-			return
-		}
-
-		// Parse WAV header to get raw PCM and the actual sample rate from the TTS backend.
-		pcmData, ttsSampleRate := laudio.ParseWAV(audioBytes)
-		if ttsSampleRate == 0 {
-			ttsSampleRate = localSampleRate
-		}
-		xlog.Debug("TTS audio parsed", "raw_bytes", len(audioBytes), "pcm_bytes", len(pcmData), "sample_rate", ttsSampleRate)
-
-		// SendAudio (WebRTC) passes PCM at the TTS sample rate directly to the
-		// Opus encoder, which resamples to 48kHz internally. This avoids a
-		// lossy intermediate resample through 16kHz.
-		// XXX: This is a noop in websocket mode; it's included in the JSON instead
-		if err := t.SendAudio(ctx, pcmData, ttsSampleRate); err != nil {
-			if ctx.Err() != nil {
-				xlog.Debug("Audio playback cancelled (barge-in)")
-				sendCancelledResponse()
-				return
-			}
-			xlog.Error("failed to send audio via transport", "error", err)
-		}
-
-		_, isWebRTC := t.(*WebRTCTransport)
-
-		// For WebSocket clients, resample to the session's output rate and
-		// deliver audio as base64 in JSON events. WebRTC clients already
-		// received audio over the RTP track, so skip the base64 payload.
 		var audioString string
-		if !isWebRTC {
-			wsPCM := pcmData
-			if ttsSampleRate != session.OutputSampleRate {
-				samples := sound.BytesToInt16sLE(pcmData)
-				resampled := sound.ResampleInt16(samples, ttsSampleRate, session.OutputSampleRate)
-				wsPCM = sound.Int16toBytesLE(resampled)
-			}
-			audioString = base64.StdEncoding.EncodeToString(wsPCM)
+		_, isWebRTC := t.(*WebRTCTransport)
+		var respMods []types.Modality
+		if overrides != nil {
+			respMods = overrides.OutputModalities
 		}
+		modalities := resolveOutputModalities(session.OutputModalities, respMods)
+		if modalitiesContainAudio(modalities) {
+			// Check for cancellation before TTS
+			if ctx.Err() != nil {
+				xlog.Debug("Response cancelled before TTS (barge-in)")
+				sendCancelledResponse()
+				return
+			}

-		sendEvent(t, types.ResponseOutputAudioTranscriptDeltaEvent{
-			ServerEventBase: types.ServerEventBase{},
-			ResponseID:      responseID,
-			ItemID:          item.Assistant.ID,
-			OutputIndex:     0,
-			ContentIndex:    0,
-			Delta:           finalSpeech,
-		})
-		sendEvent(t, types.ResponseOutputAudioTranscriptDoneEvent{
-			ServerEventBase: types.ServerEventBase{},
-			ResponseID:      responseID,
-			ItemID:          item.Assistant.ID,
-			OutputIndex:     0,
-			ContentIndex:    0,
-			Transcript:      finalSpeech,
-		})
+			audioFilePath, res, err := session.ModelInterface.TTS(ctx, finalSpeech, session.Voice, session.InputAudioTranscription.Language)
+			if err != nil {
+				if ctx.Err() != nil {
+					xlog.Debug("TTS cancelled (barge-in)")
+					sendCancelledResponse()
+					return
+				}
+				xlog.Error("TTS failed", "error", err)
+				sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %v", err), "", item.Assistant.ID)
+				return
+			}
+			if !res.Success {
+				xlog.Error("TTS failed", "message", res.Message)
+				sendError(t, "tts_error", fmt.Sprintf("TTS generation failed: %s", res.Message), "", item.Assistant.ID)
+				return
+			}
+			defer func() { _ = os.Remove(audioFilePath) }()

-		if !isWebRTC {
-			sendEvent(t, types.ResponseOutputAudioDeltaEvent{
+			audioBytes, err := os.ReadFile(audioFilePath)
+			if err != nil {
+				xlog.Error("failed to read TTS file", "error", err)
+				sendError(t, "tts_error", fmt.Sprintf("Failed to read TTS audio: %v", err), "", item.Assistant.ID)
+				return
+			}
+
+			// Parse WAV header to get raw PCM and the actual sample rate from the TTS backend.
+			pcmData, ttsSampleRate := laudio.ParseWAV(audioBytes)
+			if ttsSampleRate == 0 {
+				ttsSampleRate = localSampleRate
+			}
+			xlog.Debug("TTS audio parsed", "raw_bytes", len(audioBytes), "pcm_bytes", len(pcmData), "sample_rate", ttsSampleRate)
+
+			// SendAudio (WebRTC) passes PCM at the TTS sample rate directly to the
+			// Opus encoder, which resamples to 48kHz internally. This avoids a
+			// lossy intermediate resample through 16kHz.
+			// XXX: This is a noop in websocket mode; it's included in the JSON instead
+			if err := t.SendAudio(ctx, pcmData, ttsSampleRate); err != nil {
+				if ctx.Err() != nil {
+					xlog.Debug("Audio playback cancelled (barge-in)")
+					sendCancelledResponse()
+					return
+				}
+				xlog.Error("failed to send audio via transport", "error", err)
+			}
+
+			// For WebSocket clients, resample to the session's output rate and
+			// deliver audio as base64 in JSON events. WebRTC clients already
+			// received audio over the RTP track, so skip the base64 payload.
+			if !isWebRTC {
+				wsPCM := pcmData
+				if ttsSampleRate != session.OutputSampleRate {
+					samples := sound.BytesToInt16sLE(pcmData)
+					resampled := sound.ResampleInt16(samples, ttsSampleRate, session.OutputSampleRate)
+					wsPCM = sound.Int16toBytesLE(resampled)
+				}
+				audioString = base64.StdEncoding.EncodeToString(wsPCM)
+			}
+
+			sendEvent(t, types.ResponseOutputAudioTranscriptDeltaEvent{
 				ServerEventBase: types.ServerEventBase{},
 				ResponseID:      responseID,
 				ItemID:          item.Assistant.ID,
 				OutputIndex:     0,
 				ContentIndex:    0,
-				Delta:           audioString,
+				Delta:           finalSpeech,
 			})
-			sendEvent(t, types.ResponseOutputAudioDoneEvent{
+			sendEvent(t, types.ResponseOutputAudioTranscriptDoneEvent{
 				ServerEventBase: types.ServerEventBase{},
 				ResponseID:      responseID,
 				ItemID:          item.Assistant.ID,
 				OutputIndex:     0,
 				ContentIndex:    0,
+				Transcript:      finalSpeech,
+			})
+
+			if !isWebRTC {
+				sendEvent(t, types.ResponseOutputAudioDeltaEvent{
+					ServerEventBase: types.ServerEventBase{},
+					ResponseID:      responseID,
+					ItemID:          item.Assistant.ID,
+					OutputIndex:     0,
+					ContentIndex:    0,
+					Delta:           audioString,
+				})
+				sendEvent(t, types.ResponseOutputAudioDoneEvent{
+					ServerEventBase: types.ServerEventBase{},
+					ResponseID:      responseID,
+					ItemID:          item.Assistant.ID,
+					OutputIndex:     0,
+					ContentIndex:    0,
+				})
+			}
+		} else {
+			// Text-only mode: skip TTS, emit only the text events.
+			sendEvent(t, types.ResponseOutputTextDeltaEvent{
+				ServerEventBase: types.ServerEventBase{},
+				ResponseID:      responseID,
+				ItemID:          item.Assistant.ID,
+				OutputIndex:     0,
+				ContentIndex:    0,
+				Delta:           finalSpeech,
+			})
+			sendEvent(t, types.ResponseOutputTextDoneEvent{
+				ServerEventBase: types.ServerEventBase{},
+				ResponseID:      responseID,
+				ItemID:          item.Assistant.ID,
+				OutputIndex:     0,
+				ContentIndex:    0,
+				Text:            finalSpeech,
 			})
 		}

@@ -1575,8 +1838,16 @@ func triggerResponse(ctx context.Context, session *Session, conv *Conversation,
 		})
 	}

-	// Handle Tool Calls
+	// Handle Tool Calls. Two paths:
+	//   - LocalAI Assistant tools (session.AssistantExecutor.IsTool) run
+	//     server-side; we append both the call and its output to conv.Items
+	//     and re-trigger a follow-up response so the model can speak the
+	//     result. The client only sees observability events.
+	//   - All other tools follow the standard OpenAI flow: emit
+	//     function_call_arguments.done and wait for the client to send
+	//     conversation.item.create back.
 	xlog.Debug("About to handle tool calls", "finalToolCallsCount", len(finalToolCalls))
+	executedAssistantTool := false
 	for i, tc := range finalToolCalls {
 		toolCallID := generateItemID()
 		callID := "call_" + generateUniqueID() // OpenAI uses call_xyz
@@ -1608,6 +1879,51 @@ func triggerResponse(ctx context.Context, session *Session, conv *Conversation,
 			Item:            fcItem,
 		})

+		serverSide := session.AssistantExecutor != nil && session.AssistantExecutor.IsTool(tc.Name)
+		if serverSide {
+			output, execErr := session.AssistantExecutor.ExecuteTool(ctx, tc.Name, tc.Arguments)
+			if execErr != nil {
+				output = "Error: " + execErr.Error()
+				xlog.Error("realtime: assistant tool execution failed", "tool", tc.Name, "error", execErr)
+			}
+			foItem := types.MessageItemUnion{
+				FunctionCallOutput: &types.MessageItemFunctionCallOutput{
+					ID:     generateItemID(),
+					CallID: callID,
+					Output: output,
+					Status: types.ItemStatusCompleted,
+				},
+			}
+			conv.Lock.Lock()
+			conv.Items = append(conv.Items, &foItem)
+			conv.Lock.Unlock()
+			// Close the call out and emit the output as its own paired
+			// added/done — the OpenAI spec pairs every item-done with a
+			// preceding item-added, so we re-pair here for the output.
+			// The UI renders the transcript entry on item.done for both
+			// shapes (FunctionCall + FunctionCallOutput).
+			sendEvent(t, types.ResponseOutputItemDoneEvent{
+				ServerEventBase: types.ServerEventBase{},
+				ResponseID:      responseID,
+				OutputIndex:     outputIndex,
+				Item:            fcItem,
+			})
+			sendEvent(t, types.ResponseOutputItemAddedEvent{
+				ServerEventBase: types.ServerEventBase{},
+				ResponseID:      responseID,
+				OutputIndex:     outputIndex,
+				Item:            foItem,
+			})
+			sendEvent(t, types.ResponseOutputItemDoneEvent{
+				ServerEventBase: types.ServerEventBase{},
+				ResponseID:      responseID,
+				OutputIndex:     outputIndex,
+				Item:            foItem,
+			})
+			executedAssistantTool = true
+			continue
+		}
+
 		sendEvent(t, types.ResponseFunctionCallArgumentsDeltaEvent{
 			ServerEventBase: types.ServerEventBase{},
 			ResponseID:      responseID,
@@ -1643,6 +1959,19 @@ func triggerResponse(ctx context.Context, session *Session, conv *Conversation,
 			Status: types.ResponseStatusCompleted,
 		},
 	})
+
+	// If we executed any assistant tools inproc, run another response cycle
+	// so the model can speak the result. Mirrors the chat-side agentic loop
+	// but driven server-side rather than by client round-trip. Bounded so a
+	// degenerate "model keeps calling tools" doesn't blow the stack.
+	if executedAssistantTool {
+		if toolTurn+1 >= maxAssistantToolTurns {
+			xlog.Warn("realtime: assistant tool-turn limit reached, stopping the agentic loop",
+				"limit", maxAssistantToolTurns, "model", session.Model)
+			return
+		}
+		triggerResponseAtTurn(ctx, session, conv, t, nil, toolTurn+1)
+	}
 }

 // Helper functions to generate unique IDs
--- a/core/http/endpoints/openai/realtime_gate_test.go
+++ b/core/http/endpoints/openai/realtime_gate_test.go
@@ -0,0 +1,153 @@
+package openai
+
+import (
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+// withUsecases returns a *ModelConfigUsecase pointing at the OR of the given flags.
+// Helper so each spec keeps its intent obvious.
+func withUsecases(flags ...config.ModelConfigUsecase) *config.ModelConfigUsecase {
+	var u config.ModelConfigUsecase
+	for _, f := range flags {
+		u |= f
+	}
+	return &u
+}
+
+var _ = Describe("prepareRealtimeConfig", func() {
+	It("rejects a nil config", func() {
+		code, msg, ok := prepareRealtimeConfig(nil)
+		Expect(ok).To(BeFalse())
+		Expect(code).To(Equal("invalid_model"))
+		Expect(msg).To(ContainSubstring("not a pipeline model"))
+	})
+
+	It("rejects a model with no pipeline slots and no realtime_audio usecase", func() {
+		cfg := &config.ModelConfig{Name: "plain-chat"}
+		code, msg, ok := prepareRealtimeConfig(cfg)
+		Expect(ok).To(BeFalse())
+		Expect(code).To(Equal("invalid_model"))
+		Expect(msg).To(ContainSubstring("not a pipeline model"))
+	})
+
+	It("accepts a model with a fully populated legacy pipeline", func() {
+		cfg := &config.ModelConfig{
+			Name: "legacy",
+			Pipeline: config.Pipeline{
+				VAD:           "silero",
+				Transcription: "whisper",
+				LLM:           "llama",
+				TTS:           "piper",
+			},
+		}
+		_, _, ok := prepareRealtimeConfig(cfg)
+		Expect(ok).To(BeTrue())
+		Expect(cfg.Pipeline.LLM).To(Equal("llama"), "user-supplied pipeline slot must not be overwritten")
+	})
+
+	It("accepts a self-contained realtime_audio model and self-pipelines empty slots", func() {
+		cfg := &config.ModelConfig{
+			Name:          "lfm2.5-audio-realtime",
+			KnownUsecases: withUsecases(config.FLAG_REALTIME_AUDIO),
+		}
+		_, _, ok := prepareRealtimeConfig(cfg)
+		Expect(ok).To(BeTrue())
+		Expect(cfg.Pipeline.VAD).To(Equal("lfm2.5-audio-realtime"))
+		Expect(cfg.Pipeline.Transcription).To(Equal("lfm2.5-audio-realtime"))
+		Expect(cfg.Pipeline.LLM).To(Equal("lfm2.5-audio-realtime"))
+		Expect(cfg.Pipeline.TTS).To(Equal("lfm2.5-audio-realtime"))
+	})
+
+	It("preserves user-pinned pipeline slots on a realtime_audio model", func() {
+		// A user might want a dedicated silero-vad and let the realtime_audio
+		// model own only STT/LLM/TTS.
+		cfg := &config.ModelConfig{
+			Name:          "lfm-with-external-vad",
+			KnownUsecases: withUsecases(config.FLAG_REALTIME_AUDIO),
+			Pipeline: config.Pipeline{
+				VAD: "silero-vad",
+			},
+		}
+		_, _, ok := prepareRealtimeConfig(cfg)
+		Expect(ok).To(BeTrue())
+		Expect(cfg.Pipeline.VAD).To(Equal("silero-vad"))
+		Expect(cfg.Pipeline.Transcription).To(Equal("lfm-with-external-vad"))
+		Expect(cfg.Pipeline.LLM).To(Equal("lfm-with-external-vad"))
+		Expect(cfg.Pipeline.TTS).To(Equal("lfm-with-external-vad"))
+	})
+
+	It("accepts a model with at least one legacy pipeline slot set", func() {
+		// Pre-existing behaviour: the gate only rejected when ALL four slots
+		// were empty. Lock that in so the change doesn't tighten the gate.
+		cfg := &config.ModelConfig{
+			Name: "partial",
+			Pipeline: config.Pipeline{
+				LLM: "llama",
+			},
+		}
+		_, _, ok := prepareRealtimeConfig(cfg)
+		Expect(ok).To(BeTrue())
+	})
+})
+
+var _ = Describe("defaultMaxHistoryItems", func() {
+	It("caps realtime_audio sessions at 6", func() {
+		cfg := &config.ModelConfig{KnownUsecases: withUsecases(config.FLAG_REALTIME_AUDIO)}
+		Expect(defaultMaxHistoryItems(cfg)).To(Equal(6))
+	})
+	It("leaves legacy pipelines unlimited", func() {
+		cfg := &config.ModelConfig{Pipeline: config.Pipeline{LLM: "llama"}}
+		Expect(defaultMaxHistoryItems(cfg)).To(Equal(0))
+	})
+	It("tolerates nil", func() {
+		Expect(defaultMaxHistoryItems(nil)).To(Equal(0))
+	})
+})
+
+var _ = Describe("trimRealtimeItems", func() {
+	user := func(id string) *types.MessageItemUnion {
+		return &types.MessageItemUnion{User: &types.MessageItemUser{ID: id}}
+	}
+	assistant := func(id string) *types.MessageItemUnion {
+		return &types.MessageItemUnion{Assistant: &types.MessageItemAssistant{ID: id}}
+	}
+	fnCall := func(id, callID string) *types.MessageItemUnion {
+		return &types.MessageItemUnion{FunctionCall: &types.MessageItemFunctionCall{ID: id, CallID: callID}}
+	}
+	fnOut := func(id, callID string) *types.MessageItemUnion {
+		return &types.MessageItemUnion{FunctionCallOutput: &types.MessageItemFunctionCallOutput{ID: id, CallID: callID}}
+	}
+
+	It("returns the input unchanged when cap is zero", func() {
+		in := []*types.MessageItemUnion{user("u1"), assistant("a1")}
+		Expect(trimRealtimeItems(in, 0)).To(Equal(in))
+	})
+
+	It("returns the input unchanged when under the cap", func() {
+		in := []*types.MessageItemUnion{user("u1"), assistant("a1")}
+		Expect(trimRealtimeItems(in, 4)).To(Equal(in))
+	})
+
+	It("keeps the tail when over the cap", func() {
+		in := []*types.MessageItemUnion{user("u1"), assistant("a1"), user("u2"), assistant("a2"), user("u3")}
+		out := trimRealtimeItems(in, 3)
+		Expect(out).To(HaveLen(3))
+		Expect(out[0].User.ID).To(Equal("u2"))
+		Expect(out[2].User.ID).To(Equal("u3"))
+	})
+
+	It("pulls the cut left to keep a function_call paired with its output", func() {
+		// 0:user 1:fc 2:fc_out 3:assistant — cap=2 would otherwise start at
+		// index 2 (orphan fc_out). Helper must roll back to include 1.
+		in := []*types.MessageItemUnion{user("u1"), fnCall("fc1", "c1"), fnOut("fo1", "c1"), assistant("a1")}
+		out := trimRealtimeItems(in, 2)
+		// Expect at least the fc + fc_out + assistant (3 items, cap was 2)
+		// — the rollback prefers correctness over the cap.
+		Expect(len(out)).To(BeNumerically(">=", 3))
+		Expect(out[0].FunctionCall).NotTo(BeNil())
+		Expect(out[1].FunctionCallOutput).NotTo(BeNil())
+	})
+})
--- a/core/http/endpoints/openai/realtime_modality_test.go
+++ b/core/http/endpoints/openai/realtime_modality_test.go
@@ -0,0 +1,39 @@
+package openai
+
+import (
+	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("resolveOutputModalities", func() {
+	It("defaults to audio when neither session nor response specify", func() {
+		got := resolveOutputModalities(nil, nil)
+		Expect(got).To(ConsistOf(types.ModalityAudio))
+	})
+
+	It("uses session modalities when response omits them", func() {
+		sess := []types.Modality{types.ModalityText}
+		got := resolveOutputModalities(sess, nil)
+		Expect(got).To(ConsistOf(types.ModalityText))
+	})
+
+	It("response modalities override session", func() {
+		sess := []types.Modality{types.ModalityAudio}
+		resp := []types.Modality{types.ModalityText}
+		got := resolveOutputModalities(sess, resp)
+		Expect(got).To(ConsistOf(types.ModalityText))
+	})
+
+	It("returns false from modalitiesContainAudio for text-only", func() {
+		Expect(modalitiesContainAudio([]types.Modality{types.ModalityText})).To(BeFalse())
+	})
+
+	It("returns true from modalitiesContainAudio for audio (default)", func() {
+		Expect(modalitiesContainAudio([]types.Modality{types.ModalityAudio})).To(BeTrue())
+	})
+
+	It("returns true when both audio and text are present", func() {
+		Expect(modalitiesContainAudio([]types.Modality{types.ModalityText, types.ModalityAudio})).To(BeTrue())
+	})
+})
--- a/core/http/endpoints/openai/realtime_webrtc.go
+++ b/core/http/endpoints/openai/realtime_webrtc.go
@@ -15,6 +15,10 @@ import (
 type RealtimeCallRequest struct {
 	SDP   string `json:"sdp"`
 	Model string `json:"model"`
+	// LocalAIAssistant opts the session into the in-process admin tool
+	// surface (same modality as the chat page's "Manage Mode"). Admin-only;
+	// the realtime entry point gates it the same way the chat handler does.
+	LocalAIAssistant bool `json:"localai_assistant,omitempty"`
 }

 // RealtimeCallResponse is the JSON response for POST /v1/realtime/calls.
@@ -165,9 +169,13 @@ func RealtimeCalls(application *application.Application) echo.HandlerFunc {

 		// Start the realtime session in a goroutine
 		evaluator := application.TemplatesEvaluator()
+		opts := RealtimeSessionOptions{
+			LocalAIAssistant: req.LocalAIAssistant,
+			IsAdmin:          isCurrentUserAdmin(c, application),
+		}
 		go func() {
 			defer transport.Close()
-			runRealtimeSession(application, transport, req.Model, evaluator)
+			runRealtimeSession(application, transport, req.Model, evaluator, opts)
 		}()

 		return c.JSON(http.StatusCreated, RealtimeCallResponse{
--- a/core/http/middleware/baseurl.go
+++ b/core/http/middleware/baseurl.go
@@ -6,20 +6,55 @@ import (
 	"github.com/labstack/echo/v4"
 )

+// BasePathPrefix returns the URL path prefix that the request was reached
+// under (e.g. "/myprefix/"). It always returns a value that starts and ends
+// with `/`, defaulting to "/" when the app is not behind a path prefix.
+//
+// It first looks at the path StripPathPrefix removed (when the proxy forwards
+// the prefix in the URL), then falls back to the X-Forwarded-Prefix header
+// (when the proxy strips the prefix before forwarding, e.g. Caddy's
+// handle_path).
+//
+// The header fallback is gated through SafeForwardedPrefix because the value
+// flows into the SPA HTML response (both <base href> and the path-absolute
+// asset URL rewrite in serveIndex). X-Forwarded-Prefix is attacker
+// controllable on misconfigured proxy chains; without that gate a value like
+// "//evil.com" turns the asset rewrite into a protocol-relative URL that
+// loads JS from a foreign origin.
+func BasePathPrefix(c echo.Context) string {
+	path := c.Path()
+	origPath := c.Request().URL.Path
+
+	if storedPath, ok := c.Get("_original_path").(string); ok && storedPath != "" {
+		origPath = storedPath
+	}
+
+	if path != origPath && strings.HasSuffix(origPath, path) && len(path) > 0 {
+		prefixLen := len(origPath) - len(path)
+		if prefixLen > 0 {
+			pathPrefix := origPath[:prefixLen]
+			if !strings.HasSuffix(pathPrefix, "/") {
+				pathPrefix += "/"
+			}
+			return pathPrefix
+		}
+	}
+
+	if validated, ok := SafeForwardedPrefix(c.Request().Header.Get("X-Forwarded-Prefix")); ok {
+		if !strings.HasSuffix(validated, "/") {
+			validated += "/"
+		}
+		return validated
+	}
+
+	return "/"
+}
+
 // BaseURL returns the base URL for the given HTTP request context.
 // It takes into account that the app may be exposed by a reverse-proxy under a different protocol, host and path.
 // The returned URL is guaranteed to end with `/`.
 // The method should be used in conjunction with the StripPathPrefix middleware.
 func BaseURL(c echo.Context) string {
-	path := c.Path()
-	origPath := c.Request().URL.Path
-
-	// Check if StripPathPrefix middleware stored the original path
-	if storedPath, ok := c.Get("_original_path").(string); ok && storedPath != "" {
-		origPath = storedPath
-	}
-
-	// Check X-Forwarded-Proto for scheme
 	scheme := "http"
 	if c.Request().Header.Get("X-Forwarded-Proto") == "https" {
 		scheme = "https"
@@ -27,22 +62,10 @@ func BaseURL(c echo.Context) string {
 		scheme = "https"
 	}

-	// Check X-Forwarded-Host for host
 	host := c.Request().Host
 	if forwardedHost := c.Request().Header.Get("X-Forwarded-Host"); forwardedHost != "" {
 		host = forwardedHost
 	}

-	if path != origPath && strings.HasSuffix(origPath, path) && len(path) > 0 {
-		prefixLen := len(origPath) - len(path)
-		if prefixLen > 0 && prefixLen <= len(origPath) {
-			pathPrefix := origPath[:prefixLen]
-			if !strings.HasSuffix(pathPrefix, "/") {
-				pathPrefix += "/"
-			}
-			return scheme + "://" + host + pathPrefix
-		}
-	}
-
-	return scheme + "://" + host + "/"
+	return scheme + "://" + host + BasePathPrefix(c)
 }
--- a/core/http/middleware/baseurl_test.go
+++ b/core/http/middleware/baseurl_test.go
@@ -55,4 +55,84 @@ var _ = Describe("BaseURL", func() {
 			Expect(actualURL).To(Equal("http://example.com/myprefix/"), "base URL")
 		})
 	})
+
+	// Caddy's handle_path (and similar reverse-proxy directives) strips the
+	// matched prefix before forwarding upstream, so LocalAI receives the
+	// already-stripped path together with X-Forwarded-Prefix. In that case
+	// StripPathPrefix never stores _original_path, but BaseURL must still
+	// honor the header so that <base href> and asset URLs include the prefix.
+	Context("with X-Forwarded-Prefix header but pre-stripped path", func() {
+		It("should return base URL with prefix from header", func() {
+			app := echo.New()
+			actualURL := ""
+
+			routePath := "/app"
+			app.GET(routePath, func(c echo.Context) error {
+				actualURL = BaseURL(c)
+				return nil
+			})
+
+			req := httptest.NewRequest("GET", "/app", nil)
+			req.Header.Set("X-Forwarded-Prefix", "/localai")
+			rec := httptest.NewRecorder()
+			app.ServeHTTP(rec, req)
+
+			Expect(rec.Code).To(Equal(200), "response status code")
+			Expect(actualURL).To(Equal("http://example.com/localai/"), "base URL")
+		})
+
+		It("should normalize a prefix that already ends with a slash", func() {
+			app := echo.New()
+			actualURL := ""
+
+			routePath := "/app"
+			app.GET(routePath, func(c echo.Context) error {
+				actualURL = BaseURL(c)
+				return nil
+			})
+
+			req := httptest.NewRequest("GET", "/app", nil)
+			req.Header.Set("X-Forwarded-Prefix", "/localai/")
+			rec := httptest.NewRecorder()
+			app.ServeHTTP(rec, req)
+
+			Expect(rec.Code).To(Equal(200), "response status code")
+			Expect(actualURL).To(Equal("http://example.com/localai/"), "base URL")
+		})
+	})
+
+	// X-Forwarded-Prefix is attacker controllable on misconfigured proxy
+	// chains, and the value flows into the SPA HTML response (<base href>
+	// and asset URLs). BasePathPrefix must gate the header through
+	// SafeForwardedPrefix so values that turn the prefix into an open
+	// redirect or a protocol-relative URL are ignored and the base falls
+	// back to "/".
+	Context("with unsafe X-Forwarded-Prefix header", func() {
+		DescribeTable("falls back to / when the header is unsafe",
+			func(header string) {
+				app := echo.New()
+				actualURL := ""
+
+				app.GET("/app", func(c echo.Context) error {
+					actualURL = BaseURL(c)
+					return nil
+				})
+
+				req := httptest.NewRequest("GET", "/app", nil)
+				req.Header.Set("X-Forwarded-Prefix", header)
+				rec := httptest.NewRecorder()
+				app.ServeHTTP(rec, req)
+
+				Expect(rec.Code).To(Equal(200), "response status code")
+				Expect(actualURL).To(Equal("http://example.com/"), "base URL")
+			},
+			Entry("protocol-relative URL", "//evil.com"),
+			Entry("protocol-relative URL with path", "//evil.com/assets"),
+			Entry("backslash path", `/foo\bar`),
+			Entry("embedded NUL", "/foo\x00bar"),
+			Entry("CR injection", "/foo\rbar"),
+			Entry("LF injection", "/foo\nbar"),
+			Entry("missing leading slash", "evil"),
+		)
+	})
 })
--- a/core/http/middleware/request.go
+++ b/core/http/middleware/request.go
@@ -14,7 +14,6 @@ import (
 	"github.com/mudler/LocalAI/core/schema"
 	"github.com/mudler/LocalAI/core/services/galleryop"
 	"github.com/mudler/LocalAI/core/templates"
-	"github.com/mudler/LocalAI/pkg/functions"
 	"github.com/mudler/LocalAI/pkg/model"
 	"github.com/mudler/LocalAI/pkg/utils"
 	"github.com/mudler/xlog"
@@ -241,6 +240,28 @@ func (re *RequestExtractor) SetOpenAIRequest(c echo.Context) error {
 	return nil
 }

+// extractToolChoiceFunctionName parses a tool_choice map and returns the
+// specific function name. Accepts both the OpenAI-spec nested shape
+// ({type:function, function:{name:...}}) and the legacy/Anthropic-compat
+// flat shape ({type:function, name:...}); the nested form wins when both
+// are present. Returns "" for malformed input or when the shape names a
+// mode rather than a specific tool.
+func extractToolChoiceFunctionName(m map[string]any) string {
+	tcType, ok := m["type"].(string)
+	if !ok || tcType != "function" {
+		return ""
+	}
+	if fn, ok := m["function"].(map[string]any); ok {
+		if n, ok := fn["name"].(string); ok && n != "" {
+			return n
+		}
+	}
+	if n, ok := m["name"].(string); ok {
+		return n
+	}
+	return ""
+}
+
 func mergeOpenAIRequestAndModelConfig(config *config.ModelConfig, input *schema.OpenAIRequest) error {
 	if input.Echo {
 		config.Echo = input.Echo
@@ -320,17 +341,55 @@ func mergeOpenAIRequestAndModelConfig(config *config.ModelConfig, input *schema.
 	}

 	if input.ToolsChoice != nil {
-		var toolChoice functions.Tool
-
+		// OpenAI tool_choice has three valid shapes plus one tolerated
+		// non-spec form seen in the wild:
+		//
+		//   1. string mode:    "auto" | "none" | "required"
+		//   2. specific tool:  {"type":"function", "function":{"name":"..."}}  (current spec)
+		//   3. legacy:         {"type":"function", "name":"..."}                (older / Anthropic-compat)
+		//   4. double-encoded: "{\"type\":\"function\", ...}"                   (some clients serialize the object)
+		//
+		// The pre-#9559 code unmarshalled the string case through
+		// json.Unmarshal([]byte(content), &functions.Tool{}), which:
+		//   - failed for plain string modes (so "required" / "none" were
+		//     silently ignored and tools stayed enabled regardless), but
+		//   - happened to handle shape 4 by accident.
+		// It also could not parse shape 3 because functions.Tool has no
+		// flat top-level Name field.
+		//
+		// Mirror the parsing pattern from MergeOpenResponsesConfig (#9509),
+		// route results through the existing input.FunctionCall string/map
+		// dispatch downstream (see the switch on input.FunctionCall in this
+		// same function), and preserve the shape-4 fallback so non-spec
+		// clients don't silently break. Tracked in #9508; sibling fix in #9526.
 		switch content := input.ToolsChoice.(type) {
 		case string:
-			_ = json.Unmarshal([]byte(content), &toolChoice)
+			// "auto" is the default and needs no override. "none" and "required"
+			// both reach SetFunctionCallString via the input.FunctionCall string
+			// branch below; ShouldUseFunctions() then returns false for "none"
+			// (tools disabled) and true for "required" (mode engaged).
+			//
+			// If the string looks like a JSON object, try shape 4 first: parse
+			// it as a tool_choice map and use the resulting name. Falling back
+			// to mode-string handling when the parse yields no usable name keeps
+			// genuinely-malformed input from accidentally engaging a mode.
+			if content == "" || content == "auto" {
+				break
+			}
+			if strings.HasPrefix(strings.TrimSpace(content), "{") {
+				var nested map[string]any
+				if err := json.Unmarshal([]byte(content), &nested); err == nil {
+					if name := extractToolChoiceFunctionName(nested); name != "" {
+						input.FunctionCall = map[string]any{"name": name}
+						break
+					}
+				}
+			}
+			input.FunctionCall = content
 		case map[string]any:
-			dat, _ := json.Marshal(content)
-			_ = json.Unmarshal(dat, &toolChoice)
-		}
-		input.FunctionCall = map[string]any{
-			"name": toolChoice.Function.Name,
+			if name := extractToolChoiceFunctionName(content); name != "" {
+				input.FunctionCall = map[string]any{"name": name}
+			}
 		}
 	}

--- a/core/http/middleware/request_test.go
+++ b/core/http/middleware/request_test.go
@@ -306,3 +306,248 @@ var _ = Describe("MergeOpenResponsesConfig tool_choice parsing", func() {
 		})
 	})
 })
+
+// ---------------------------------------------------------------------------
+// SetModelAndConfig + SetOpenAIRequest - /v1/chat/completions tool_choice parsing
+// ---------------------------------------------------------------------------
+//
+// Parallel to the MergeOpenResponsesConfig specs above, but for the chat
+// completions path. The parsing block lives in mergeOpenAIRequestAndModelConfig
+// (called from SetOpenAIRequest), so these tests drive the full middleware
+// chain the way the production /v1/chat/completions route does.
+//
+// What we assert per shape:
+//   - "required"                                  -> ShouldUseFunctions=true,  no specific name
+//   - "none"                                      -> ShouldUseFunctions=false (tools disabled)
+//   - "auto"                                      -> ShouldUseFunctions=true,  no specific name
+//   - {type:function, function:{name:"X"}} (spec) -> ShouldCallSpecificFunction=true, FunctionToCall="X"
+//   - {type:function, name:"X"}        (legacy)   -> ShouldCallSpecificFunction=true, FunctionToCall="X"
+//   - nested+flat both present                    -> nested wins
+//   - malformed (no type / no name)               -> no-op
+var _ = Describe("SetModelAndConfig tool_choice parsing (chat completions)", func() {
+	var (
+		app            *echo.Echo
+		modelDir       string
+		capturedConfig *config.ModelConfig
+	)
+
+	BeforeEach(func() {
+		var err error
+		modelDir, err = os.MkdirTemp("", "localai-test-models-*")
+		Expect(err).ToNot(HaveOccurred())
+
+		cfgContent := []byte("name: test-model\nbackend: llama-cpp\n")
+		Expect(os.WriteFile(filepath.Join(modelDir, "test-model.yaml"), cfgContent, 0644)).To(Succeed())
+
+		ss := &system.SystemState{
+			Model: system.Model{ModelsPath: modelDir},
+		}
+		appConfig := config.NewApplicationConfig()
+		appConfig.SystemState = ss
+
+		mcl := config.NewModelConfigLoader(modelDir)
+		ml := model.NewModelLoader(ss)
+		re := NewRequestExtractor(mcl, ml, appConfig)
+
+		capturedConfig = nil
+		app = echo.New()
+		app.POST("/v1/chat/completions",
+			func(c echo.Context) error {
+				if cfg, ok := c.Get(CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig); ok {
+					capturedConfig = cfg
+				}
+				return c.String(http.StatusOK, "ok")
+			},
+			re.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.OpenAIRequest) }),
+			func(next echo.HandlerFunc) echo.HandlerFunc {
+				return func(c echo.Context) error {
+					if err := re.SetOpenAIRequest(c); err != nil {
+						return err
+					}
+					return next(c)
+				}
+			},
+		)
+	})
+
+	AfterEach(func() {
+		_ = os.RemoveAll(modelDir)
+	})
+
+	// chatReq wraps a tool_choice JSON fragment in a minimal valid chat-completions
+	// payload. The tools array is non-empty so downstream code paths that gate on
+	// len(input.Functions) see something to work with.
+	chatReq := func(toolChoiceJSON string) string {
+		return `{"model":"test-model",` +
+			`"messages":[{"role":"user","content":"hi"}],` +
+			`"tools":[{"type":"function","function":{"name":"get_weather"}}],` +
+			`"tool_choice":` + toolChoiceJSON + `}`
+	}
+
+	Context("string tool_choice", func() {
+		It("engages mode for tool_choice=\"required\"", func() {
+			rec := postJSON(app, "/v1/chat/completions", chatReq(`"required"`))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeFalse())
+			Expect(capturedConfig.ShouldUseFunctions()).To(BeTrue())
+		})
+
+		It("disables tools for tool_choice=\"none\"", func() {
+			// Before #9559 this was a silent no-op (json.Unmarshal of "none"
+			// into functions.Tool failed); now "none" is honored per OpenAI spec.
+			rec := postJSON(app, "/v1/chat/completions", chatReq(`"none"`))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.ShouldUseFunctions()).To(BeFalse())
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeFalse())
+		})
+
+		It("leaves config untouched for tool_choice=\"auto\"", func() {
+			rec := postJSON(app, "/v1/chat/completions", chatReq(`"auto"`))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeFalse())
+			// "auto" is the default: tools available, model decides.
+			Expect(capturedConfig.ShouldUseFunctions()).To(BeTrue())
+			Expect(capturedConfig.FunctionToCall()).To(Equal(""))
+		})
+	})
+
+	Context("specific-function tool_choice (OpenAI spec shape)", func() {
+		It("parses {type:function, function:{name:...}} and forces the named function", func() {
+			rec := postJSON(app, "/v1/chat/completions",
+				chatReq(`{"type":"function","function":{"name":"get_weather"}}`))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			// Key invariant: a correctly-formed OpenAI tool_choice must engage
+			// grammar-based forcing via SetFunctionCallNameString.
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeTrue())
+			Expect(capturedConfig.FunctionToCall()).To(Equal("get_weather"))
+		})
+
+		It("prefers the nested function.name over a stray top-level name", func() {
+			rec := postJSON(app, "/v1/chat/completions",
+				chatReq(`{"type":"function","function":{"name":"correct_name"},"name":"legacy_name"}`))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.FunctionToCall()).To(Equal("correct_name"))
+		})
+	})
+
+	Context("specific-function tool_choice (legacy Anthropic-compat shape)", func() {
+		It("parses {type:function, name:...} and forces the named function", func() {
+			rec := postJSON(app, "/v1/chat/completions",
+				chatReq(`{"type":"function","name":"get_weather"}`))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeTrue())
+			Expect(capturedConfig.FunctionToCall()).To(Equal("get_weather"))
+		})
+	})
+
+	// Some non-spec clients send the object form serialized as a JSON string.
+	// The pre-#9559 code accepted that by accident; this Context locks in
+	// continued tolerance so those clients do not silently regress.
+	Context("double-encoded tool_choice (JSON string of an object, non-spec)", func() {
+		It("parses a serialized OpenAI-spec nested object", func() {
+			// tool_choice value is itself a JSON-encoded string containing the
+			// object form. Use json.Marshal of the inner blob so the escapes
+			// are correct regardless of the test reader.
+			inner := `{"type":"function","function":{"name":"get_weather"}}`
+			encoded, err := json.Marshal(inner)
+			Expect(err).ToNot(HaveOccurred())
+			rec := postJSON(app, "/v1/chat/completions", chatReq(string(encoded)))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeTrue())
+			Expect(capturedConfig.FunctionToCall()).To(Equal("get_weather"))
+		})
+
+		It("parses a serialized legacy/Anthropic flat object", func() {
+			inner := `{"type":"function","name":"get_weather"}`
+			encoded, err := json.Marshal(inner)
+			Expect(err).ToNot(HaveOccurred())
+			rec := postJSON(app, "/v1/chat/completions", chatReq(string(encoded)))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeTrue())
+			Expect(capturedConfig.FunctionToCall()).To(Equal("get_weather"))
+		})
+
+		It("falls back to mode-string handling when the JSON string parses but has no usable name", func() {
+			// A JSON-string that decodes to a map without a function name
+			// should not engage specific-function forcing. We expect it to
+			// fall through to the mode-string path; the resulting mode is
+			// the raw blob (nonsense), but ShouldCallSpecificFunction stays
+			// false - the invariant that matters.
+			inner := `{"type":"function"}`
+			encoded, err := json.Marshal(inner)
+			Expect(err).ToNot(HaveOccurred())
+			rec := postJSON(app, "/v1/chat/completions", chatReq(string(encoded)))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeFalse())
+		})
+	})
+
+	Context("malformed tool_choice", func() {
+		It("is a no-op when type is missing", func() {
+			rec := postJSON(app, "/v1/chat/completions",
+				chatReq(`{"function":{"name":"get_weather"}}`))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeFalse())
+		})
+
+		It("is a no-op when type is not \"function\"", func() {
+			rec := postJSON(app, "/v1/chat/completions",
+				chatReq(`{"type":"object","function":{"name":"get_weather"}}`))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeFalse())
+		})
+
+		It("is a no-op when name is missing from both shapes", func() {
+			rec := postJSON(app, "/v1/chat/completions",
+				chatReq(`{"type":"function","function":{}}`))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeFalse())
+			Expect(capturedConfig.FunctionToCall()).To(Equal(""))
+		})
+
+		It("is a no-op when name is empty string", func() {
+			rec := postJSON(app, "/v1/chat/completions",
+				chatReq(`{"type":"function","function":{"name":""}}`))
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeFalse())
+		})
+	})
+
+	Context("nil tool_choice", func() {
+		It("is a no-op", func() {
+			rec := postJSON(app, "/v1/chat/completions",
+				`{"model":"test-model","messages":[{"role":"user","content":"hi"}]}`)
+
+			Expect(rec.Code).To(Equal(http.StatusOK))
+			Expect(capturedConfig).ToNot(BeNil())
+			Expect(capturedConfig.ShouldCallSpecificFunction()).To(BeFalse())
+			Expect(capturedConfig.FunctionToCall()).To(Equal(""))
+		})
+	})
+})
--- a/core/http/react-ui/public/locales/en/models.json
+++ b/core/http/react-ui/public/locales/en/models.json
@@ -24,6 +24,7 @@
    "diarization": "Diarization",
    "soundGen": "Sound",
    "audioTransform": "Audio FX",
+    "realtimeAudio": "Realtime Audio",
    "embedding": "Embeddings",
    "rerank": "Rerank",
    "detection": "Detection",
--- a/core/http/react-ui/src/hooks/useChat.js
+++ b/core/http/react-ui/src/hooks/useChat.js
@@ -218,9 +218,15 @@ export function useChat(initialModel = '') {
          })
          userFiles.push({ name: file.name, type: 'audio' })
        } else {
-          // Text/PDF files - append to content
-          userFiles.push({ name: file.name, type: 'file', content: file.textContent || '' })
-        }
+			// Text/PDF files - append to content
+			if (file.textContent) {
+				messageContent.push({
+					type: 'text',
+					text: `\n\n--- File: ${file.name} ---\n${file.textContent}\n--- End of ${file.name} ---`,
+				})
+			}
+			userFiles.push({ name: file.name, type: 'file', content: file.textContent || '' })
+		}
      }
    } else {
      messageContent = content
@@ -255,7 +261,10 @@ export function useChat(initialModel = '') {
    )
    messages.push(...historyForApi, { role: 'user', content: messageContent })

-    const requestBody = { model, messages, stream: true }
+    // include_usage tells LocalAI to emit a trailing chunk with token totals;
+    // without it the spec-compliant server drops `usage` from the stream and
+    // the token-count badge would never populate.
+    const requestBody = { model, messages, stream: true, stream_options: { include_usage: true } }
    if (temperature !== null && temperature !== undefined) requestBody.temperature = temperature
    if (topP !== null && topP !== undefined) requestBody.top_p = topP
    if (topK !== null && topK !== undefined) requestBody.top_k = topK
--- a/core/http/react-ui/src/pages/FineTune.jsx
+++ b/core/http/react-ui/src/pages/FineTune.jsx
@@ -732,6 +732,9 @@ export default function FineTune() {
  const [seed, setSeed] = useState(0)
  const [mixedPrecision, setMixedPrecision] = useState('')
  const [extraOptions, setExtraOptions] = useState([])
+  // liquid-audio specific knobs (folded into extra_options on submit)
+  const [liquidAudioVoice, setLiquidAudioVoice] = useState('')
+  const [liquidAudioValDataset, setLiquidAudioValDataset] = useState('')
  const [hfToken, setHfToken] = useState('')
  const [showAdvanced, setShowAdvanced] = useState(false)
  const [resumeFromCheckpoint, setResumeFromCheckpoint] = useState('')
@@ -801,6 +804,12 @@ export default function FineTune() {
      for (const { key, value } of extraOptions) {
        if (key.trim()) extra[key.trim()] = value
      }
+      // Fold liquid-audio specific fields into extra_options. The Python
+      // backend reads `voice` and `val_dataset` directly from there.
+      if (backend === 'liquid-audio') {
+        if (liquidAudioVoice) extra.voice = liquidAudioVoice
+        if (liquidAudioValDataset.trim()) extra.val_dataset = liquidAudioValDataset.trim()
+      }

      const isAdapter = ['lora', 'loha', 'lokr'].includes(trainingType)

@@ -872,6 +881,10 @@ export default function FineTune() {
    for (const { key, value } of extraOptions) {
      if (key.trim()) extra[key.trim()] = value
    }
+    if (backend === 'liquid-audio') {
+      if (liquidAudioVoice) extra.voice = liquidAudioVoice
+      if (liquidAudioValDataset.trim()) extra.val_dataset = liquidAudioValDataset.trim()
+    }
    return {
      model,
      backend,
@@ -965,10 +978,15 @@ export default function FineTune() {
      setSaveTotalLimit(Number(config.extra_options.save_total_limit))
    }

+    // Restore liquid-audio specific extras (also filtered out of the
+    // freeform list below).
+    if (config.extra_options?.voice != null) setLiquidAudioVoice(String(config.extra_options.voice))
+    if (config.extra_options?.val_dataset != null) setLiquidAudioValDataset(String(config.extra_options.val_dataset))
+
    // Convert extra_options object to [{key, value}] entries, filtering out handled keys
    if (config.extra_options && typeof config.extra_options === 'object') {
      const entries = Object.entries(config.extra_options)
-        .filter(([k]) => !['max_seq_length', 'save_total_limit', 'hf_token', 'eval_strategy', 'eval_steps', 'eval_split', 'eval_dataset_source', 'eval_split_ratio'].includes(k))
+        .filter(([k]) => !['max_seq_length', 'save_total_limit', 'hf_token', 'eval_strategy', 'eval_steps', 'eval_split', 'eval_dataset_source', 'eval_split_ratio', 'voice', 'val_dataset'].includes(k))
        .map(([key, value]) => ({ key, value: String(value) }))
      setExtraOptions(entries)
    }
@@ -1458,6 +1476,31 @@ export default function FineTune() {
                  </div>
                )}

+                {backend === 'liquid-audio' && (
+                  <div style={{ marginBottom: 'var(--spacing-md)' }}>
+                    <label className="form-label">Liquid Audio</label>
+                    <div style={{ fontSize: '0.8125rem', color: 'var(--color-text-muted)', marginBottom: 'var(--spacing-sm)' }}>
+                      Dataset must be preprocessed by <code>LFM2AudioChatMapper</code> (a directory of LFM2DataLoader-ready arrow files). See <code>liquid_audio/examples/preprocess_jenny_tts.py</code> for the conversion recipe.
+                    </div>
+                    <div style={{ display: 'grid', gridTemplateColumns: 'repeat(auto-fit, minmax(220px, 1fr))', gap: 'var(--spacing-sm)' }}>
+                      <div>
+                        <label className="form-label">TTS Voice (optional)</label>
+                        <select value={liquidAudioVoice} onChange={e => setLiquidAudioVoice(e.target.value)} className="input">
+                          <option value="">— inherit from system prompt —</option>
+                          <option value="us_male">us_male</option>
+                          <option value="us_female">us_female</option>
+                          <option value="uk_male">uk_male</option>
+                          <option value="uk_female">uk_female</option>
+                        </select>
+                      </div>
+                      <div>
+                        <label className="form-label">Validation Dataset (path)</label>
+                        <input type="text" value={liquidAudioValDataset} onChange={e => setLiquidAudioValDataset(e.target.value)} placeholder="e.g. /data/jenny_tts/val" className="input" />
+                      </div>
+                    </div>
+                  </div>
+                )}
+
                <div>
                  <label className="form-label">Extra Options (backend-specific key-value pairs)</label>
                  <KeyValueEditor entries={extraOptions} onChange={setExtraOptions} />
--- a/core/http/react-ui/src/pages/Home.jsx
+++ b/core/http/react-ui/src/pages/Home.jsx
@@ -161,7 +161,11 @@ export default function Home() {
    const newFiles = []
    for (const file of fileList) {
      const base64 = await fileToBase64(file)
-      newFiles.push({ name: file.name, type: file.type, base64 })
+      const entry = { name: file.name, type: file.type, base64 }
+      if (!file.type.startsWith('image/') && !file.type.startsWith('audio/')) {
+        entry.textContent = await file.text().catch(() => '')
+      }
+      newFiles.push(entry)
    }
    setter(prev => [...prev, ...newFiles])
  }, [])
--- a/core/http/react-ui/src/pages/Models.jsx
+++ b/core/http/react-ui/src/pages/Models.jsx
@@ -28,6 +28,7 @@ const FILTERS = [
  { key: 'diarization', labelKey: 'filters.diarization', icon: 'fa-users' },
  { key: 'sound_generation', labelKey: 'filters.soundGen', icon: 'fa-music' },
  { key: 'audio_transform', labelKey: 'filters.audioTransform', icon: 'fa-sliders' },
+  { key: 'realtime_audio', labelKey: 'filters.realtimeAudio', icon: 'fa-tower-broadcast' },
  { key: 'embeddings', labelKey: 'filters.embedding', icon: 'fa-vector-square' },
  { key: 'rerank', labelKey: 'filters.rerank', icon: 'fa-sort' },
  { key: 'detection', labelKey: 'filters.detection', icon: 'fa-bullseye' },
--- a/core/http/react-ui/src/pages/Talk.jsx
+++ b/core/http/react-ui/src/pages/Talk.jsx
@@ -2,6 +2,10 @@ import { useState, useRef, useEffect, useCallback, useMemo } from 'react'
 import { useOutletContext, useNavigate } from 'react-router-dom'
 import { realtimeApi } from '../utils/api'
 import ModelSelector from '../components/ModelSelector'
+import ClientMCPDropdown from '../components/ClientMCPDropdown'
+import { useMCPClient } from '../hooks/useMCPClient'
+import { loadClientMCPServers } from '../utils/mcpClientStorage'
+import { useAuth } from '../context/AuthContext'

 const STATUS_STYLES = {
  disconnected: { icon: 'fa-solid fa-circle', color: 'var(--color-text-secondary)', bg: 'transparent' },
@@ -40,6 +44,27 @@ export default function Talk() {
  const [voiceEdited, setVoiceEdited] = useState(false)
  const [language, setLanguage] = useState('')

+  // Client MCP — mirrors the chat page's wiring (useMCPClient + ClientMCPDropdown).
+  // Talk has a single ephemeral session, so the active server set lives in component
+  // state rather than per-chat config.
+  const [clientMCPServers, setClientMCPServers] = useState(() => loadClientMCPServers())
+  const [activeMCPIds, setActiveMCPIds] = useState([])
+  const {
+    connect: mcpConnect,
+    disconnect: mcpDisconnect,
+    getToolsForLLM,
+    isClientTool,
+    executeTool,
+    connectionStatuses,
+    getConnectedTools,
+  } = useMCPClient()
+
+  // LocalAI Assistant ("Manage Mode") — mirrors the chat-page toggle.
+  // Admin-only; the realtime endpoint enforces the gate too. When on, the
+  // backend mounts the in-process MCP admin tool surface for this session.
+  const { isAdmin } = useAuth()
+  const [manageMode, setManageMode] = useState(false)
+
  // Diagnostics
  const [diagVisible, setDiagVisible] = useState(false)

@@ -75,7 +100,7 @@ export default function Talk() {
          if (!voiceEdited) setVoice(models[0].voice || '')
        }
      })
-      .catch(err => addToast(`Failed to load pipeline models: ${err.message}`, 'error', 5000, { link: { href: '/app/traces?tab=backend', text: 'View traces' } }))
+      .catch(err => addToast(`Failed to load realtime models: ${err.message}`, 'error', 5000, { link: { href: '/app/traces?tab=backend', text: 'View traces' } }))
      .finally(() => setModelsLoading(false))
  }, [])

@@ -84,6 +109,32 @@ export default function Talk() {
    transcriptEndRef.current?.scrollIntoView({ behavior: 'smooth' })
  }, [transcript])

+  // Mirror Chat.jsx: connect / disconnect client MCP servers as the user toggles them.
+  useEffect(() => {
+    const activeSet = new Set(activeMCPIds)
+    for (const server of clientMCPServers) {
+      const status = connectionStatuses[server.id]?.status
+      if (activeSet.has(server.id) && status !== 'connected' && status !== 'connecting') {
+        mcpConnect(server)
+      } else if (!activeSet.has(server.id) && (status === 'connected' || status === 'connecting')) {
+        mcpDisconnect(server.id)
+      }
+    }
+  }, [activeMCPIds.join(','), clientMCPServers, connectionStatuses, mcpConnect, mcpDisconnect])
+
+  const handleClientMCPToggle = useCallback((serverId) => {
+    setActiveMCPIds(prev => prev.includes(serverId) ? prev.filter(s => s !== serverId) : [...prev, serverId])
+  }, [])
+  const handleClientMCPServerAdded = useCallback((server) => {
+    setClientMCPServers(loadClientMCPServers())
+    setActiveMCPIds(prev => prev.includes(server.id) ? prev : [...prev, server.id])
+  }, [])
+  const handleClientMCPServerRemoved = useCallback(async (id) => {
+    await mcpDisconnect(id)
+    setClientMCPServers(loadClientMCPServers())
+    setActiveMCPIds(prev => prev.filter(s => s !== id))
+  }, [mcpDisconnect])
+
  const selectedModelInfo = pipelineModels.find(m => m.name === selectedModel)

  // ── Status helper ──
@@ -96,7 +147,9 @@ export default function Talk() {
  const sendSessionUpdate = useCallback(() => {
    const dc = dcRef.current
    if (!dc || dc.readyState !== 'open') return
-    if (!instructions.trim() && !voice.trim() && !language.trim()) return
+
+    const tools = getToolsForLLM()
+    if (!instructions.trim() && !voice.trim() && !language.trim() && tools.length === 0) return

    const session = {}
    if (instructions.trim()) session.instructions = instructions.trim()
@@ -105,9 +158,57 @@ export default function Talk() {
      if (voice.trim()) session.audio.output = { voice: voice.trim() }
      if (language.trim()) session.audio.input = { transcription: { language: language.trim() } }
    }
+    // Pass MCP-server-advertised tools straight through. Server-side they
+    // get rendered into the model's prompt via the function:/argument_regex
+    // pair on the model config (gallery/lfm.yaml for LFM2.5-Audio).
+    if (tools.length > 0) session.tools = tools

    dc.send(JSON.stringify({ type: 'session.update', session }))
-  }, [instructions, voice, language])
+  }, [instructions, voice, language, getToolsForLLM])
+
+  // Re-send session.update whenever the tool set changes mid-session so the
+  // model sees newly-toggled MCP servers without a reconnect.
+  useEffect(() => {
+    if (isConnected) sendSessionUpdate()
+    // eslint-disable-next-line react-hooks/exhaustive-deps
+  }, [activeMCPIds.join(',')])
+
+  // ── Function-call dispatcher ──
+  // Mirrors the chat-page agentic loop: collect args from the model's
+  // function_call_arguments.done event, hand them to the MCP client's
+  // executeTool, then echo the result back via conversation.item.create +
+  // response.create so the model can complete its turn with the tool output.
+  const handleFunctionCall = useCallback(async (event) => {
+    const dc = dcRef.current
+    if (!dc || dc.readyState !== 'open') return
+    const { call_id: callId, name, arguments: argsJson } = event
+    if (!callId || !name) return
+    if (!isClientTool(name)) {
+      // No MCP server advertises this tool — let the model know so it can
+      // recover instead of hanging.
+      dc.send(JSON.stringify({
+        type: 'conversation.item.create',
+        item: { type: 'function_call_output', call_id: callId, output: `Error: unknown tool "${name}"` },
+      }))
+      dc.send(JSON.stringify({ type: 'response.create' }))
+      return
+    }
+    updateStatus('thinking', `Running tool ${name}...`)
+    try {
+      const result = await executeTool(name, argsJson)
+      dc.send(JSON.stringify({
+        type: 'conversation.item.create',
+        item: { type: 'function_call_output', call_id: callId, output: typeof result === 'string' ? result : JSON.stringify(result) },
+      }))
+      dc.send(JSON.stringify({ type: 'response.create' }))
+    } catch (err) {
+      dc.send(JSON.stringify({
+        type: 'conversation.item.create',
+        item: { type: 'function_call_output', call_id: callId, output: `Error: ${err?.message || err}` },
+      }))
+      dc.send(JSON.stringify({ type: 'response.create' }))
+    }
+  }, [executeTool, isClientTool, updateStatus])

  // ── Server event handler ──
  const handleServerEvent = useCallback((event) => {
@@ -163,6 +264,32 @@ export default function Talk() {
      case 'response.output_audio.delta':
        updateStatus('speaking', 'Speaking...')
        break
+      case 'response.output_item.done': {
+        // Server-executed tools (Manage Mode) surface as output items —
+        // FunctionCall when the model invokes a tool, FunctionCallOutput
+        // once the server has run it. Render both on `done` so we get
+        // each transcript entry exactly once.
+        const item = event.item
+        if (!item) break
+        if (item.FunctionCall) {
+          setTranscript(prev => [...prev, {
+            role: 'tool_call',
+            text: `${item.FunctionCall.name}(${item.FunctionCall.arguments || ''})`,
+          }])
+        } else if (item.FunctionCallOutput) {
+          let preview = item.FunctionCallOutput.output || ''
+          // Pretty-print JSON for readability; fall back to raw string.
+          try { preview = JSON.stringify(JSON.parse(preview), null, 2) } catch (_) { /* keep raw */ }
+          setTranscript(prev => [...prev, { role: 'tool_result', text: preview }])
+          streamingRef.current = null  // tool result ends the current assistant text run
+        }
+        break
+      }
+      case 'response.function_call_arguments.done':
+        // Don't await — keep the event loop free; handleFunctionCall sends
+        // conversation.item.create + response.create when it's done.
+        handleFunctionCall(event)
+        break
      case 'response.done':
        updateStatus('listening', 'Listening...')
        break
@@ -171,12 +298,12 @@ export default function Talk() {
        updateStatus('error', 'Error: ' + (event.error?.message || 'Unknown error'))
        break
    }
-  }, [sendSessionUpdate, updateStatus])
+  }, [sendSessionUpdate, updateStatus, handleFunctionCall])

  // ── Connect ──
  const connect = useCallback(async () => {
    if (!selectedModel) {
-      addToast('Please select a pipeline model first.', 'warning')
+      addToast('Please select a realtime model first.', 'warning')
      return
    }
    if (!navigator.mediaDevices?.getUserMedia) {
@@ -237,6 +364,7 @@ export default function Talk() {
      const data = await realtimeApi.call({
        sdp: pc.localDescription.sdp,
        model: selectedModel,
+        localai_assistant: manageMode,
      })

      await pc.setRemoteDescription({ type: 'answer', sdp: data.sdp })
@@ -245,7 +373,7 @@ export default function Talk() {
      updateStatus('error', 'Connection failed: ' + err.message)
      disconnect()
    }
-  }, [selectedModel, diagVisible, handleServerEvent, updateStatus, addToast])
+  }, [selectedModel, manageMode, diagVisible, handleServerEvent, updateStatus, addToast])

  // ── Disconnect ──
  const disconnect = useCallback(() => {
@@ -508,8 +636,58 @@ export default function Talk() {
            </button>
          </div>

+          {/* Tools (client-side MCP servers, mirroring the chat page) */}
+          <div style={{ marginBottom: 'var(--spacing-md)' }}>
+            <label className="form-label" style={{ fontSize: '0.8125rem' }}>
+              <i className="fas fa-screwdriver-wrench" style={{ color: 'var(--color-primary)', marginRight: 4 }} /> Tools
+            </label>
+            <ClientMCPDropdown
+              activeServerIds={activeMCPIds}
+              onToggleServer={handleClientMCPToggle}
+              onServerAdded={handleClientMCPServerAdded}
+              onServerRemoved={handleClientMCPServerRemoved}
+              connectionStatuses={connectionStatuses}
+              getConnectedTools={getConnectedTools}
+            />
+            {isAdmin && (
+              <label style={{
+                display: 'flex', alignItems: 'center', gap: 'var(--spacing-xs)',
+                marginTop: 'var(--spacing-xs)', fontSize: '0.8125rem',
+                cursor: isConnected ? 'not-allowed' : 'pointer',
+                color: isConnected ? 'var(--color-text-secondary)' : 'var(--color-text)',
+              }}>
+                <input
+                  type="checkbox"
+                  checked={manageMode}
+                  disabled={isConnected}
+                  onChange={(e) => setManageMode(e.target.checked)}
+                />
+                <i className="fas fa-user-shield" style={{ color: 'var(--color-primary)' }} />
+                Manage Mode
+                <span style={{ color: 'var(--color-text-secondary)', fontSize: '0.75rem' }}>
+                  — let the model query LocalAI (models, backends, system info)
+                </span>
+              </label>
+            )}
+          </div>
+
          {/* Pipeline details */}
-          {selectedModelInfo && (
+          {selectedModelInfo && selectedModelInfo.self_contained && (
+            <div style={{
+              background: 'var(--color-bg-secondary)', borderRadius: 'var(--radius-sm)',
+              padding: 'var(--spacing-xs) var(--spacing-sm)', border: '1px solid var(--color-border)',
+              marginBottom: 'var(--spacing-xs)', fontSize: '0.75rem',
+              display: 'flex', alignItems: 'center', gap: 'var(--spacing-xs)',
+            }}>
+              <i className="fas fa-tower-broadcast" style={{ color: 'var(--color-primary)' }} />
+              <span style={{ color: 'var(--color-text-secondary)' }}>Self-contained any-to-any —</span>
+              <span style={{ fontFamily: 'var(--font-mono)', overflow: 'hidden', textOverflow: 'ellipsis', whiteSpace: 'nowrap' }}>
+                {selectedModelInfo.name}
+              </span>
+              <span style={{ color: 'var(--color-text-secondary)', marginLeft: 'auto' }}>handles VAD · STT · LLM · TTS</span>
+            </div>
+          )}
+          {selectedModelInfo && !selectedModelInfo.self_contained && (
            <div style={{
              display: 'grid', gridTemplateColumns: 'repeat(4, 1fr)', gap: 'var(--spacing-xs)',
              marginBottom: 'var(--spacing-xs)', fontSize: '0.75rem',
@@ -533,7 +711,8 @@ export default function Talk() {
          {selectedModelInfo && !isConnected && (
            <div style={{ marginBottom: 'var(--spacing-md)' }}>
              <button className="btn btn-secondary btn-sm" onClick={() => navigate(`/app/model-editor/${encodeURIComponent(selectedModel)}`)}>
-                <i className="fas fa-pen-to-square" style={{ marginRight: 'var(--spacing-xs)' }} /> Edit Pipeline
+                <i className="fas fa-pen-to-square" style={{ marginRight: 'var(--spacing-xs)' }} />
+                {selectedModelInfo.self_contained ? ' Edit Model Config' : ' Edit Pipeline'}
              </button>
            </div>
          )}
@@ -600,16 +779,28 @@ export default function Talk() {
                Conversation will appear here...
              </p>
            )}
-            {transcript.map((entry, i) => (
-              <div key={i} style={{ display: 'flex', alignItems: 'flex-start', gap: 'var(--spacing-xs)' }}>
-                <i className={entry.role === 'user' ? 'fa-solid fa-user' : 'fa-solid fa-robot'}
-                  style={{
-                    color: entry.role === 'user' ? 'var(--color-primary)' : 'var(--color-accent)',
-                    marginTop: 3, flexShrink: 0, fontSize: '0.75rem',
-                  }} />
-                <p style={{ margin: 0 }}>{entry.text}</p>
-              </div>
-            ))}
+            {transcript.map((entry, i) => {
+              const isToolCall = entry.role === 'tool_call'
+              const isToolResult = entry.role === 'tool_result'
+              const isUser = entry.role === 'user'
+              const iconClass = isToolCall ? 'fa-solid fa-screwdriver-wrench'
+                              : isToolResult ? 'fa-solid fa-clipboard-list'
+                              : isUser ? 'fa-solid fa-user' : 'fa-solid fa-robot'
+              const iconColor = isToolCall || isToolResult ? 'var(--color-text-secondary)'
+                              : isUser ? 'var(--color-primary)' : 'var(--color-accent)'
+              return (
+                <div key={i} style={{ display: 'flex', alignItems: 'flex-start', gap: 'var(--spacing-xs)' }}>
+                  <i className={iconClass} style={{ color: iconColor, marginTop: 3, flexShrink: 0, fontSize: '0.75rem' }} />
+                  <p style={{
+                    margin: 0,
+                    fontFamily: (isToolCall || isToolResult) ? 'var(--font-mono)' : undefined,
+                    fontSize: (isToolCall || isToolResult) ? '0.8125rem' : undefined,
+                    color: (isToolCall || isToolResult) ? 'var(--color-text-secondary)' : undefined,
+                    whiteSpace: isToolResult ? 'pre-wrap' : undefined,
+                  }}>{entry.text}</p>
+                </div>
+              )
+            })}
            <div ref={transcriptEndRef} />
          </div>

--- a/core/http/react-ui/src/utils/capabilities.js
+++ b/core/http/react-ui/src/utils/capabilities.js
@@ -20,3 +20,4 @@ export const CAP_DETECTION = 'FLAG_DETECTION'
 export const CAP_FACE_RECOGNITION = 'FLAG_FACE_RECOGNITION'
 export const CAP_SPEAKER_RECOGNITION = 'FLAG_SPEAKER_RECOGNITION'
 export const CAP_AUDIO_TRANSFORM = 'FLAG_AUDIO_TRANSFORM'
+export const CAP_REALTIME_AUDIO = 'FLAG_REALTIME_AUDIO'
--- a/core/http/routes/ui.go
+++ b/core/http/routes/ui.go
@@ -18,7 +18,11 @@ func RegisterUIRoutes(app *echo.Echo,
 	// SPA routes are handled by the 404 fallback in app.go which serves
 	// index.html for any unmatched HTML request, enabling client-side routing.

-	// Pipeline models API (for the Talk page WebRTC interface)
+	// Pipeline models API (for the Talk page WebRTC interface).
+	// A model qualifies when it either declares an explicit VAD+STT+LLM+TTS
+	// pipeline (legacy/composed) or carries the realtime_audio usecase (a
+	// self-contained any-to-any audio backend like liquid-audio that owns the
+	// full loop in a single AudioToAudioStream RPC).
 	app.GET("/api/pipeline-models", func(c echo.Context) error {
 		type pipelineModelInfo struct {
 			Name          string `json:"name"`
@@ -27,9 +31,17 @@ func RegisterUIRoutes(app *echo.Echo,
 			LLM           string `json:"llm"`
 			TTS           string `json:"tts"`
 			Voice         string `json:"voice"`
+			// SelfContained is true for any-to-any audio models — the four
+			// pipeline slots are populated with the model's own name so the
+			// UI can render them, but the Realtime API routes the session
+			// directly to the backend's AudioToAudioStream RPC.
+			SelfContained bool `json:"self_contained,omitempty"`
 		}

 		pipelineModels := cl.GetModelConfigsByFilter(func(_ string, cfg *config.ModelConfig) bool {
+			if cfg.HasUsecases(config.FLAG_REALTIME_AUDIO) {
+				return true
+			}
 			p := cfg.Pipeline
 			return p.VAD != "" && p.Transcription != "" && p.LLM != "" && p.TTS != ""
 		})
@@ -38,8 +50,20 @@ func RegisterUIRoutes(app *echo.Echo,
 			return cmp.Compare(a.Name, b.Name)
 		})

-		var models []pipelineModelInfo
+		models := make([]pipelineModelInfo, 0, len(pipelineModels))
 		for _, cfg := range pipelineModels {
+			if cfg.HasUsecases(config.FLAG_REALTIME_AUDIO) {
+				models = append(models, pipelineModelInfo{
+					Name:          cfg.Name,
+					VAD:           cfg.Name,
+					Transcription: cfg.Name,
+					LLM:           cfg.Name,
+					TTS:           cfg.Name,
+					Voice:         cfg.TTSConfig.Voice,
+					SelfContained: true,
+				})
+				continue
+			}
 			models = append(models, pipelineModelInfo{
 				Name:          cfg.Name,
 				VAD:           cfg.Pipeline.VAD,
--- a/core/http/routes/ui_api.go
+++ b/core/http/routes/ui_api.go
@@ -54,6 +54,7 @@ var usecaseFilters = map[string]config.ModelConfigUsecase{
 	config.UsecaseVAD:             config.FLAG_VAD,
 	config.UsecaseAudioTransform:  config.FLAG_AUDIO_TRANSFORM,
 	config.UsecaseDiarization:     config.FLAG_DIARIZATION,
+	config.UsecaseRealtimeAudio:   config.FLAG_REALTIME_AUDIO,
 }


--- a/core/http/routes/ui_pipeline_models_test.go
+++ b/core/http/routes/ui_pipeline_models_test.go
@@ -0,0 +1,153 @@
+package routes_test
+
+import (
+	"encoding/json"
+	"io"
+	"net/http"
+	"net/http/httptest"
+	"os"
+	"path/filepath"
+
+	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/http/routes"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("Pipeline models API", func() {
+	var (
+		app          *echo.Echo
+		tempDir      string
+		configLoader *config.ModelConfigLoader
+	)
+
+	BeforeEach(func() {
+		var err error
+		tempDir, err = os.MkdirTemp("", "pipeline-models-test-*")
+		Expect(err).NotTo(HaveOccurred())
+
+		configLoader = config.NewModelConfigLoader(tempDir)
+	})
+
+	AfterEach(func() {
+		Expect(os.RemoveAll(tempDir)).To(Succeed())
+	})
+
+	writeConfig := func(name, body string) {
+		path := filepath.Join(tempDir, name+".yaml")
+		Expect(os.WriteFile(path, []byte(body), 0o644)).To(Succeed())
+	}
+
+	queryPipelineModels := func() []map[string]any {
+		Expect(configLoader.LoadModelConfigsFromPath(tempDir)).To(Succeed())
+
+		app = echo.New()
+		routes.RegisterUIRoutes(app, configLoader, nil, nil, func(next echo.HandlerFunc) echo.HandlerFunc { return next })
+
+		req := httptest.NewRequest(http.MethodGet, "/api/pipeline-models", nil)
+		rec := httptest.NewRecorder()
+		app.ServeHTTP(rec, req)
+		Expect(rec.Code).To(Equal(http.StatusOK))
+		body, err := io.ReadAll(rec.Body)
+		Expect(err).NotTo(HaveOccurred())
+
+		var got []map[string]any
+		Expect(json.Unmarshal(body, &got)).To(Succeed())
+		return got
+	}
+
+	It("returns models with an explicit VAD/STT/LLM/TTS pipeline", func() {
+		writeConfig("legacy-pipeline", `
+name: legacy-pipeline
+backend: llama-cpp
+pipeline:
+  vad: silero
+  transcription: whisper
+  llm: llama
+  tts: piper
+tts:
+  voice: en-amy
+`)
+		// A model with a partial pipeline must not appear.
+		writeConfig("half-pipeline", `
+name: half-pipeline
+backend: llama-cpp
+pipeline:
+  vad: silero
+  transcription: whisper
+`)
+
+		models := queryPipelineModels()
+		Expect(models).To(HaveLen(1))
+		Expect(models[0]["name"]).To(Equal("legacy-pipeline"))
+		Expect(models[0]["vad"]).To(Equal("silero"))
+		Expect(models[0]["llm"]).To(Equal("llama"))
+		Expect(models[0]["voice"]).To(Equal("en-amy"))
+		// self_contained is omitempty — absent for legacy pipelines.
+		_, hasFlag := models[0]["self_contained"]
+		Expect(hasFlag).To(BeFalse())
+	})
+
+	It("surfaces self-contained any-to-any models tagged with realtime_audio", func() {
+		writeConfig("lfm-realtime", `
+name: lfm-realtime
+backend: liquid-audio
+known_usecases:
+  - realtime_audio
+  - chat
+  - tts
+  - transcript
+tts:
+  voice: us_female
+`)
+
+		models := queryPipelineModels()
+		Expect(models).To(HaveLen(1))
+		Expect(models[0]["name"]).To(Equal("lfm-realtime"))
+		// All four pipeline slots are populated with the model's own name so
+		// the Talk page UI has something to render.
+		Expect(models[0]["vad"]).To(Equal("lfm-realtime"))
+		Expect(models[0]["transcription"]).To(Equal("lfm-realtime"))
+		Expect(models[0]["llm"]).To(Equal("lfm-realtime"))
+		Expect(models[0]["tts"]).To(Equal("lfm-realtime"))
+		Expect(models[0]["voice"]).To(Equal("us_female"))
+		Expect(models[0]["self_contained"]).To(BeTrue())
+	})
+
+	It("includes both legacy and self-contained models in the same response", func() {
+		writeConfig("legacy", `
+name: legacy
+backend: llama-cpp
+pipeline:
+  vad: silero
+  transcription: whisper
+  llm: llama
+  tts: piper
+`)
+		writeConfig("realtime", `
+name: realtime
+backend: liquid-audio
+known_usecases:
+  - realtime_audio
+`)
+
+		models := queryPipelineModels()
+		Expect(models).To(HaveLen(2))
+		// Sorted by name → legacy, realtime.
+		Expect(models[0]["name"]).To(Equal("legacy"))
+		Expect(models[1]["name"]).To(Equal("realtime"))
+		Expect(models[1]["self_contained"]).To(BeTrue())
+	})
+
+	It("excludes models that have neither a pipeline nor realtime_audio", func() {
+		writeConfig("plain-chat", `
+name: plain-chat
+backend: llama-cpp
+known_usecases:
+  - chat
+`)
+
+		Expect(queryPipelineModels()).To(BeEmpty())
+	})
+})
--- a/Show More
+++ b/Show More