fix(distributed): make admin backend installs resilient and observable (#9958 )

* feat(distributed): add configurable NATS backend install/upgrade timeouts Adds BackendInstallTimeout and BackendUpgradeTimeout to DistributedConfig with 15m defaults, following the existing MCPToolTimeout / WorkerWaitTimeout pattern. These will replace the hardcoded literals in RemoteUnloaderAdapter so admin-driven backend installs across the cluster survive long OCI image pulls that previously timed out at 3m. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(distributed): gofmt alignment after timeout fields Re-aligns the Validate() negative-duration map and the Default* const block so the new BackendInstall/UpgradeTimeout entries do not leave the surrounding columns mis-padded. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(cli): surface LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT and _UPGRADE_TIMEOUT Parses the two new env vars on the run CLI and threads them through the existing AppOption builder so DistributedConfig picks them up. Invalid duration strings now fail loudly at startup rather than silently falling back to the default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): inject NATS install/upgrade timeouts into RemoteUnloaderAdapter Removes the hardcoded 3m / 15m literals from RemoteUnloaderAdapter and threads in DistributedConfig.BackendInstallTimeoutOrDefault() and BackendUpgradeTimeoutOrDefault() at construction. Install now defaults to 15m (was 3m); cold OCI image pulls on Jetson Wi-Fi routinely blew past the old ceiling. Scripted messaging client captures the timeout so tests can assert the configured value actually reaches the NATS request. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): introduce galleryop.ErrWorkerStillInstalling sentinel When the NATS request-reply for backend.install (or .upgrade) times out the worker is almost always still pulling the OCI image. Wrap the timeout in a typed sentinel so the manager above can distinguish "worker hung" from "worker still working" and leave the pending_backend_ops row in place for the reconciler to confirm via backend.list. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): treat NATS install timeout as in-progress, not failure When a worker times out replying to backend.install but the install is still running on the worker, enqueueAndDrainBackendOp now reports a running_on_worker status and pushes NextRetryAt out by the install timeout so the reconciler does not immediately re-fire another install while the worker is still pulling the image. The pending_backend_ops row stays in place for the next reconciler pass to confirm via backend.list. InstallBackend wraps the result in galleryop.ErrWorkerStillInstalling so callers can branch (galleryop renders yellow in-progress instead of red error). UpgradeBackend uses the same wrap. Adds RemoteUnloaderAdapter.InstallTimeout() so the manager can push NextRetryAt by the configured timeout without reaching into a private field, and NodeRegistry.RecordPendingBackendOpInFlight as the soft cousin of RecordPendingBackendOpFailure. Also includes incidental gofmt-driven struct-field alignment in registry.go on lines unrelated to the change (touched files are re-formatted to canonical form per project policy). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): don't increment Attempts on in-flight install timeout An in-flight timeout (worker still pulling the OCI image) is not a failed attempt, it's a delayed one. Incrementing Attempts let genuinely-progressing slow installs (e.g. 30 GB CUDA images on Wi-Fi) trip the reconciler's maxPendingBackendOpAttempts cap and dead-letter the queue row while the worker was still legitimately working. RecordPendingBackendOpInFlight now only updates LastError and NextRetryAt. Also documents "running_on_worker" in the NodeOpStatus.Status enum comment so Task 6 implementers see the full surface. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): surface ErrWorkerStillInstalling as non-error OpStatus When the distributed backend manager returns an error that wraps ErrWorkerStillInstalling, backendHandler now completes the op with a "still installing in background" message rather than marking it as a red failure. Admin UI sees a yellow in-progress state; reconciler confirms completion on its next pass. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): end-to-end install-timeout-then-reconcile Wires Task 1-6 end-to-end so any seam mismatch surfaces in CI rather than during a real cluster install. NATS times out, the queue row stays alive with running_on_worker status, the worker eventually reports the backend installed via backend.list, the manager surfaces it via ListBackends. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / _UPGRADE_TIMEOUT Add the two new operator-tunable env vars to the Frontend Configuration table in the distributed-mode docs. Explains the 15m default, when to raise it (slow links pulling multi-GB OCI images), and the new "still installing in background" admin-UI state when the round-trip times out but the worker is still working. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): clear pending install rows when backend.list confirms DistributedBackendManager.ListBackends now proactively clears pending_backend_ops install rows whose (nodeID, backend) is reported installed by backend.list. Operator UI updates immediately instead of waiting up to installTimeout (default 15m) for the next reconciler tick after NextRetryAt. Only install rows are cleared; upgrade and delete intents are not satisfied by presence in backend.list and continue to drain through their normal reconciler paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(messaging): add BackendInstallProgressEvent wire type and subject New NATS subject nodes.<nodeID>.backend.install.<opID>.progress lets the worker publish transient progress events (file, current/total bytes, percentage, phase) while a long-running install pulls its OCI image. BackendInstallRequest gains an optional OpID field so the worker knows which subject to publish on. Transient pub/sub (not JetStream): the install reply remains ground truth for success/failure; dropped progress events are tolerable. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(messaging): drop em-dash from BackendInstallProgress test comment Per project convention (no em-dashes anywhere). Comment substance is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): worker publishes debounced install progress over NATS When BackendInstallRequest.OpID is set, the worker's backend.install handler wires a debounced publisher (250ms window) into the gallery download callback. Each tick becomes a BackendInstallProgressEvent on nodes.<nodeID>.backend.install.<opID>.progress; the publisher always emits a final event on Flush so the UI sees the terminal percentage. Old masters that do not set OpID continue to run silent installs: no behavior change for them. Lock ordering: the publisher releases its mutex before calling messaging.Publish so a slow network never stalls the install loop. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): RemoteUnloaderAdapter subscribes to install progress InstallBackend gains opID + onProgress parameters. When both are set, the adapter subscribes to nodes.<nodeID>.backend.install.<opID>.progress BEFORE publishing the install request, decodes each message into the caller's onProgress callback in a goroutine (so a slow callback never stalls the NATS reader thread), and unsubscribes after RequestJSON returns. When onProgress is nil OR opID is empty (the reconciler retry path), subscription is skipped entirely - silent installs cost nothing extra. Subscribe failure is logged at Warn and the install proceeds without progress streaming; the NATS round-trip still owns terminal status. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): forward backend install progress into galleryop OpStatus DistributedBackendManager.InstallBackend now passes the gallery op ID and a progress bridge into the adapter call. Each BackendInstallProgressEvent from the worker becomes a galleryop.ProgressCallback tick - which the existing backendHandler already turns into OpStatus.UpdateStatus, so the admin UI/SSE polling sees per-byte progress for distributed installs without any UI-side change. UpgradeBackend is intentionally left silent for now: its wire request (BackendUpgradeRequest) does not carry OpID, and rolling-update fallback is the rarer path. Will be picked up in a follow-up if the worker upgrade path also gets a progress channel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): InstallBackend tolerates silent (pre-Phase-2) workers A worker on pre-Phase-2 code never publishes progress events. The new master subscribes optimistically; this spec pins that a silent worker still produces a green install with no progressCb ticks. The install reply is the source of truth for terminal state; the progress stream is a best-effort UX enrichment. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document install progress streaming Note the new nodes.<nodeID>.backend.install.<opID>.progress subject and the silent-worker compatibility behavior so operators know to expect real-time progress and what happens on a mixed-version cluster. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): note progress-event ordering trade-off in InstallBackend Document near the goroutine dispatch why ordering at the consumer is best-effort, why it rarely matters in practice (worker debounce >> goroutine jitter), and what a future hardening pass would look like (Seq field + stale-by-seq drop). Stops the next reader from accidentally "fixing" the goroutine pool away. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): add NodeProgress + OpStatus.Nodes for per-node breakdown Adds the data model the UI needs to render an expandable per-node breakdown of a fanned-out backend install. NodeProgress carries node identity (ID + name), per-node status (queued / running_on_worker / success / error / downloading), the current file + bytes + percentage from the Phase 2 progress stream, and any per-node error. OpStatus.Nodes is the slice the /api/operations handler will surface in a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): UpdateNodeProgress merges per-node ticks by NodeID GalleryService.UpdateNodeProgress(opID, nodeID, np) merges a NodeProgress into OpStatus.Nodes (keyed by NodeID, no duplicates) and mirrors the latest tick into the aggregate Progress / FileName / DownloadedFileSize / TotalFileSize fields so the legacy single-bar OperationsBar view keeps working unchanged alongside the new per-node breakdown. Concurrent-safe via the existing g.Mutex. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): write per-node OpStatus entries during install fan-out DistributedBackendManager now accepts a nodeProgressSink and feeds it two streams: 1. enqueueAndDrainBackendOp emits a per-node terminal entry on each status it appends to BackendOpResult (queued, success, error, running_on_worker). The opID is threaded through the function so the sink gets the right gallery op identity. 2. The install apply closure fans each BackendInstallProgressEvent into the sink as a downloading entry, alongside the legacy progressCb path so the aggregate single-bar view stays correct. Production wiring passes the GalleryService (which implements UpdateNodeProgress via Task 2) as the sink. Single-node tests pass nil. DeleteBackend and UpgradeBackend pass an empty opID so the sink path no-ops for ops that aren't gallery-tracked the same way as Install. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(operations): expose per-node breakdown on /api/operations When an operation's OpStatus has Nodes entries (populated by the Phase 4 progress sink wiring), surface them as a "nodes" array on the /api/operations response, sorted by node_name for stable rendering. Backward compatible: legacy clients ignore the field; ops without any node entries (single-node mode, model installs) omit the array entirely thanks to the empty-slice guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): per-node breakdown in OperationsBar When an install op fans out to more than one worker, the operations bar now shows a "N nodes" chevron that expands into a per-node list. Each row carries the node's status (color-coded pill), the current file being downloaded, byte counts, percentage, and a thin per-node progress bar. Yellow "Worker busy" pill marks running_on_worker status with a tooltip explaining the NATS round-trip timed out but the worker is still installing in the background. Backward compatible: ops without a nodes field (legacy or single-node mode) render as before. State for expand/collapse is local to the component, keyed by jobID/id - reload starts collapsed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document per-node breakdown in the operations bar Adds a short subsection covering the expandable "N nodes" chevron in the OperationsBar admin UI, the meaning of each status pill, and how it relates to the /api/operations nodes array. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(galleryop): UpdateStatus preserves Nodes when caller sends none Real-world bug surfaced by the Phase 4 multi-worker smoke test: the nodes[] array in /api/operations flickered between a single node at a time on a 2-worker install. Root cause: the Phase 2 progress bridge also calls the legacy progressCb -> UpdateStatus(&OpStatus{...}) on every tick. UpdateStatus then overwrote the entire status pointer, wiping the Nodes slice that UpdateNodeProgress had just merged in. Fix: in UpdateStatus, if the incoming op has an empty Nodes slice, carry forward the previous status's Nodes before storing. Callers that explicitly populate Nodes still win (their slice replaces the prior one, no merge across the two code paths). Two regression specs added pinning both directions of the contract. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): strip implementation details from user-facing docs Trim the new install/upgrade timeout rows and the install-progress sections to focus on what the operator sees and tunes. Drops: - the NATS subject names and pub/sub mechanics - "round-trip" / reconciler / backend.list jargon - /api/operations polling cadence - "pre-2026-05-22" version references Reframes the breakdown text around the admin UI (Operations Bar, chevron, status pills, "Worker busy" tooltip). Implementation context lives in the agent notes and code comments. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): move DistributedConfig.Validate flag names to constants The negative-duration check map was a wall of literal kebab-case strings that had to stay in sync with the kong-derived CLI flag names manually. Move them to a Flag* const block alongside the existing Default* block so a rename of either the Go field or the CLI naming convention forces a compile error rather than silent drift. Sole consumer today is Validate; the constants are exported so future operator-facing surfaces (e.g. error messages on other validation paths) can reference them by name instead of repeating the literals. Tests pin both the literal values (so a future "let's just rename this" doesn't accidentally regress the CLI flag) and the negative- duration error message for the new BackendInstall / BackendUpgrade fields. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed): extract NodeStatus and Phase enums to constants Sweep for the same literal-string-as-identifier pattern called out on the Validate flag names: the per-node install status enum ("queued" | "downloading" | "running_on_worker" | "success" | "error") appeared as raw literals across managers_distributed.go (10+ sites, including 3 separate `n.Status == "running_on_worker"` checks), operation.go, and the test suite. Same shape for the Phase enum ("resolving" | "downloading" | "extracting" | "starting") in the worker-side progress publisher. Promote both to exported const blocks: - galleryop.NodeStatus{Queued,Downloading,RunningOnWorker,Success,Error} shared between galleryop.NodeProgress.Status (the wire field) and nodes.NodeOpStatus.Status (the in-process per-node summary) - messaging.Phase{Resolving,Downloading,Extracting,Starting} shared between the worker publisher and any future consumer that needs to switch on phase Tests pin both the literal values (so a future "let's just rename" doesn't silently change the JSON wire) and use the constants in setup (so the producer side stays drift-protected). Wire-format assertions on the /api/operations JSON output keep their literals deliberately, so the constant value can never silently diverge from what the UI receives. Out of scope for this PR (separate cleanup): the finetune and quantization job-status enums have the same anti-pattern with 14+ literal sites each, but predate this PR's work. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
chore: ⬆️ Update ggml-org/llama.cpp to 1acee6bf8939948f9bcbf4b14034e4b475f06069 (#9952 )
2026-05-23 08:10:48 -04:00 · 2026-05-23 12:35:44 +02:00 · 2026-05-23 08:38:29 +02:00 · 2026-05-23 08:37:26 +02:00 · 2026-05-23 08:37:10 +02:00 · 2026-05-23 00:20:28 +02:00
244 changed files with 13581 additions and 1509 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -112,6 +112,8 @@ Add a YAML anchor definition in the `## metas` section (around line 2-300). Look
 Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags.
 **Note on integrity:** OCI backends installed from a gallery whose `verification:` block is set are verified against a keyless-cosign policy before extraction; tarball/HTTP backends use the optional `sha256:` field. New backends do not need any extra YAML — the gallery-level `verification:` block covers every entry. See [.agents/backend-signing.md](backend-signing.md) for the producer-side CI step.
 ## 4. Update the Makefile
 The Makefile needs to be updated in several places to support building and testing the new backend:
--- a/.agents/api-endpoints-and-auth.md
+++ b/.agents/api-endpoints-and-auth.md
@@ -284,7 +284,17 @@ Also bump the expected-length count in `api_instructions_test.go` and add the na
 ### 3. `capabilities.js` symbol (for new model-config FLAG_* flags)
-If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), also declare the matching symbol in `core/http/react-ui/src/utils/capabilities.js`:
+If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), you need to update **all** of:
 - `Usecase<Name>` string constant in `core/config/backend_capabilities.go`
 - `UsecaseInfoMap` entry mapping the string to its flag + gRPC method
 - `FLAG_<NAME>` bitmask in `core/config/model_config.go`
 - `GetAllModelConfigUsecases()` map entry (otherwise the YAML loader silently ignores the string)
 - `ModalityGroups` membership if the flag should affect `IsMultimodal()` (e.g. realtime_audio is in both speech-input and audio-output groups so a lone flag still reads as multimodal)
 - `GuessUsecases()` branch listing the backends that own this capability
 - `usecaseFilters` in `core/http/routes/ui_api.go` (drives the gallery filter dropdown)
 - `Models.jsx` `FILTERS` array + matching `filters.<camelCase>` i18n key in `core/http/react-ui/public/locales/en/models.json`
 - `core/http/react-ui/src/utils/capabilities.js`:
 ```js
 export const CAP_MY_CAPABILITY = 'FLAG_MY_CAPABILITY'
--- a/.agents/backend-signing.md
+++ b/.agents/backend-signing.md
@@ -0,0 +1,120 @@
 # Backend image signing & verification
 LocalAI verifies backend OCI images against a per-gallery keyless-cosign
 policy. This page documents the trust model, the producer side
 (`.github/workflows/backend_merge.yml` in this repo), and the consumer
 side (`pkg/oci/cosignverify` plus the gallery YAML).
 ## Trust model
 - **Producer:** `.github/workflows/backend_merge.yml` signs each pushed
  manifest list with `cosign sign --recursive` in keyless mode after
  `docker buildx imagetools create`. The signing cert is issued by
  Fulcio bound to the workflow's OIDC identity. There is no long-lived
  signing key. `--recursive` signs both the manifest list and every
  per-arch entry — needed because our consumer resolves a tag to a
  per-arch manifest before checking signatures.
 - **Storage:** Signatures are written as OCI 1.1 referrers
  (`--registry-referrers-mode=oci-1-1`) in the new Sigstore bundle format
  (current cosign releases do this by default; no `--new-bundle-format`
  flag). No `:sha256-<hex>.sig` tag clutter.
 - **Consumer:** `pkg/oci/cosignverify` discovers the bundle via the
  referrers API, hands it to `sigstore-go`, and verifies it against the
  policy declared in the gallery YAML (`Gallery.Verification`).
 - **Revocation:** Keyless cosign certs are ephemeral (10-minute Fulcio
  validity), so revocation is policy-side, not CA-side. The gallery's
  `verification.not_before` (RFC3339) is the kill-switch — advance it to
  invalidate every signature produced before a known compromise window.
 ## Producer setup
 `backend_merge.yml` is the workflow that joins per-arch digests into the
 multi-arch manifest list users actually pull, so it's also the right place
 to sign. The job needs:
 - `permissions: { id-token: write, contents: read }` at the job level so
  the runner can exchange its GitHub OIDC token for a Fulcio cert.
 - `sigstore/cosign-installer@v3` step (current cosign releases already
  default to the new bundle format).
 - After each `docker buildx imagetools create`, resolve the resulting
  list digest with `docker buildx imagetools inspect <tag> --format
  '{{.Manifest.Digest}}'` and sign:
 ```sh
 cosign sign --yes --recursive \
  --registry-referrers-mode=oci-1-1 \
  "${REGISTRY_REPO}@${DIGEST}"
 ```
 Sign by digest, never by tag — signing by tag binds the signature to
 whatever the tag points at *now*, and a subsequent tag push orphans it.
 `backend_build_darwin.yml` builds and pushes single-arch darwin images
 that bypass the manifest-list merge. If/when those entries get a gallery
 `verification:` policy, the equivalent cosign step has to land there
 too.
 ## Consumer setup (in `mudler/LocalAI` gallery YAML)
 Once CI is signing, add a `verification:` block to the backend gallery
 entry (`backend/index.yaml`):
 ```yaml
 - name: localai
  url: github:mudler/LocalAI/backend/index.yaml@master
  verification:
    issuer: "https://token.actions.githubusercontent.com"
    identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@refs/heads/master$"
    # Optional revocation cutoff; advance during incident response.
    # not_before: "2026-06-01T00:00:00Z"
 ```
 Identity matching pins the OIDC subject Fulcio issued the signing cert
 to. Without this, any image signed by *anyone* with a Fulcio cert would
 pass — the regex is what makes a signature mean "produced by our CI".
 ## Strict mode
 Default behaviour: OCI backends without a `verification:` block install
 with a warning (logs include `installing OCI backend without signature
 verification`). Tarball/HTTP backends without a `sha256` field log a
 similar warning.
 For production, set `LOCALAI_REQUIRE_BACKEND_INTEGRITY=1` (or pass
 `--require-backend-integrity` to `local-ai run` / `local-ai backends
 install` / `local-ai models install`). The warning becomes a hard error
 and unverifiable backends refuse to install.
 ## Revocation playbook
 If `backend_merge.yml` (or any workflow with `id-token: write`) is
 compromised and we've shipped malicious signed images:
 1. **Identify the compromise window.** Find the earliest IntegratedTime
   from the bad signatures (Rekor search by `subject` filter).
 2. **Set `verification.not_before`** in `backend/index.yaml` to a
   timestamp just *after* that window's start.
 3. **Push the YAML.** Deployed LocalAI instances pick it up on next
   gallery refresh (1-hour cache in `core/gallery/gallery.go`).
 4. **Fix the underlying compromise** in the workflow and re-sign images
   with the new build, which will have IntegratedTime > `not_before`.
 5. **Optional:** for absolute decisiveness, also rotate to a new
   workflow path (`backend_merge_v2.yml`) and update `identity_regex`.
 ## Where the code lives
 - `pkg/oci/cosignverify/` — verifier, policy, OCI referrer fetch, NotBefore enforcement.
 - `pkg/downloader/uri.go` — `WithImageVerifier` option threaded through `DownloadFileWithContext`.
 - `core/gallery/backends.go` — `backendDownloadOptions` builds the verifier from the gallery's policy.
 - `core/config/gallery.go` — `Gallery.Verification` YAML schema.
 - `core/cli/run.go`, `core/cli/backends.go`, `core/cli/models.go` — `--require-backend-integrity` flag propagation.
 - `.github/workflows/backend_merge.yml` — producer-side `cosign sign --recursive` after each multi-arch manifest list push.
 ## Out of scope (follow-ups)
 - **Signing the gallery YAML itself.** The index is fetched over HTTPS
  from GitHub; we trust the host. A cosign blob signature on the YAML
  would close that gap but adds key-management overhead. Revisit this
  page if/when added.
 - **Tarball/HTTP backend signing.** Cosign can sign arbitrary blobs, but
  for now non-OCI backends keep using the `sha256:` field in YAML.
--- a/.agents/llama-cpp-backend.md
+++ b/.agents/llama-cpp-backend.md
@@ -61,6 +61,12 @@ Always check `llama.cpp` for new model configuration options that should be supp
   - `reasoning_format` - Reasoning format options
   - Any new flags or parameters
 ### Speculative Decoding Types
 The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
 `draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
 ### Implementation Guidelines
 1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -278,6 +278,19 @@ include:
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "8"
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-12-liquid-audio'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "liquid-audio"
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "8"
@@ -808,6 +821,19 @@ include:
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-13-liquid-audio'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "liquid-audio"
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1088,6 +1114,19 @@ include:
    backend: "vibevoice"
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
  - build-type: 'l4t'
    cuda-major-version: "13"
    cuda-minor-version: "0"
    platforms: 'linux/arm64'
    tag-latest: 'auto'
    tag-suffix: '-nvidia-l4t-cuda-13-arm64-liquid-audio'
    runs-on: 'ubuntu-24.04-arm'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    ubuntu-version: '2404'
    backend: "liquid-audio"
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
  - build-type: 'l4t'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1729,6 +1768,19 @@ include:
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'hipblas'
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-rocm-hipblas-liquid-audio'
    runs-on: 'ubuntu-latest'
    base-image: "rocm/dev-ubuntu-24.04:7.2.1"
    skip-drivers: 'false'
    backend: "liquid-audio"
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'hipblas'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2177,6 +2229,19 @@ include:
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'intel'
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-intel-liquid-audio'
    runs-on: 'ubuntu-latest'
    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
    skip-drivers: 'false'
    backend: "liquid-audio"
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'intel'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -3503,6 +3568,20 @@ include:
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    platform-tag: 'amd64'
    tag-latest: 'auto'
    tag-suffix: '-cpu-liquid-audio'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "liquid-audio"
    dockerfile: "./backend/Dockerfile.python"
    context: "./"
    ubuntu-version: '2404'
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -31,6 +31,13 @@ on:
 jobs:
  merge:
    runs-on: ubuntu-latest
    # id-token: write is required for keyless cosign — the workflow
    # exchanges the GitHub OIDC token for a short-lived Fulcio cert that
    # signs each pushed manifest. Without this permission the runner
    # cannot mint the token, and `cosign sign` fails with "no token".
    permissions:
      contents: read
      id-token: write
    env:
      quay_username: ${{ secrets.quayUsername }}
    steps:
@@ -57,6 +64,16 @@ jobs:
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@master
      # cosign signs each pushed manifest list with --recursive so the
      # index and every per-arch entry get an attached Sigstore bundle.
      # Recent cosign releases always emit the new bundle format, so
      # there's no extra CLI flag to opt into it.
      - name: Install cosign
        if: github.event_name != 'pull_request'
        uses: sigstore/cosign-installer@v3
        with:
          cosign-release: 'v2.4.1'
      - name: Login to DockerHub
        if: github.event_name != 'pull_request'
        uses: docker/login-action@v4
@@ -88,6 +105,25 @@ jobs:
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true
      # Source from ci-cache, not local-ai-backends.
      #
      # The build job pushes per-arch manifests to local-ai-backends with
      # push-by-digest=true (no tag), then anchors a tagged copy into
      # ci-cache so the manifest can be retrieved hours later when this
      # merge runs. Quay's manifest GC, however, is per-repository: the
      # anchor tag in ci-cache protects the manifest there, but the same
      # digest in local-ai-backends has no tag in *that* repo and gets
      # reaped independently. Sourcing local-ai-backends@<digest> here
      # then fails with "manifest not found" — exactly the regression
      # we hit on v4.2.2 (19/37 multiarch merges failed).
      #
      # ci-cache@<digest> resolves because we anchored it there. buildx
      # imagetools create copies the manifest into local-ai-backends
      # (cross-repo within the same registry, blobs already cross-mounted
      # from the original push so no transfer needed) and publishes the
      # manifest list with the user-facing tags. The resulting manifest
      # list is fully self-contained in local-ai-backends — child digests
      # only, no embedded references to ci-cache.
      - name: Create manifest list and push (quay)
        if: github.event_name != 'pull_request'
        working-directory: /tmp/digests
@@ -101,11 +137,25 @@ jobs:
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          if [ -z "$tags" ]; then
            echo "No quay.io tags from docker/metadata-action; skipping quay merge"
-          else
+            exit 0
            # shellcheck disable=SC2086
            docker buildx imagetools create $tags \
              $(printf 'quay.io/go-skynet/local-ai-backends@sha256:%s ' *)
          fi
          # shellcheck disable=SC2086
          docker buildx imagetools create $tags \
            $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
          # Resolve the manifest-list digest (any tag points at it) so
          # cosign can sign by digest. Signing by tag would leave the
          # signature orphaned the next time the tag moves.
          first_tag=$(jq -cr '
            .tags | map(select(startswith("quay.io/"))) | .[0]
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
          # --recursive walks the list and signs every per-arch entry
          # too — clients that resolve a tag to a platform-specific
          # manifest before checking signatures need the per-arch
          # signatures, not just the list-level one.
          cosign sign --yes --recursive \
            --registry-referrers-mode=oci-1-1 \
            "quay.io/go-skynet/local-ai-backends@${digest}"
      - name: Create manifest list and push (dockerhub)
        if: github.event_name != 'pull_request'
@@ -120,11 +170,18 @@ jobs:
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          if [ -z "$tags" ]; then
            echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
-          else
+            exit 0
            # shellcheck disable=SC2086
            docker buildx imagetools create $tags \
              $(printf 'localai/localai-backends@sha256:%s ' *)
          fi
          # shellcheck disable=SC2086
          docker buildx imagetools create $tags \
            $(printf 'localai/localai-backends@sha256:%s ' *)
          first_tag=$(jq -cr '
            .tags | map(select(startswith("localai/"))) | .[0]
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
          cosign sign --yes --recursive \
            --registry-referrers-mode=oci-1-1 \
            "localai/localai-backends@${digest}"
      - name: Inspect manifest
        if: github.event_name != 'pull_request'
--- a/.github/workflows/image.yml
+++ b/.github/workflows/image.yml
@@ -151,7 +151,11 @@
              ubuntu-codename: 'noble'
    core-image-merge:
-      if: github.repository == 'mudler/LocalAI'
+      # !cancelled(): without it, GHA's default `needs:` cascade skips the
      # merge whenever any matrix cell of the parent build fails or is
      # cancelled. Same fix as backend.yml's merge jobs — we still want to
      # publish the manifest list for tag-suffixes whose legs all succeeded.
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: core-image-build
      uses: ./.github/workflows/image_merge.yml
      with:
@@ -164,7 +168,7 @@
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    gpu-vulkan-image-merge:
-      if: github.repository == 'mudler/LocalAI'
+      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: core-image-build
      uses: ./.github/workflows/image_merge.yml
      with:
@@ -175,7 +179,91 @@
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-  
+
    # Single-arch server-image merges. Same conceptual fix as the backend
    # singletons in PR #9781: image_build.yml pushes by canonical digest
    # only, so without a downstream merge step there's no tag for consumers
    # (no :latest-gpu-nvidia-cuda-12, no :v<X>-gpu-nvidia-cuda-12, etc.).
    # Each merge job needs only its parent build matrix and is filtered by
    # tag-suffix in image_merge.yml's artifact-download pattern.
    gpu-nvidia-cuda-12-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: core-image-build
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-gpu-nvidia-cuda-12'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    gpu-nvidia-cuda-13-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: core-image-build
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-gpu-nvidia-cuda-13'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    gpu-intel-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: core-image-build
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-gpu-intel'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    gpu-hipblas-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: hipblas-jobs
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-gpu-hipblas'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    nvidia-l4t-arm64-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: gh-runner
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-nvidia-l4t-arm64'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    nvidia-l4t-arm64-cuda-13-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: gh-runner
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-nvidia-l4t-arm64-cuda-13'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    gh-runner:
      if: github.repository == 'mudler/LocalAI'
      uses: ./.github/workflows/image_build.yml
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -106,6 +106,7 @@ jobs:
            type=ref,event=branch
            type=semver,pattern={{raw}}
            type=sha
            type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true
@@ -185,11 +186,28 @@ jobs:
          digest="${{ steps.build.outputs.digest }}"
          touch "/tmp/digests/${digest#sha256:}"
      # See .github/scripts/anchor-digest-in-cache.sh for why this is needed
      # and how it interacts with image_merge.yml's cleanup step. Mirrors the
      # same anchor in backend_build.yml — quay's per-repo manifest GC reaps
      # untagged manifests in local-ai before the merge runs.
      - name: Anchor digest in ci-cache so quay GC won't reap before merge
        if: github.event_name != 'pull_request'
        env:
          TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
          PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
          DIGEST: ${{ steps.build.outputs.digest }}
          SOURCE_IMAGE: quay.io/go-skynet/local-ai
        run: .github/scripts/anchor-digest-in-cache.sh
      - name: Upload digest artifact
        if: github.event_name != 'pull_request'
        uses: actions/upload-artifact@v7
        with:
-          name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-${{ inputs.platform-tag }}
+          # `--` separator + 'single' placeholder for empty platform-tag —
          # same pattern as backend_build.yml. Prevents prefix collisions
          # in the merge-side glob (e.g. -nvidia-l4t-arm64 is a prefix of
          # -nvidia-l4t-arm64-cuda-13).
          name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
          path: /tmp/digests/*
          if-no-files-found: error
          retention-days: 1
--- a/.github/workflows/image_merge.yml
+++ b/.github/workflows/image_merge.yml
@@ -33,10 +33,22 @@ jobs:
    env:
      quay_username: ${{ secrets.quayUsername }}
    steps:
      # Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
      # script). Skips the rest of the source tree.
      - name: Checkout (.github/scripts only)
        uses: actions/checkout@v6
        with:
          sparse-checkout: |
            .github/scripts
          sparse-checkout-cone-mode: false
      - name: Download digests
        uses: actions/download-artifact@v8
        with:
-          pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-*
+          # `--` separator anchors the glob so we don't over-match sibling
          # tag-suffixes (e.g. -nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13).
          # Must stay in sync with image_build.yml's upload-artifact name.
          pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--*
          merge-multiple: true
          path: /tmp/digests
@@ -68,10 +80,18 @@ jobs:
            type=ref,event=branch
            type=semver,pattern={{raw}}
            type=sha
            type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true
      # Source from ci-cache, not local-ai. See backend_merge.yml for the
      # detailed rationale — quay's manifest GC is per-repository, so the
      # untagged digest in local-ai gets reaped while the same content lives
      # tagged under ci-cache (anchored by image_build.yml). buildx imagetools
      # create copies the manifest into local-ai (blobs already cross-mounted)
      # and publishes the manifest list with user-facing tags. End state in
      # local-ai is self-contained; no embedded reference to ci-cache.
      - name: Create manifest list and push (quay)
        working-directory: /tmp/digests
        run: |
@@ -82,7 +102,7 @@ jobs:
          else
            # shellcheck disable=SC2086
            docker buildx imagetools create $tags \
-              $(printf 'quay.io/go-skynet/local-ai@sha256:%s ' *)
+              $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
          fi
      - name: Create manifest list and push (dockerhub)
@@ -107,6 +127,15 @@ jobs:
            docker buildx imagetools inspect "$first_tag"
          fi
      # See .github/scripts/cleanup-keepalive-tags.sh for the best-effort
      # semantics — fails soft when the registry credential isn't OAuth-scoped.
      - name: Cleanup keepalive tags in ci-cache
        if: github.event_name != 'pull_request' && success()
        env:
          TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
          QUAY_TOKEN: ${{ secrets.quayPassword }}
        run: .github/scripts/cleanup-keepalive-tags.sh
      - name: Job summary
        run: |
          set -euo pipefail
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -28,6 +28,7 @@ jobs:
      qwen-asr: ${{ steps.detect.outputs.qwen-asr }}
      nemo: ${{ steps.detect.outputs.nemo }}
      voxcpm: ${{ steps.detect.outputs.voxcpm }}
      liquid-audio: ${{ steps.detect.outputs.liquid-audio }}
      llama-cpp-quantization: ${{ steps.detect.outputs.llama-cpp-quantization }}
      llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
      ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
@@ -447,6 +448,32 @@ jobs:
        run: |
          make --jobs=5 --output-sync=target -C backend/python/voxcpm
          make --jobs=5 --output-sync=target -C backend/python/voxcpm test
  # liquid-audio: LFM2.5-Audio any-to-any backend. The CI smoke test
  # exercises Health() and LoadModel(mode:finetune) — fine-tune mode
  # short-circuits before pulling weights (backend.py:192), so no
  # HuggingFace download or GPU is needed. The full-inference path is
  # gated on LIQUID_AUDIO_MODEL_ID, which we don't set here.
  tests-liquid-audio:
    needs: detect-changes
    if: needs.detect-changes.outputs.liquid-audio == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential ffmpeg
          sudo apt-get install -y ca-certificates cmake curl patch python3-pip
          # Install UV
          curl -LsSf https://astral.sh/uv/install.sh | sh
          pip install --user --no-cache-dir grpcio-tools==1.64.1
      - name: Test liquid-audio
        run: |
          make --jobs=5 --output-sync=target -C backend/python/liquid-audio
          make --jobs=5 --output-sync=target -C backend/python/liquid-audio test
  tests-llama-cpp-quantization:
    needs: detect-changes
    if: needs.detect-changes.outputs.llama-cpp-quantization == 'true' || needs.detect-changes.outputs.run-all == 'true'
--- a/.gitignore
+++ b/.gitignore
@@ -77,3 +77,6 @@ local-backends/
 tests/e2e-ui/ui-test-server
 core/http/react-ui/playwright-report/
 core/http/react-ui/test-results/
 # Local worktrees
 .worktrees/
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -46,8 +46,52 @@ linters:
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.'
        - pattern: '^t\.FailNow$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.'
        # In-process config should flow through ApplicationConfig / kong-bound
        # CLI flags, not via os.Getenv. The CLI layer is the legitimate
        # env→struct boundary (kong's `env:"..."` tag); anything deeper that
        # reads env directly leaks process state into business logic and
        # makes flags impossible to test or override per-request. Backend
        # subprocesses, the system/capabilities probe, and a few places that
        # read non-LocalAI env vars (HOME, PATH, AUTH_TOKEN passed by parent)
        # are exempt — see linters.exclusions.rules below.
        - pattern: '^os\.(Getenv|LookupEnv|Environ)$'
          msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
  exclusions:
    paths:
      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
      - 'backend/go/whisper/sources'
      - 'docs/'
    rules:
      # CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
      # boundary, and a handful of subcommands legitimately propagate values
      # to spawned subprocesses (LLAMACPP_GRPC_SERVERS, MLX hostfile, ...).
      - path: ^core/cli/
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
      # Backend subprocesses are independent binaries with their own env
      # surface; they're not "in-process config" of the LocalAI server.
      - path: ^backend/
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
      # System capability probe reads HOME, PATH-style vars to discover
      # GPUs, default paths, etc. — not LocalAI config.
      - path: ^pkg/system/
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
      # gRPC server reads AUTH_TOKEN passed in by the parent process at spawn
      # time; model.Loader sets/inherits env to communicate with subprocesses.
      - path: ^pkg/grpc/
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
      - path: ^pkg/model/
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
      # Top-level main binaries (local-ai, launcher) are entry points.
      - path: ^cmd/
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
      # Tests legitimately read $HOME, $TMPDIR, and gating env vars
      # (LOCALAI_COSIGN_LIVE, etc.) to skip live-network specs.
      - path: _test\.go$
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -31,6 +31,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
 | [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery |
 | [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) | LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync |
 | [.agents/backend-signing.md](.agents/backend-signing.md) | Backend OCI image signing (keyless cosign + sigstore-go) — producer-side CI setup, consumer-side gallery `verification:` block, strict mode (`LOCALAI_REQUIRE_BACKEND_INTEGRITY`), revocation via `not_before` |
 ## Quick Reference
--- a/8
+++ b/8
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -463,6 +463,7 @@ prepare-test-extra: protogen-python
 	$(MAKE) -C backend/python/vllm-omni
 	$(MAKE) -C backend/python/sglang
 	$(MAKE) -C backend/python/vibevoice
 	$(MAKE) -C backend/python/liquid-audio
 	$(MAKE) -C backend/python/moonshine
 	$(MAKE) -C backend/python/pocket-tts
 	$(MAKE) -C backend/python/qwen-tts
@@ -488,6 +489,7 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/vllm test
 	$(MAKE) -C backend/python/vllm-omni test
 	$(MAKE) -C backend/python/vibevoice test
 	$(MAKE) -C backend/python/liquid-audio test
 	$(MAKE) -C backend/python/moonshine test
 	$(MAKE) -C backend/python/pocket-tts test
 	$(MAKE) -C backend/python/qwen-tts test
@@ -1092,6 +1094,7 @@ BACKEND_SGLANG = sglang|python|.|false|true
 BACKEND_DIFFUSERS = diffusers|python|.|--progress=plain|true
 BACKEND_CHATTERBOX = chatterbox|python|.|false|true
 BACKEND_VIBEVOICE = vibevoice|python|.|--progress=plain|true
 BACKEND_LIQUID_AUDIO = liquid-audio|python|.|--progress=plain|true
 BACKEND_MOONSHINE = moonshine|python|.|false|true
 BACKEND_POCKET_TTS = pocket-tts|python|.|false|true
 BACKEND_QWEN_TTS = qwen-tts|python|.|false|true
@@ -1169,6 +1172,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SGLANG)))
 $(eval $(call generate-docker-build-target,$(BACKEND_DIFFUSERS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_CHATTERBOX)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LIQUID_AUDIO)))
 $(eval $(call generate-docker-build-target,$(BACKEND_MOONSHINE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_POCKET_TTS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_QWEN_TTS)))
@@ -1197,7 +1201,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar
-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
 ########################################################
 ### Mock Backend for E2E Tests
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -48,6 +48,11 @@ service Backend {
  rpc AudioTransform(AudioTransformRequest) returns (AudioTransformResult) {}
  rpc AudioTransformStream(stream AudioTransformFrameRequest) returns (stream AudioTransformFrameResponse) {}
  // AudioToAudioStream is the bidirectional any-to-any S2S RPC. Backends
  // that load a speech-to-speech model consume input audio frames and emit
  // interleaved audio + transcript + tool-call deltas as typed events.
  // Backends without S2S support return UNIMPLEMENTED.
  rpc AudioToAudioStream(stream AudioToAudioRequest) returns (stream AudioToAudioResponse) {}
  rpc ModelMetadata(ModelOptions) returns (ModelMetadataResponse) {}
@@ -768,6 +773,93 @@ message AudioTransformFrameResponse {
  int64 frame_index = 2;
 }
 // === AudioToAudioStream messages =========================================
 //
 // Bidirectional stream between the LocalAI core and an any-to-any audio
 // model. The client opens the stream with a Config payload, then alternates
 // Frame (input audio) and Control (turn boundaries, function-call results,
 // session updates) payloads. The server streams back typed events: audio
 // frames carry PCM in `pcm`; transcript / tool-call deltas carry JSON in
 // `meta`; the stream ends with a `response.done` (success) or `error` event.
 message AudioToAudioRequest {
  oneof payload {
    AudioToAudioConfig  config  = 1;
    AudioToAudioFrame   frame   = 2;
    AudioToAudioControl control = 3;
  }
 }
 message AudioToAudioConfig {
  // PCM format for client→server audio. 0 => backend default
  // (16 kHz for the LFM2-Audio Conformer encoder).
  int32 input_sample_rate = 1;
  // Preferred server→client audio rate. 0 => backend default
  // (24 kHz for the LFM2-Audio vocoder).
  int32 output_sample_rate = 2;
  // Optional system prompt override. Empty => backend chooses based on
  // mode (e.g. "Respond with interleaved text and audio.").
  string system_prompt = 3;
  // Optional baked-voice id. Models that only ship a fixed set of
  // voices (e.g. LFM2-Audio: us_male/us_female/uk_male/uk_female) match
  // this against their voice table; an empty string keeps the default.
  string voice = 4;
  // JSON-encoded array of tool definitions in OpenAI Chat Completions
  // format. Empty => no tools.
  string tools = 5;
  // Free-form sampling / decoding parameters (temperature, top_k,
  // max_new_tokens, audio_top_k, etc).
  map<string, string> params = 6;
  // True => reset any session-scoped state before processing further
  // frames on this stream. The first Config implicitly resets.
  bool reset = 7;
 }
 message AudioToAudioFrame {
  // Raw PCM s16le mono at config.input_sample_rate. Empty pcm + end_of_input
  // is a valid "user finished speaking" marker without trailing audio.
  bytes pcm = 1;
  // Marks the last frame of a user turn. The backend may begin emitting
  // a response immediately after seeing this.
  bool end_of_input = 2;
 }
 message AudioToAudioControl {
  // Free-form control event names. Initial set:
  //   "input_audio_buffer.commit"     — user finished speaking
  //   "response.cancel"               — abort in-flight generation
  //   "conversation.item.create"      — inject a non-audio item (e.g.
  //                                     function_call_output as JSON in
  //                                     `payload`)
  //   "session.update"                — re-configure mid-stream
  string event = 1;
  // Event-specific JSON payload.
  bytes payload = 2;
 }
 message AudioToAudioResponse {
  // Event identifies what this frame carries. Mirrors the OpenAI Realtime
  // API server-event names where applicable. Initial set:
  //   "response.audio.delta"
  //   "response.audio_transcript.delta"
  //   "response.function_call_arguments.delta"
  //   "response.function_call_arguments.done"
  //   "response.done"
  //   "error"
  string event = 1;
  // Populated when event = response.audio.delta.
  bytes pcm = 2;
  // Populated alongside pcm to identify its rate. 0 => same as the
  // session's negotiated output_sample_rate.
  int32 sample_rate = 3;
  // JSON payload for non-PCM events (transcript chunk, tool args, error
  // body).
  bytes meta = 4;
  // Monotonic per-stream counter, useful for client reordering and
  // debugging.
  int64 sequence = 5;
 }
 message ModelMetadataResponse {
  bool supports_thinking = 1;
  string rendered_template = 2;  // The rendered chat template with enable_thinking=true (empty if not applicable)
--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -1,10 +1,10 @@
 # ds4 backend Makefile.
 #
-# Upstream pin lives below as DS4_VERSION?= so the bump-deps bot
+# Upstream pin lives below as DS4_VERSION?=8d576642c39b9a2d782a80159ba84ef5a81c0b81
 # (.github/bump_deps.sh) can find and update it - matches the
 # llama-cpp / ik-llama-cpp / turboquant convention.
-DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
+DS4_VERSION?=8d576642c39b9a2d782a80159ba84ef5a81c0b81
 DS4_REPO?=https://github.com/antirez/ds4
 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@
-IK_LLAMA_VERSION?=eb570eb96689c235933b813693ca28ab9d3d26de
+IK_LLAMA_VERSION?=b3d39cff8bffbd67296d6badd4076a1486a0715c
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@
-LLAMA_VERSION?=1ec7ba0c14f33f17e980daeeda5f35b225d41994
+LLAMA_VERSION?=1acee6bf8939948f9bcbf4b14034e4b475f06069
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -32,6 +32,7 @@
 #include <grpcpp/health_check_service_interface.h>
 #include <grpcpp/security/server_credentials.h>
 #include <regex>
 #include <algorithm>
 #include <atomic>
 #include <cstdlib>
 #include <fstream>
@@ -450,6 +451,8 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
        // vector; the turboquant fork still uses the legacy scalar. The
        // LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
        // backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
        // Upstream renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE
        // in ggml-org/llama.cpp#22964; the fork still uses the old name.
 #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
        if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
            params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
@@ -458,7 +461,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
        const bool no_spec_type = params.speculative.types.empty() ||
            (params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
        if (no_spec_type) {
-            params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT };
+            params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE };
        }
 #endif
    }
@@ -514,16 +517,27 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    params.warmup = true;
    // no_op_offload: disable host tensor op offload (default: false)
    params.no_op_offload = false;
-    // kv_unified: enable unified KV cache (default: false)
+    // kv_unified: enable unified KV cache. Upstream's server auto-enables this
-    params.kv_unified = false;
+    // when the slot count is auto (-np <0), bumping n_parallel to 4 alongside.
-    // n_ctx_checkpoints: max context checkpoints per slot (default: 8)
+    // LocalAI keeps n_parallel=1 by default, which would skip that auto path
-    params.n_ctx_checkpoints = 8;
+    // and leave kv_unified=false. We flip the default to true here so the
-
+    // server-side prompt cache (cache_idle_slots) is actually usable on the
-    // llama memory fit fails if we don't provide a buffer for tensor overrides
+    // single-slot path that LocalAI ships with: without it, idle slots are
-    const size_t ntbo = llama_max_tensor_buft_overrides();
+    // never persisted across requests and the prompt cache is dead weight.
-    while (params.tensor_buft_overrides.size() < ntbo) {
+    // Users can opt out with `options: [ "kv_unified:false" ]`.
-        params.tensor_buft_overrides.push_back({nullptr, nullptr});
+    params.kv_unified = true;
-    }
+    // n_ctx_checkpoints: max context checkpoints per slot. Match upstream's
    // default (32); the previous LocalAI-specific 8 was unnecessarily tight
    // and limits partial-prefix recovery without a clear memory rationale.
    params.n_ctx_checkpoints = 32;
    // cache_idle_slots: save and clear idle slot KV to the prompt cache on
    // task switch. Upstream default is true; the server auto-disables it if
    // kv_unified=false or cache_ram_mib=0, so flipping kv_unified above is
    // what actually unlocks it.
    params.cache_idle_slots = true;
    // checkpoint_every_nt: create a context checkpoint every N tokens during
    // prefill (-1 disables). Match upstream's default (8192).
    params.checkpoint_every_nt = 8192;
     // decode options. Options are in form optname:optvale, or if booleans only optname.
    for (int i = 0; i < request->options_size(); i++) {
@@ -682,9 +696,161 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                try {
                    params.n_ctx_checkpoints = std::stoi(optval_str);
                } catch (const std::exception& e) {
-                    // If conversion fails, keep default value (8)
+                    // If conversion fails, keep default value (32)
                }
            }
        // --- server-side idle-slot prompt cache toggle (upstream --cache-idle-slots) ---
        // Saves the slot's KV state into the host-side prompt cache on task
        // switch so a later request with the same prefix can warm-load it.
        // Auto-disabled by the server if kv_unified=false or cache_ram=0.
        } else if (!strcmp(optname, "cache_idle_slots") || !strcmp(optname, "idle_slots_cache")) {
            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
                params.cache_idle_slots = true;
            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
                params.cache_idle_slots = false;
            }
        // --- prefill checkpoint cadence (upstream -cpent / --checkpoint-every-n-tokens) ---
        // -1 disables checkpointing during prefill.
        } else if (!strcmp(optname, "checkpoint_every_nt") || !strcmp(optname, "checkpoint_every_n_tokens")) {
            if (optval != NULL) {
                try {
                    params.checkpoint_every_nt = std::stoi(optval_str);
                } catch (const std::exception& e) {
                    // If conversion fails, keep default value (8192)
                }
            }
        // --- physical batch size (upstream -ub / --ubatch-size) ---
        // Note: line ~482 already aliases n_ubatch to n_batch as a default; this
        // option lets users decouple the two (useful for embeddings/rerank).
        } else if (!strcmp(optname, "n_ubatch") || !strcmp(optname, "ubatch")) {
            if (optval != NULL) {
                try { params.n_ubatch = std::stoi(optval_str); } catch (...) {}
            }
        // --- main-model batch threads (upstream -tb / --threads-batch) ---
        } else if (!strcmp(optname, "threads_batch") || !strcmp(optname, "n_threads_batch")) {
            if (optval != NULL) {
                try {
                    int n = std::stoi(optval_str);
                    if (n <= 0) n = (int)std::thread::hardware_concurrency();
                    params.cpuparams_batch.n_threads = n;
                } catch (...) {}
            }
        // --- pooling type for embeddings (upstream --pooling) ---
        } else if (!strcmp(optname, "pooling_type") || !strcmp(optname, "pooling")) {
            if (optval != NULL) {
                if      (optval_str == "none") params.pooling_type = LLAMA_POOLING_TYPE_NONE;
                else if (optval_str == "mean") params.pooling_type = LLAMA_POOLING_TYPE_MEAN;
                else if (optval_str == "cls")  params.pooling_type = LLAMA_POOLING_TYPE_CLS;
                else if (optval_str == "last") params.pooling_type = LLAMA_POOLING_TYPE_LAST;
                else if (optval_str == "rank") params.pooling_type = LLAMA_POOLING_TYPE_RANK;
                // unknown values silently leave UNSPECIFIED (auto-detect)
            }
        // --- llama log verbosity threshold (upstream -lv / --verbosity) ---
        } else if (!strcmp(optname, "verbosity")) {
            if (optval != NULL) {
                try { params.verbosity = std::stoi(optval_str); } catch (...) {}
            }
        // --- O_DIRECT model loading (upstream --direct-io) ---
        } else if (!strcmp(optname, "direct_io") || !strcmp(optname, "use_direct_io")) {
            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
                params.use_direct_io = true;
            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
                params.use_direct_io = false;
            }
        // --- embedding normalization (upstream --embd-normalize) ---
        // -1 none, 0 max-abs, 1 taxicab, 2 L2 (default), >2 p-norm
        } else if (!strcmp(optname, "embd_normalize") || !strcmp(optname, "embedding_normalize")) {
            if (optval != NULL) {
                try { params.embd_normalize = std::stoi(optval_str); } catch (...) {}
            }
        // --- reasoning parser (upstream --reasoning-format) ---
        // Picks the parser for <think> blocks emitted by reasoning models.
        // none / auto / deepseek / deepseek-legacy
        } else if (!strcmp(optname, "reasoning_format")) {
            if (optval != NULL) {
                if      (optval_str == "none")             params.reasoning_format = COMMON_REASONING_FORMAT_NONE;
                else if (optval_str == "auto")             params.reasoning_format = COMMON_REASONING_FORMAT_AUTO;
                else if (optval_str == "deepseek")         params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
                else if (optval_str == "deepseek-legacy" || optval_str == "deepseek_legacy")
                                                            params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY;
                // unknown values silently keep the upstream default (DEEPSEEK)
            }
        // --- reasoning budget (upstream --reasoning-budget) ---
        // -1 unlimited, 0 disabled, >0 token budget for thinking blocks.
        // Distinct from per-request `enable_thinking` (chat_template_kwargs).
        } else if (!strcmp(optname, "enable_reasoning") || !strcmp(optname, "reasoning_budget")) {
            if (optval != NULL) {
                try { params.enable_reasoning = std::stoi(optval_str); } catch (...) {}
            }
        // --- prefill assistant turn (upstream --no-prefill-assistant) ---
        } else if (!strcmp(optname, "prefill_assistant")) {
            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
                params.prefill_assistant = true;
            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
                params.prefill_assistant = false;
            }
        // --- mmproj GPU offload (upstream --no-mmproj-offload, inverted) ---
        } else if (!strcmp(optname, "mmproj_use_gpu") || !strcmp(optname, "mmproj_offload")) {
            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
                params.mmproj_use_gpu = true;
            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
                params.mmproj_use_gpu = false;
            }
        // --- per-image vision token budget (upstream --image-min/max-tokens) ---
        } else if (!strcmp(optname, "image_min_tokens")) {
            if (optval != NULL) {
                try { params.image_min_tokens = std::stoi(optval_str); } catch (...) {}
            }
        } else if (!strcmp(optname, "image_max_tokens")) {
            if (optval != NULL) {
                try { params.image_max_tokens = std::stoi(optval_str); } catch (...) {}
            }
        // --- main-model tensor buffer overrides (upstream --override-tensor) ---
        // Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
        // Mirrors the existing `draft_override_tensor` parser below.
        } else if (!strcmp(optname, "override_tensor") || !strcmp(optname, "tensor_buft_overrides")) {
            ggml_backend_load_all();
            std::map<std::string, ggml_backend_buffer_type_t> buft_list;
            for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
                auto * dev = ggml_backend_dev_get(i);
                auto * buft = ggml_backend_dev_buffer_type(dev);
                if (buft) {
                    buft_list[ggml_backend_buft_name(buft)] = buft;
                }
            }
            static std::list<std::string> override_names;
            std::string cur;
            auto flush = [&](const std::string & spec) {
                auto pos = spec.find('=');
                if (pos == std::string::npos) return;
                const std::string name = spec.substr(0, pos);
                const std::string type = spec.substr(pos + 1);
                auto it = buft_list.find(type);
                if (it == buft_list.end()) return; // unknown buffer type: ignore
                override_names.push_back(name);
                params.tensor_buft_overrides.push_back(
                    {override_names.back().c_str(), it->second});
            };
            for (char c : optval_str) {
                if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
                else { cur.push_back(c); }
            }
            if (!cur.empty()) flush(cur);
        // Speculative decoding options
        } else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
 #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
@@ -701,16 +867,27 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            // Upstream switched to a vector of types (comma-separated for multi-type
            // chaining via common_speculative_types_from_names). We keep accepting a
            // single value here, but also tolerate comma-separated lists.
            //
            // ggml-org/llama.cpp#22964 also renamed the registered names from
            // underscore- to dash-separated form, and replaced the bare
            // `draft`/`eagle3` aliases with `draft-simple`/`draft-eagle3`. We
            // normalize each token here so existing model configs keep working.
            auto normalize_spec_name = [](std::string s) -> std::string {
                std::replace(s.begin(), s.end(), '_', '-');
                if (s == "draft")  return "draft-simple";
                if (s == "eagle3") return "draft-eagle3";
                return s;
            };
            std::vector<std::string> names;
            std::string item;
            for (char c : optval_str) {
                if (c == ',') {
-                    if (!item.empty()) { names.push_back(item); item.clear(); }
+                    if (!item.empty()) { names.push_back(normalize_spec_name(item)); item.clear(); }
                } else {
                    item.push_back(c);
                }
            }
-            if (!item.empty()) names.push_back(item);
+            if (!item.empty()) names.push_back(normalize_spec_name(item));
            auto parsed = common_speculative_types_from_names(names);
            if (!parsed.empty()) {
                params.speculative.types = parsed;
@@ -937,6 +1114,20 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
        params.kv_overrides.back().key[0] = 0;
    }
    // tensor_buft_overrides sentinel termination (mirrors upstream common/arg.cpp).
    // Real entries are pushed during option parsing; here we pad/terminate so the
    // model loader sees back().pattern == nullptr (GGML_ASSERT at common.cpp:1543)
    // and so llama_params_fit has the placeholder slots it requires.
    {
        const size_t ntbo = llama_max_tensor_buft_overrides();
        while (params.tensor_buft_overrides.size() < ntbo) {
            params.tensor_buft_overrides.push_back({nullptr, nullptr});
        }
    }
    if (!params.speculative.draft.tensor_buft_overrides.empty()) {
        params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
    }
    // TODO: Add yarn
    if (!request->tensorsplit().empty()) {
@@ -2794,7 +2985,9 @@ public:
            }
        }
-        int embd_normalize = 2; // default to Euclidean/L2 norm
+        // Honor the load-time embd_normalize set via options:embd_normalize.
        // -1 none, 0 max-abs, 1 taxicab, 2 L2 (default), >2 p-norm.
        int embd_normalize = params_base.embd_normalize;
        // create and queue the task
        auto rd = ctx_server.get_response_reader();
        {
--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -1,7 +1,7 @@
 # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
 # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
-TURBOQUANT_VERSION?=69d8e4be47243e83b3d0d71e932bc7aa61c644dc
+TURBOQUANT_VERSION?=5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403
 LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
 CMAKE_ARGS?=
--- a/backend/go/acestep-cpp/Makefile
+++ b/backend/go/acestep-cpp/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
 # acestep.cpp version
 ACESTEP_REPO?=https://github.com/ace-step/acestep.cpp
-ACESTEP_CPP_VERSION?=e0c8d75a672fca5684c88c68dbf6d12f58754258
+ACESTEP_CPP_VERSION?=ed53caf164e4492a5620b2e3f2264629cf66da24
 SO_TARGET?=libgoacestepcpp.so
 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/go/acestep-cpp/cpp/goacestepcpp.cpp
+++ b/backend/go/acestep-cpp/cpp/goacestepcpp.cpp
@@ -22,12 +22,11 @@
 #include <vector>
 // Global model contexts (loaded once, reused across requests)
-static DiTGGML       g_dit       = {};
+static DiTGGML g_dit        = {};
-static DiTGGMLConfig g_dit_cfg;
+static VAEGGML g_vae        = {};
-static VAEGGML       g_vae       = {};
+static bool    g_dit_loaded = false;
-static bool          g_dit_loaded = false;
+static bool    g_vae_loaded = false;
-static bool          g_vae_loaded = false;
+static bool    g_is_turbo   = false;
 static bool          g_is_turbo   = false;
 // Silence latent [15000, 64] — read once from DiT GGUF
 static std::vector<float> g_silence_full;
@@ -72,10 +71,9 @@ int load_model(const char * lm_model_path, const char * text_encoder_path,
    g_text_enc_path = text_encoder_path;
    g_dit_path      = dit_model_path;
-    // Load DiT model
+    // Load DiT model (backend init + config are handled inside dit_ggml_load)
    fprintf(stderr, "[acestep-cpp] Loading DiT from %s\n", dit_model_path);
-    dit_ggml_init_backend(&g_dit);
+    if (!dit_ggml_load(&g_dit, dit_model_path)) {
    if (!dit_ggml_load(&g_dit, dit_model_path, g_dit_cfg, nullptr, 0.0f)) {
        fprintf(stderr, "[acestep-cpp] FATAL: failed to load DiT from %s\n", dit_model_path);
        return 1;
    }
@@ -149,16 +147,16 @@ int generate_music(const char * caption, const char * lyrics, int bpm,
    // Compute T (latent frames at 25Hz)
    int T = (int)(duration * FRAMES_PER_SECOND);
-    T     = ((T + g_dit_cfg.patch_size - 1) / g_dit_cfg.patch_size) * g_dit_cfg.patch_size;
+    T     = ((T + g_dit.cfg.patch_size - 1) / g_dit.cfg.patch_size) * g_dit.cfg.patch_size;
-    int S = T / g_dit_cfg.patch_size;
+    int S = T / g_dit.cfg.patch_size;
    if (T > 15000) {
        fprintf(stderr, "[acestep-cpp] ERROR: T=%d exceeds max 15000\n", T);
        return 2;
    }
-    int Oc     = g_dit_cfg.out_channels;      // 64
+    int Oc     = g_dit.cfg.out_channels;      // 64
-    int ctx_ch = g_dit_cfg.in_channels - Oc;  // 128
+    int ctx_ch = g_dit.cfg.in_channels - Oc;  // 128
    fprintf(stderr, "[acestep-cpp] T=%d, S=%d, duration=%.1fs, seed=%d\n", T, S, duration, seed);
@@ -191,9 +189,8 @@ int generate_music(const char * caption, const char * lyrics, int bpm,
    fprintf(stderr, "[acestep-cpp] caption: %d tokens, lyrics: %d tokens\n", S_text, S_lyric);
-    // 4. Text encoder forward
+    // 4. Text encoder forward (backend init handled inside qwen3_load_text_encoder)
    Qwen3GGML text_enc = {};
    qwen3_init_backend(&text_enc);
    if (!qwen3_load_text_encoder(&text_enc, g_text_enc_path.c_str())) {
        fprintf(stderr, "[acestep-cpp] FATAL: failed to load text encoder\n");
        return 4;
@@ -209,9 +206,8 @@ int generate_music(const char * caption, const char * lyrics, int bpm,
    std::vector<float> lyric_embed(H_text * S_lyric);
    qwen3_embed_lookup(&text_enc, lyric_ids.data(), S_lyric, lyric_embed.data());
-    // 6. Condition encoder
+    // 6. Condition encoder (backend init handled inside cond_ggml_load)
    CondGGML cond = {};
    cond_ggml_init_backend(&cond);
    if (!cond_ggml_load(&cond, g_dit_path.c_str())) {
        fprintf(stderr, "[acestep-cpp] FATAL: failed to load condition encoder\n");
        qwen3_free(&text_enc);
--- a/backend/go/stablediffusion-ggml/Makefile
+++ b/backend/go/stablediffusion-ggml/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
 # stablediffusion.cpp (ggml)
 STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
-STABLEDIFFUSION_GGML_VERSION?=90e87bc846f17059771efb8aaa31e9ef0cab6f78
+STABLEDIFFUSION_GGML_VERSION?=0baf721215f45335a5df8caf0ecb34e870c956e7
 CMAKE_ARGS+=-DGGML_MAX_NAME=128
--- a/backend/go/stablediffusion-ggml/cpp/gosd.cpp
+++ b/backend/go/stablediffusion-ggml/cpp/gosd.cpp
@@ -1188,6 +1188,9 @@ int gen_video(sd_vid_gen_params_t *p, int steps, char *dst, float cfg_scale, int
    p->high_noise_sample_params.scheduler                = scheduler;
    p->high_noise_sample_params.flow_shift               = flow_shift;
    // Pin output fps in params; upstream uses it for audio sync (and we also mux at this rate).
    p->fps = fps;
    // Load init/end reference images if provided (resized to output dims).
    uint8_t* init_buf = nullptr;
    uint8_t* end_buf  = nullptr;
@@ -1206,11 +1209,14 @@ int gen_video(sd_vid_gen_params_t *p, int steps, char *dst, float cfg_scale, int
    // Generate
    int num_frames_out = 0;
-    sd_image_t* frames = generate_video(sd_c, p, &num_frames_out);
+    sd_image_t* frames = nullptr;
    sd_audio_t* audio = nullptr;
    bool ok = generate_video(sd_c, p, &frames, &num_frames_out, &audio);
    std::free(p);
-    if (!frames || num_frames_out == 0) {
+    if (!ok || !frames || num_frames_out == 0) {
        fprintf(stderr, "generate_video produced no frames\n");
        if (audio) free_sd_audio(audio);
        if (init_buf) free(init_buf);
        if (end_buf) free(end_buf);
        return 1;
@@ -1224,6 +1230,7 @@ int gen_video(sd_vid_gen_params_t *p, int steps, char *dst, float cfg_scale, int
        if (frames[i].data) free(frames[i].data);
    }
    free(frames);
    if (audio) free_sd_audio(audio);
    if (init_buf) free(init_buf);
    if (end_buf) free(end_buf);
--- a/backend/go/whisper/Makefile
+++ b/backend/go/whisper/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
 # whisper.cpp version
 WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
-WHISPER_CPP_VERSION?=c33c5618b72bb345df029b730b36bc0e369845a3
+WHISPER_CPP_VERSION?=0ccd896f5b882628e1c077f9769735ef4ce52860
 SO_TARGET?=libgowhisper.so
 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -847,6 +847,35 @@
    nvidia-l4t-cuda-12: "nvidia-l4t-vibevoice"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vibevoice"
  icon: https://avatars.githubusercontent.com/u/6154722?s=200&v=4
 - &liquid-audio
  urls:
    - https://github.com/Liquid4All/liquid-audio
    - https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B
  description: |
    LiquidAI LFM2 / LFM2.5 Audio Python backend. End-to-end speech-to-speech, ASR,
    TTS (4 baked voices), and text chat from a single 1.5B model. Wraps the
    upstream `liquid-audio` package; supports fine-tuning via LocalAI's
    /v1/fine-tuning/jobs endpoint.
  tags:
    - speech-to-speech
    - any-to-any
    - text-to-speech
    - speech-to-text
    - TTS
    - ASR
    - realtime
  license: LFM-Open-License-v1.0
  name: "liquid-audio"
  alias: "liquid-audio"
  capabilities:
    nvidia: "cuda12-liquid-audio"
    intel: "intel-liquid-audio"
    amd: "rocm-liquid-audio"
    default: "cpu-liquid-audio"
    nvidia-cuda-13: "cuda13-liquid-audio"
    nvidia-cuda-12: "cuda12-liquid-audio"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio"
  icon: https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/7_6D7rWrLxp2hb6OHSV1p.png
 - &qwen-tts
  urls:
    - https://github.com/QwenLM/Qwen3-TTS
@@ -3437,6 +3466,77 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vibevoice"
  mirrors:
    - localai/localai-backends:master-metal-darwin-arm64-vibevoice
 ## liquid-audio
 - !!merge <<: *liquid-audio
  name: "liquid-audio-development"
  capabilities:
    nvidia: "cuda12-liquid-audio-development"
    intel: "intel-liquid-audio-development"
    amd: "rocm-liquid-audio-development"
    default: "cpu-liquid-audio-development"
    nvidia-cuda-13: "cuda13-liquid-audio-development"
    nvidia-cuda-12: "cuda12-liquid-audio-development"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio-development"
 - !!merge <<: *liquid-audio
  name: "cpu-liquid-audio"
  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-liquid-audio"
  mirrors:
    - localai/localai-backends:latest-cpu-liquid-audio
 - !!merge <<: *liquid-audio
  name: "cpu-liquid-audio-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-liquid-audio"
  mirrors:
    - localai/localai-backends:master-cpu-liquid-audio
 - !!merge <<: *liquid-audio
  name: "cuda12-liquid-audio"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-liquid-audio"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-12-liquid-audio
 - !!merge <<: *liquid-audio
  name: "cuda12-liquid-audio-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-liquid-audio"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-liquid-audio
 - !!merge <<: *liquid-audio
  name: "cuda13-liquid-audio"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-liquid-audio"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-13-liquid-audio
 - !!merge <<: *liquid-audio
  name: "cuda13-liquid-audio-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-liquid-audio"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-13-liquid-audio
 - !!merge <<: *liquid-audio
  name: "intel-liquid-audio"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-liquid-audio"
  mirrors:
    - localai/localai-backends:latest-gpu-intel-liquid-audio
 - !!merge <<: *liquid-audio
  name: "intel-liquid-audio-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-liquid-audio"
  mirrors:
    - localai/localai-backends:master-gpu-intel-liquid-audio
 - !!merge <<: *liquid-audio
  name: "rocm-liquid-audio"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-liquid-audio"
  mirrors:
    - localai/localai-backends:latest-gpu-rocm-hipblas-liquid-audio
 - !!merge <<: *liquid-audio
  name: "rocm-liquid-audio-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-liquid-audio"
  mirrors:
    - localai/localai-backends:master-gpu-rocm-hipblas-liquid-audio
 - !!merge <<: *liquid-audio
  name: "cuda13-nvidia-l4t-arm64-liquid-audio"
  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-liquid-audio"
  mirrors:
    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-liquid-audio
 - !!merge <<: *liquid-audio
  name: "cuda13-nvidia-l4t-arm64-liquid-audio-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-liquid-audio"
  mirrors:
    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-liquid-audio
 ## qwen-tts
 - !!merge <<: *qwen-tts
  name: "qwen-tts-development"
--- a/backend/python/liquid-audio/Makefile
+++ b/backend/python/liquid-audio/Makefile
@@ -0,0 +1,23 @@
 .PHONY: liquid-audio
 liquid-audio:
 	bash install.sh
 .PHONY: run
 run: liquid-audio
 	@echo "Running liquid-audio..."
 	bash run.sh
 	@echo "liquid-audio run."
 .PHONY: test
 test: liquid-audio
 	@echo "Testing liquid-audio..."
 	bash test.sh
 	@echo "liquid-audio tested."
 .PHONY: protogen-clean
 protogen-clean:
 	$(RM) backend_pb2_grpc.py backend_pb2.py
 .PHONY: clean
 clean: protogen-clean
 	rm -rf venv __pycache__
--- a/backend/python/liquid-audio/backend.py
+++ b/backend/python/liquid-audio/backend.py
@@ -0,0 +1,871 @@
 #!/usr/bin/env python3
 """
 Liquid Audio backend for LocalAI.
 Wraps LiquidAI's `liquid-audio` Python package (https://github.com/Liquid4All/liquid-audio).
 The same model serves four roles, selected by the `mode` option at load time:
 chat, asr, tts, s2s. Fine-tuning is exposed via StartFineTune.
 """
 from concurrent import futures
 import argparse
 import json
 import os
 import queue
 import signal
 import sys
 import threading
 import time
 import traceback
 import uuid
 import grpc
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
 from grpc_auth import get_auth_interceptors  # noqa: E402
 from python_utils import parse_options  # noqa: E402
 import backend_pb2  # noqa: E402
 import backend_pb2_grpc  # noqa: E402
 _ONE_DAY_IN_SECONDS = 60 * 60 * 24
 MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
 # Voice id → system-prompt suffix. The model only ships these four voices.
 VOICE_PROMPTS = {
    "us_male":   "Perform TTS. Use the US male voice.",
    "us_female": "Perform TTS. Use the US female voice.",
    "uk_male":   "Perform TTS. Use the UK male voice.",
    "uk_female": "Perform TTS. Use the UK female voice.",
 }
 DEFAULT_VOICE = "us_female"
 # Special-token IDs that LFM2-Audio emits to delimit modality boundaries.
 # Sourced from liquid_audio/model/lfm2_audio.py (see generate_sequential/_sample_*).
 TEXT_END_TOKEN = 130        # <|text_end|>
 AUDIO_START_TOKEN = 128     # <|audio_start|>
 IM_END_TOKEN = 7            # <|im_end|>
 AUDIO_EOS_CODE = 2048       # signals end-of-audio in any codebook position
 _PATCHED_LOCAL_PATHS = False
 def _patch_liquid_audio_local_paths():
    """Make liquid_audio.utils.get_model_dir() tolerate local directories.
    Upstream always passes its argument to huggingface_hub.snapshot_download,
    which only accepts `owner/repo` ids. LocalAI's gallery hands us absolute
    paths under <ModelPath>/<owner>/<repo>, so we intercept snapshot_download
    in the liquid_audio.utils namespace and return the directory as-is when
    it already exists on disk. Idempotent.
    """
    global _PATCHED_LOCAL_PATHS
    if _PATCHED_LOCAL_PATHS:
        return
    import liquid_audio.utils as _la_utils
    _orig_snapshot_download = _la_utils.snapshot_download
    def _local_first_snapshot_download(repo_id, revision=None, **kwargs):
        if isinstance(repo_id, (str, os.PathLike)) and os.path.isdir(str(repo_id)):
            return str(repo_id)
        return _orig_snapshot_download(repo_id, revision=revision, **kwargs)
    _la_utils.snapshot_download = _local_first_snapshot_download
    _PATCHED_LOCAL_PATHS = True
 def _select_device():
    import torch
    if torch.cuda.is_available():
        return "cuda"
    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        return "mps"
    return "cpu"
 class ActiveJob:
    """Tracks an in-flight fine-tune so FineTuneProgress can stream from its queue."""
    def __init__(self, job_id):
        self.job_id = job_id
        self.progress_queue = queue.Queue()
        self.thread = None
        self.stopped = False
        self.completed = False
        self.error = None
 class BackendServicer(backend_pb2_grpc.BackendServicer):
    def __init__(self):
        self.processor = None
        self.model = None
        self.device = "cpu"
        self.dtype = None
        self.options = {}
        self.model_id = None
        self.active_job = None
    @property
    def mode(self):
        return str(self.options.get("mode", "chat")).lower()
    @property
    def voice(self):
        v = str(self.options.get("voice", DEFAULT_VOICE)).lower()
        return v if v in VOICE_PROMPTS else DEFAULT_VOICE
    def Free(self, request, context):
        # Called by LocalAI when unloading the model. Drop GPU tensors so the
        # next load starts from a clean state instead of bumping into OOM.
        try:
            for attr in ("model", "processor", "tokenizer"):
                if hasattr(self, attr):
                    try:
                        delattr(self, attr)
                    except Exception:
                        pass
            import gc
            gc.collect()
            try:
                import torch
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
            except Exception:
                pass
            return backend_pb2.Result(success=True, message="OK")
        except Exception as exc:
            print(f"Free failed: {exc}", file=sys.stderr)
            return backend_pb2.Result(success=False, message=str(exc))
    def Health(self, request, context):
        return backend_pb2.Reply(message=bytes("OK", 'utf-8'))
    def LoadModel(self, request, context):
        try:
            import torch
            self.options = parse_options(request.Options)
            if self.options.get("voice") and self.options["voice"] not in VOICE_PROMPTS:
                print(f"Warning: unknown voice '{self.options['voice']}'; defaulting to '{DEFAULT_VOICE}'",
                      file=sys.stderr)
            requested_device = self.options.get("device")
            self.device = requested_device or _select_device()
            if self.device == "cuda" and not torch.cuda.is_available():
                return backend_pb2.Result(success=False, message="CUDA requested but not available")
            if self.device == "mps" and not (hasattr(torch.backends, "mps") and
                                             torch.backends.mps.is_available()):
                print("MPS not available; falling back to CPU", file=sys.stderr)
                self.device = "cpu"
            dtype_name = str(self.options.get("dtype", "bfloat16")).lower()
            self.dtype = {
                "bfloat16": torch.bfloat16,
                "bf16":     torch.bfloat16,
                "float16":  torch.float16,
                "fp16":     torch.float16,
                "half":     torch.float16,
                "float32":  torch.float32,
                "fp32":     torch.float32,
            }.get(dtype_name, torch.bfloat16)
            # request.Model holds the raw `parameters.model` value (an HF
            # repo id like "LiquidAI/LFM2.5-Audio-1.5B"); request.ModelFile
            # is LocalAI's ModelPath-prefixed local copy that exists only
            # when the gallery supplied a `files:` list. Mirror the
            # transformers/vibevoice convention: prefer the repo id and
            # only switch to the local path if it's been staged on disk.
            model_id = request.Model
            if not model_id:
                model_id = request.ModelFile
            if not model_id:
                return backend_pb2.Result(success=False, message="No model identifier provided")
            if request.ModelFile and os.path.isdir(request.ModelFile):
                model_id = request.ModelFile
            self.model_id = model_id
            # Pure fine-tune jobs don't need an in-memory inference model — the
            # Trainer instantiates its own copy at StartFineTune time.
            if self.mode == "finetune":
                print(f"Loaded liquid-audio backend in fine-tune mode (model id: {model_id})",
                      file=sys.stderr)
                return backend_pb2.Result(success=True, message="OK")
            from liquid_audio import LFM2AudioModel, LFM2AudioProcessor
            # liquid_audio's from_pretrained unconditionally routes through
            # huggingface_hub.snapshot_download, which rejects local paths
            # (HFValidationError on `/models/LiquidAI/LFM2.5-Audio-1.5B`).
            # When LocalAI's gallery has already staged the weights on disk,
            # short-circuit the download to return the local directory.
            _patch_liquid_audio_local_paths()
            print(f"Loading liquid-audio model '{model_id}' on {self.device} ({self.dtype})",
                  file=sys.stderr)
            self.processor = LFM2AudioProcessor.from_pretrained(model_id, device=self.device).eval()
            self.model = LFM2AudioModel.from_pretrained(
                model_id, device=self.device, dtype=self.dtype
            ).eval()
            print(f"Liquid-audio mode={self.mode}, voice={self.voice}", file=sys.stderr)
            return backend_pb2.Result(success=True, message="OK")
        except Exception as exc:
            print(f"LoadModel failed: {exc}", file=sys.stderr)
            print(traceback.format_exc(), file=sys.stderr)
            return backend_pb2.Result(success=False, message=str(exc))
    def Predict(self, request, context):
        try:
            text = "".join(self._generate_text_stream(request))
            return backend_pb2.Reply(message=text.encode("utf-8"))
        except Exception as exc:
            print(f"Predict failed: {exc}", file=sys.stderr)
            print(traceback.format_exc(), file=sys.stderr)
            context.set_code(grpc.StatusCode.INTERNAL)
            context.set_details(str(exc))
            return backend_pb2.Reply()
    def PredictStream(self, request, context):
        try:
            for delta in self._generate_text_stream(request):
                yield backend_pb2.Reply(message=delta.encode("utf-8"))
        except Exception as exc:
            print(f"PredictStream failed: {exc}", file=sys.stderr)
            print(traceback.format_exc(), file=sys.stderr)
            context.set_code(grpc.StatusCode.INTERNAL)
            context.set_details(str(exc))
    def VAD(self, request, context):
        # Stub voice-activity detector: RMS-energy threshold over 30ms frames at
        # 16 kHz. Good enough for the realtime endpoint's handleVAD loop, which
        # only inspects segment presence + last segment end. The proper signal
        # would come from the model's audio encoder, but that ride-along is a
        # PR-D scope item — until then this keeps the legacy pipeline path
        # working without forcing the operator to install a separate VAD model.
        import numpy as np
        try:
            audio = np.asarray(request.audio, dtype=np.float32)
            if audio.size == 0:
                return backend_pb2.VADResponse(segments=[])
            sample_rate = 16000
            frame_size = sample_rate * 30 // 1000  # 30ms → 480 samples
            threshold = float(self.options.get("vad_rms_threshold", 0.01))
            min_speech_frames = int(self.options.get("vad_min_speech_frames", 2))  # ≥60ms
            # handleVAD ticks every 300 ms and only inspects segment presence
            # + last segment end relative to silence_threshold (~500 ms). Cap
            # the analysed window to the tail of the buffer so we don't redo
            # the entire growing utterance every tick.
            window_s = float(self.options.get("vad_window_s", 5.0))
            window_samples = int(window_s * sample_rate)
            time_offset_s = 0.0
            if audio.size > window_samples:
                time_offset_s = (audio.size - window_samples) / sample_rate
                audio = audio[-window_samples:]
            n_frames = audio.size // frame_size
            if n_frames == 0:
                return backend_pb2.VADResponse(segments=[])
            frames = audio[: n_frames * frame_size].reshape(n_frames, frame_size)
            rms = np.sqrt(np.mean(frames ** 2, axis=1))
            speech = rms > threshold
            def _emit(start_idx, end_idx, out):
                if end_idx - start_idx >= min_speech_frames:
                    out.append(backend_pb2.VADSegment(
                        start=time_offset_s + start_idx * frame_size / sample_rate,
                        end=time_offset_s + end_idx * frame_size / sample_rate,
                    ))
            segments = []
            start_idx = None
            for i, is_speech in enumerate(speech):
                if is_speech and start_idx is None:
                    start_idx = i
                elif not is_speech and start_idx is not None:
                    _emit(start_idx, i, segments)
                    start_idx = None
            if start_idx is not None:
                _emit(start_idx, n_frames, segments)
            return backend_pb2.VADResponse(segments=segments)
        except Exception as exc:
            print(f"VAD failed: {exc}", file=sys.stderr)
            print(traceback.format_exc(), file=sys.stderr)
            context.set_code(grpc.StatusCode.INTERNAL)
            context.set_details(str(exc))
            return backend_pb2.VADResponse(segments=[])
    def TTS(self, request, context):
        try:
            if self.model is None or self.processor is None:
                return backend_pb2.Result(success=False, message="Model not loaded")
            import torch
            import torchaudio
            from liquid_audio import ChatState
            voice = request.voice.lower() if request.voice else self.voice
            voice = voice.removeprefix("lfm2:").removeprefix("lfm:")
            if voice not in VOICE_PROMPTS:
                voice = self.voice
            system_prompt = VOICE_PROMPTS[voice]
            chat = ChatState(self.processor)
            chat.new_turn("system")
            chat.add_text(system_prompt)
            chat.end_turn()
            chat.new_turn("user")
            chat.add_text(request.text or "")
            chat.end_turn()
            chat.new_turn("assistant")
            audio_top_k = int(self.options.get("audio_top_k", 64))
            audio_temp = float(self.options.get("audio_temperature", 0.8))
            max_new = int(self.options.get("max_new_tokens", 2048))
            audio_out = []
            for tok in self.model.generate_sequential(
                **chat,
                max_new_tokens=max_new,
                audio_temperature=audio_temp,
                audio_top_k=audio_top_k,
            ):
                if tok.numel() > 1:
                    audio_out.append(tok)
            if len(audio_out) <= 1:
                return backend_pb2.Result(success=False, message="No audio frames generated")
            # Drop the trailing end-of-audio frame, matching the package's examples.
            audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
            waveform = self.processor.decode(audio_codes)
            out_path = request.dst
            if not out_path:
                return backend_pb2.Result(success=False, message="dst path is required")
            os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True)
            # soundfile in preference to torchaudio.save — the latter routes
            # through torchcodec, whose native libs need NVIDIA NPP that we
            # don't bundle in the cuda13 image.
            import soundfile as _sf
            _sf.write(out_path, waveform.cpu().numpy().squeeze(0).T, 24_000)
            return backend_pb2.Result(success=True)
        except Exception as exc:
            print(f"TTS failed: {exc}", file=sys.stderr)
            print(traceback.format_exc(), file=sys.stderr)
            return backend_pb2.Result(success=False, message=str(exc))
    def AudioToAudioStream(self, request_iterator, context):
        """Bidirectional any-to-any speech-to-speech stream.
        See `backend.proto` AudioToAudioStream for the wire protocol. Audio
        is decoded once per turn here; chunked detokenization for sub-second
        TTFB is left to a future iteration once the LFM2AudioDetokenizer
        gains a streaming entry point.
        """
        try:
            yield from self._audio_to_audio_stream(request_iterator, context)
        except Exception as exc:
            print(f"AudioToAudioStream failed: {exc}", file=sys.stderr)
            print(traceback.format_exc(), file=sys.stderr)
            yield backend_pb2.AudioToAudioResponse(
                event="error",
                meta=json.dumps({"message": str(exc)}).encode("utf-8"),
            )
    def _audio_to_audio_stream(self, request_iterator, context):
        if self.model is None or self.processor is None:
            raise RuntimeError("Model not loaded")
        import torch
        import torchaudio
        from liquid_audio import ChatState
        cfg = None
        chat = None
        input_sample_rate = 16000
        output_sample_rate = 24000
        sequence = 0
        def _new_event(event, **kwargs):
            nonlocal sequence
            sequence += 1
            kwargs.setdefault("sequence", sequence)
            return backend_pb2.AudioToAudioResponse(event=event, **kwargs)
        def _ensure_chat():
            """Build a fresh ChatState seeded with the system prompt."""
            nonlocal chat
            chat = ChatState(self.processor)
            system_prompt = (cfg.system_prompt if cfg and cfg.system_prompt
                             else "Respond with interleaved text and audio.")
            chat.new_turn("system")
            chat.add_text(system_prompt)
            chat.end_turn()
        # Buffers for the in-flight user turn
        pcm_buffer = bytearray()
        def _consume_user_turn():
            nonlocal pcm_buffer
            if not pcm_buffer:
                return
            # Avoid the bytes(pcm_buffer) copy and let the float widen happen
            # in-place: numpy view → torch view → in-place divide.
            import numpy as np
            arr = np.frombuffer(memoryview(pcm_buffer), dtype=np.int16)
            wav = torch.from_numpy(arr).to(torch.float32).div_(32768.0).unsqueeze(0)
            chat.new_turn("user")
            chat.add_audio(wav, input_sample_rate)
            chat.end_turn()
            pcm_buffer = bytearray()
        def _run_generation():
            """Run generate_interleaved; yield response events as we go."""
            chat.new_turn("assistant")
            audio_top_k = int(self.options.get("audio_top_k", 4))
            audio_temp = float(self.options.get("audio_temperature", 1.0))
            text_top_k = int(self.options.get("text_top_k", 0)) or None
            text_temp = float(self.options.get("text_temperature", 0)) or None
            max_new = int(self.options.get("max_new_tokens", 512))
            audio_tokens = []
            for tok in self.model.generate_interleaved(
                **chat,
                max_new_tokens=max_new,
                text_temperature=text_temp,
                text_top_k=text_top_k,
                audio_temperature=audio_temp,
                audio_top_k=audio_top_k,
            ):
                if tok.numel() == 1:
                    if tok.item() == IM_END_TOKEN:
                        break
                    text = self.processor.text.decode(tok)
                    if not text:
                        continue
                    yield _new_event(
                        "response.audio_transcript.delta",
                        meta=json.dumps({"delta": text}).encode("utf-8"),
                    )
                else:
                    audio_tokens.append(tok)
            # Detokenize the accumulated audio at end-of-turn — the
            # LFM2AudioDetokenizer is non-streaming today.
            if len(audio_tokens) > 1:
                audio_codes = torch.stack(audio_tokens[:-1], 1).unsqueeze(0)
                waveform = self.processor.decode(audio_codes)
                # Convert to s16le PCM bytes at output_sample_rate
                if output_sample_rate != 24000:
                    waveform = torchaudio.functional.resample(
                        waveform.cpu(), 24000, output_sample_rate
                    )
                pcm = (waveform.cpu().squeeze(0).clamp(-1, 1) * 32767.0).to(
                    torch.int16
                ).numpy().tobytes()
                yield _new_event(
                    "response.audio.delta",
                    pcm=pcm,
                    sample_rate=output_sample_rate,
                )
            yield _new_event("response.done", meta=b"{}")
        for req in request_iterator:
            if not context.is_active():
                return
            payload = req.WhichOneof("payload")
            if payload == "config":
                cfg = req.config
                if cfg.input_sample_rate > 0:
                    input_sample_rate = cfg.input_sample_rate
                if cfg.output_sample_rate > 0:
                    output_sample_rate = cfg.output_sample_rate
                # The first config implicitly resets state.
                _ensure_chat()
                pcm_buffer = bytearray()
            elif payload == "frame":
                if chat is None:
                    _ensure_chat()
                if req.frame.pcm:
                    pcm_buffer.extend(req.frame.pcm)
                if req.frame.end_of_input:
                    _consume_user_turn()
                    yield from _run_generation()
            elif payload == "control":
                event = req.control.event
                if event == "input_audio_buffer.commit":
                    _consume_user_turn()
                    yield from _run_generation()
                elif event == "response.cancel":
                    # Synchronous generation here means cancel can only
                    # take effect between turns; we ack so the client unblocks.
                    yield _new_event("response.done", meta=b'{"cancelled":true}')
                elif event == "session.update":
                    # Free-form session re-config; treat as a soft reset.
                    _ensure_chat()
                    pcm_buffer = bytearray()
                # Unknown events are ignored — forward-compatible.
    def AudioTranscription(self, request, context):
        try:
            if self.model is None or self.processor is None:
                return backend_pb2.TranscriptResult(segments=[], text="")
            import torchaudio
            from liquid_audio import ChatState
            audio_path = request.dst
            if not audio_path:
                return backend_pb2.TranscriptResult(segments=[], text="")
            chat = ChatState(self.processor)
            chat.new_turn("system")
            chat.add_text("Perform ASR.")
            chat.end_turn()
            chat.new_turn("user")
            # soundfile in preference to torchaudio.load — the latter routes
            # through torchcodec which needs NVIDIA NPP libs we don't bundle.
            import soundfile as _sf
            import torch
            audio_np, sr = _sf.read(audio_path, dtype="float32", always_2d=True)
            wav = torch.from_numpy(audio_np.T)  # (channels, samples)
            if wav.shape[0] > 1:
                # Down-mix to mono — the processor expects a single channel
                wav = wav.mean(dim=0, keepdim=True)
            chat.add_audio(wav, sr)
            chat.end_turn()
            chat.new_turn("assistant")
            max_new = int(self.options.get("max_new_tokens", 1024))
            pieces = []
            for tok in self.model.generate_sequential(**chat, max_new_tokens=max_new):
                if tok.numel() == 1:
                    if tok.item() == IM_END_TOKEN:
                        break
                    pieces.append(self.processor.text.decode(tok))
            text = "".join(pieces).strip()
            duration_ms = int((wav.shape[1] / sr) * 1000)
            segment = backend_pb2.TranscriptSegment(
                id=0, start=0, end=duration_ms, text=text, tokens=[],
            )
            return backend_pb2.TranscriptResult(segments=[segment], text=text)
        except Exception as exc:
            print(f"AudioTranscription failed: {exc}", file=sys.stderr)
            print(traceback.format_exc(), file=sys.stderr)
            return backend_pb2.TranscriptResult(segments=[], text="")
    def StartFineTune(self, request, context):
        if self.active_job is not None and not self.active_job.completed:
            return backend_pb2.FineTuneJobResult(
                job_id="", success=False,
                message="A fine-tuning job is already running",
            )
        job_id = request.job_id or str(uuid.uuid4())
        job = ActiveJob(job_id)
        self.active_job = job
        thread = threading.Thread(target=self._run_training, args=(request, job), daemon=True)
        job.thread = thread
        thread.start()
        return backend_pb2.FineTuneJobResult(
            job_id=job_id, success=True, message="Training started",
        )
    def FineTuneProgress(self, request, context):
        if self.active_job is None or self.active_job.job_id != request.job_id:
            context.set_code(grpc.StatusCode.NOT_FOUND)
            context.set_details(f"Job {request.job_id} not found")
            return
        job = self.active_job
        while True:
            try:
                update = job.progress_queue.get(timeout=1.0)
            except queue.Empty:
                if job.completed or job.stopped:
                    break
                if not context.is_active():
                    break
                continue
            if update is None:
                break
            yield update
            if update.status in ("completed", "failed", "stopped"):
                break
    def StopFineTune(self, request, context):
        # We can't kill the Accelerate training loop mid-step cleanly from here;
        # LocalAI's job manager kills the backend process on stop. The flag below
        # at least lets the progress stream terminate quickly.
        if self.active_job is not None and self.active_job.job_id == request.job_id:
            self.active_job.stopped = True
            self.active_job.progress_queue.put(None)
        return backend_pb2.Result(success=True, message="OK")
    def _run_training(self, request, job):
        try:
            self._do_train(request, job)
            job.completed = True
            job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
                job_id=job.job_id, status="completed", message="Training completed",
                progress_percent=100.0,
            ))
        except Exception as exc:
            job.error = str(exc)
            job.completed = True
            print(f"Training failed: {exc}", file=sys.stderr)
            print(traceback.format_exc(), file=sys.stderr)
            job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
                job_id=job.job_id, status="failed", message=str(exc),
            ))
        finally:
            job.progress_queue.put(None)
    def _do_train(self, request, job):
        from liquid_audio import LFM2AudioModel  # noqa: F401  (sanity import)
        from liquid_audio.data.dataloader import LFM2DataLoader
        from liquid_audio.trainer import Trainer
        model_id = request.model or self.model_id or "LiquidAI/LFM2.5-Audio-1.5B"
        dataset_path = request.dataset_source
        if not dataset_path:
            raise ValueError("dataset_source is required (path to a preprocessed dataset)")
        extras = dict(request.extra_options) if request.extra_options else {}
        val_path = extras.get("val_dataset")
        # Map FineTuneRequest hyperparameters to liquid_audio.Trainer constructor args
        lr = request.learning_rate or 3e-5
        max_steps = request.max_steps or 1000
        warmup_steps = request.warmup_steps or min(100, max_steps // 10)
        batch_size = request.batch_size or 16
        save_interval = request.save_steps or max(1, max_steps // 4)
        output_dir = request.output_dir or os.path.join(
            os.environ.get("LIQUID_AUDIO_OUTPUT_DIR", "/tmp"),
            f"liquid-audio-{job.job_id}",
        )
        os.makedirs(output_dir, exist_ok=True)
        job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
            job_id=job.job_id, status="loading_dataset",
            message=f"Loading preprocessed dataset from {dataset_path}",
        ))
        train_data = LFM2DataLoader(dataset_path)
        val_data = LFM2DataLoader(val_path) if val_path else None
        job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
            job_id=job.job_id, status="loading_model",
            message=f"Loading base model {model_id}",
        ))
        # The Liquid Trainer logs via self.accelerator.print; we subclass it to
        # also push progress events onto the queue every logging_interval steps.
        progress_q = job.progress_queue
        class QueuedTrainer(Trainer):
            def log(self_, model_output):
                if self_.step > 0 and self_.step % self_.logging_interval == 0:
                    try:
                        loss = self_.accelerator.reduce(
                            model_output.loss.detach(), reduction="mean"
                        ).item()
                    except Exception:
                        loss = float("nan")
                    lr_now = self_.optimizer.param_groups[0]["lr"]
                    pct = (self_.step / self_.max_steps * 100.0) if self_.max_steps else 0.0
                    progress_q.put(backend_pb2.FineTuneProgressUpdate(
                        job_id=job.job_id,
                        current_step=int(self_.step),
                        total_steps=int(self_.max_steps),
                        current_epoch=float(self_.epoch),
                        loss=float(loss),
                        learning_rate=float(lr_now),
                        progress_percent=float(pct),
                        status="training",
                    ))
                # Honour stop requests: raising here terminates the loop cleanly
                if job.stopped:
                    raise KeyboardInterrupt("stop requested")
                return super().log(model_output)
            def validate(self_):
                progress_q.put(backend_pb2.FineTuneProgressUpdate(
                    job_id=job.job_id, current_step=int(self_.step),
                    total_steps=int(self_.max_steps), status="training",
                    message=f"Running validation at step {self_.step}",
                ))
                return super().validate()
        trainer = QueuedTrainer(
            model_id=model_id,
            train_data=train_data,
            val_data=val_data,
            lr=lr,
            max_steps=max_steps,
            warmup_steps=warmup_steps,
            batch_size=batch_size,
            save_interval=save_interval,
            output_dir=output_dir,
            weight_decay=request.weight_decay or 0.1,
        )
        job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
            job_id=job.job_id, status="training", message="Training started",
            total_steps=int(max_steps),
        ))
        trainer.train()
        job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
            job_id=job.job_id, status="saving",
            message=f"Saved final model to {output_dir}",
            checkpoint_path=os.path.join(output_dir, "final"),
        ))
    def _build_chat_state(self, messages, user_prompt, tools_prelude=None):
        """Build a ChatState from a list of (role, content) tuples plus an optional final user turn.
        tools_prelude, when non-empty, is prepended as an extra system turn carrying
        the LFM2 tool-list block — mirrors gallery/lfm.yaml's `function:` template
        so the model sees the same prompt shape whether served via llama-cpp or here.
        """
        from liquid_audio import ChatState
        chat = ChatState(self.processor)
        if tools_prelude:
            chat.new_turn("system")
            chat.add_text(tools_prelude)
            chat.end_turn()
        for role, content in messages:
            chat.new_turn(role)
            chat.add_text(content)
            chat.end_turn()
        if user_prompt:
            chat.new_turn("user")
            chat.add_text(user_prompt)
            chat.end_turn()
        chat.new_turn("assistant")
        return chat
    def _collect_messages(self, request):
        """Translate PredictOptions.Messages into (role, content) tuples."""
        out = []
        for m in request.Messages:
            role = (m.role or "user").lower()
            if role not in ("system", "user", "assistant"):
                role = "user"
            out.append((role, m.content or ""))
        return out
    def _render_tools_prelude(self, request):
        """Build the LFM2 `<|tool_list_start|>…<|tool_list_end|>` system prelude
        from request.Tools (OpenAI Chat-Completions tool JSON). Returns "" when
        no tools are attached. Output mirrors gallery/lfm.yaml's `function:`
        template so the model sees the same prompt whether routed via llama-cpp
        or this backend."""
        tools_raw = getattr(request, "Tools", "") or ""
        if not tools_raw:
            return ""
        try:
            tools = json.loads(tools_raw)
        except json.JSONDecodeError:
            print(f"liquid-audio: ignoring malformed Tools JSON: {tools_raw[:200]!r}",
                  file=sys.stderr)
            return ""
        if not isinstance(tools, list) or not tools:
            return ""
        # The LFM2 chat template uses single-quoted Python-dict-ish syntax in
        # examples, but the tokenizer treats this whole block as opaque text;
        # JSON works fine and is what other backends emit.
        return (
            "You are a function calling AI model. You are provided with functions to "
            "execute. You may call one or more functions to assist with the user query. "
            "Don't make assumptions about what values to plug into functions.\n"
            "List of tools: <|tool_list_start|>"
            + json.dumps(tools, separators=(",", ":"))
            + "<|tool_list_end|>"
        )
    def _generate_text_stream(self, request):
        """Yield text-only deltas from generate_sequential. Caller joins for unary Predict."""
        if self.model is None or self.processor is None:
            raise RuntimeError("Model not loaded")
        messages = self._collect_messages(request)
        user_prompt = request.Prompt or None
        tools_prelude = self._render_tools_prelude(request)
        # If the request already carries Messages, Prompt is the templated form
        # of the same content — don't append a duplicate user turn.
        chat = self._build_chat_state(
            messages,
            user_prompt if not messages else None,
            tools_prelude=tools_prelude,
        )
        max_new = request.Tokens if request.Tokens > 0 else int(self.options.get("max_new_tokens", 512))
        temperature = request.Temperature if request.Temperature > 0 else None
        top_k = request.TopK if request.TopK > 0 else None
        for tok in self.model.generate_sequential(
            **chat,
            max_new_tokens=max_new,
            text_temperature=temperature,
            text_top_k=top_k,
        ):
            if tok.numel() == 1:
                if tok.item() == IM_END_TOKEN:
                    break
                yield self.processor.text.decode(tok)
 def serve(address):
    server = grpc.server(
        futures.ThreadPoolExecutor(max_workers=MAX_WORKERS),
        options=[
            ('grpc.max_message_length', 50 * 1024 * 1024),
            ('grpc.max_send_message_length', 50 * 1024 * 1024),
            ('grpc.max_receive_message_length', 50 * 1024 * 1024),
        ],
        interceptors=get_auth_interceptors(),
    )
    backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
    server.add_insecure_port(address)
    server.start()
    print(f"Liquid-audio backend listening on {address}", file=sys.stderr, flush=True)
    def stop(_signum, _frame):
        server.stop(0)
        sys.exit(0)
    signal.signal(signal.SIGTERM, stop)
    signal.signal(signal.SIGINT, stop)
    try:
        while True:
            time.sleep(_ONE_DAY_IN_SECONDS)
    except KeyboardInterrupt:
        server.stop(0)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Liquid Audio gRPC backend")
    parser.add_argument("--addr", default="localhost:50051", help="gRPC server address")
    args = parser.parse_args()
    serve(args.addr)
--- a/backend/python/liquid-audio/install.sh
+++ b/backend/python/liquid-audio/install.sh
@@ -0,0 +1,18 @@
 #!/bin/bash
 set -e
 # liquid-audio requires Python ≥ 3.12 (per its pyproject.toml); the default
 # portable Python in libbackend.sh is 3.10. Override before sourcing.
 export PYTHON_VERSION="${PYTHON_VERSION:-3.12}"
 export PYTHON_PATCH="${PYTHON_PATCH:-11}"
 backend_dir=$(dirname $0)
 if [ -d $backend_dir/common ]; then
    source $backend_dir/common/libbackend.sh
 else
    source $backend_dir/../common/libbackend.sh
 fi
 # liquid-audio's torch wheels are large; allow upgrades to satisfy transitive pins
 EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
 installRequirements
--- a/backend/python/liquid-audio/protogen.sh
+++ b/backend/python/liquid-audio/protogen.sh
@@ -0,0 +1,11 @@
 #!/bin/bash
 set -e
 backend_dir=$(dirname $0)
 if [ -d $backend_dir/common ]; then
    source $backend_dir/common/libbackend.sh
 else
    source $backend_dir/../common/libbackend.sh
 fi
 runProtogen
--- a/backend/python/liquid-audio/requirements-cpu.txt
+++ b/backend/python/liquid-audio/requirements-cpu.txt
@@ -0,0 +1,13 @@
 --extra-index-url https://download.pytorch.org/whl/cpu
 torch>=2.8.0
 torchaudio>=2.8.0
 torchcodec>=0.9.1
 transformers>=4.55.4
 accelerate>=1.10.1
 datasets>=4.8.4
 einops>=0.8.1
 librosa>=0.11.0
 soundfile>=0.12.1
 sentencepiece>=0.2.1
 huggingface-hub>=1.3.0
 liquid-audio>=1.2.0
--- a/backend/python/liquid-audio/requirements-cublas12.txt
+++ b/backend/python/liquid-audio/requirements-cublas12.txt
@@ -0,0 +1,13 @@
 --extra-index-url https://download.pytorch.org/whl/cu121
 torch>=2.8.0
 torchaudio>=2.8.0
 torchcodec>=0.9.1
 transformers>=4.55.4
 accelerate>=1.10.1
 datasets>=4.8.4
 einops>=0.8.1
 librosa>=0.11.0
 soundfile>=0.12.1
 sentencepiece>=0.2.1
 huggingface-hub>=1.3.0
 liquid-audio>=1.2.0
--- a/backend/python/liquid-audio/requirements-cublas13.txt
+++ b/backend/python/liquid-audio/requirements-cublas13.txt
@@ -0,0 +1,13 @@
 --extra-index-url https://download.pytorch.org/whl/cu130
 torch>=2.8.0
 torchaudio>=2.8.0
 torchcodec>=0.9.1
 transformers>=4.55.4
 accelerate>=1.10.1
 datasets>=4.8.4
 einops>=0.8.1
 librosa>=0.11.0
 soundfile>=0.12.1
 sentencepiece>=0.2.1
 huggingface-hub>=1.3.0
 liquid-audio>=1.2.0
--- a/backend/python/liquid-audio/requirements-hipblas.txt
+++ b/backend/python/liquid-audio/requirements-hipblas.txt
@@ -0,0 +1,13 @@
 --extra-index-url https://download.pytorch.org/whl/rocm7.0
 torch>=2.8.0
 torchaudio>=2.8.0
 torchcodec>=0.9.1
 transformers>=4.55.4
 accelerate>=1.10.1
 datasets>=4.8.4
 einops>=0.8.1
 librosa>=0.11.0
 soundfile>=0.12.1
 sentencepiece>=0.2.1
 huggingface-hub>=1.3.0
 liquid-audio>=1.2.0
--- a/backend/python/liquid-audio/requirements-l4t13.txt
+++ b/backend/python/liquid-audio/requirements-l4t13.txt
@@ -0,0 +1,13 @@
 --extra-index-url https://pypi.jetson-ai-lab.io/jp7/cu130
 torch>=2.8.0
 torchaudio>=2.8.0
 torchcodec>=0.9.1
 transformers>=4.55.4
 accelerate>=1.10.1
 datasets>=4.8.4
 einops>=0.8.1
 librosa>=0.11.0
 soundfile>=0.12.1
 sentencepiece>=0.2.1
 huggingface-hub>=1.3.0
 liquid-audio>=1.2.0
--- a/backend/python/liquid-audio/requirements-mps.txt
+++ b/backend/python/liquid-audio/requirements-mps.txt
@@ -0,0 +1,12 @@
 torch>=2.8.0
 torchaudio>=2.8.0
 torchcodec>=0.9.1
 transformers>=4.55.4
 accelerate>=1.10.1
 datasets>=4.8.4
 einops>=0.8.1
 librosa>=0.11.0
 soundfile>=0.12.1
 sentencepiece>=0.2.1
 huggingface-hub>=1.3.0
 liquid-audio>=1.2.0
--- a/backend/python/liquid-audio/requirements.txt
+++ b/backend/python/liquid-audio/requirements.txt
@@ -0,0 +1,3 @@
 grpcio==1.78.1
 protobuf
 certifi
--- a/backend/python/liquid-audio/run.sh
+++ b/backend/python/liquid-audio/run.sh
@@ -0,0 +1,10 @@
 #!/bin/bash
 backend_dir=$(dirname $0)
 if [ -d $backend_dir/common ]; then
    source $backend_dir/common/libbackend.sh
 else
    source $backend_dir/../common/libbackend.sh
 fi
 startBackend $@
--- a/backend/python/liquid-audio/test.py
+++ b/backend/python/liquid-audio/test.py
@@ -0,0 +1,89 @@
 """Smoke tests for the liquid-audio backend.
 These run without contacting HuggingFace or loading model weights:
 they only verify that the gRPC service starts and Health() responds.
 To run an end-to-end inference test, set LIQUID_AUDIO_MODEL_ID
 (e.g. "LiquidAI/LFM2.5-Audio-1.5B") in the environment — see test_inference().
 """
 import os
 import subprocess
 import sys
 import time
 import unittest
 import grpc
 # Ensure generated protobuf stubs are importable
 sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
 import backend_pb2
 import backend_pb2_grpc
 class TestBackend(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        addr = os.environ.get("LIQUID_AUDIO_TEST_ADDR", "localhost:50053")
        cls.addr = addr
        cls.server = subprocess.Popen(
            [sys.executable, os.path.join(os.path.dirname(__file__), "backend.py"), "--addr", addr],
        )
        time.sleep(2)  # Give the server a moment to bind
    @classmethod
    def tearDownClass(cls):
        cls.server.terminate()
        try:
            cls.server.wait(timeout=5)
        except subprocess.TimeoutExpired:
            cls.server.kill()
    def _stub(self):
        channel = grpc.insecure_channel(self.addr)
        return backend_pb2_grpc.BackendStub(channel)
    def test_health(self):
        stub = self._stub()
        reply = stub.Health(backend_pb2.HealthMessage(), timeout=5)
        self.assertEqual(reply.message, b"OK")
    def test_load_finetune_mode_without_weights(self):
        """Loading in fine-tune mode should succeed without pulling model weights."""
        stub = self._stub()
        result = stub.LoadModel(
            backend_pb2.ModelOptions(
                Model="LiquidAI/LFM2.5-Audio-1.5B",
                Options=["mode:finetune"],
            ),
            timeout=10,
        )
        self.assertTrue(result.success, msg=result.message)
    @unittest.skipUnless(os.environ.get("LIQUID_AUDIO_MODEL_ID"),
                         "Set LIQUID_AUDIO_MODEL_ID to run an end-to-end inference smoke test")
    def test_inference(self):
        """End-to-end: load a real LFM2-Audio model and run one short prediction."""
        stub = self._stub()
        model_id = os.environ["LIQUID_AUDIO_MODEL_ID"]
        result = stub.LoadModel(
            backend_pb2.ModelOptions(
                Model=model_id,
                Options=["mode:chat"],
            ),
            timeout=600,
        )
        self.assertTrue(result.success, msg=result.message)
        reply = stub.Predict(
            backend_pb2.PredictOptions(
                Prompt="Hello!",
                Tokens=8,
                Temperature=0.0,
            ),
            timeout=120,
        )
        self.assertGreater(len(reply.message), 0)
 if __name__ == "__main__":
    unittest.main()
--- a/backend/python/liquid-audio/test.sh
+++ b/backend/python/liquid-audio/test.sh
@@ -0,0 +1,11 @@
 #!/bin/bash
 set -e
 backend_dir=$(dirname $0)
 if [ -d $backend_dir/common ]; then
    source $backend_dir/common/libbackend.sh
 else
    source $backend_dir/../common/libbackend.sh
 fi
 runUnittests
--- a/backend/python/sglang/install.sh
+++ b/backend/python/sglang/install.sh
@@ -36,15 +36,11 @@ fi
 # flash-attn-4 4.0 stable lands.
 EXTRA_PIP_INSTALL_FLAGS+=" --prerelease=allow"
-# JetPack 7 / L4T arm64 wheels are built for cp312 and shipped via
+# JetPack 7 / L4T arm64 sglang + torch wheels come straight from PyPI now
-# pypi.jetson-ai-lab.io. Bump the venv Python so the prebuilt sglang
+# (torch 2.11+ ships aarch64 + cu130 manylinux wheels and sglang 0.5.11+
-# wheel resolves cleanly. The actual install on l4t13 goes through
+# ships a cp312 aarch64 wheel pinned to that torch). They're cp312-only,
-# pyproject.toml (see the elif branch below) so [tool.uv.sources] can
+# so bump the venv Python accordingly.
-# pin only torch/torchvision/torchaudio/sglang to the jetson-ai-lab
+# https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
 # index — leaving PyPI as the path for transitive deps like
 # markdown-it-py / anthropic / propcache that the L4T mirror's proxy
 # 503s on. No --index-strategy flag here: the explicit index keeps the
 # scoping clean.
 if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
    PYTHON_VERSION="3.12"
    PYTHON_PATCH="12"
@@ -110,27 +106,6 @@ if [ "x${BUILD_TYPE}" == "x" ] || [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
        fi
        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} .
    popd
 # L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
 # [tool.uv.sources] can pin torch/torchvision/torchaudio/sglang to the
 # jetson-ai-lab index, while everything else (transitive deps and
 # PyPI-resolvable packages like transformers / accelerate) comes from
 # PyPI. Bypasses installRequirements because uv pip install -r
 # requirements.txt does not honor sources — see
 # backend/python/sglang/pyproject.toml for the rationale. Mirrors the
 # equivalent path in backend/python/vllm/install.sh.
 elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
    ensureVenv
    if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
        export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
    fi
    pushd "${backend_dir}"
        # Build deps first (matches installRequirements' requirements-install.txt
        # pass — sglang/sgl-kernel sdists need packaging/setuptools-scm in the
        # venv before they can build under --no-build-isolation).
        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
    popd
    runProtogen
 else
    installRequirements
 fi
--- a/backend/python/sglang/pyproject.toml
+++ b/backend/python/sglang/pyproject.toml
@@ -1,68 +0,0 @@
 # L4T arm64 (JetPack 7 / sbsa cu130) install spec for the sglang backend.
 #
 # Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
 #
 # pypi.jetson-ai-lab.io hosts the L4T-specific torch / sglang / sgl-kernel
 # wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
 # rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently.
 # With `--extra-index-url` + `--index-strategy=unsafe-best-match` (the
 # historical fix in install.sh) uv would pick those proxy URLs for ordinary
 # PyPI packages — markdown-it-py, anthropic, propcache, etc. — and trip on
 # the 503s. See e.g. CI run 25439791228 (markdown-it-py-4.0.0).
 #
 # `explicit = true` on the index makes uv consult the L4T mirror ONLY for
 # packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
 # This breaks the historical 503 path without losing access to the L4T
 # wheels we actually need from there. Mirrors the equivalent fix already
 # in backend/python/vllm/pyproject.toml.
 #
 # `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
 # (sources are project-mode only, not pip-compat mode), so install.sh's
 # l4t13 branch invokes `uv pip install --requirement pyproject.toml`
 # directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
 # pipeline through libbackend.sh's installRequirements and never read
 # this file.
 [project]
 name = "localai-sglang-l4t13"
 version = "0.0.0"
 requires-python = ">=3.12,<3.13"
 dependencies = [
    # Mirror of requirements.txt — kept in sync manually for now since the
    # l4t13 path bypasses installRequirements (see install.sh).
    "grpcio==1.80.0",
    "protobuf",
    "certifi",
    "setuptools",
    "pillow",
    # L4T-specific accelerator stack (sourced from jetson-ai-lab below).
    "torch",
    "torchvision",
    "torchaudio",
    # sglang on jetson — the [all] extra is deliberately omitted because it
    # pulls outlines/decord, and decord has no aarch64 cp312 wheel anywhere
    # (PyPI nor the jetson-ai-lab index ships only legacy cp35-cp37). With
    # [all] uv backtracks through versions trying to satisfy decord and
    # lands on sglang==0.1.16. The 0.5.0 floor matches the only major
    # series the jetson-ai-lab sbsa/cu130 mirror currently publishes
    # (sglang==0.5.1.post2 as of 2026-05-06). Bumping to >=0.5.11 here
    # would make the build unsatisfiable until the mirror catches up.
    # Gemma 4 / MTP recipes are therefore not supported on l4t13 — those
    # features land on cublas12/cublas13 hosts that pull the newer wheel
    # from PyPI. backend.py keeps backward compat with the 0.5.x SamplingParams
    # field rename via runtime detection.
    "sglang>=0.5.0",
    # PyPI-resolvable packages that complete the runtime.
    "accelerate",
    "transformers",
 ]
 [[tool.uv.index]]
 name = "jetson-ai-lab"
 url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
 explicit = true
 [tool.uv.sources]
 torch = { index = "jetson-ai-lab" }
 torchvision = { index = "jetson-ai-lab" }
 torchaudio = { index = "jetson-ai-lab" }
 sglang = { index = "jetson-ai-lab" }
--- a/backend/python/sglang/requirements-l4t13-after.txt
+++ b/backend/python/sglang/requirements-l4t13-after.txt
@@ -0,0 +1,15 @@
 # sglang 0.5.11+ ships an aarch64 manylinux wheel on PyPI whose Requires-Dist
 # pins torch==2.11.0 / torchaudio==2.11.0, locking an ABI-consistent set with
 # the cu130 torch wheel installed above. 0.5.11 is the floor for Gemma 4
 # support (sgl-project/sglang#21952).
 #
 # The [all] extra is deliberately NOT used on aarch64: it pulls the
 # [diffusion] sub-extra which requires `xatlas`, and xatlas ships no
 # aarch64 wheel and its sdist depends on scikit_build_core without
 # declaring it in build-system.requires — so under --no-build-isolation
 # uv can't build it. Upstream sglang gates st_attn and vsa on
 # platform_machine != aarch64 in the diffusion extra but forgot xatlas.
 # Plain `sglang` carries everything backend.py uses (Engine, ServerArgs,
 # FunctionCallParser, ReasoningParser); the [all] extras are optional
 # accelerators not required at import time.
 sglang>=0.5.11
--- a/backend/python/sglang/requirements-l4t13.txt
+++ b/backend/python/sglang/requirements-l4t13.txt
@@ -0,0 +1,9 @@
 # JetPack 7 / L4T arm64 + CUDA 13. Since PyTorch 2.11 (April 2026), PyPI ships
 # aarch64 + cu130 manylinux wheels for torch/torchvision/torchaudio directly,
 # so we no longer need a custom --extra-index-url for the L4T mirror.
 # https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
 accelerate
 torch
 torchvision
 torchaudio
 transformers
--- a/backend/python/transformers/requirements-cpu.txt
+++ b/backend/python/transformers/requirements-cpu.txt
@@ -2,9 +2,9 @@ torch==2.7.1
 llvmlite==0.43.0
 numba==0.60.0
 accelerate
-transformers>=5.8.0
+transformers>=5.8.1
 bitsandbytes
-sentence-transformers==5.4.0
+sentence-transformers==5.5.0
 diffusers
 soundfile
 protobuf==6.33.5
--- a/backend/python/transformers/requirements-cublas12.txt
+++ b/backend/python/transformers/requirements-cublas12.txt
@@ -2,9 +2,9 @@ torch==2.7.1
 accelerate
 llvmlite==0.43.0
 numba==0.60.0
-transformers>=5.8.0
+transformers>=5.8.1
 bitsandbytes
-sentence-transformers==5.4.0
+sentence-transformers==5.5.0
 diffusers
 soundfile
 protobuf==6.33.5
--- a/backend/python/transformers/requirements-cublas13.txt
+++ b/backend/python/transformers/requirements-cublas13.txt
@@ -2,9 +2,9 @@
 torch==2.9.0
 llvmlite==0.43.0
 numba==0.60.0
-transformers>=5.8.0
+transformers>=5.8.1
 bitsandbytes
-sentence-transformers==5.4.0
+sentence-transformers==5.5.0
 diffusers
 soundfile
 protobuf==6.33.5
--- a/backend/python/transformers/requirements-hipblas.txt
+++ b/backend/python/transformers/requirements-hipblas.txt
@@ -1,11 +1,11 @@
 --extra-index-url https://download.pytorch.org/whl/rocm7.0
 torch==2.10.0+rocm7.0
 accelerate
-transformers>=5.8.0
+transformers>=5.8.1
 llvmlite==0.43.0
 numba==0.60.0
 bitsandbytes
-sentence-transformers==5.4.0
+sentence-transformers==5.5.0
 diffusers
 soundfile
 protobuf==6.33.5
--- a/backend/python/transformers/requirements-intel.txt
+++ b/backend/python/transformers/requirements-intel.txt
@@ -3,9 +3,9 @@ torch
 optimum[openvino]
 llvmlite==0.43.0
 numba==0.60.0
-transformers>=5.8.0
+transformers>=5.8.1
 bitsandbytes
-sentence-transformers==5.4.0
+sentence-transformers==5.5.0
 diffusers
 soundfile
 protobuf==6.33.5
--- a/backend/python/transformers/requirements-mps.txt
+++ b/backend/python/transformers/requirements-mps.txt
@@ -2,9 +2,9 @@ torch==2.7.1
 llvmlite==0.43.0
 numba==0.60.0
 accelerate
-transformers>=5.8.0
+transformers>=5.8.1
 bitsandbytes
-sentence-transformers==5.4.0
+sentence-transformers==5.5.0
 diffusers
 soundfile
 protobuf==6.33.5
--- a/backend/python/vllm-omni/install.sh
+++ b/backend/python/vllm-omni/install.sh
@@ -13,14 +13,14 @@ else
 fi
 # Handle l4t build profiles (Python 3.12, pip fallback) if needed.
-# unsafe-best-match is required on l4t13 because the jetson-ai-lab index
+# Since PyTorch 2.11 (April 2026) PyPI ships aarch64 + cu130 manylinux wheels
-# lists transitive deps at limited versions — without it uv pins to the
+# directly for torch/torchvision/torchaudio and an aarch64 vllm wheel pinned
-# first matching index and fails to resolve a compatible wheel from PyPI.
+# to that torch, so the jetson-ai-lab mirror is no longer needed.
 # https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
 if [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
  PYTHON_VERSION="3.12"
  PYTHON_PATCH="12"
  PY_STANDALONE_TAG="20251120"
  EXTRA_PIP_INSTALL_FLAGS="${EXTRA_PIP_INSTALL_FLAGS:-} --index-strategy=unsafe-best-match"
 fi
 if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
@@ -42,18 +42,11 @@ if [ "x${BUILD_TYPE}" == "xhipblas" ]; then
    else
        uv pip install vllm==0.14.0 --extra-index-url https://wheels.vllm.ai/rocm/0.14.0/rocm700
    fi
-elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
+elif [ "x${BUILD_PROFILE}" == "xcublas13" ] || [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
-    # JetPack 7 / L4T arm64 cu130 — vllm comes from the prebuilt SBSA wheel
+    # cublas13 (x86_64) and l4t13 (aarch64) both pull vllm from PyPI now:
-    # at jetson-ai-lab. Version is unpinned: the index ships whatever build
+    # vllm 0.19+ defaults to cu130 wheels on x86_64 and vllm 0.20+ ships an
-    # matches the cu130/cp312 ABI. unsafe-best-match lets uv fall through
+    # aarch64 manylinux wheel pinned to torch==2.11.0. No extra index needed
-    # to PyPI for transitive deps not present on the jetson-ai-lab index.
+    # in either case.
    if [ "x${USE_PIP}" == "xtrue" ]; then
        pip install vllm --extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
    else
        uv pip install --index-strategy=unsafe-best-match vllm --extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
    fi
 elif [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
    # vllm 0.19+ defaults to cu130 wheels on PyPI, no extra index needed.
    if [ "x${USE_PIP}" == "xtrue" ]; then
        pip install vllm --torch-backend=auto
    else
--- a/backend/python/vllm-omni/requirements-l4t13.txt
+++ b/backend/python/vllm-omni/requirements-l4t13.txt
@@ -1,11 +1,15 @@
--extra-index-url https://pypi.jetson-ai-lab.io/sbsa/cu130
+# JetPack 7 / L4T arm64 + CUDA 13. PyPI ships aarch64 + cu130 manylinux wheels
 # for torch/torchvision/torchaudio directly since PyTorch 2.11 (April 2026),
 # so no custom index is needed. flash-attn is dropped here: PyPI has no
 # aarch64 wheel for it, but vLLM 0.20+ bundles its own vllm_flash_attn
 # (fa2 + fa3) inside the main wheel, so it is not required at runtime.
 # https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
 accelerate
 torch
 torchvision
 torchaudio
 transformers
 bitsandbytes
 flash-attn
 diffusers
 librosa
 soundfile
--- a/backend/python/vllm/install.sh
+++ b/backend/python/vllm/install.sh
@@ -43,14 +43,11 @@ if [ "x${BUILD_PROFILE}" == "xcublas13" ]; then
    EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
 fi
-# JetPack 7 / L4T arm64 wheels (torch, vllm, flash-attn) live on
+# JetPack 7 / L4T arm64 vllm + torch wheels come straight from PyPI now
-# pypi.jetson-ai-lab.io and are built for cp312, so bump the venv Python
+# (torch 2.11+ ships aarch64 + cu130 manylinux wheels and vllm 0.20+ ships
-# accordingly. JetPack 6 keeps cp310 + USE_PIP=true.
+# an aarch64 wheel pinned to that torch). They're cp312-only, so bump the
-#
+# venv Python accordingly. JetPack 6 keeps cp310 + USE_PIP=true.
-# l4t13 uses pyproject.toml (see the elif branch below) to pin only the
+# https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
 # L4T-specific wheels to the jetson-ai-lab index via [tool.uv.sources].
 # That keeps PyPI as the resolution path for transitive deps like
 # anthropic/openai/propcache, which the L4T mirror's proxy 503s on.
 if [ "x${BUILD_PROFILE}" == "xl4t12" ]; then
    USE_PIP=true
 fi
@@ -103,25 +100,6 @@ if [ "x${BUILD_TYPE}" == "xintel" ]; then
        export CMAKE_PREFIX_PATH="$(python -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH:-}"
        VLLM_TARGET_DEVICE=xpu uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --no-deps .
    popd
 # L4T arm64 (JetPack 7): drive the install through pyproject.toml so that
 # [tool.uv.sources] can pin torch/vllm/flash-attn/torchvision/torchaudio
 # to the jetson-ai-lab index, while everything else (transitive deps and
 # PyPI-resolvable packages like transformers) comes from PyPI. Bypasses
 # installRequirements because uv pip install -r requirements.txt does not
 # honor sources — see backend/python/vllm/pyproject.toml for the rationale.
 elif [ "x${BUILD_PROFILE}" == "xl4t13" ]; then
    ensureVenv
    if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
        export C_INCLUDE_PATH="${C_INCLUDE_PATH:-}:$(_portable_dir)/include/python${PYTHON_VERSION}"
    fi
    pushd "${backend_dir}"
        # Build deps first (matches installRequirements' requirements-install.txt
        # pass — fastsafetensors and friends need pybind11 in the venv before
        # their sdists can build under --no-build-isolation).
        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} -r requirements-install.txt
        uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --requirement pyproject.toml
    popd
    runProtogen
 # FROM_SOURCE=true on a CPU build skips the prebuilt vllm wheel in
 # requirements-cpu-after.txt and compiles vllm locally against the host's
 # actual CPU. Not used by default because it takes ~30-40 minutes, but
--- a/backend/python/vllm/pyproject.toml
+++ b/backend/python/vllm/pyproject.toml
@@ -1,61 +0,0 @@
 # L4T arm64 (JetPack 7 / sbsa cu130) install spec for the vllm backend.
 #
 # Why this file exists, and why only the l4t13 BUILD_PROFILE consumes it:
 #
 # pypi.jetson-ai-lab.io hosts the L4T-specific torch / vllm / flash-attn
 # wheels we need on aarch64 + cuda13, but it ALSO transparently proxies the
 # rest of PyPI through `/+f/<sha>/<filename>` URLs that 503 frequently. With
 # `--extra-index-url` + `--index-strategy=unsafe-best-match` (the historical
 # fix in install.sh) uv would pick those proxy URLs for ordinary PyPI
 # packages — `anthropic`, `openai`, `propcache`, `annotated-types` — and
 # trip on the 503s. See e.g. CI run 25212201349 (anthropic-0.97.0).
 #
 # `explicit = true` on the index makes uv consult the L4T mirror ONLY for
 # packages mapped under [tool.uv.sources]. Everything else goes to PyPI.
 # This breaks the historical 503 path without losing access to the L4T
 # wheels we actually need from there.
 #
 # `uv pip install -r requirements.txt` does NOT honor [tool.uv.sources]
 # (sources are project-mode only, not pip-compat mode), so install.sh's
 # l4t13 branch invokes `uv pip install --requirement pyproject.toml`
 # directly. Other BUILD_PROFILEs continue to use the requirements-*.txt
 # pipeline through libbackend.sh's installRequirements and never read
 # this file.
 [project]
 name = "localai-vllm-l4t13"
 version = "0.0.0"
 requires-python = ">=3.12,<3.13"
 dependencies = [
    # Mirror of requirements.txt — kept in sync manually for now since the
    # l4t13 path bypasses installRequirements (see install.sh).
    "grpcio==1.80.0",
    "protobuf",
    "certifi",
    "setuptools",
    "pillow",
    "charset-normalizer>=3.4.7",
    "chardet",
    # L4T-specific accelerator stack (sourced from jetson-ai-lab below).
    "torch",
    "torchvision",
    "torchaudio",
    "flash-attn",
    "vllm",
    # PyPI-resolvable packages that complete the runtime — accelerate,
    # transformers, bitsandbytes carry their own wheels for aarch64.
    "accelerate",
    "transformers",
    "bitsandbytes",
 ]
 [[tool.uv.index]]
 name = "jetson-ai-lab"
 url = "https://pypi.jetson-ai-lab.io/sbsa/cu130"
 explicit = true
 [tool.uv.sources]
 torch = { index = "jetson-ai-lab" }
 torchvision = { index = "jetson-ai-lab" }
 torchaudio = { index = "jetson-ai-lab" }
 flash-attn = { index = "jetson-ai-lab" }
 vllm = { index = "jetson-ai-lab" }
--- a/backend/python/vllm/requirements-cublas13-after.txt
+++ b/backend/python/vllm/requirements-cublas13-after.txt
@@ -3,5 +3,5 @@
 # on a cu130 host. Pull the cu130-flavoured wheel from vLLM's per-tag index
 # instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
 # so uv consults this index alongside PyPI.
--extra-index-url https://wheels.vllm.ai/0.20.2/cu130
+--extra-index-url https://wheels.vllm.ai/0.21.0/cu130
-vllm==0.20.2
+vllm==0.21.0
--- a/backend/python/vllm/requirements-l4t13-after.txt
+++ b/backend/python/vllm/requirements-l4t13-after.txt
@@ -0,0 +1,4 @@
 # vLLM 0.20+ ships an aarch64 manylinux wheel on PyPI whose Requires-Dist pins
 # torch==2.11.0 / torchvision==0.26.0 / torchaudio==2.11.0, locking an ABI-
 # consistent set with the cu130 torch wheel installed above.
 vllm
--- a/backend/python/vllm/requirements-l4t13.txt
+++ b/backend/python/vllm/requirements-l4t13.txt
@@ -0,0 +1,8 @@
 # JetPack 7 / L4T arm64 + CUDA 13. Since PyTorch 2.11 (April 2026), PyPI ships
 # aarch64 + cu130 manylinux wheels for torch/torchvision/torchaudio directly,
 # so we no longer need a custom --extra-index-url for the L4T mirror.
 # https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
 accelerate
 torch
 transformers
 bitsandbytes
--- a/core/application/distributed.go
+++ b/core/application/distributed.go
@@ -169,7 +169,7 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
 		cfg.Distributed.HealthCheckIntervalOrDefault(),
 		cfg.Distributed.StaleNodeThresholdOrDefault(),
 		routerAuthToken,
-		cfg.Distributed.PerModelHealthCheck,
+		!cfg.Distributed.DisablePerModelHealthCheck,
 	)
 	// Initialize job store
@@ -233,7 +233,12 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
 		xlog.Info("File stager initialized (HTTP direct transfer)")
 	}
 	// Create RemoteUnloaderAdapter — needed by SmartRouter and startup.go
-	remoteUnloader := nodes.NewRemoteUnloaderAdapter(registry, natsClient)
+	remoteUnloader := nodes.NewRemoteUnloaderAdapter(
 		registry,
 		natsClient,
 		cfg.Distributed.BackendInstallTimeoutOrDefault(),
 		cfg.Distributed.BackendUpgradeTimeoutOrDefault(),
 	)
 	// All dependencies ready — build SmartRouter with all options at once
 	var conflictResolver nodes.ConcurrencyConflictResolver
--- a/core/application/startup.go
+++ b/core/application/startup.go
@@ -17,9 +17,9 @@ import (
 	"github.com/mudler/LocalAI/core/services/jobs"
 	"github.com/mudler/LocalAI/core/services/nodes"
 	"github.com/mudler/LocalAI/core/services/storage"
 	"github.com/mudler/LocalAI/pkg/vram"
 	coreStartup "github.com/mudler/LocalAI/core/startup"
 	"github.com/mudler/LocalAI/internal"
 	"github.com/mudler/LocalAI/pkg/vram"
 	"github.com/mudler/LocalAI/pkg/model"
 	"github.com/mudler/LocalAI/pkg/sanitize"
@@ -200,7 +200,7 @@ func New(opts ...config.AppOption) (*Application, error) {
 				nodes.NewDistributedModelManager(options, application.modelLoader, distSvc.Unloader),
 			)
 			application.galleryService.SetBackendManager(
-				nodes.NewDistributedBackendManager(options, application.modelLoader, distSvc.Unloader, distSvc.Registry),
+				nodes.NewDistributedBackendManager(options, application.modelLoader, distSvc.Unloader, distSvc.Registry, application.galleryService),
 			)
 		}
 	}
@@ -212,12 +212,12 @@ func New(opts ...config.AppOption) (*Application, error) {
 		}
 	}
-	if err := coreStartup.InstallModels(options.Context, application.GalleryService(), options.Galleries, options.BackendGalleries, options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, nil, options.ModelsURL...); err != nil {
+	if err := coreStartup.InstallModels(options.Context, application.GalleryService(), options.Galleries, options.BackendGalleries, options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.RequireBackendIntegrity, nil, options.ModelsURL...); err != nil {
 		xlog.Error("error installing models", "error", err)
 	}
 	for _, backend := range options.ExternalBackends {
-		if err := galleryop.InstallExternalBackend(options.Context, options.BackendGalleries, options.SystemState, application.ModelLoader(), nil, backend, "", ""); err != nil {
+		if err := galleryop.InstallExternalBackend(options.Context, options.BackendGalleries, options.SystemState, application.ModelLoader(), nil, backend, "", "", options.RequireBackendIntegrity); err != nil {
 			xlog.Error("error installing external backend", "error", err)
 		}
 	}
@@ -267,13 +267,13 @@ func New(opts ...config.AppOption) (*Application, error) {
 	}
 	if options.PreloadJSONModels != "" {
-		if err := galleryop.ApplyGalleryFromString(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadJSONModels); err != nil {
+		if err := galleryop.ApplyGalleryFromString(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadJSONModels, options.RequireBackendIntegrity); err != nil {
 			return nil, err
 		}
 	}
 	if options.PreloadModelsFromPath != "" {
-		if err := galleryop.ApplyGalleryFromFile(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadModelsFromPath); err != nil {
+		if err := galleryop.ApplyGalleryFromFile(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadModelsFromPath, options.RequireBackendIntegrity); err != nil {
 			return nil, err
 		}
 	}
@@ -552,6 +552,13 @@ func loadRuntimeSettingsFromFile(options *config.ApplicationConfig) {
 			options.TracingMaxItems = *settings.TracingMaxItems
 		}
 	}
 	if settings.TracingMaxBodyBytes != nil {
 		// Allow the on-disk setting to override the CLI/env default. The
 		// startup default is non-zero (see NewApplicationConfig), so a plain
 		// `== 0` guard like the others would never trigger; we instead respect
 		// any value the file specifies. 0 in the file means "uncapped".
 		options.TracingMaxBodyBytes = *settings.TracingMaxBodyBytes
 	}
 	// Branding / whitelabeling. There are no env vars for these — the file is
 	// the only source — so apply unconditionally. Without this block a server
--- a/core/application/upgrade_checker.go
+++ b/core/application/upgrade_checker.go
@@ -217,7 +217,7 @@ func (uc *UpgradeChecker) runCheck(ctx context.Context) {
 				err = bm.UpgradeBackend(ctx, name, nil)
 			} else {
 				err = gallery.UpgradeBackend(ctx, uc.systemState, uc.modelLoader,
-					uc.galleries, name, nil)
+					uc.galleries, name, nil, uc.appConfig.RequireBackendIntegrity)
 			}
 			if err != nil {
 				xlog.Error("Failed to auto-upgrade backend",
--- a/core/backend/llm.go
+++ b/core/backend/llm.go
@@ -86,7 +86,7 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
 		if !slices.Contains(modelNames, modelName) {
 			utils.ResetDownloadTimers()
 			// if we failed to load the model, we try to download it
-			err := gallery.InstallModelFromGallery(ctx, o.Galleries, o.BackendGalleries, o.SystemState, loader, modelName, gallery.GalleryModel{}, utils.DisplayDownloadFunction, o.EnforcePredownloadScans, o.AutoloadBackendGalleries)
+			err := gallery.InstallModelFromGallery(ctx, o.Galleries, o.BackendGalleries, o.SystemState, loader, modelName, gallery.GalleryModel{}, utils.DisplayDownloadFunction, o.EnforcePredownloadScans, o.AutoloadBackendGalleries, o.RequireBackendIntegrity)
 			if err != nil {
 				xlog.Error("failed to install model from gallery", "error", err, "model", modelFile)
 				//return nil, err
--- a/core/backend/options.go
+++ b/core/backend/options.go
@@ -277,7 +277,7 @@ func gRPCPredictOpts(c config.ModelConfig, modelPath string) *pb.PredictOptions
 		MinP:                float32(*c.MinP),
 		Tokens:              int32(*c.Maxtokens),
 		Threads:             int32(*c.Threads),
-		PromptCacheAll:      c.PromptCacheAll,
+		PromptCacheAll:      *c.PromptCacheAll,
 		PromptCacheRO:       c.PromptCacheRO,
 		PromptCachePath:     promptCachePath,
 		F16KV:               *c.F16,
--- a/core/cli/backends.go
+++ b/core/cli/backends.go
@@ -17,9 +17,10 @@ import (
 )
 type BackendsCMDFlags struct {
-	BackendGalleries   string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
+	BackendGalleries        string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
-	BackendsPath       string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"storage"`
+	BackendsPath            string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"storage"`
-	BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
+	BackendsSystemPath      string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
 	RequireBackendIntegrity bool   `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
 }
 type BackendsList struct {
@@ -126,7 +127,7 @@ func (bi *BackendsInstall) Run(ctx *cliContext.Context) error {
 	}
 	modelLoader := model.NewModelLoader(systemState)
-	err = galleryop.InstallExternalBackend(context.Background(), galleries, systemState, modelLoader, progressCallback, bi.BackendArgs, bi.Name, bi.Alias)
+	err = galleryop.InstallExternalBackend(context.Background(), galleries, systemState, modelLoader, progressCallback, bi.BackendArgs, bi.Name, bi.Alias, bi.RequireBackendIntegrity)
 	if err != nil {
 		return err
 	}
@@ -197,7 +198,7 @@ func (bu *BackendsUpgrade) Run(ctx *cliContext.Context) error {
 			}
 		}
-		if err := gallery.UpgradeBackend(context.Background(), systemState, modelLoader, galleries, name, progressCallback); err != nil {
+		if err := gallery.UpgradeBackend(context.Background(), systemState, modelLoader, galleries, name, progressCallback, bu.RequireBackendIntegrity); err != nil {
 			fmt.Printf("Failed to upgrade %s: %v\n", name, err)
 		} else {
 			fmt.Printf("Backend %s upgraded successfully\n", name)
--- a/core/cli/models.go
+++ b/core/cli/models.go
@@ -32,6 +32,7 @@ type ModelsList struct {
 type ModelsInstall struct {
 	DisablePredownloadScan   bool     `env:"LOCALAI_DISABLE_PREDOWNLOAD_SCAN" help:"If true, disables the best-effort security scanner before downloading any files." group:"hardening" default:"false"`
 	RequireBackendIntegrity  bool     `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
 	AutoloadBackendGalleries bool     `env:"LOCALAI_AUTOLOAD_BACKEND_GALLERIES" help:"If true, automatically loads backend galleries" group:"backends" default:"true"`
 	ModelArgs                []string `arg:"" optional:"" name:"models" help:"Model configuration URLs to load"`
@@ -71,7 +72,6 @@ func (ml *ModelsList) Run(ctx *cliContext.Context) error {
 }
 func (mi *ModelsInstall) Run(ctx *cliContext.Context) error {
 	systemState, err := system.GetSystemState(
 		system.WithModelPath(mi.ModelsPath),
 		system.WithBackendPath(mi.BackendsPath),
@@ -135,7 +135,7 @@ func (mi *ModelsInstall) Run(ctx *cliContext.Context) error {
 		}
 		modelLoader := model.NewModelLoader(systemState)
-		err = startup.InstallModels(context.Background(), galleryService, galleries, backendGalleries, systemState, modelLoader, !mi.DisablePredownloadScan, mi.AutoloadBackendGalleries, progressCallback, modelName)
+		err = startup.InstallModels(context.Background(), galleryService, galleries, backendGalleries, systemState, modelLoader, !mi.DisablePredownloadScan, mi.AutoloadBackendGalleries, mi.RequireBackendIntegrity, progressCallback, modelName)
 		if err != nil {
 			return err
 		}
--- a/core/cli/run.go
+++ b/core/cli/run.go
@@ -39,19 +39,19 @@ type RunCMD struct {
 	LocalaiConfigDir             string        `env:"LOCALAI_CONFIG_DIR" type:"path" default:"${basepath}/configuration" help:"Directory for dynamic loading of certain configuration files (currently api_keys.json and external_backends.json)" group:"storage"`
 	LocalaiConfigDirPollInterval time.Duration `env:"LOCALAI_CONFIG_DIR_POLL_INTERVAL" help:"Typically the config path picks up changes automatically, but if your system has broken fsnotify events, set this to an interval to poll the LocalAI Config Dir (example: 1m)" group:"storage"`
 	// The alias on this option is there to preserve functionality with the old `--config-file` parameter
-	ModelsConfigFile         string   `env:"LOCALAI_MODELS_CONFIG_FILE,CONFIG_FILE" aliases:"config-file" help:"YAML file containing a list of model backend configs" group:"storage"`
+	ModelsConfigFile          string   `env:"LOCALAI_MODELS_CONFIG_FILE,CONFIG_FILE" aliases:"config-file" help:"YAML file containing a list of model backend configs" group:"storage"`
-	BackendGalleries         string   `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
+	BackendGalleries          string   `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
-	Galleries                string   `env:"LOCALAI_GALLERIES,GALLERIES" help:"JSON list of galleries" group:"models" default:"${galleries}"`
+	Galleries                 string   `env:"LOCALAI_GALLERIES,GALLERIES" help:"JSON list of galleries" group:"models" default:"${galleries}"`
-	AutoloadGalleries        bool     `env:"LOCALAI_AUTOLOAD_GALLERIES,AUTOLOAD_GALLERIES" group:"models" default:"true"`
+	AutoloadGalleries         bool     `env:"LOCALAI_AUTOLOAD_GALLERIES,AUTOLOAD_GALLERIES" group:"models" default:"true"`
-	AutoloadBackendGalleries bool     `env:"LOCALAI_AUTOLOAD_BACKEND_GALLERIES,AUTOLOAD_BACKEND_GALLERIES" group:"backends" default:"true"`
+	AutoloadBackendGalleries  bool     `env:"LOCALAI_AUTOLOAD_BACKEND_GALLERIES,AUTOLOAD_BACKEND_GALLERIES" group:"backends" default:"true"`
-	BackendImagesReleaseTag  string   `env:"LOCALAI_BACKEND_IMAGES_RELEASE_TAG,BACKEND_IMAGES_RELEASE_TAG" help:"Fallback release tag for backend images" group:"backends" default:"latest"`
+	BackendImagesReleaseTag   string   `env:"LOCALAI_BACKEND_IMAGES_RELEASE_TAG,BACKEND_IMAGES_RELEASE_TAG" help:"Fallback release tag for backend images" group:"backends" default:"latest"`
-	BackendImagesBranchTag   string   `env:"LOCALAI_BACKEND_IMAGES_BRANCH_TAG,BACKEND_IMAGES_BRANCH_TAG" help:"Fallback branch tag for backend images" group:"backends" default:"master"`
+	BackendImagesBranchTag    string   `env:"LOCALAI_BACKEND_IMAGES_BRANCH_TAG,BACKEND_IMAGES_BRANCH_TAG" help:"Fallback branch tag for backend images" group:"backends" default:"master"`
-	BackendDevSuffix         string   `env:"LOCALAI_BACKEND_DEV_SUFFIX,BACKEND_DEV_SUFFIX" help:"Development suffix for backend images" group:"backends" default:"development"`
+	BackendDevSuffix          string   `env:"LOCALAI_BACKEND_DEV_SUFFIX,BACKEND_DEV_SUFFIX" help:"Development suffix for backend images" group:"backends" default:"development"`
 	AutoUpgradeBackends       bool     `env:"LOCALAI_AUTO_UPGRADE_BACKENDS,AUTO_UPGRADE_BACKENDS" help:"Automatically upgrade backends when new versions are detected" group:"backends" default:"false"`
 	PreferDevelopmentBackends bool     `env:"LOCALAI_PREFER_DEV_BACKENDS,PREFER_DEV_BACKENDS" help:"Prefer development backend versions (shows development backends by default in UI)" group:"backends" default:"false"`
 	PreloadModels             string   `env:"LOCALAI_PRELOAD_MODELS,PRELOAD_MODELS" help:"A List of models to apply in JSON at start" group:"models"`
-	Models                   []string `env:"LOCALAI_MODELS,MODELS" help:"A List of model configuration URLs to load" group:"models"`
+	Models                    []string `env:"LOCALAI_MODELS,MODELS" help:"A List of model configuration URLs to load" group:"models"`
-	PreloadModelsConfig      string   `env:"LOCALAI_PRELOAD_MODELS_CONFIG,PRELOAD_MODELS_CONFIG" help:"A List of models to apply at startup. Path to a YAML config file" group:"models"`
+	PreloadModelsConfig       string   `env:"LOCALAI_PRELOAD_MODELS_CONFIG,PRELOAD_MODELS_CONFIG" help:"A List of models to apply at startup. Path to a YAML config file" group:"models"`
 	F16         bool `name:"f16" env:"LOCALAI_F16,F16" help:"Enable GPU acceleration" group:"performance"`
 	Threads     int  `env:"LOCALAI_THREADS,THREADS" short:"t" help:"Number of threads used for parallel computation. Usage of the number of physical cores in the system is suggested" group:"performance"`
@@ -67,6 +67,7 @@ type RunCMD struct {
 	OllamaAPIRootEndpoint              bool     `env:"LOCALAI_OLLAMA_API_ROOT_ENDPOINT" default:"false" help:"Register Ollama-compatible health check on / (replaces web UI on root path). The /api/* Ollama endpoints are always available regardless of this flag" group:"api"`
 	DisableRuntimeSettings             bool     `env:"LOCALAI_DISABLE_RUNTIME_SETTINGS,DISABLE_RUNTIME_SETTINGS" default:"false" help:"Disables the runtime settings. When set to true, the server will not load the runtime settings from the runtime_settings.json file" group:"api"`
 	DisablePredownloadScan             bool     `env:"LOCALAI_DISABLE_PREDOWNLOAD_SCAN" help:"If true, disables the best-effort security scanner before downloading any files." group:"hardening" default:"false"`
 	RequireBackendIntegrity            bool     `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, backend installs without a configured signature verification policy (for OCI URIs) or SHA256 (for tarball/HTTP URIs) are rejected. Default is to warn and install. Set this in production once your gallery's verification: block is populated." group:"hardening" default:"false"`
 	OpaqueErrors                       bool     `env:"LOCALAI_OPAQUE_ERRORS" default:"false" help:"If true, all error responses are replaced with blank 500 errors. This is intended only for hardening against information leaks and is normally not recommended." group:"hardening"`
 	UseSubtleKeyComparison             bool     `env:"LOCALAI_SUBTLE_KEY_COMPARISON" default:"false" help:"If true, API Key validation comparisons will be performed using constant-time comparisons rather than simple equality. This trades off performance on each request for resiliancy against timing attacks." group:"hardening"`
 	DisableApiKeyRequirementForHttpGet bool     `env:"LOCALAI_DISABLE_API_KEY_REQUIREMENT_FOR_HTTP_GET" default:"false" help:"If true, a valid API key is not required to issue GET requests to portions of the web ui. This should only be enabled in secure testing environments" group:"hardening"`
@@ -99,6 +100,7 @@ type RunCMD struct {
 	LoadToMemory                       []string `env:"LOCALAI_LOAD_TO_MEMORY,LOAD_TO_MEMORY" help:"A list of models to load into memory at startup" group:"models"`
 	EnableTracing                      bool     `env:"LOCALAI_ENABLE_TRACING,ENABLE_TRACING" help:"Enable API tracing" group:"api"`
 	TracingMaxItems                    int      `env:"LOCALAI_TRACING_MAX_ITEMS" default:"1024" help:"Maximum number of traces to keep" group:"api"`
 	TracingMaxBodyBytes                int      `env:"LOCALAI_TRACING_MAX_BODY_BYTES" default:"65536" help:"Maximum bytes captured per request/response body in the trace buffer (0 = uncapped). Caps memory growth from chatty endpoints like /embeddings." group:"api"`
 	AgentJobRetentionDays              int      `env:"LOCALAI_AGENT_JOB_RETENTION_DAYS,AGENT_JOB_RETENTION_DAYS" default:"30" help:"Number of days to keep agent job history (default: 30)" group:"api"`
 	OpenResponsesStoreTTL              string   `env:"LOCALAI_OPEN_RESPONSES_STORE_TTL,OPEN_RESPONSES_STORE_TTL" default:"0" help:"TTL for Open Responses store (e.g., 1h, 30m, 0 = no expiration)" group:"api"`
@@ -143,16 +145,18 @@ type RunCMD struct {
 	DefaultAPIKeyExpiry  string `env:"LOCALAI_DEFAULT_API_KEY_EXPIRY" help:"Default expiry for API keys (e.g. 90d, 1y; empty = no expiry)" group:"auth"`
 	// Distributed / Horizontal Scaling
-	Distributed       bool   `env:"LOCALAI_DISTRIBUTED" default:"false" help:"Enable distributed mode (requires PostgreSQL + NATS)" group:"distributed"`
+	Distributed           bool   `env:"LOCALAI_DISTRIBUTED" default:"false" help:"Enable distributed mode (requires PostgreSQL + NATS)" group:"distributed"`
-	InstanceID        string `env:"LOCALAI_INSTANCE_ID" help:"Unique instance ID for distributed mode (auto-generated UUID if empty)" group:"distributed"`
+	InstanceID            string `env:"LOCALAI_INSTANCE_ID" help:"Unique instance ID for distributed mode (auto-generated UUID if empty)" group:"distributed"`
-	NatsURL           string `env:"LOCALAI_NATS_URL" help:"NATS server URL (e.g., nats://localhost:4222)" group:"distributed"`
+	NatsURL               string `env:"LOCALAI_NATS_URL" help:"NATS server URL (e.g., nats://localhost:4222)" group:"distributed"`
-	StorageURL        string `env:"LOCALAI_STORAGE_URL" help:"S3-compatible storage endpoint URL (e.g., http://minio:9000)" group:"distributed"`
+	StorageURL            string `env:"LOCALAI_STORAGE_URL" help:"S3-compatible storage endpoint URL (e.g., http://minio:9000)" group:"distributed"`
-	StorageBucket     string `env:"LOCALAI_STORAGE_BUCKET" default:"localai" help:"S3 bucket name for object storage" group:"distributed"`
+	StorageBucket         string `env:"LOCALAI_STORAGE_BUCKET" default:"localai" help:"S3 bucket name for object storage" group:"distributed"`
-	StorageRegion     string `env:"LOCALAI_STORAGE_REGION" default:"us-east-1" help:"S3 region" group:"distributed"`
+	StorageRegion         string `env:"LOCALAI_STORAGE_REGION" default:"us-east-1" help:"S3 region" group:"distributed"`
-	StorageAccessKey  string `env:"LOCALAI_STORAGE_ACCESS_KEY" help:"S3 access key ID" group:"distributed"`
+	StorageAccessKey      string `env:"LOCALAI_STORAGE_ACCESS_KEY" help:"S3 access key ID" group:"distributed"`
-	StorageSecretKey  string `env:"LOCALAI_STORAGE_SECRET_KEY" help:"S3 secret access key" group:"distributed"`
+	StorageSecretKey      string `env:"LOCALAI_STORAGE_SECRET_KEY" help:"S3 secret access key" group:"distributed"`
-	RegistrationToken string `env:"LOCALAI_REGISTRATION_TOKEN" help:"Token that backend nodes must provide to register (empty = no auth required)" group:"distributed"`
+	RegistrationToken     string `env:"LOCALAI_REGISTRATION_TOKEN" help:"Token that backend nodes must provide to register (empty = no auth required)" group:"distributed"`
-	AutoApproveNodes  bool   `env:"LOCALAI_AUTO_APPROVE_NODES" default:"false" help:"Auto-approve new worker nodes (skip admin approval)" group:"distributed"`
+	AutoApproveNodes      bool   `env:"LOCALAI_AUTO_APPROVE_NODES" default:"false" help:"Auto-approve new worker nodes (skip admin approval)" group:"distributed"`
 	BackendInstallTimeout string `env:"LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT" help:"NATS round-trip timeout for backend.install requests sent to worker nodes (default 15m). Increase for slow links pulling multi-GB images." group:"distributed"`
 	BackendUpgradeTimeout string `env:"LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT" help:"NATS round-trip timeout for backend.upgrade requests (default 15m)." group:"distributed"`
 	Version bool
 }
@@ -253,6 +257,20 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
 	if r.StorageSecretKey != "" {
 		opts = append(opts, config.WithStorageSecretKey(r.StorageSecretKey))
 	}
 	if r.BackendInstallTimeout != "" {
 		d, err := time.ParseDuration(r.BackendInstallTimeout)
 		if err != nil {
 			return fmt.Errorf("invalid LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT %q: %w", r.BackendInstallTimeout, err)
 		}
 		opts = append(opts, config.WithBackendInstallTimeout(d))
 	}
 	if r.BackendUpgradeTimeout != "" {
 		d, err := time.ParseDuration(r.BackendUpgradeTimeout)
 		if err != nil {
 			return fmt.Errorf("invalid LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT %q: %w", r.BackendUpgradeTimeout, err)
 		}
 		opts = append(opts, config.WithBackendUpgradeTimeout(d))
 	}
 	if r.RegistrationToken != "" {
 		opts = append(opts, config.WithRegistrationToken(r.RegistrationToken))
 	}
@@ -272,6 +290,7 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
 		opts = append(opts, config.EnableTracing)
 	}
 	opts = append(opts, config.WithTracingMaxItems(r.TracingMaxItems))
 	opts = append(opts, config.WithTracingMaxBodyBytes(r.TracingMaxBodyBytes))
 	token := ""
 	if r.Peer2Peer || r.Peer2PeerToken != "" {
@@ -503,6 +522,10 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
 		opts = append(opts, config.WithAutoUpgradeBackends(r.AutoUpgradeBackends))
 	}
 	if r.RequireBackendIntegrity {
 		opts = append(opts, config.WithRequireBackendIntegrity(r.RequireBackendIntegrity))
 	}
 	if r.PreferDevelopmentBackends {
 		opts = append(opts, config.WithPreferDevelopmentBackends(r.PreferDevelopmentBackends))
 	}
--- a/core/cli/worker/worker.go
+++ b/core/cli/worker/worker.go
@@ -1,10 +1,11 @@
 package worker
 type WorkerFlags struct {
-	BackendsPath       string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"backends"`
+	BackendsPath            string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"backends"`
-	BackendGalleries   string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
+	BackendGalleries        string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
-	BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
+	BackendsSystemPath      string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
-	ExtraLLamaCPPArgs  string `name:"llama-cpp-args" env:"LOCALAI_EXTRA_LLAMA_CPP_ARGS,EXTRA_LLAMA_CPP_ARGS" help:"Extra arguments to pass to llama-cpp-rpc-server"`
+	RequireBackendIntegrity bool   `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
 	ExtraLLamaCPPArgs       string `name:"llama-cpp-args" env:"LOCALAI_EXTRA_LLAMA_CPP_ARGS,EXTRA_LLAMA_CPP_ARGS" help:"Extra arguments to pass to llama-cpp-rpc-server"`
 }
 type Worker struct {
--- a/core/cli/worker/worker_backend_common.go
+++ b/core/cli/worker/worker_backend_common.go
@@ -18,7 +18,7 @@ import (
 // installing the backend from the gallery if it isn't present.
 // `name` is the gallery entry name (for vLLM the meta entry "vllm"
 // resolves to a platform-specific package via capability lookup).
-func findBackendPath(name, galleries string, systemState *system.SystemState) (string, error) {
+func findBackendPath(name, galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
 	backends, err := gallery.ListSystemBackends(systemState)
 	if err != nil {
 		return "", err
@@ -33,7 +33,7 @@ func findBackendPath(name, galleries string, systemState *system.SystemState) (s
 		xlog.Error("failed loading galleries", "error", err)
 		return "", err
 	}
-	if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, name, nil, true); err != nil {
+	if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, name, nil, true, requireIntegrity); err != nil {
 		xlog.Error("backend not found, failed to install it", "name", name, "error", err)
 		return "", err
 	}
--- a/core/cli/worker/worker_llamacpp.go
+++ b/core/cli/worker/worker_llamacpp.go
@@ -27,7 +27,7 @@ const (
 	llamaCPPGalleryName   = "llama-cpp"
 )
-func findLLamaCPPBackend(galleries string, systemState *system.SystemState) (string, error) {
+func findLLamaCPPBackend(galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
 	backends, err := gallery.ListSystemBackends(systemState)
 	if err != nil {
 		xlog.Warn("Failed listing system backends", "error", err)
@@ -43,7 +43,7 @@ func findLLamaCPPBackend(galleries string, systemState *system.SystemState) (str
 			xlog.Error("failed loading galleries", "error", err)
 			return "", err
 		}
-		err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, llamaCPPGalleryName, nil, true)
+		err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, llamaCPPGalleryName, nil, true, requireIntegrity)
 		if err != nil {
 			xlog.Error("llama-cpp backend not found, failed to install it", "error", err)
 			return "", err
@@ -76,7 +76,7 @@ func (r *LLamaCPP) Run(ctx *cliContext.Context) error {
 	if err != nil {
 		return err
 	}
-	grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState)
+	grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 	if err != nil {
 		return err
 	}
--- a/core/cli/worker/worker_mlx_common.go
+++ b/core/cli/worker/worker_mlx_common.go
@@ -9,8 +9,8 @@ import (
 const mlxDistributedGalleryName = "mlx-distributed"
-func findMLXDistributedBackendPath(galleries string, systemState *system.SystemState) (string, error) {
+func findMLXDistributedBackendPath(galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
-	return findBackendPath(mlxDistributedGalleryName, galleries, systemState)
+	return findBackendPath(mlxDistributedGalleryName, galleries, systemState, requireIntegrity)
 }
 // buildMLXCommand builds the exec.Cmd to launch the mlx-distributed backend.
--- a/core/cli/worker/worker_mlx_distributed.go
+++ b/core/cli/worker/worker_mlx_distributed.go
@@ -28,7 +28,7 @@ func (r *MLXDistributed) Run(ctx *cliContext.Context) error {
 		return err
 	}
-	backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState)
+	backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 	if err != nil {
 		return fmt.Errorf("cannot find mlx-distributed backend: %w", err)
 	}
--- a/core/cli/worker/worker_p2p.go
+++ b/core/cli/worker/worker_p2p.go
@@ -73,7 +73,7 @@ func (r *P2P) Run(ctx *cliContext.Context) error {
 			for {
 				xlog.Info("Starting llama-cpp-rpc-server", "address", address, "port", port)
-				grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState)
+				grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 				if err != nil {
 					xlog.Error("Failed to find llama-cpp-rpc-server", "error", err)
 					return
--- a/core/cli/worker/worker_p2p_mlx.go
+++ b/core/cli/worker/worker_p2p_mlx.go
@@ -48,7 +48,7 @@ func (r *P2PMLX) Run(ctx *cliContext.Context) error {
 	c, cancel := context.WithCancel(context.Background())
 	defer cancel()
-	backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState)
+	backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 	if err != nil {
 		xlog.Warn("Could not find mlx-distributed backend from gallery, will try backend.py directly", "error", err)
 	}
--- a/core/cli/worker/worker_vllm.go
+++ b/core/cli/worker/worker_vllm.go
@@ -77,7 +77,7 @@ func (r *VLLMDistributed) Run(ctx *cliContext.Context) error {
 		return fmt.Errorf("getting system state: %w", err)
 	}
-	backendPath, err := findBackendPath("vllm", r.BackendGalleries, systemState)
+	backendPath, err := findBackendPath("vllm", r.BackendGalleries, systemState, r.RequireBackendIntegrity)
 	if err != nil {
 		return fmt.Errorf("cannot find vllm backend: %w", err)
 	}
--- a/core/config/application_config.go
+++ b/core/config/application_config.go
@@ -21,6 +21,7 @@ type ApplicationConfig struct {
 	Debug                               bool
 	EnableTracing                       bool
 	TracingMaxItems                     int
 	TracingMaxBodyBytes                 int // Per-body cap for captured request/response bodies; 0 disables the cap
 	EnableBackendLogging                bool
 	GeneratedContentDir                 string
@@ -60,6 +61,13 @@ type ApplicationConfig struct {
 	AutoUpgradeBackends                         bool
 	PreferDevelopmentBackends                   bool
 	// RequireBackendIntegrity promotes a missing SHA256 (tarball/HTTP URIs)
 	// or missing verification policy (OCI URIs) from a warning to a hard
 	// failure during backend install/upgrade. Off by default to keep
 	// upgrades non-breaking; operators opt in explicitly via
 	// --require-backend-integrity / LOCALAI_REQUIRE_BACKEND_INTEGRITY.
 	RequireBackendIntegrity bool
 	SingleBackend           bool // Deprecated: use MaxActiveBackends = 1 instead
 	MaxActiveBackends       int  // Maximum number of active backends (0 = unlimited, 1 = single backend mode)
 	WatchDogIdle bool
@@ -180,6 +188,7 @@ func NewApplicationConfig(o ...AppOption) *ApplicationConfig {
 		LRUEvictionRetryInterval: 1 * time.Second,        // Default: 1 second
 		WatchDogInterval:         500 * time.Millisecond, // Default: 500ms
 		TracingMaxItems:          1024,
 		TracingMaxBodyBytes:      64 * 1024, // 64 KiB - caps each request/response body in the trace buffer
 		AgentPool: AgentPoolConfig{
 			Enabled:         true,
 			Timeout:         "5m",
@@ -436,6 +445,10 @@ func WithAutoUpgradeBackends(v bool) AppOption {
 	return func(o *ApplicationConfig) { o.AutoUpgradeBackends = v }
 }
 func WithRequireBackendIntegrity(v bool) AppOption {
 	return func(o *ApplicationConfig) { o.RequireBackendIntegrity = v }
 }
 func WithPreferDevelopmentBackends(v bool) AppOption {
 	return func(o *ApplicationConfig) { o.PreferDevelopmentBackends = v }
 }
@@ -567,6 +580,12 @@ func WithTracingMaxItems(items int) AppOption {
 	}
 }
 func WithTracingMaxBodyBytes(bytes int) AppOption {
 	return func(o *ApplicationConfig) {
 		o.TracingMaxBodyBytes = bytes
 	}
 }
 func WithGeneratedContentDir(generatedContentDir string) AppOption {
 	return func(o *ApplicationConfig) {
 		o.GeneratedContentDir = generatedContentDir
@@ -909,6 +928,7 @@ func (o *ApplicationConfig) ToRuntimeSettings() RuntimeSettings {
 	f16 := o.F16
 	debug := o.Debug
 	tracingMaxItems := o.TracingMaxItems
 	tracingMaxBodyBytes := o.TracingMaxBodyBytes
 	enableTracing := o.EnableTracing
 	enableBackendLogging := o.EnableBackendLogging
 	cors := o.CORS
@@ -997,6 +1017,7 @@ func (o *ApplicationConfig) ToRuntimeSettings() RuntimeSettings {
 		F16:                       &f16,
 		Debug:                     &debug,
 		TracingMaxItems:           &tracingMaxItems,
 		TracingMaxBodyBytes:       &tracingMaxBodyBytes,
 		EnableTracing:             &enableTracing,
 		EnableBackendLogging:      &enableBackendLogging,
 		CORS:                      &cors,
@@ -1135,6 +1156,9 @@ func (o *ApplicationConfig) ApplyRuntimeSettings(settings *RuntimeSettings) (req
 	if settings.TracingMaxItems != nil {
 		o.TracingMaxItems = *settings.TracingMaxItems
 	}
 	if settings.TracingMaxBodyBytes != nil {
 		o.TracingMaxBodyBytes = *settings.TracingMaxBodyBytes
 	}
 	if settings.EnableBackendLogging != nil {
 		o.EnableBackendLogging = *settings.EnableBackendLogging
 	}
--- a/core/config/backend_capabilities.go
+++ b/core/config/backend_capabilities.go
@@ -24,6 +24,7 @@ const (
 	UsecaseVAD             = "vad"
 	UsecaseAudioTransform  = "audio_transform"
 	UsecaseDiarization     = "diarization"
 	UsecaseRealtimeAudio   = "realtime_audio"
 )
 // GRPCMethod identifies a Backend service RPC from backend.proto.
@@ -45,6 +46,7 @@ const (
 	MethodVAD                GRPCMethod = "VAD"
 	MethodAudioTransform     GRPCMethod = "AudioTransform"
 	MethodDiarize            GRPCMethod = "Diarize"
 	MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
 )
 // UsecaseInfo describes a single known_usecase value and how it maps
@@ -147,6 +149,11 @@ var UsecaseInfoMap = map[string]UsecaseInfo{
 		GRPCMethod:  MethodDiarize,
 		Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.",
 	},
 	UsecaseRealtimeAudio: {
 		Flag:        FLAG_REALTIME_AUDIO,
 		GRPCMethod:  MethodAudioToAudioStream,
 		Description: "Self-contained any-to-any audio model for the Realtime API — accepts microphone audio and emits speech + transcript (+ optional function calls) from a single backend via the AudioToAudioStream RPC.",
 	},
 }
 // BackendCapability describes which gRPC methods and usecases a backend supports.
@@ -397,6 +404,15 @@ var BackendCapabilities = map[string]BackendCapability{
 		Description:      "Meta MusicGen via transformers — music generation from text",
 	},
 	// --- Any-to-any audio backends ---
 	"liquid-audio": {
 		GRPCMethods:      []GRPCMethod{MethodPredict, MethodPredictStream, MethodAudioTranscription, MethodTTS, MethodAudioToAudioStream, MethodVAD},
 		PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTranscript, UsecaseTTS, UsecaseRealtimeAudio, UsecaseVAD},
 		DefaultUsecases:  []string{UsecaseRealtimeAudio, UsecaseChat, UsecaseTranscript, UsecaseTTS, UsecaseVAD},
 		AcceptsAudios:    true,
 		Description:      "LFM2 / LFM2.5-Audio — self-contained any-to-any audio model for the Realtime API; also exposes chat, transcription, TTS and a stub energy-based VAD endpoint",
 	},
 	// --- Audio transform backends ---
 	"localvqe": {
 		GRPCMethods:      []GRPCMethod{MethodAudioTransform},
--- a/core/config/distributed_config.go
+++ b/core/config/distributed_config.go
@@ -31,8 +31,19 @@ type DistributedConfig struct {
 	DrainTimeout        time.Duration // Time to wait for in-flight requests during drain (default 30s)
 	HealthCheckInterval time.Duration // Health monitor check interval (default 15s)
 	StaleNodeThreshold  time.Duration // Time before a node is considered stale (default 60s)
-	PerModelHealthCheck bool          // Enable per-model backend health checking (default false)
+	// DisablePerModelHealthCheck turns off the health monitor's per-model
-	MCPCIJobTimeout     time.Duration // MCP CI job execution timeout (default 10m)
+	// gRPC probe. When enabled (the default), the monitor pings each model's
 	// gRPC address and removes stale node_models rows whose backend has
 	// crashed even though the worker's node-level heartbeat is still arriving.
 	// Without per-model probing, /embeddings and /completions can be dispatched
 	// to a backend that silently returns garbage (see also the cascading
 	// model-row cleanup on MarkUnhealthy / MarkDraining).
 	DisablePerModelHealthCheck bool
 	MCPCIJobTimeout time.Duration // MCP CI job execution timeout (default 10m)
 	BackendInstallTimeout time.Duration // NATS round-trip timeout for backend.install (default 15m)
 	BackendUpgradeTimeout time.Duration // NATS round-trip timeout for backend.upgrade (default 15m)
 	MaxUploadSize int64 // Maximum upload body size in bytes (default 50 GB)
@@ -60,13 +71,15 @@ func (c DistributedConfig) Validate() error {
 	}
 	// Check for negative durations
 	for name, d := range map[string]time.Duration{
-		"mcp-tool-timeout":      c.MCPToolTimeout,
+		FlagMCPToolTimeout:        c.MCPToolTimeout,
-		"mcp-discovery-timeout": c.MCPDiscoveryTimeout,
+		FlagMCPDiscoveryTimeout:   c.MCPDiscoveryTimeout,
-		"worker-wait-timeout":   c.WorkerWaitTimeout,
+		FlagWorkerWaitTimeout:     c.WorkerWaitTimeout,
-		"drain-timeout":         c.DrainTimeout,
+		FlagDrainTimeout:          c.DrainTimeout,
-		"health-check-interval": c.HealthCheckInterval,
+		FlagHealthCheckInterval:   c.HealthCheckInterval,
-		"stale-node-threshold":  c.StaleNodeThreshold,
+		FlagStaleNodeThreshold:    c.StaleNodeThreshold,
-		"mcp-ci-job-timeout":    c.MCPCIJobTimeout,
+		FlagMCPCIJobTimeout:       c.MCPCIJobTimeout,
 		FlagBackendInstallTimeout: c.BackendInstallTimeout,
 		FlagBackendUpgradeTimeout: c.BackendUpgradeTimeout,
 	} {
 		if d < 0 {
 			return fmt.Errorf("%s must not be negative", name)
@@ -129,24 +142,66 @@ func WithStorageSecretKey(key string) AppOption {
 	}
 }
 func WithBackendInstallTimeout(d time.Duration) AppOption {
 	return func(o *ApplicationConfig) {
 		o.Distributed.BackendInstallTimeout = d
 	}
 }
 func WithBackendUpgradeTimeout(d time.Duration) AppOption {
 	return func(o *ApplicationConfig) {
 		o.Distributed.BackendUpgradeTimeout = d
 	}
 }
 var EnableAutoApproveNodes = func(o *ApplicationConfig) {
 	o.Distributed.AutoApproveNodes = true
 }
 // Flag names for distributed timeout / interval configuration. These are
 // the kebab-case identifiers kong derives from the matching RunCMD struct
 // fields; they appear in Validate error messages and any other operator-
 // facing surface that needs to reference a specific knob by name. Keeping
 // them as constants prevents the string from drifting from the actual
 // flag a future rename would produce.
 const (
 	FlagMCPToolTimeout        = "mcp-tool-timeout"
 	FlagMCPDiscoveryTimeout   = "mcp-discovery-timeout"
 	FlagWorkerWaitTimeout     = "worker-wait-timeout"
 	FlagDrainTimeout          = "drain-timeout"
 	FlagHealthCheckInterval   = "health-check-interval"
 	FlagStaleNodeThreshold    = "stale-node-threshold"
 	FlagMCPCIJobTimeout       = "mcp-ci-job-timeout"
 	FlagBackendInstallTimeout = "backend-install-timeout"
 	FlagBackendUpgradeTimeout = "backend-upgrade-timeout"
 )
 // Defaults for distributed timeouts.
 const (
-	DefaultMCPToolTimeout      = 360 * time.Second
+	DefaultMCPToolTimeout        = 360 * time.Second
-	DefaultMCPDiscoveryTimeout = 60 * time.Second
+	DefaultMCPDiscoveryTimeout   = 60 * time.Second
-	DefaultWorkerWaitTimeout   = 5 * time.Minute
+	DefaultWorkerWaitTimeout     = 5 * time.Minute
-	DefaultDrainTimeout        = 30 * time.Second
+	DefaultDrainTimeout          = 30 * time.Second
-	DefaultHealthCheckInterval = 15 * time.Second
+	DefaultHealthCheckInterval   = 15 * time.Second
-	DefaultStaleNodeThreshold  = 60 * time.Second
+	DefaultStaleNodeThreshold    = 60 * time.Second
-	DefaultMCPCIJobTimeout     = 10 * time.Minute
+	DefaultMCPCIJobTimeout       = 10 * time.Minute
 	DefaultBackendInstallTimeout = 15 * time.Minute
 	DefaultBackendUpgradeTimeout = 15 * time.Minute
 )
 // DefaultMaxUploadSize is the default maximum upload body size (50 GB).
 const DefaultMaxUploadSize int64 = 50 << 30
 // BackendInstallTimeoutOrDefault returns the configured timeout or the default.
 func (c DistributedConfig) BackendInstallTimeoutOrDefault() time.Duration {
 	return cmp.Or(c.BackendInstallTimeout, DefaultBackendInstallTimeout)
 }
 // BackendUpgradeTimeoutOrDefault returns the configured timeout or the default.
 func (c DistributedConfig) BackendUpgradeTimeoutOrDefault() time.Duration {
 	return cmp.Or(c.BackendUpgradeTimeout, DefaultBackendUpgradeTimeout)
 }
 // MCPToolTimeoutOrDefault returns the configured timeout or the default.
 func (c DistributedConfig) MCPToolTimeoutOrDefault() time.Duration {
 	return cmp.Or(c.MCPToolTimeout, DefaultMCPToolTimeout)
--- a/core/config/distributed_config_test.go
+++ b/core/config/distributed_config_test.go
@@ -0,0 +1,90 @@
 package config_test
 import (
 	"time"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 	"github.com/mudler/LocalAI/core/config"
 )
 var _ = Describe("DistributedConfig backend NATS timeouts", func() {
 	Context("BackendInstallTimeoutOrDefault", func() {
 		It("returns 15 minutes when unset", func() {
 			c := config.DistributedConfig{}
 			Expect(c.BackendInstallTimeoutOrDefault()).To(Equal(15 * time.Minute))
 		})
 		It("returns the configured value when set", func() {
 			c := config.DistributedConfig{BackendInstallTimeout: 42 * time.Minute}
 			Expect(c.BackendInstallTimeoutOrDefault()).To(Equal(42 * time.Minute))
 		})
 	})
 	Context("BackendUpgradeTimeoutOrDefault", func() {
 		It("returns 15 minutes when unset", func() {
 			c := config.DistributedConfig{}
 			Expect(c.BackendUpgradeTimeoutOrDefault()).To(Equal(15 * time.Minute))
 		})
 		It("returns the configured value when set", func() {
 			c := config.DistributedConfig{BackendUpgradeTimeout: 30 * time.Minute}
 			Expect(c.BackendUpgradeTimeoutOrDefault()).To(Equal(30 * time.Minute))
 		})
 	})
 })
 var _ = Describe("DistributedConfig flag-name constants", func() {
 	// Pin the kebab-case strings so a rename of the Go field name (or a
 	// CLI flag naming convention change) forces the constant to update,
 	// keeping the Validate error messages and any future operator-facing
 	// surface in sync with the actual CLI flag.
 	DescribeTable("flag name constants",
 		func(actual, expected string) {
 			Expect(actual).To(Equal(expected))
 		},
 		Entry("MCP tool timeout", config.FlagMCPToolTimeout, "mcp-tool-timeout"),
 		Entry("MCP discovery timeout", config.FlagMCPDiscoveryTimeout, "mcp-discovery-timeout"),
 		Entry("worker wait timeout", config.FlagWorkerWaitTimeout, "worker-wait-timeout"),
 		Entry("drain timeout", config.FlagDrainTimeout, "drain-timeout"),
 		Entry("health check interval", config.FlagHealthCheckInterval, "health-check-interval"),
 		Entry("stale node threshold", config.FlagStaleNodeThreshold, "stale-node-threshold"),
 		Entry("MCP CI job timeout", config.FlagMCPCIJobTimeout, "mcp-ci-job-timeout"),
 		Entry("backend install timeout", config.FlagBackendInstallTimeout, "backend-install-timeout"),
 		Entry("backend upgrade timeout", config.FlagBackendUpgradeTimeout, "backend-upgrade-timeout"),
 	)
 })
 var _ = Describe("DistributedConfig.Validate negative-duration errors", func() {
 	It("rejects a negative BackendInstallTimeout with the flag name in the error", func() {
 		c := config.DistributedConfig{
 			Enabled:               true,
 			NatsURL:               "nats://localhost:4222",
 			BackendInstallTimeout: -1 * time.Second,
 		}
 		err := c.Validate()
 		Expect(err).To(HaveOccurred())
 		Expect(err.Error()).To(ContainSubstring(config.FlagBackendInstallTimeout))
 		Expect(err.Error()).To(ContainSubstring("must not be negative"))
 	})
 	It("rejects a negative BackendUpgradeTimeout with the flag name in the error", func() {
 		c := config.DistributedConfig{
 			Enabled:               true,
 			NatsURL:               "nats://localhost:4222",
 			BackendUpgradeTimeout: -1 * time.Second,
 		}
 		err := c.Validate()
 		Expect(err).To(HaveOccurred())
 		Expect(err.Error()).To(ContainSubstring(config.FlagBackendUpgradeTimeout))
 	})
 	It("accepts all-zero durations as valid (defaults apply)", func() {
 		c := config.DistributedConfig{
 			Enabled: true,
 			NatsURL: "nats://localhost:4222",
 		}
 		Expect(c.Validate()).To(Succeed())
 	})
 })
--- a/core/config/gallery.go
+++ b/core/config/gallery.go
@@ -1,6 +1,37 @@
 package config
-type Gallery struct {
+// GalleryVerification declares the keyless-cosign signature policy that
-	URL  string `json:"url" yaml:"url"`
+// every OCI backend image fetched from this gallery must satisfy.
-	Name string `json:"name" yaml:"name"`
+//
 // Verification is opt-in: galleries without a Verification block install
 // backends with no signature check (the downloader logs a warning when
 // LOCALAI_REQUIRE_BACKEND_INTEGRITY is unset; that flag turns the warning
 // into a hard error).
 //
 // Identity matching: set Issuer (exact) or IssuerRegex, AND Identity
 // (exact) or IdentityRegex. For GitHub Actions keyless signing the
 // typical shape is:
 //
 //	verification:
 //	  issuer: "https://token.actions.githubusercontent.com"
 //	  identity_regex: "^https://github\\.com/mudler/local-ai-backends/\\.github/workflows/build\\.yaml@refs/heads/master$"
 //	  not_before: "2026-05-01T00:00:00Z"
 //
 // NotBefore is the revocation lever: advance it to invalidate every
 // signature produced before a known compromise window. Keyless cosign
 // certs are ephemeral so there is no CA-side revocation.
 type GalleryVerification struct {
 	Issuer        string `json:"issuer,omitempty" yaml:"issuer,omitempty"`
 	IssuerRegex   string `json:"issuer_regex,omitempty" yaml:"issuer_regex,omitempty"`
 	Identity      string `json:"identity,omitempty" yaml:"identity,omitempty"`
 	IdentityRegex string `json:"identity_regex,omitempty" yaml:"identity_regex,omitempty"`
 	// NotBefore is an RFC3339 timestamp. Empty disables the time check.
 	NotBefore string `json:"not_before,omitempty" yaml:"not_before,omitempty"`
 }
 type Gallery struct {
 	URL          string               `json:"url" yaml:"url"`
 	Name         string               `json:"name" yaml:"name"`
 	Verification *GalleryVerification `json:"verification,omitempty" yaml:"verification,omitempty"`
 }
--- a/core/config/gguf.go
+++ b/core/config/gguf.go
@@ -54,6 +54,13 @@ func guessGGUFFromFile(cfg *ModelConfig, f *gguf.GGUFFile, defaultCtx int) {
 		cfg.modelTemplate = chatTemplate.ValueString()
 	}
 	// Auto-enable Multi-Token Prediction (ggml-org/llama.cpp#22673) when the
 	// GGUF carries an embedded MTP head. Skipped silently for non-MTP models
 	// and when the user already configured a spec_type.
 	if n, ok := HasEmbeddedMTPHead(f); ok {
 		ApplyMTPDefaults(cfg, n)
 	}
 	// Thinking support detection is done after model load via DetectThinkingSupportFromBackend
 	// template estimations
--- a/core/config/hooks_test.go
+++ b/core/config/hooks_test.go
@@ -136,4 +136,36 @@ var _ = Describe("Backend hooks and parser defaults", func() {
 			Expect(cfg.EngineArgs["enable_chunked_prefill"]).To(Equal(true))
 		})
 	})
 	Context("PromptCacheAll default", func() {
 		It("defaults to true when omitted from YAML", func() {
 			cfg := &ModelConfig{}
 			cfg.SetDefaults()
 			Expect(cfg.PromptCacheAll).NotTo(BeNil())
 			Expect(*cfg.PromptCacheAll).To(BeTrue())
 		})
 		It("preserves an explicit false from YAML", func() {
 			falseV := false
 			cfg := &ModelConfig{
 				LLMConfig: LLMConfig{PromptCacheAll: &falseV},
 			}
 			cfg.SetDefaults()
 			Expect(cfg.PromptCacheAll).NotTo(BeNil())
 			Expect(*cfg.PromptCacheAll).To(BeFalse())
 		})
 		It("preserves an explicit true from YAML", func() {
 			trueV := true
 			cfg := &ModelConfig{
 				LLMConfig: LLMConfig{PromptCacheAll: &trueV},
 			}
 			cfg.SetDefaults()
 			Expect(cfg.PromptCacheAll).NotTo(BeNil())
 			Expect(*cfg.PromptCacheAll).To(BeTrue())
 		})
 	})
 })
--- a/core/config/model_config.go
+++ b/core/config/model_config.go
@@ -209,7 +209,7 @@ type LLMConfig struct {
 	RMSNormEps      float32  `yaml:"rms_norm_eps,omitempty" json:"rms_norm_eps,omitempty"`
 	NGQA            int32    `yaml:"ngqa,omitempty" json:"ngqa,omitempty"`
 	PromptCachePath string   `yaml:"prompt_cache_path,omitempty" json:"prompt_cache_path,omitempty"`
-	PromptCacheAll  bool     `yaml:"prompt_cache_all,omitempty" json:"prompt_cache_all,omitempty"`
+	PromptCacheAll  *bool    `yaml:"prompt_cache_all,omitempty" json:"prompt_cache_all,omitempty"`
 	PromptCacheRO   bool     `yaml:"prompt_cache_ro,omitempty" json:"prompt_cache_ro,omitempty"`
 	MirostatETA     *float64 `yaml:"mirostat_eta,omitempty" json:"mirostat_eta,omitempty"`
 	MirostatTAU     *float64 `yaml:"mirostat_tau,omitempty" json:"mirostat_tau,omitempty"`
@@ -494,6 +494,13 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
 		cfg.Reranking = &falseV
 	}
 	if cfg.PromptCacheAll == nil {
 		// Match upstream llama.cpp's default (common/common.h: cache_prompt = true)
 		// and let cache_idle_slots / kv_unified actually do useful work; users can
 		// opt out with an explicit `prompt_cache_all: false` in the model YAML.
 		cfg.PromptCacheAll = &trueV
 	}
 	if threads == 0 {
 		// Threads can't be 0
 		threads = 4
@@ -636,6 +643,7 @@ const (
 	FLAG_SPEAKER_RECOGNITION ModelConfigUsecase = 0b1000000000000000
 	FLAG_AUDIO_TRANSFORM     ModelConfigUsecase = 0b10000000000000000
 	FLAG_DIARIZATION         ModelConfigUsecase = 0b100000000000000000
 	FLAG_REALTIME_AUDIO      ModelConfigUsecase = 0b1000000000000000000
 	// Common Subsets
 	FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT
@@ -645,12 +653,12 @@ const (
 // Flags within the same group are NOT orthogonal (e.g., chat and completion are
 // both text/language). A model is multimodal when its usecases span 2+ groups.
 var ModalityGroups = []ModelConfigUsecase{
-	FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT, // text/language
+	FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT,    // text/language
-	FLAG_VISION | FLAG_DETECTION,            // visual understanding
+	FLAG_VISION | FLAG_DETECTION,               // visual understanding
-	FLAG_TRANSCRIPT,                         // speech input
+	FLAG_TRANSCRIPT | FLAG_REALTIME_AUDIO,      // speech input — realtime_audio is any-to-any, so it counts here too
-	FLAG_TTS | FLAG_SOUND_GENERATION,        // audio output
+	FLAG_TTS | FLAG_SOUND_GENERATION | FLAG_REALTIME_AUDIO, // audio output — and here, so a lone realtime_audio flag still reads as multimodal
-	FLAG_AUDIO_TRANSFORM,                    // audio in/out transforms
+	FLAG_AUDIO_TRANSFORM,                       // audio in/out transforms
-	FLAG_IMAGE | FLAG_VIDEO,                 // visual generation
+	FLAG_IMAGE | FLAG_VIDEO,                    // visual generation
 }
 // IsMultimodal returns true if the given usecases span two or more orthogonal
@@ -692,6 +700,7 @@ func GetAllModelConfigUsecases() map[string]ModelConfigUsecase {
 		"FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION,
 		"FLAG_AUDIO_TRANSFORM":     FLAG_AUDIO_TRANSFORM,
 		"FLAG_DIARIZATION":         FLAG_DIARIZATION,
 		"FLAG_REALTIME_AUDIO":      FLAG_REALTIME_AUDIO,
 	}
 }
@@ -866,6 +875,16 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
 		}
 	}
 	if (u & FLAG_REALTIME_AUDIO) == FLAG_REALTIME_AUDIO {
 		// Backends that own a single any-to-any loop and implement
 		// AudioToAudioStream — listed here so models without an explicit
 		// known_usecases still surface on the Talk page.
 		realtimeAudioBackends := []string{"liquid-audio"}
 		if !slices.Contains(realtimeAudioBackends, c.Backend) {
 			return false
 		}
 	}
 	return true
 }
--- a/core/config/mtp.go
+++ b/core/config/mtp.go
@@ -0,0 +1,84 @@
 package config
 import (
 	"strings"
 	gguf "github.com/gpustack/gguf-parser-go"
 	"github.com/mudler/xlog"
 )
 // mtpSpecOptions lists the speculative-decoding option keys auto-applied when
 // an MTP head is detected on a llama-cpp GGUF. Defaults track the upstream
 // MTP PR (ggml-org/llama.cpp#22673):
 //
 //   - spec_type:draft-mtp      activates Multi-Token Prediction
 //   - spec_n_max:6             draft window
 //   - spec_p_min:0.75          pinned because upstream marked the 0.75 default
 //     with a "change to 0.0f" TODO; locking it here keeps acceptance
 //     thresholds stable across future bumps
 var mtpSpecOptions = []string{
 	"spec_type:draft-mtp",
 	"spec_n_max:6",
 	"spec_p_min:0.75",
 }
 // MTPSpecOptions returns a copy of the option keys auto-applied when an MTP
 // head is detected. Exported for testing and for the importer.
 func MTPSpecOptions() []string {
 	out := make([]string, len(mtpSpecOptions))
 	copy(out, mtpSpecOptions)
 	return out
 }
 // HasEmbeddedMTPHead reports whether the parsed GGUF declares a Multi-Token
 // Prediction head. Detection reads `<arch>.nextn_predict_layers`, which is
 // what `gguf_writer.add_nextn_predict_layers(n)` emits in upstream's
 // `conversion/qwen.py` MTP mixin. A positive layer count means the head is
 // present in the same GGUF as the trunk.
 func HasEmbeddedMTPHead(f *gguf.GGUFFile) (uint32, bool) {
 	if f == nil {
 		return 0, false
 	}
 	arch := f.Architecture().Architecture
 	if arch == "" {
 		return 0, false
 	}
 	v, ok := f.Header.MetadataKV.Get(arch + ".nextn_predict_layers")
 	if !ok {
 		return 0, false
 	}
 	n := gguf.ValueNumeric[uint32](v)
 	return n, n > 0
 }
 // hasSpecTypeOption returns true when the slice already contains a
 // user-configured `spec_type:` / `speculative_type:` entry. Used to avoid
 // clobbering an explicit choice with the MTP auto-defaults.
 func hasSpecTypeOption(opts []string) bool {
 	for _, o := range opts {
 		if strings.HasPrefix(o, "spec_type:") || strings.HasPrefix(o, "speculative_type:") {
 			return true
 		}
 	}
 	return false
 }
 // ApplyMTPDefaults appends the auto-MTP option keys to cfg.Options when none
 // is already configured. It is a no-op when the user already picked a
 // `spec_type` (either via YAML or via the importer's preferences flow).
 //
 // `layers` is the value read from `<arch>.nextn_predict_layers` and is only
 // used for the diagnostic log line.
 func ApplyMTPDefaults(cfg *ModelConfig, layers uint32) {
 	if cfg == nil {
 		return
 	}
 	if hasSpecTypeOption(cfg.Options) {
 		xlog.Debug("[mtp] embedded MTP head detected but spec_type already configured; leaving user choice intact",
 			"name", cfg.Name, "nextn_layers", layers)
 		return
 	}
 	cfg.Options = append(cfg.Options, mtpSpecOptions...)
 	xlog.Info("[mtp] embedded MTP head detected; enabling draft-mtp speculative decoding",
 		"name", cfg.Name, "nextn_layers", layers, "spec_n_max", 6, "spec_p_min", 0.75)
 }
--- a/core/config/mtp_test.go
+++ b/core/config/mtp_test.go
@@ -0,0 +1,86 @@
 package config_test
 import (
 	. "github.com/mudler/LocalAI/core/config"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 )
 var _ = Describe("MTP auto-defaults", func() {
 	Context("MTPSpecOptions", func() {
 		It("returns the upstream-recommended speculative tuple", func() {
 			Expect(MTPSpecOptions()).To(Equal([]string{
 				"spec_type:draft-mtp",
 				"spec_n_max:6",
 				"spec_p_min:0.75",
 			}))
 		})
 		It("returns a defensive copy so callers cannot mutate the package default", func() {
 			opts := MTPSpecOptions()
 			opts[0] = "spec_type:none"
 			Expect(MTPSpecOptions()[0]).To(Equal("spec_type:draft-mtp"))
 		})
 	})
 	Context("ApplyMTPDefaults", func() {
 		It("appends MTP options when nothing is configured", func() {
 			cfg := &ModelConfig{Name: "qwen-mtp"}
 			ApplyMTPDefaults(cfg, 1)
 			Expect(cfg.Options).To(Equal([]string{
 				"spec_type:draft-mtp",
 				"spec_n_max:6",
 				"spec_p_min:0.75",
 			}))
 		})
 		It("preserves unrelated options already on the config", func() {
 			cfg := &ModelConfig{
 				Name:    "qwen-mtp",
 				Options: []string{"use_jinja:true", "cache_reuse:256"},
 			}
 			ApplyMTPDefaults(cfg, 1)
 			Expect(cfg.Options).To(Equal([]string{
 				"use_jinja:true",
 				"cache_reuse:256",
 				"spec_type:draft-mtp",
 				"spec_n_max:6",
 				"spec_p_min:0.75",
 			}))
 		})
 		It("is a no-op when the user already configured spec_type", func() {
 			cfg := &ModelConfig{
 				Name:    "qwen-mtp",
 				Options: []string{"spec_type:ngram-simple", "use_jinja:true"},
 			}
 			ApplyMTPDefaults(cfg, 1)
 			Expect(cfg.Options).To(Equal([]string{
 				"spec_type:ngram-simple",
 				"use_jinja:true",
 			}))
 		})
 		It("also respects the legacy speculative_type alias", func() {
 			cfg := &ModelConfig{
 				Name:    "qwen-mtp",
 				Options: []string{"speculative_type:ngram-mod"},
 			}
 			ApplyMTPDefaults(cfg, 1)
 			Expect(cfg.Options).To(Equal([]string{"speculative_type:ngram-mod"}))
 		})
 		It("tolerates a nil config", func() {
 			Expect(func() { ApplyMTPDefaults(nil, 1) }).ToNot(Panic())
 		})
 	})
 	Context("HasEmbeddedMTPHead", func() {
 		It("returns false on a nil GGUF file", func() {
 			n, ok := HasEmbeddedMTPHead(nil)
 			Expect(ok).To(BeFalse())
 			Expect(n).To(BeZero())
 		})
 	})
 })
--- a/core/config/runtime_settings.go
+++ b/core/config/runtime_settings.go
@@ -38,6 +38,7 @@ type RuntimeSettings struct {
 	Debug                *bool `json:"debug,omitempty"`
 	EnableTracing        *bool `json:"enable_tracing,omitempty"`
 	TracingMaxItems      *int  `json:"tracing_max_items,omitempty"`
 	TracingMaxBodyBytes  *int  `json:"tracing_max_body_bytes,omitempty"` // Per-body cap in bytes; 0 disables the cap
 	EnableBackendLogging *bool `json:"enable_backend_logging,omitempty"`
 	// Security/CORS settings
--- a/core/gallery/backends.go
+++ b/core/gallery/backends.go
@@ -16,6 +16,7 @@ import (
 	"github.com/mudler/LocalAI/pkg/downloader"
 	"github.com/mudler/LocalAI/pkg/model"
 	"github.com/mudler/LocalAI/pkg/oci"
 	"github.com/mudler/LocalAI/pkg/oci/cosignverify"
 	"github.com/mudler/LocalAI/pkg/system"
 	"github.com/mudler/xlog"
 	cp "github.com/otiai10/copy"
@@ -102,8 +103,81 @@ func writeBackendMetadata(backendPath string, metadata *BackendMetadata) error {
 	return nil
 }
 // backendDownloadOptions translates the gallery's verification policy into
 // downloader options, and gates the call on strict-integrity mode. Both
 // InstallBackend and UpgradeBackend MUST route their download through these
 // options — without them, the corresponding code path silently downloads
 // and activates unverified backend bytes even when the gallery has a
 // verification: policy configured.
 //
 // For OCI URIs with a verification policy, returns a slice containing
 // downloader.WithImageVerifier(v) — the downloader will then run cosign
 // signature verification between fetching the manifest and extracting
 // layers (see pkg/downloader/uri.go OCI branch).
 //
 // For OCI URIs without a verification policy, or non-OCI URIs without a
 // SHA256, the function either returns a non-fatal warning (requireIntegrity
 // false) or fails the install (requireIntegrity true).
 func backendDownloadOptions(config *GalleryBackend, requireIntegrity bool) ([]downloader.DownloadOption, error) {
 	uri := downloader.URI(config.URI)
 	hasVerification := config.Gallery.Verification != nil
 	hasSHA := config.SHA256 != ""
 	switch {
 	case uri.LooksLikeOCI():
 		if !hasVerification {
 			if requireIntegrity {
 				return nil, fmt.Errorf("strict integrity: gallery %q has no verification policy for OCI backend %q (set verification: in the gallery YAML or disable --require-backend-integrity)",
 					config.Gallery.Name, config.Name)
 			}
 			xlog.Warn("installing OCI backend without signature verification",
 				"backend", config.Name, "gallery", config.Gallery.Name, "uri", config.URI)
 			return nil, nil
 		}
 		v, err := newGalleryVerifier(config.Gallery.Verification)
 		if err != nil {
 			return nil, fmt.Errorf("gallery %q verification policy: %w", config.Gallery.Name, err)
 		}
 		return []downloader.DownloadOption{downloader.WithImageVerifier(v)}, nil
 	case uri.LooksLikeDir():
 		// Local directory — out of scope for integrity checks.
 		return nil, nil
 	default:
 		if !hasSHA && requireIntegrity {
 			return nil, fmt.Errorf("strict integrity: backend %q has no SHA256 (gallery %q)",
 				config.Name, config.Gallery.Name)
 		}
 		// Non-strict: pkg/downloader already emits a warning when sha is empty.
 		return nil, nil
 	}
 }
 // newGalleryVerifier constructs a cosignverify.Verifier from the gallery
 // policy. Parses NotBefore (RFC3339) here so YAML errors surface at install
 // time rather than during signature verification.
 func newGalleryVerifier(p *config.GalleryVerification) (*cosignverify.Verifier, error) {
 	pol := cosignverify.Policy{
 		Issuer:        p.Issuer,
 		IssuerRegex:   p.IssuerRegex,
 		Identity:      p.Identity,
 		IdentityRegex: p.IdentityRegex,
 	}
 	if p.NotBefore != "" {
 		t, err := time.Parse(time.RFC3339, p.NotBefore)
 		if err != nil {
 			return nil, fmt.Errorf("not_before %q: %w", p.NotBefore, err)
 		}
 		pol.NotBefore = t
 	}
 	return cosignverify.NewVerifier(pol, nil, nil)
 }
 // InstallBackendFromGallery installs a backend from the gallery.
-func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, name string, downloadStatus func(string, string, string, float64), force bool) error {
+// requireIntegrity escalates a missing SHA256 / verification policy from a
 // warning to a hard failure (see backendDownloadOptions).
 func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, name string, downloadStatus func(string, string, string, float64), force, requireIntegrity bool) error {
 	if !force {
 		// check if we already have the backend installed
 		backends, err := ListSystemBackends(systemState)
@@ -149,7 +223,7 @@ func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery,
 		xlog.Debug("Installing backend from meta backend", "name", name, "bestBackend", bestBackend.Name)
 		// Then, let's install the best backend
-		if err := InstallBackend(ctx, systemState, modelLoader, bestBackend, downloadStatus); err != nil {
+		if err := InstallBackend(ctx, systemState, modelLoader, bestBackend, downloadStatus, requireIntegrity); err != nil {
 			return err
 		}
@@ -175,10 +249,10 @@ func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery,
 		return nil
 	}
-	return InstallBackend(ctx, systemState, modelLoader, backend, downloadStatus)
+	return InstallBackend(ctx, systemState, modelLoader, backend, downloadStatus, requireIntegrity)
 }
-func InstallBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, config *GalleryBackend, downloadStatus func(string, string, string, float64)) error {
+func InstallBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, config *GalleryBackend, downloadStatus func(string, string, string, float64), requireIntegrity bool) error {
 	// Get configurable fallback tag values from SystemState
 	latestTag, masterTag, devSuffix := getFallbackTagValues(systemState)
@@ -213,6 +287,14 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 		return fmt.Errorf("failed to create base path: %v", err)
 	}
 	// Build the download options once and reuse for every retry path —
 	// mirrors and tag fallbacks must verify against the same gallery
 	// policy or we open a hole where a non-default URI bypasses the check.
 	downloadOpts, optsErr := backendDownloadOptions(config, requireIntegrity)
 	if optsErr != nil {
 		return fmt.Errorf("backend %q: %w", config.Name, optsErr)
 	}
 	uri := downloader.URI(config.URI)
 	// Check if it is a directory
 	if uri.LooksLikeDir() {
@@ -222,7 +304,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 		}
 	} else {
 		xlog.Debug("Downloading backend", "uri", config.URI, "backendPath", backendPath)
-		if err := uri.DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err != nil {
+		if err := uri.DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err != nil {
 			xlog.Debug("Backend download failed, trying fallback", "backendPath", backendPath, "error", err)
 			// resetBackendPath cleans up partial state from a failed OCI extraction
@@ -243,7 +325,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 				default:
 				}
 				resetBackendPath()
-				if err := downloader.URI(mirror).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil {
+				if err := downloader.URI(mirror).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
 					success = true
 					xlog.Debug("Downloaded backend from mirror", "uri", config.URI, "backendPath", backendPath)
 					break
@@ -256,7 +338,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 				if fallbackURI != string(config.URI) {
 					resetBackendPath()
 					xlog.Info("Trying fallback URI", "original", config.URI, "fallback", fallbackURI)
-					if err := downloader.URI(fallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil {
+					if err := downloader.URI(fallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
 						xlog.Info("Downloaded backend using fallback URI", "uri", fallbackURI, "backendPath", backendPath)
 						success = true
 					} else {
@@ -265,7 +347,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
 							resetBackendPath()
 							devFallbackURI := fallbackURI + "-" + devSuffix
 							xlog.Info("Trying development fallback URI", "fallback", devFallbackURI)
-							if err := downloader.URI(devFallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil {
+							if err := downloader.URI(devFallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
 								xlog.Info("Downloaded backend using development fallback URI", "uri", devFallbackURI, "backendPath", backendPath)
 								success = true
 							} else {
--- a/core/gallery/backends_test.go
+++ b/core/gallery/backends_test.go
@@ -117,13 +117,13 @@ var _ = Describe("Gallery Backends", func() {
 	Describe("InstallBackendFromGallery", func() {
 		It("should return error when backend is not found", func() {
-			err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "non-existent", nil, true)
+			err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "non-existent", nil, true, false)
 			Expect(err).To(HaveOccurred())
 			Expect(err.Error()).To(ContainSubstring("no backend found with name \"non-existent\""))
 		})
 		It("should install backend from gallery", func() {
-			err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "test-backend", nil, true)
+			err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "test-backend", nil, true, false)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(filepath.Join(tempDir, "test-backend", "run.sh")).To(BeARegularFile())
 		})
@@ -545,7 +545,7 @@ var _ = Describe("Gallery Backends", func() {
 				VRAM:      1000000000000,
 				Backend:   system.Backend{BackendsPath: tempDir},
 			}
-			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true)
+			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
 			Expect(err).NotTo(HaveOccurred())
 			metaBackendPath := filepath.Join(tempDir, "meta-backend")
@@ -625,7 +625,7 @@ var _ = Describe("Gallery Backends", func() {
 				VRAM:      1000000000000,
 				Backend:   system.Backend{BackendsPath: tempDir},
 			}
-			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true)
+			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
 			Expect(err).NotTo(HaveOccurred())
 			metaBackendPath := filepath.Join(tempDir, "meta-backend")
@@ -709,7 +709,7 @@ var _ = Describe("Gallery Backends", func() {
 				VRAM:      1000000000000,
 				Backend:   system.Backend{BackendsPath: tempDir},
 			}
-			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true)
+			err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
 			Expect(err).NotTo(HaveOccurred())
 			metaBackendPath := filepath.Join(tempDir, "meta-backend")
@@ -808,7 +808,7 @@ var _ = Describe("Gallery Backends", func() {
 				system.WithBackendPath(newPath),
 			)
 			Expect(err).NotTo(HaveOccurred())
-			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
+			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
 			Expect(newPath).To(BeADirectory())
 			Expect(err).To(HaveOccurred()) // Will fail due to invalid URI, but path should be created
 		})
@@ -840,7 +840,7 @@ var _ = Describe("Gallery Backends", func() {
 				system.WithBackendPath(tempDir),
 			)
 			Expect(err).NotTo(HaveOccurred())
-			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
+			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())
 			dat, err := os.ReadFile(filepath.Join(tempDir, "test-backend", "metadata.json"))
@@ -873,7 +873,7 @@ var _ = Describe("Gallery Backends", func() {
 			Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).ToNot(BeARegularFile())
-			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
+			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())
 		})
@@ -894,7 +894,7 @@ var _ = Describe("Gallery Backends", func() {
 				system.WithBackendPath(tempDir),
 			)
 			Expect(err).NotTo(HaveOccurred())
-			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
+			err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())
--- a/core/gallery/backends_version_test.go
+++ b/core/gallery/backends_version_test.go
@@ -47,7 +47,7 @@ var _ = Describe("Backend versioning", func() {
 		backend.URI = srcDir
 		backend.Version = "1.2.3"
-		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil)
+		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
 		Expect(err).NotTo(HaveOccurred())
 		// Read the metadata file and check version
@@ -74,7 +74,7 @@ var _ = Describe("Backend versioning", func() {
 		backend.URI = srcDir
 		backend.Version = "2.0.0"
-		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil)
+		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
 		Expect(err).NotTo(HaveOccurred())
 		metadataPath := filepath.Join(tempDir, "test-backend-uri", "metadata.json")
@@ -100,7 +100,7 @@ var _ = Describe("Backend versioning", func() {
 		backend.URI = srcDir
 		// Version intentionally left empty
-		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil)
+		err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
 		Expect(err).NotTo(HaveOccurred())
 		metadataPath := filepath.Join(tempDir, "test-backend-noversion", "metadata.json")
--- a/core/gallery/importers/importers.go
+++ b/core/gallery/importers/importers.go
@@ -130,6 +130,8 @@ var defaultImporters = []Importer{
 	// and would otherwise swallow the C++ port's GGUF bundles.
 	&VibeVoiceCppImporter{},
 	&VibeVoiceImporter{},
 	// LiquidAudio (Python) — keep before LlamaCPP so non-GGUF LFM2-Audio repos route here.
 	&LiquidAudioImporter{},
 	&CoquiImporter{},
 	// Image/Video (Batch 3)
 	&StableDiffusionGGMLImporter{},
--- a/core/gallery/importers/liquid-audio.go
+++ b/core/gallery/importers/liquid-audio.go
@@ -0,0 +1,145 @@
 package importers
 import (
 	"encoding/json"
 	"path/filepath"
 	"strings"
 	"github.com/mudler/LocalAI/core/config"
 	"github.com/mudler/LocalAI/core/gallery"
 	"github.com/mudler/LocalAI/core/schema"
 	"go.yaml.in/yaml/v2"
 )
 var _ Importer = &LiquidAudioImporter{}
 // LiquidAudioImporter recognises LiquidAI's LFM2-Audio family (LFM2-Audio-1.5B,
 // LFM2.5-Audio-1.5B, community finetunes) and routes them to the Python
 // `liquid-audio` backend. Detection is by repo-name substring so third-party
 // mirrors still match. preferences.backend="liquid-audio" overrides detection.
 //
 // Once upstream llama.cpp PR #18641 lands and the GGUF gallery entries are
 // added, GGUF mirrors of these models should route to llama-cpp; that's
 // handled by ordering LlamaCPPImporter after this one and by the explicit
 // "-gguf" exclusion below.
 type LiquidAudioImporter struct{}
 func (i *LiquidAudioImporter) Name() string      { return "liquid-audio" }
 func (i *LiquidAudioImporter) Modality() string  { return "tts" }
 func (i *LiquidAudioImporter) AutoDetects() bool { return true }
 func (i *LiquidAudioImporter) Match(details Details) bool {
 	preferences, err := details.Preferences.MarshalJSON()
 	if err != nil {
 		return false
 	}
 	preferencesMap := make(map[string]any)
 	if len(preferences) > 0 {
 		if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
 			return false
 		}
 	}
 	if b, ok := preferencesMap["backend"].(string); ok && b == "liquid-audio" {
 		return true
 	}
 	matchRepo := func(repo string) bool {
 		r := strings.ToLower(repo)
 		// Cede GGUF mirrors to the (later-ordered) llama-cpp importer.
 		if strings.HasSuffix(r, "-gguf") {
 			return false
 		}
 		return strings.Contains(r, "lfm2-audio") || strings.Contains(r, "lfm2.5-audio")
 	}
 	if details.HuggingFace != nil {
 		repoName := details.HuggingFace.ModelID
 		if idx := strings.Index(repoName, "/"); idx >= 0 {
 			repoName = repoName[idx+1:]
 		}
 		if matchRepo(repoName) {
 			return true
 		}
 	}
 	if _, repo, ok := HFOwnerRepoFromURI(details.URI); ok {
 		return matchRepo(repo)
 	}
 	return false
 }
 func (i *LiquidAudioImporter) Import(details Details) (gallery.ModelConfig, error) {
 	preferences, err := details.Preferences.MarshalJSON()
 	if err != nil {
 		return gallery.ModelConfig{}, err
 	}
 	preferencesMap := make(map[string]any)
 	if len(preferences) > 0 {
 		if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
 			return gallery.ModelConfig{}, err
 		}
 	}
 	name, ok := preferencesMap["name"].(string)
 	if !ok {
 		name = filepath.Base(details.URI)
 	}
 	description, ok := preferencesMap["description"].(string)
 	if !ok {
 		description = "Imported from " + details.URI
 	}
 	model := details.URI
 	if details.HuggingFace != nil && details.HuggingFace.ModelID != "" {
 		model = details.HuggingFace.ModelID
 	}
 	// Preferences may pin the mode (chat / asr / tts / s2s / finetune).
 	// Default to s2s — the headline any-to-any use case.
 	mode, _ := preferencesMap["mode"].(string)
 	if mode == "" {
 		mode = "s2s"
 	}
 	options := []string{"mode:" + mode}
 	if voice, ok := preferencesMap["voice"].(string); ok && voice != "" {
 		options = append(options, "voice:"+voice)
 	}
 	usecases := []string{"chat"}
 	switch mode {
 	case "asr":
 		usecases = []string{"transcript"}
 	case "tts":
 		usecases = []string{"tts"}
 	case "s2s":
 		// realtime_audio surfaces the model on the Talk page; chat/tts/
 		// transcript/vad keep the standalone OpenAI-compatible endpoints
 		// working since liquid-audio implements all of them.
 		usecases = []string{"realtime_audio", "chat", "tts", "transcript", "vad"}
 	}
 	modelConfig := config.ModelConfig{
 		Name:                name,
 		Description:         description,
 		Backend:             "liquid-audio",
 		KnownUsecaseStrings: usecases,
 		Options:             options,
 		PredictionOptions: schema.PredictionOptions{
 			BasicModelRequest: schema.BasicModelRequest{Model: model},
 		},
 	}
 	data, err := yaml.Marshal(modelConfig)
 	if err != nil {
 		return gallery.ModelConfig{}, err
 	}
 	return gallery.ModelConfig{
 		Name:        name,
 		Description: description,
 		ConfigFile:  string(data),
 	}, nil
 }
--- a/core/gallery/importers/liquid-audio_test.go
+++ b/core/gallery/importers/liquid-audio_test.go
@@ -0,0 +1,91 @@
 package importers_test
 import (
 	"encoding/json"
 	"fmt"
 	"github.com/mudler/LocalAI/core/gallery/importers"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 )
 var _ = Describe("LiquidAudioImporter", func() {
 	Context("detection from HuggingFace", func() {
 		It("matches LiquidAI/LFM2.5-Audio-1.5B", func() {
 			uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B"
 			preferences := json.RawMessage(`{}`)
 			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
 			Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
 			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: liquid-audio"))
 			Expect(modelConfig.ConfigFile).To(ContainSubstring("LiquidAI/LFM2.5-Audio-1.5B"))
 		})
 		It("matches LiquidAI/LFM2-Audio-1.5B (older variant)", func() {
 			uri := "https://huggingface.co/LiquidAI/LFM2-Audio-1.5B"
 			preferences := json.RawMessage(`{}`)
 			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
 			Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
 			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: liquid-audio"))
 		})
 		It("cedes -GGUF mirrors to the llama-cpp importer", func() {
 			// LiquidAI/LFM2.5-Audio-1.5B-GGUF should NOT route to liquid-audio.
 			// Once upstream PR #18641 lands and the GGUF gallery entry exists,
 			// this is the path that lets users opt into the C++ runtime.
 			uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B-GGUF"
 			preferences := json.RawMessage(`{}`)
 			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
 			Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
 			Expect(modelConfig.ConfigFile).ToNot(ContainSubstring("backend: liquid-audio"),
 				fmt.Sprintf("GGUF repo should not match Python importer; got: %s", modelConfig.ConfigFile))
 		})
 	})
 	Context("preference override", func() {
 		It("honours preferences.backend=liquid-audio for arbitrary URIs", func() {
 			uri := "https://example.com/some-unrelated-model"
 			preferences := json.RawMessage(`{"backend": "liquid-audio"}`)
 			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
 			Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
 			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: liquid-audio"))
 		})
 		It("picks up the mode preference", func() {
 			uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B"
 			preferences := json.RawMessage(`{"mode": "asr"}`)
 			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
 			Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
 			Expect(modelConfig.ConfigFile).To(ContainSubstring("mode:asr"))
 			Expect(modelConfig.ConfigFile).To(ContainSubstring("transcript"))
 		})
 		It("picks up the voice preference", func() {
 			uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B"
 			preferences := json.RawMessage(`{"mode": "tts", "voice": "uk_male"}`)
 			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
 			Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
 			Expect(modelConfig.ConfigFile).To(ContainSubstring("voice:uk_male"))
 		})
 	})
 	Context("Importer interface metadata", func() {
 		It("exposes name/modality/autodetect", func() {
 			imp := &importers.LiquidAudioImporter{}
 			Expect(imp.Name()).To(Equal("liquid-audio"))
 			Expect(imp.Modality()).To(Equal("tts"))
 			Expect(imp.AutoDetects()).To(BeTrue())
 		})
 	})
 })
--- a/core/gallery/importers/llama-cpp.go
+++ b/core/gallery/importers/llama-cpp.go
@@ -1,10 +1,13 @@
 package importers
 import (
 	"context"
 	"encoding/json"
 	"path/filepath"
 	"strings"
 	"time"
 	gguf "github.com/gpustack/gguf-parser-go"
 	"github.com/mudler/LocalAI/core/config"
 	"github.com/mudler/LocalAI/core/gallery"
 	"github.com/mudler/LocalAI/core/schema"
@@ -261,6 +264,13 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error)
 	// Apply per-model-family inference parameter defaults
 	config.ApplyInferenceDefaults(&modelConfig, details.URI)
 	// Auto-detect Multi-Token Prediction heads (ggml-org/llama.cpp#22673) and
 	// enable speculative decoding. Mirrors the load-time hook so freshly
 	// imported configs already carry spec_type:draft-mtp before the model is
 	// ever loaded - users see it in the YAML preview rather than discovering
 	// it after the first start.
 	maybeApplyMTPDefaults(&modelConfig, details, &cfg)
 	data, err := yaml.Marshal(modelConfig)
 	if err != nil {
 		return gallery.ModelConfig{}, err
@@ -291,6 +301,85 @@ func pickPreferredGroup(groups []hfapi.ShardGroup, prefs []string) *hfapi.ShardG
 	return &groups[len(groups)-1]
 }
 // maybeApplyMTPDefaults parses the picked GGUF header (range-fetched over
 // HTTP for HF/URL imports) and, if the file declares a Multi-Token Prediction
 // head, appends the auto-MTP option keys to modelConfig.Options. Failures
 // during the probe are non-fatal: the importer keeps the config without MTP
 // so an unrelated network blip or weird header doesn't break the import.
 //
 // OCI/Ollama URIs are skipped because the artifact isn't directly fetchable
 // as a GGUF byte stream - the load-time hook (core/config/gguf.go) covers
 // those once the model is materialised on disk.
 func maybeApplyMTPDefaults(modelConfig *config.ModelConfig, details Details, cfg *gallery.ModelConfig) {
 	probeURL := pickMTPProbeURL(details, cfg)
 	if probeURL == "" {
 		return
 	}
 	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
 	defer cancel()
 	defer func() {
 		if r := recover(); r != nil {
 			xlog.Debug("[mtp-importer] panic while probing GGUF header", "uri", probeURL, "recover", r)
 		}
 	}()
 	f, err := gguf.ParseGGUFFileRemote(ctx, probeURL)
 	if err != nil {
 		xlog.Debug("[mtp-importer] failed to read remote GGUF header for MTP detection", "uri", probeURL, "error", err)
 		return
 	}
 	n, ok := config.HasEmbeddedMTPHead(f)
 	if !ok {
 		return
 	}
 	config.ApplyMTPDefaults(modelConfig, n)
 }
 // pickMTPProbeURL returns an HTTP(S) URL pointing at the main (non-mmproj)
 // GGUF shard that should be inspected for an MTP head, or "" when no
 // suitable URL is available. Custom URI schemes (`huggingface://`,
 // `ollama://`, etc.) are run through `downloader.URI.ResolveURL` so the
 // resulting URL is something `gguf.ParseGGUFFileRemote` can actually open.
 // OCI/Ollama URIs are skipped because the artifact is not directly
 // streamable as a GGUF byte range.
 func pickMTPProbeURL(details Details, cfg *gallery.ModelConfig) string {
 	uri := downloader.URI(details.URI)
 	if uri.LooksLikeOCI() {
 		return ""
 	}
 	if strings.HasSuffix(strings.ToLower(details.URI), ".gguf") {
 		return resolveHTTPProbe(details.URI)
 	}
 	for _, f := range cfg.Files {
 		lower := strings.ToLower(f.Filename)
 		if strings.Contains(lower, "mmproj") {
 			continue
 		}
 		if !strings.HasSuffix(lower, ".gguf") {
 			continue
 		}
 		return resolveHTTPProbe(f.URI)
 	}
 	return ""
 }
 // resolveHTTPProbe resolves an importer-side URI to the HTTP(S) URL that
 // `gguf.ParseGGUFFileRemote` can range-fetch. Returns "" if the URI can't
 // be reduced to an HTTP(S) endpoint (e.g. local path, unsupported scheme).
 func resolveHTTPProbe(uri string) string {
 	resolved := downloader.URI(uri).ResolveURL()
 	if downloader.URI(resolved).LooksLikeHTTPURL() {
 		return resolved
 	}
 	return ""
 }
 // appendShardGroup copies every shard of group into cfg.Files under dest,
 // skipping any entry whose target filename is already present so repeated
 // calls (e.g. the rare case of mmproj + model picking the same group)
--- a/core/gallery/models.go
+++ b/core/gallery/models.go
@@ -77,7 +77,7 @@ func InstallModelFromGallery(
 	modelGalleries, backendGalleries []lconfig.Gallery,
 	systemState *system.SystemState,
 	modelLoader *model.ModelLoader,
-	name string, req GalleryModel, downloadStatus func(string, string, string, float64), enforceScan, automaticallyInstallBackend bool) error {
+	name string, req GalleryModel, downloadStatus func(string, string, string, float64), enforceScan, automaticallyInstallBackend, requireBackendIntegrity bool) error {
 	applyModel := func(model *GalleryModel) error {
 		name = strings.ReplaceAll(name, string(os.PathSeparator), "__")
@@ -137,7 +137,7 @@ func InstallModelFromGallery(
 		if automaticallyInstallBackend && installedModel.Backend != "" {
 			xlog.Debug("Installing backend", "backend", installedModel.Backend)
-			if err := InstallBackendFromGallery(ctx, backendGalleries, systemState, modelLoader, installedModel.Backend, downloadStatus, false); err != nil {
+			if err := InstallBackendFromGallery(ctx, backendGalleries, systemState, modelLoader, installedModel.Backend, downloadStatus, false, requireBackendIntegrity); err != nil {
 				return err
 			}
 		}
--- a/core/gallery/models_test.go
+++ b/core/gallery/models_test.go
@@ -89,7 +89,7 @@ var _ = Describe("Model test", func() {
 			Expect(models[0].URL).To(Equal(bertEmbeddingsURL))
 			Expect(models[0].Installed).To(BeFalse())
-			err = InstallModelFromGallery(context.TODO(), galleries, []config.Gallery{}, systemState, nil, "test@bert", GalleryModel{}, func(s1, s2, s3 string, f float64) {}, true, true)
+			err = InstallModelFromGallery(context.TODO(), galleries, []config.Gallery{}, systemState, nil, "test@bert", GalleryModel{}, func(s1, s2, s3 string, f float64) {}, true, true, false)
 			Expect(err).ToNot(HaveOccurred())
 			dat, err := os.ReadFile(filepath.Join(tempdir, "bert.yaml"))
--- a/core/gallery/upgrade.go
+++ b/core/gallery/upgrade.go
@@ -232,7 +232,7 @@ func summarizeNodeDrift(nodes []NodeBackendRef) (majority struct{ version, diges
 // UpgradeBackend upgrades a single backend to the latest gallery version using
 // an atomic swap with backup-based rollback on failure.
-func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, galleries []config.Gallery, backendName string, downloadStatus func(string, string, string, float64)) error {
+func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, galleries []config.Gallery, backendName string, downloadStatus func(string, string, string, float64), requireIntegrity bool) error {
 	// Look up the installed backend
 	installedBackends, err := ListSystemBackends(systemState)
 	if err != nil {
@@ -251,7 +251,7 @@ func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelL
 	// If this is a meta backend, recursively upgrade the concrete backend it points to
 	if installed.Metadata != nil && installed.Metadata.MetaBackendFor != "" {
 		xlog.Info("Meta backend detected, upgrading concrete backend", "meta", backendName, "concrete", installed.Metadata.MetaBackendFor)
-		return UpgradeBackend(ctx, systemState, modelLoader, galleries, installed.Metadata.MetaBackendFor, downloadStatus)
+		return UpgradeBackend(ctx, systemState, modelLoader, galleries, installed.Metadata.MetaBackendFor, downloadStatus, requireIntegrity)
 	}
 	// Find the gallery entry
@@ -265,6 +265,16 @@ func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelL
 		return fmt.Errorf("no gallery entry found for backend %q", backendName)
 	}
 	// Resolve integrity options (cosign verifier for OCI URIs, strict-mode
 	// gate for missing SHA256/policy) BEFORE writing anything to disk.
 	// Without this, the upgrade path would atomically swap in an
 	// unverified backend even when the gallery has a verification policy
 	// — see backendDownloadOptions in backends.go.
 	downloadOpts, err := backendDownloadOptions(galleryEntry, requireIntegrity)
 	if err != nil {
 		return fmt.Errorf("upgrade %q: %w", backendName, err)
 	}
 	backendPath := filepath.Join(systemState.Backend.BackendsPath, backendName)
 	tmpPath := backendPath + ".upgrade-tmp"
 	backupPath := backendPath + ".backup"
@@ -285,7 +295,7 @@ func UpgradeBackend(ctx context.Context, systemState *system.SystemState, modelL
 			return fmt.Errorf("failed to copy backend from directory: %w", err)
 		}
 	} else {
-		if err := uri.DownloadFileWithContext(ctx, tmpPath, "", 1, 1, downloadStatus); err != nil {
+		if err := uri.DownloadFileWithContext(ctx, tmpPath, galleryEntry.SHA256, 1, 1, downloadStatus, downloadOpts...); err != nil {
 			os.RemoveAll(tmpPath)
 			return fmt.Errorf("failed to download backend: %w", err)
 		}
--- a/core/gallery/upgrade_test.go
+++ b/core/gallery/upgrade_test.go
@@ -383,7 +383,7 @@ var _ = Describe("Upgrade Detection and Execution", func() {
 			})
 			ml := model.NewModelLoader(systemState)
-			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil)
+			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil, false)
 			Expect(err).NotTo(HaveOccurred())
 			// Verify run.sh was updated
@@ -417,7 +417,7 @@ var _ = Describe("Upgrade Detection and Execution", func() {
 			})
 			ml := model.NewModelLoader(systemState)
-			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil)
+			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil, false)
 			Expect(err).To(HaveOccurred())
 			// Verify v1 is still intact
@@ -432,5 +432,41 @@ var _ = Describe("Upgrade Detection and Execution", func() {
 			Expect(json.Unmarshal(metaData, &meta)).To(Succeed())
 			Expect(meta.Version).To(Equal("1.0.0"))
 		})
 		// Regression: an earlier version of UpgradeBackend wrote the
 		// downloaded bytes to disk without going through
 		// backendDownloadOptions, so the gallery's verification policy
 		// (and strict-integrity gate) didn't apply on upgrade. This test
 		// pins the upgrade path to the same integrity gate as installs:
 		// strict mode + an OCI URI without a verification: block must
 		// hard-fail *before* anything is downloaded or swapped in.
 		It("should refuse to upgrade an OCI backend that bypasses integrity in strict mode", func() {
 			installBackendWithVersion("my-backend", "1.0.0", "#!/bin/sh\necho v1")
 			// OCI URI, no Gallery.Verification → backendDownloadOptions
 			// returns a strict-integrity error before any network call.
 			writeGalleryYAML([]GalleryBackend{
 				{
 					Metadata: Metadata{
 						Name: "my-backend",
 					},
 					URI:     "oci://example.invalid/missing:never-fetched",
 					Version: "2.0.0",
 				},
 			})
 			ml := model.NewModelLoader(systemState)
 			err := UpgradeBackend(context.Background(), systemState, ml, galleries, "my-backend", nil, true)
 			Expect(err).To(HaveOccurred())
 			Expect(err.Error()).To(ContainSubstring("strict integrity"))
 			// The installed v1 must be untouched — the upgrade should
 			// have aborted before writing anything.
 			content, err := os.ReadFile(filepath.Join(backendsPath, "my-backend", "run.sh"))
 			Expect(err).NotTo(HaveOccurred())
 			Expect(string(content)).To(Equal("#!/bin/sh\necho v1"))
 			Expect(filepath.Join(backendsPath, "my-backend.upgrade-tmp")).NotTo(BeAnExistingFile())
 			Expect(filepath.Join(backendsPath, "my-backend.backup")).NotTo(BeAnExistingFile())
 		})
 	})
 })
--- a/core/http/app.go
+++ b/core/http/app.go
@@ -28,6 +28,7 @@ import (
 	"github.com/mudler/LocalAI/core/services/monitoring"
 	"github.com/mudler/LocalAI/core/services/nodes"
 	"github.com/mudler/LocalAI/core/services/quantization"
 	"github.com/mudler/LocalAI/pkg/signals"
 	"github.com/mudler/xlog"
 )
@@ -267,9 +268,12 @@ func API(application *application.Application) (*echo.Echo, error) {
 		e.Static("/generated-videos", videoPath)
 	}
-	// Initialize usage recording when auth DB is available
+	// Initialize usage recording when auth DB is available, and ensure the
 	// batcher drains its in-memory queue on graceful shutdown so the last
 	// few seconds of usage don't disappear when the process exits.
 	if application.AuthDB() != nil {
 		httpMiddleware.InitUsageRecorder(application.AuthDB())
 		signals.RegisterGracefulTerminationHandler(httpMiddleware.ShutdownUsageRecorder)
 	}
 	// Auth is applied to _all_ endpoints. Filtering out endpoints to bypass is
@@ -403,7 +407,7 @@ func API(application *application.Application) (*echo.Echo, error) {
 		}
 	}
 	routes.RegisterNodeSelfServiceRoutes(e, registry, distCfg.RegistrationToken, distCfg.AutoApproveNodes, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret)
-	routes.RegisterNodeAdminRoutes(e, registry, remoteUnloader, adminMiddleware, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret, application.ApplicationConfig().Distributed.RegistrationToken)
+	routes.RegisterNodeAdminRoutes(e, registry, remoteUnloader, application.GalleryService(), opcache, application.ApplicationConfig(), adminMiddleware, application.AuthDB(), application.ApplicationConfig().Auth.APIKeyHMACSecret, application.ApplicationConfig().Distributed.RegistrationToken)
 	// Distributed SSE routes (job progress + agent events via NATS)
 	if d := application.Distributed(); d != nil {
@@ -443,6 +447,25 @@ func API(application *application.Application) (*echo.Echo, error) {
 					baseTag := `<base href="` + httpMiddleware.SecureBaseHref(baseURL) + `" />`
 					indexHTML = []byte(strings.Replace(string(indexHTML), "<head>", "<head>\n  "+baseTag, 1))
 				}
 				// <base href> only changes how relative URLs resolve; path-absolute
 				// URLs (those starting with `/`) still resolve against the origin
 				// and would bypass the reverse-proxy prefix. Rewrite the internal
 				// path-absolute references emitted by the build so the browser
 				// requests them through the proxy under the prefix.
 				//
 				// HTML-escape the prefix before interpolating it into attributes:
 				// BasePathPrefix already gates X-Forwarded-Prefix via
 				// SafeForwardedPrefix, but the validator only blocks open-redirect
 				// shapes (// prefix, backslashes, control chars), not attribute
 				// breakout characters like `"`. Escaping makes this resilient
 				// even if the validator ever loosens.
 				if prefix := httpMiddleware.BasePathPrefix(c); prefix != "/" {
 					safePrefix := httpMiddleware.SecureBaseHref(prefix)
 					html := string(indexHTML)
 					html = strings.ReplaceAll(html, `="/assets/`, `="`+safePrefix+`assets/`)
 					html = strings.ReplaceAll(html, `="/favicon.svg"`, `="`+safePrefix+`favicon.svg"`)
 					indexHTML = []byte(html)
 				}
 				return c.HTMLBlob(http.StatusOK, indexHTML)
 			}
--- a/core/http/app_test.go
+++ b/core/http/app_test.go
@@ -446,6 +446,42 @@ var _ = Describe("API test", func() {
 				Expect(sc).To(Equal(200), "status code")
 				Expect(string(body)).To(ContainSubstring(`<base href="https://example.org/myprefix/" />`), "body")
 			})
 			// Caddy's `handle_path` (and similar directives) strip the matched
 			// prefix before forwarding upstream, so LocalAI receives the
 			// already-stripped path together with X-Forwarded-Prefix. The base
 			// href and asset URLs must still include the prefix so the browser
 			// requests them through the proxy.
 			It("Should support reverse-proxy when prefix is stripped by the proxy", func() {
 				err, sc, body := getRequest("http://127.0.0.1:9090/app", http.Header{
 					"X-Forwarded-Proto":  {"https"},
 					"X-Forwarded-Host":   {"example.org"},
 					"X-Forwarded-Prefix": {"/myprefix"},
 				})
 				Expect(err).To(BeNil(), "error")
 				Expect(sc).To(Equal(200), "status code")
 				Expect(string(body)).To(ContainSubstring(`<base href="https://example.org/myprefix/" />`), "body")
 				Expect(string(body)).ToNot(ContainSubstring(`="/assets/`), "asset URLs must include the prefix")
 				Expect(string(body)).ToNot(ContainSubstring(`="/favicon.svg"`), "favicon URL must include the prefix")
 			})
 			// X-Forwarded-Prefix is attacker controllable on misconfigured
 			// proxy chains. A value like "//evil.com" would otherwise turn the
 			// asset URL rewrite into a protocol-relative URL that loads JS
 			// from a foreign origin. BasePathPrefix must reject these via
 			// SafeForwardedPrefix and fall back to "/".
 			It("Should ignore an unsafe X-Forwarded-Prefix and not poison asset URLs", func() {
 				err, sc, body := getRequest("http://127.0.0.1:9090/app", http.Header{
 					"X-Forwarded-Proto":  {"https"},
 					"X-Forwarded-Host":   {"example.org"},
 					"X-Forwarded-Prefix": {"//evil.com"},
 				})
 				Expect(err).To(BeNil(), "error")
 				Expect(sc).To(Equal(200), "status code")
 				Expect(string(body)).ToNot(ContainSubstring("evil.com"), "unsafe prefix must not leak into the response")
 				Expect(string(body)).ToNot(ContainSubstring(`="//`), "asset URLs must not become protocol-relative")
 			})
 		})
 		Context("Applying models", func() {
--- a/core/http/auth/db.go
+++ b/core/http/auth/db.go
@@ -38,9 +38,15 @@ func InitDB(databaseURL string) (*gorm.DB, error) {
 	}
 	// Backfill: users created before the provider column existed have an empty
-	// provider — treat them as local accounts so the UI can identify them.
+	// provider - treat them as local accounts so the UI can identify them.
 	db.Exec("UPDATE users SET provider = ? WHERE provider = '' OR provider IS NULL", ProviderLocal)
 	// Backfill: pre-feature usage_records have no source column. Classify them so the
 	// new per-source aggregators include them.
 	if err := BackfillUsageSource(db); err != nil {
 		return nil, fmt.Errorf("failed to backfill usage source: %w", err)
 	}
 	// Create composite index on users(provider, subject) for fast OAuth lookups
 	if err := db.Exec("CREATE INDEX IF NOT EXISTS idx_users_provider_subject ON users(provider, subject)").Error; err != nil {
 		// Ignore error on postgres if index already exists
--- a/core/http/auth/middleware.go
+++ b/core/http/auth/middleware.go
@@ -16,8 +16,10 @@ import (
 )
 const (
-	contextKeyUser = "auth_user"
+	contextKeyUser   = "auth_user"
-	contextKeyRole = "auth_role"
+	contextKeyRole   = "auth_role"
 	contextKeyAPIKey = "auth_apikey"
 	contextKeySource = "auth_source"
 )
 // Middleware returns an Echo middleware that handles authentication.
@@ -75,6 +77,7 @@ func Middleware(db *gorm.DB, appConfig *config.ApplicationConfig) echo.Middlewar
 					}
 					c.Set(contextKeyUser, syntheticUser)
 					c.Set(contextKeyRole, RoleAdmin)
 					c.Set(contextKeySource, UsageSourceLegacy)
 					authenticated = true
 				}
 			}
@@ -213,6 +216,20 @@ func GetUserRole(c echo.Context) string {
 	return role
 }
 // GetAPIKey returns the resolved API key from the echo context, or nil.
 // Nil for session-cookie and legacy-env-key authentication.
 func GetAPIKey(c echo.Context) *UserAPIKey {
 	k, _ := c.Get(contextKeyAPIKey).(*UserAPIKey)
 	return k
 }
 // GetSource returns the request's authentication source: UsageSourceAPIKey,
 // UsageSourceWeb, UsageSourceLegacy, or empty if no authentication was performed.
 func GetSource(c echo.Context) string {
 	s, _ := c.Get(contextKeySource).(string)
 	return s
 }
 // RequireRouteFeature returns a global middleware that checks the user has access
 // to the feature required by the matched route. It uses the RouteFeatureRegistry
 // to look up the required feature for each route pattern + HTTP method.
@@ -421,47 +438,67 @@ func RequireQuota(db *gorm.DB) echo.MiddlewareFunc {
 }
 // tryAuthenticate attempts to authenticate the request using the database.
 //
 // On success it returns the user and, as a side effect, sets the following
 // values on the Echo context:
 //   - contextKeySource ("auth_source"): always set, one of UsageSourceWeb /
 //     UsageSourceAPIKey. UsageSourceLegacy is set elsewhere by the parent
 //     Middleware when a legacy env key matches.
 //   - contextKeyAPIKey ("auth_apikey"): set to the resolved *UserAPIKey for
 //     named-key branches (Bearer, x-api-key, xi-api-key, token cookie).
 //   - "_auth_session": session record, used by Middleware to drive cookie
 //     rotation. Only set on the session-cookie branch.
 //
 // contextKeyUser and contextKeyRole are populated by the parent Middleware
 // after this function returns.
 func tryAuthenticate(c echo.Context, db *gorm.DB, appConfig *config.ApplicationConfig) *User {
 	hmacSecret := appConfig.Auth.APIKeyHMACSecret
-	// a. Session cookie
+	// a. Session cookie -> web UI
 	if cookie, err := c.Cookie(sessionCookie); err == nil && cookie.Value != "" {
 		if user, session := ValidateSession(db, cookie.Value, hmacSecret); user != nil {
 			// Store session for rotation check in middleware
 			c.Set("_auth_session", session)
 			c.Set(contextKeySource, UsageSourceWeb)
 			return user
 		}
 	}
-	// b. Authorization: Bearer token
+	// b. Authorization: Bearer
 	authHeader := c.Request().Header.Get("Authorization")
 	if strings.HasPrefix(authHeader, "Bearer ") {
 		token := strings.TrimPrefix(authHeader, "Bearer ")
-		// Try as session ID first
+		// b1. Session token via Bearer -> still web UI
 		if user, _ := ValidateSession(db, token, hmacSecret); user != nil {
 			c.Set(contextKeySource, UsageSourceWeb)
 			return user
 		}
-		// Try as user API key
+		// b2. Named API key
 		if key, err := ValidateAPIKey(db, token, hmacSecret); err == nil {
 			c.Set(contextKeySource, UsageSourceAPIKey)
 			c.Set(contextKeyAPIKey, key)
 			return &key.User
 		}
 	}
-	// c. x-api-key / xi-api-key headers
+	// c. x-api-key / xi-api-key -> named API key
 	for _, header := range []string{"x-api-key", "xi-api-key"} {
-		if key := c.Request().Header.Get(header); key != "" {
+		if k := c.Request().Header.Get(header); k != "" {
-			if apiKey, err := ValidateAPIKey(db, key, hmacSecret); err == nil {
+			if apiKey, err := ValidateAPIKey(db, k, hmacSecret); err == nil {
 				c.Set(contextKeySource, UsageSourceAPIKey)
 				c.Set(contextKeyAPIKey, apiKey)
 				return &apiKey.User
 			}
 		}
 	}
-	// d. token cookie (legacy)
+	// d. token cookie -> named API key
 	if cookie, err := c.Cookie("token"); err == nil && cookie.Value != "" {
 		// Try as user API key
 		if key, err := ValidateAPIKey(db, cookie.Value, hmacSecret); err == nil {
 			c.Set(contextKeySource, UsageSourceAPIKey)
 			c.Set(contextKeyAPIKey, key)
 			return &key.User
 		}
 	}
--- a/core/http/auth/middleware_test.go
+++ b/core/http/auth/middleware_test.go
@@ -303,4 +303,122 @@ var _ = Describe("Auth Middleware", func() {
 			}
 		})
 	})
 	Describe("auth context plumbing for usage source", func() {
 		// probeApp builds a minimal echo app with the auth middleware and a single
 		// "/probe" route that captures the user, source, and apikey from context.
 		type probe struct {
 			user   *auth.User
 			source string
 			key    *auth.UserAPIKey
 		}
 		probeApp := func(db *gorm.DB, appConfig *config.ApplicationConfig, p *probe) *echo.Echo {
 			e := echo.New()
 			e.Use(auth.Middleware(db, appConfig))
 			e.GET("/probe", func(c echo.Context) error {
 				p.user = auth.GetUser(c)
 				p.source = auth.GetSource(c)
 				p.key = auth.GetAPIKey(c)
 				return c.NoContent(http.StatusOK)
 			})
 			return e
 		}
 		It("session cookie sets source=web, apikey=nil", func() {
 			db := testDB()
 			appConfig := config.NewApplicationConfig()
 			user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
 			token := createTestSession(db, user.ID)
 			var p probe
 			app := probeApp(db, appConfig, &p)
 			rec := doRequest(app, http.MethodGet, "/probe", withSessionCookie(token))
 			Expect(rec.Code).To(Equal(http.StatusOK))
 			Expect(p.user).ToNot(BeNil())
 			Expect(p.user.ID).To(Equal(user.ID))
 			Expect(p.source).To(Equal(auth.UsageSourceWeb))
 			Expect(p.key).To(BeNil())
 		})
 		It("Bearer session token sets source=web, apikey=nil", func() {
 			db := testDB()
 			appConfig := config.NewApplicationConfig()
 			user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
 			token := createTestSession(db, user.ID)
 			var p probe
 			app := probeApp(db, appConfig, &p)
 			rec := doRequest(app, http.MethodGet, "/probe", withBearerToken(token))
 			Expect(rec.Code).To(Equal(http.StatusOK))
 			Expect(p.user).ToNot(BeNil())
 			Expect(p.user.ID).To(Equal(user.ID))
 			Expect(p.source).To(Equal(auth.UsageSourceWeb))
 			Expect(p.key).To(BeNil())
 		})
 		It("Bearer API key sets source=apikey and exposes the resolved *UserAPIKey", func() {
 			db := testDB()
 			appConfig := config.NewApplicationConfig()
 			user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
 			plaintext, key, err := auth.CreateAPIKey(db, user.ID, "ci", auth.RoleUser, appConfig.Auth.APIKeyHMACSecret, nil)
 			Expect(err).ToNot(HaveOccurred())
 			var p probe
 			app := probeApp(db, appConfig, &p)
 			rec := doRequest(app, http.MethodGet, "/probe", withBearerToken(plaintext))
 			Expect(rec.Code).To(Equal(http.StatusOK))
 			Expect(p.source).To(Equal(auth.UsageSourceAPIKey))
 			Expect(p.key).ToNot(BeNil())
 			Expect(p.key.ID).To(Equal(key.ID))
 		})
 		It("x-api-key header sets source=apikey", func() {
 			db := testDB()
 			appConfig := config.NewApplicationConfig()
 			user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
 			plaintext, _, err := auth.CreateAPIKey(db, user.ID, "ci", auth.RoleUser, appConfig.Auth.APIKeyHMACSecret, nil)
 			Expect(err).ToNot(HaveOccurred())
 			var p probe
 			app := probeApp(db, appConfig, &p)
 			rec := doRequest(app, http.MethodGet, "/probe", withXApiKey(plaintext))
 			Expect(rec.Code).To(Equal(http.StatusOK))
 			Expect(p.source).To(Equal(auth.UsageSourceAPIKey))
 			Expect(p.key).ToNot(BeNil())
 		})
 		It("token cookie sets source=apikey", func() {
 			db := testDB()
 			appConfig := config.NewApplicationConfig()
 			user := createTestUser(db, "alice@example.com", auth.RoleUser, auth.ProviderLocal)
 			plaintext, _, err := auth.CreateAPIKey(db, user.ID, "ci", auth.RoleUser, appConfig.Auth.APIKeyHMACSecret, nil)
 			Expect(err).ToNot(HaveOccurred())
 			var p probe
 			app := probeApp(db, appConfig, &p)
 			rec := doRequest(app, http.MethodGet, "/probe", withTokenCookie(plaintext))
 			Expect(rec.Code).To(Equal(http.StatusOK))
 			Expect(p.source).To(Equal(auth.UsageSourceAPIKey))
 			Expect(p.key).ToNot(BeNil())
 		})
 		It("legacy env key sets source=legacy, apikey=nil", func() {
 			db := testDB()
 			appConfig := config.NewApplicationConfig()
 			appConfig.ApiKeys = []string{"legacy-secret"}
 			var p probe
 			app := probeApp(db, appConfig, &p)
 			rec := doRequest(app, http.MethodGet, "/probe", withBearerToken("legacy-secret"))
 			Expect(rec.Code).To(Equal(http.StatusOK))
 			Expect(p.source).To(Equal(auth.UsageSourceLegacy))
 			Expect(p.key).To(BeNil())
 		})
 	})
 })
--- a/Show More
+++ b/Show More