docs(llama-cpp): correct run.sh comment for arm64/darwin cpu-all

arm64 and darwin CPU images now also ship llama-cpp-cpu-all (not fallback-only); only GPU images ship fallback-only. Fix the stale comment to match. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
fix(llama-cpp,turboquant): only CPU_ALL_VARIANTS for pure-CPU builds, GPU uses fallback
2026-06-25 09:09:07 -04:00 · 2026-06-25 07:05:06 +00:00 · 2026-06-25 07:04:06 +00:00 · 2026-06-24 21:59:29 +00:00 · 2026-06-24 21:50:29 +00:00 · 2026-06-24 21:33:32 +00:00
525 changed files with 32232 additions and 8377 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -198,6 +198,27 @@ docker-build-backends: ... docker-build-<backend-name>
 - If the backend is in `backend/python/<backend-name>/` but uses `.` as context in the workflow file, use `.` context
 - Check similar backends to determine the correct context
 ## Documenting the backend (README + docs)
 A backend is not "added" until it is discoverable. Update the user-facing docs:
 - **`docs/content/features/backends.md`** - add the backend to the right
  category in the "LocalAI supports various types of backends" list (and add a
  new category if it introduces a new modality, e.g. sound classification).
 - If the backend introduces a **new API surface** (a new endpoint or a realtime
  capability), document it under `docs/content/` where its area lives (audio,
  vision, etc.) and follow the api-endpoints checklist in
  [api-endpoints-and-auth.md](api-endpoints-and-auth.md).
 **If the backend is a native C/C++/GGML engine created and maintained by the
 LocalAI team** (a from-scratch port like `parakeet.cpp`, `ced.cpp`,
 `vibevoice.cpp`, `rf-detr.cpp`, not a wrapper around a third-party runtime), it
 ALSO belongs in the top-level **`README.md`** table under "native C/C++/GGML
 engines ... developed and maintained by the LocalAI project itself". Add a row
 linking the upstream engine repo with a one-line description. This is the
 project's showcase of its own engines; a new in-house backend that is missing
 from it is a documentation bug.
 ## 5. Verification Checklist
 After adding a new backend, verify:
@@ -211,6 +232,8 @@ After adding a new backend, verify:
 - [ ] No YAML syntax errors (check with linter)
 - [ ] No Makefile syntax errors (check with linter)
 - [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
 - [ ] Documented: added to the category list in `docs/content/features/backends.md` (and any new endpoint/realtime capability documented under `docs/content/`)
 - [ ] If it is an in-house native C/C++/GGML engine, added to the maintained-engines table in the top-level `README.md`
 ## Bundling runtime shared libraries (`package.sh`)
--- a/.agents/ds4-backend.md
+++ b/.agents/ds4-backend.md
@@ -44,6 +44,39 @@ maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_
 via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
 NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).
 ## Engine options (LoadModel)
 `LoadModel` maps `ModelOptions.Options[]` (`"key:value"`, from model-YAML
 `options:`) onto `ds4_engine_options` through a **declarative table**
 (`kEngineOptSpecs` + `apply_engine_option` in `grpc-server.cpp`). The struct is
 plain C with no reflection, so the field set is enumerated once in the table;
 adding a future engine knob is a one-line table row, not a new branch. Unknown
 keys are ignored (back-compat). A bare flag (`ssd_streaming` with no value)
 means `true`. Path-type values (`mtp_path`, `expert_profile_path`,
 `directional_steering_file`) resolve **relative to the model directory**, so a
 gallery entry can reference a companion file it downloaded by bare filename;
 absolute values pass through. `ds4_role` / `ds4_layers` / `ds4_listen` /
 `ds4_route_timeout` / `kv_cache_dir` keep their dedicated handling (validation
 + coordinator wiring) and are not in the table.
 Wired keys: `mtp_path`, `mtp_draft`, `mtp_margin`, `prefill_chunk`,
 `power_percent`, `warm_weights`, `quality`, `ssd_streaming`,
 `ssd_streaming_cold`, `ssd_streaming_preload_experts`,
 `ssd_streaming_cache_experts` (count or `NGB`, sets both experts+bytes via
 `ds4_parse_streaming_cache_experts_arg`), `simulate_used_memory` (`NGB` via
 `ds4_parse_gib_arg`), `expert_profile_path`, `directional_steering_file`,
 `directional_steering_attn`, `directional_steering_ffn`.
 ## SSD streaming (running models larger than RAM)
 ds4's **SSD streaming** keeps non-routed weights resident and streams routed MoE
 experts from the GGUF on cache misses, turning "does it fit in RAM" into a speed
 spectrum. **Metal (Darwin) only** - it is a no-op on CUDA/CPU. Enable with
 `options: ["ssd_streaming"]`; size the routed-expert cache with
 `ssd_streaming_cache_experts:NGB` (omit for ds4's automatic 80%-of-working-set
 budget). Gallery entries built on this: `deepseek-v4-flash-q4-ssd` (153 GB Flash
 on a 128 GB Mac) and `deepseek-v4-pro-q2-ssd` (433 GB Pro, experimental).
 ## Build matrix
 | Build | Where | Notes |
--- a/.docker/install-base-deps.sh
+++ b/.docker/install-base-deps.sh
@@ -70,6 +70,12 @@ if [ "${BUILD_TYPE:-}" = "vulkan" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; t
        git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
        ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
        clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
    # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe + Arm SoC) and their ICD
    # manifests. The LunarG SDK below only provides the loader and shader
    # tooling, not hardware drivers — without Mesa the packaged Vulkan backend
    # would ship a loader that finds no GPU. package-gpu-libs.sh bundles these
    # .so files plus their deps into the backend so it stays self-contained.
    apt-get install -y mesa-vulkan-drivers libdrm2
    if [ "amd64" = "${TARGETARCH:-}" ]; then
        wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz"
        tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz
--- a/.docker/llama-cpp-compile.sh
+++ b/.docker/llama-cpp-compile.sh
@@ -17,19 +17,29 @@ if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
  rm -rf /LocalAI/backend/cpp/llama-cpp-*-build
 fi
-if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
+cd /LocalAI/backend/cpp/llama-cpp
-  cd /LocalAI/backend/cpp/llama-cpp
+if [ -z "${BUILD_TYPE:-}" ]; then
-  make llama-cpp-fallback
+  # Pure CPU image (BUILD_TYPE empty): one build with ggml CPU_ALL_VARIANTS replaces the
-  make llama-cpp-grpc
+  # per-microarch binaries (x86: avx/avx2/avx512/fallback; arm64: armv8.x/armv9.x). ggml
-  make llama-cpp-rpc-server
+  # dlopens the best libggml-cpu-*.so at runtime by probing host CPU features.
  #
  # arm64: the CPU_ALL_VARIANTS table includes armv9.2 SME variants whose -march=...+sme is
  # rejected by the Ubuntu 24.04 default gcc-13. gcc-14 accepts it, so build the arm64
  # variants with it (the host never *selects* SME unless it has it, but every variant must
  # still compile).
  if [ "${TARGETARCH}" = "arm64" ]; then
    apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
    export CC=gcc-14 CXX=g++-14
  fi
  make llama-cpp-cpu-all
 else
-  cd /LocalAI/backend/cpp/llama-cpp
+  # GPU build (cublas/hipblas/sycl/vulkan/...): the accelerator does the compute, so a
-  make llama-cpp-avx
+  # single fallback CPU build is enough - no per-microarch CPU variants needed. (This also
-  make llama-cpp-avx2
+  # keeps the heavy GPU backend compile from also building the whole CPU variant matrix,
-  make llama-cpp-avx512
+  # and avoids the gcc-14 apt step on GPU base images such as nvidia l4t.)
  make llama-cpp-fallback
  make llama-cpp-grpc
  make llama-cpp-rpc-server
 fi
 make llama-cpp-grpc
 make llama-cpp-rpc-server
 ccache -s || true
--- a/.docker/turboquant-compile.sh
+++ b/.docker/turboquant-compile.sh
@@ -19,17 +19,21 @@ fi
 cd /LocalAI/backend/cpp/turboquant
-if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
+if [ -z "${BUILD_TYPE:-}" ]; then
-  make turboquant-fallback
+  # Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
-  make turboquant-grpc
+  # arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
-  make turboquant-rpc-server
+  if [ "${TARGETARCH}" = "arm64" ]; then
    apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
    export CC=gcc-14 CXX=g++-14
  fi
  make turboquant-cpu-all
 else
-  make turboquant-avx
+  # GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator
-  make turboquant-avx2
+  # does the compute. Keeps the GPU compile from also building the CPU variant matrix and
-  make turboquant-avx512
+  # avoids the gcc-14 apt step on GPU base images such as nvidia l4t.
  make turboquant-fallback
  make turboquant-grpc
  make turboquant-rpc-server
 fi
 make turboquant-grpc
 make turboquant-rpc-server
 ccache -s || true
--- a/.dockerignore
+++ b/.dockerignore
@@ -31,6 +31,15 @@ backend/python/**/source
 backend/cpp/llama-cpp/llama.cpp
 backend/cpp/llama-cpp-*-build
 # privacy-filter: same in-place pattern. The Makefile fetches privacy-filter.cpp
 # at the pinned commit (or symlinks a PRIVACY_FILTER_SRC checkout for local dev).
 # A stale dir/symlink COPY'd into the image makes the clone step fail (dangling
 # symlink) or compile against the wrong commit, so keep host build state out.
 backend/cpp/privacy-filter/privacy-filter.cpp
 backend/cpp/privacy-filter/build
 backend/cpp/privacy-filter/grpc-server
 backend/cpp/privacy-filter/package
 # Rust backend build output (sources are tracked; target/ is generated)
 backend/rust/*/target
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -716,6 +716,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "8"
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-12-depth-anything-cpp'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "depth-anything-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "8"
@@ -1582,6 +1595,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-13-depth-anything-cpp'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "depth-anything-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1621,6 +1647,19 @@ include:
    backend: "locate-anything-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
    platforms: 'linux/arm64'
    skip-drivers: 'false'
    tag-latest: 'auto'
    tag-suffix: '-nvidia-l4t-cuda-13-arm64-depth-anything-cpp'
    base-image: "ubuntu:24.04"
    ubuntu-version: '2404'
    runs-on: 'ubuntu-24.04-arm'
    backend: "depth-anything-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -2631,6 +2670,78 @@ include:
    dockerfile: "./backend/Dockerfile.ds4"
    context: "./"
    ubuntu-version: '2404'
  # privacy-filter: PII/NER token classifier (per-arch native -> manifest merge).
  # Every variant builds FROM a prebuilt quay.io/go-skynet/ci-cache:base-grpc-*
  # image (gRPC + cmake + protoc + conditional CUDA/Vulkan already installed),
  # exactly like llama-cpp — no toolchain is installed in Dockerfile.privacy-filter.
  # builder-base-image makes the workflow use the Dockerfile's builder-prebuilt
  # stage; without it (local builds) the builder-fromsource stage runs the same
  # .docker/install-base-deps.sh.
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    platform-tag: 'amd64'
    tag-latest: 'auto'
    tag-suffix: '-cpu-privacy-filter'
    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-amd64'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'true'
    backend: "privacy-filter"
    dockerfile: "./backend/Dockerfile.privacy-filter"
    context: "./"
    ubuntu-version: '2404'
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/arm64'
    platform-tag: 'arm64'
    tag-latest: 'auto'
    tag-suffix: '-cpu-privacy-filter'
    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-arm64'
    runs-on: 'ubuntu-24.04-arm'
    base-image: "ubuntu:24.04"
    skip-drivers: 'true'
    backend: "privacy-filter"
    dockerfile: "./backend/Dockerfile.privacy-filter"
    context: "./"
    ubuntu-version: '2404'
  # Vulkan: base-grpc-vulkan-amd64 carries the SDK. arm64 vulkan is a one-line
  # add once amd64 is proven in CI.
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    platform-tag: 'amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-vulkan-privacy-filter'
    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-vulkan-amd64'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "privacy-filter"
    dockerfile: "./backend/Dockerfile.privacy-filter"
    context: "./"
    ubuntu-version: '2404'
  # CUDA: base-grpc-cuda-13-amd64 carries the toolkit; BUILD_TYPE=cublas ->
  # -DPF_CUDA=ON. cuda-12 and arm64/l4t are one-line adds once cuda-13 amd64 is
  # proven in CI.
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
    platforms: 'linux/amd64'
    platform-tag: 'amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-13-privacy-filter'
    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'true'
    backend: "privacy-filter"
    dockerfile: "./backend/Dockerfile.privacy-filter"
    context: "./"
    ubuntu-version: '2404'
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2898,6 +3009,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-cpu-depth-anything-cpp'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "depth-anything-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'sycl_f32'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2911,6 +3035,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'sycl_f32'
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-intel-sycl-f32-depth-anything-cpp'
    runs-on: 'ubuntu-latest'
    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
    skip-drivers: 'false'
    backend: "depth-anything-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'sycl_f16'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2924,6 +3061,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'sycl_f16'
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-intel-sycl-f16-depth-anything-cpp'
    runs-on: 'ubuntu-latest'
    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
    skip-drivers: 'false'
    backend: "depth-anything-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2938,6 +3088,20 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    platform-tag: 'amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-vulkan-depth-anything-cpp'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "depth-anything-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2952,6 +3116,20 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/arm64'
    platform-tag: 'arm64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-vulkan-depth-anything-cpp'
    runs-on: 'ubuntu-24.04-arm'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "depth-anything-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'sycl_f32'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -3058,6 +3236,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2204'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "0"
    platforms: 'linux/arm64'
    skip-drivers: 'false'
    tag-latest: 'auto'
    tag-suffix: '-nvidia-l4t-arm64-depth-anything-cpp'
    base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
    runs-on: 'ubuntu-24.04-arm'
    backend: "depth-anything-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2204'
  # whisper
  - build-type: ''
    cuda-major-version: ""
@@ -3384,6 +3575,154 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  # ced
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "8"
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-12-ced'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "ced"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-13-ced'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "ced"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
    platforms: 'linux/arm64'
    skip-drivers: 'false'
    tag-latest: 'auto'
    tag-suffix: '-nvidia-l4t-cuda-13-arm64-ced'
    base-image: "ubuntu:24.04"
    ubuntu-version: '2404'
    runs-on: 'ubuntu-24.04-arm'
    backend: "ced"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    platform-tag: 'amd64'
    tag-latest: 'auto'
    tag-suffix: '-cpu-ced'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "ced"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/arm64'
    platform-tag: 'arm64'
    tag-latest: 'auto'
    tag-suffix: '-cpu-ced'
    runs-on: 'ubuntu-24.04-arm'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "ced"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'sycl_f32'
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-intel-sycl-f32-ced'
    runs-on: 'ubuntu-latest'
    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
    skip-drivers: 'false'
    backend: "ced"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'sycl_f16'
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-intel-sycl-f16-ced'
    runs-on: 'ubuntu-latest'
    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
    skip-drivers: 'false'
    backend: "ced"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    platform-tag: 'amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-vulkan-ced'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "ced"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/arm64'
    platform-tag: 'arm64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-vulkan-ced'
    runs-on: 'ubuntu-24.04-arm'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "ced"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "0"
    platforms: 'linux/arm64'
    skip-drivers: 'false'
    tag-latest: 'auto'
    tag-suffix: '-nvidia-l4t-arm64-ced'
    base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
    runs-on: 'ubuntu-24.04-arm'
    backend: "ced"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2204'
  - build-type: 'hipblas'
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    tag-latest: 'auto'
    tag-suffix: '-gpu-rocm-hipblas-ced'
    base-image: "rocm/dev-ubuntu-24.04:7.2.1"
    runs-on: 'ubuntu-latest'
    skip-drivers: 'false'
    backend: "ced"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  # acestep-cpp
  - build-type: ''
    cuda-major-version: ""
@@ -4490,6 +4829,36 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  # supertonic CPU (amd64)
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/amd64'
    platform-tag: 'amd64'
    tag-latest: 'auto'
    tag-suffix: '-cpu-supertonic'
    runs-on: 'ubuntu-latest'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "supertonic"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
  # supertonic CPU (arm64)
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
    platforms: 'linux/arm64'
    platform-tag: 'arm64'
    tag-latest: 'auto'
    tag-suffix: '-cpu-supertonic'
    runs-on: 'ubuntu-24.04-arm'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "supertonic"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
 # Darwin matrix (consumed by backend-jobs-darwin).
 includeDarwin:
@@ -4533,6 +4902,10 @@ includeDarwin:
    tag-suffix: "-metal-darwin-arm64-parakeet-cpp"
    build-type: "metal"
    lang: "go"
  - backend: "ced"
    tag-suffix: "-metal-darwin-arm64-ced"
    build-type: "metal"
    lang: "go"
  - backend: "acestep-cpp"
    tag-suffix: "-metal-darwin-arm64-acestep-cpp"
    build-type: "metal"
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -44,7 +44,7 @@ jobs:
      has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -101,7 +101,7 @@ jobs:
    steps:
      - name: Checkout
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -57,7 +57,7 @@ jobs:
      HOMEBREW_NO_ANALYTICS: '1'
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
@@ -98,6 +98,7 @@ jobs:
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
            /opt/homebrew/Cellar/nlohmann-json
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
      - name: Dependencies
@@ -109,7 +110,10 @@ jobs:
          # Without explicitly installing them, a brew cache-hit run restores
          # ccache's Cellar dir but skips installing those transitive deps,
          # and ccache fails at runtime with `dyld: Library not loaded`.
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd
+          # nlohmann-json is header-only and required by the ds4 backend
          # (dsml_renderer.cpp includes <nlohmann/json.hpp>); on Linux it comes
          # from the apt-installed nlohmann-json3-dev in the build image.
          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json
          # Force-reinstall ccache so brew re-validates its full runtime-dep
          # closure on every run. This is the durable fix: when the upstream
          # ccache formula gains a new transitive dep (as it has multiple times
@@ -128,7 +132,7 @@ jobs:
          # and decides "already installed" without re-linking, so on a cache-
          # hit run the formulas aren't on PATH. Force-link them; --overwrite
          # tolerates pre-existing symlinks from earlier installs.
-          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd 2>/dev/null || true
+          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json 2>/dev/null || true
      - name: Save Homebrew cache
        if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
@@ -148,6 +152,7 @@ jobs:
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
            /opt/homebrew/Cellar/nlohmann-json
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
      # ---- ccache for llama.cpp CMake builds ----
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -49,7 +49,7 @@ jobs:
      # Sparse checkout: the merge job needs `.github/scripts/` (for the
      # keepalive cleanup script) but none of the source tree.
      - name: Checkout (.github/scripts only)
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          sparse-checkout: |
            .github/scripts
--- a/.github/workflows/backend_pr.yml
+++ b/.github/workflows/backend_pr.yml
@@ -23,7 +23,7 @@ jobs:
      has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
--- a/.github/workflows/base-images.yml
+++ b/.github/workflows/base-images.yml
@@ -127,7 +127,7 @@ jobs:
            # the original l4t matrix entry which set skip-drivers: 'true'.
            skip-drivers: 'true'
    steps:
-      - uses: actions/checkout@v6
+      - uses: actions/checkout@v7
        with:
          submodules: false
      - name: Free disk space
--- a/.github/workflows/build-test.yaml
+++ b/.github/workflows/build-test.yaml
@@ -11,7 +11,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -25,7 +25,7 @@ jobs:
    runs-on: macos-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -47,7 +47,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          fetch-depth: 0
      - name: Configure apt mirror on runner
--- a/.github/workflows/bump-inference-defaults.yml
+++ b/.github/workflows/bump-inference-defaults.yml
@@ -14,7 +14,7 @@ jobs:
  bump:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v6
+      - uses: actions/checkout@v7
      - uses: actions/setup-go@v5
        with:
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -26,6 +26,10 @@ jobs:
            variable: "DS4_VERSION"
            branch: "main"
            file: "backend/cpp/ds4/Makefile"
          - repository: "localai-org/privacy-filter.cpp"
            variable: "PRIVACY_FILTER_VERSION"
            branch: "master"
            file: "backend/cpp/privacy-filter/Makefile"
          - repository: "ggml-org/whisper.cpp"
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
@@ -38,6 +42,14 @@ jobs:
            variable: "PARAKEET_VERSION"
            branch: "master"
            file: "backend/go/parakeet-cpp/Makefile"
          - repository: "mudler/ced.cpp"
            variable: "CED_VERSION"
            branch: "master"
            file: "backend/go/ced/Makefile"
          - repository: "mudler/depth-anything.cpp"
            variable: "DEPTHANYTHING_VERSION"
            branch: "master"
            file: "backend/go/depth-anything-cpp/Makefile"
          - repository: "leejet/stable-diffusion.cpp"
            variable: "STABLEDIFFUSION_GGML_VERSION"
            branch: "master"
@@ -66,9 +78,9 @@ jobs:
            variable: "LOCATEANYTHING_VERSION"
            branch: "master"
            file: "backend/go/locate-anything-cpp/Makefile"
-          - repository: "predict-woo/qwen3-tts.cpp"
+          - repository: "ServeurpersoCom/qwentts.cpp"
            variable: "QWEN3TTS_CPP_VERSION"
-            branch: "main"
+            branch: "master"
            file: "backend/go/qwen3-tts-cpp/Makefile"
          - repository: "ServeurpersoCom/omnivoice.cpp"
            variable: "OMNIVOICE_VERSION"
@@ -80,7 +92,7 @@ jobs:
            file: "backend/go/vibevoice-cpp/Makefile"
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v6
+      - uses: actions/checkout@v7
      - name: Bump dependencies 🔧
        id: bump
        run: |
@@ -116,7 +128,7 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v6
+      - uses: actions/checkout@v7
      - name: Bump vLLM cu130 wheel pin 🔧
        id: bump
        run: |
--- a/.github/workflows/bump_docs.yaml
+++ b/.github/workflows/bump_docs.yaml
@@ -13,7 +13,7 @@ jobs:
          - repository: "mudler/LocalAI"
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v6
+      - uses: actions/checkout@v7
      - name: Bump dependencies 🔧
        run: |
          bash .github/bump_docs.sh ${{ matrix.repository }}
--- a/.github/workflows/checksum_checker.yaml
+++ b/.github/workflows/checksum_checker.yaml
@@ -8,7 +8,7 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v6
+      - uses: actions/checkout@v7
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - name: Install dependencies
--- a/.github/workflows/deploy-explorer.yaml
+++ b/.github/workflows/deploy-explorer.yaml
@@ -16,7 +16,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - uses: actions/setup-go@v5
--- a/.github/workflows/gallery-agent.yaml
+++ b/.github/workflows/gallery-agent.yaml
@@ -31,7 +31,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
--- a/.github/workflows/generate_intel_image.yaml
+++ b/.github/workflows/generate_intel_image.yaml
@@ -44,7 +44,7 @@ jobs:
        uses: docker/setup-buildx-action@master
      - name: Checkout
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
      - name: Cache Intel images
        uses: docker/build-push-action@v7
--- a/.github/workflows/gh-pages.yml
+++ b/.github/workflows/gh-pages.yml
@@ -28,7 +28,7 @@ jobs:
      HUGO_VERSION: "0.146.3"
    steps:
      - name: Checkout
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          fetch-depth: 0  # needed for enableGitInfo
          submodules: true
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -80,7 +80,7 @@ jobs:
    steps:
      - name: Checkout
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
      - name: Configure apt mirror on runner
        id: apt_mirror
--- a/.github/workflows/image_merge.yml
+++ b/.github/workflows/image_merge.yml
@@ -36,7 +36,7 @@ jobs:
      # Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
      # script). Skips the rest of the source tree.
      - name: Checkout (.github/scripts only)
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          sparse-checkout: |
            .github/scripts
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -20,7 +20,7 @@ jobs:
  golangci-lint:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v6
+      - uses: actions/checkout@v7
        with:
          # Full history so golangci-lint's new-from-merge-base can reach
          # origin/master and compute the diff against it.
--- a/.github/workflows/release.yaml
+++ b/.github/workflows/release.yaml
@@ -10,7 +10,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -28,7 +28,7 @@ jobs:
    runs-on: macos-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -46,7 +46,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          fetch-depth: 0
      - name: Configure apt mirror on runner
--- a/.github/workflows/secscan.yaml
+++ b/.github/workflows/secscan.yaml
@@ -14,14 +14,17 @@ jobs:
      GO111MODULE: on
    steps:
      - name: Checkout Source
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        if: ${{ github.actor != 'dependabot[bot]' }}
      - name: Run Gosec Security Scanner
        if: ${{ github.actor != 'dependabot[bot]' }}
        uses: securego/gosec@v2.27.1
        with:
          # we let the report trigger content trigger a failure using the GitHub Security features.
-          args: '-no-fail -fmt sarif -out results.sarif ./...'
+          # backend/go/supertonic is excluded: it vendors upstream supertone-inc/supertonic
          # (helper.go), whose findings (G304 model-file loads, G404 math/rand for flow-matching
          # noise, G104 unhandled errors) are inherent to that upstream code, not ours to rewrite.
          args: '-no-fail -exclude-dir=backend/go/supertonic -fmt sarif -out results.sarif ./...'
      - name: Upload SARIF file
        if: ${{ github.actor != 'dependabot[bot]' }}
        uses: github/codeql-action/upload-sarif@v4
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -50,7 +50,7 @@ jobs:
      parakeet-cpp: ${{ steps.detect.outputs.parakeet-cpp }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
      - name: Install dependencies
@@ -67,7 +67,7 @@ jobs:
  #   runs-on: ubuntu-latest
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v6
+  #       uses: actions/checkout@v7
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -90,7 +90,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -113,7 +113,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -137,7 +137,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -158,7 +158,7 @@ jobs:
  #  runs-on: ubuntu-latest
  #  steps:
  #    - name: Clone
-  #      uses: actions/checkout@v6
+  #      uses: actions/checkout@v7
  #      with:
  #        submodules: true
  #    - name: Dependencies
@@ -178,7 +178,7 @@ jobs:
  #   runs-on: ubuntu-latest
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v6
+  #       uses: actions/checkout@v7
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -240,7 +240,7 @@ jobs:
  #           sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
  #           df -h
  #     - name: Clone
-  #       uses: actions/checkout@v6
+  #       uses: actions/checkout@v7
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -265,7 +265,7 @@ jobs:
  #   runs-on: ubuntu-latest
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v6
+  #       uses: actions/checkout@v7
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -288,7 +288,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -309,7 +309,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -330,7 +330,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -351,7 +351,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -373,7 +373,7 @@ jobs:
  #   timeout-minutes: 45
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v6
+  #       uses: actions/checkout@v7
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -394,7 +394,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -415,7 +415,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -436,7 +436,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -462,7 +462,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -484,7 +484,7 @@ jobs:
    timeout-minutes: 30
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -513,7 +513,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go
@@ -530,7 +530,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go
@@ -552,7 +552,7 @@ jobs:
    timeout-minutes: 20
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go
@@ -579,7 +579,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go
@@ -604,7 +604,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go
@@ -625,7 +625,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go
@@ -645,7 +645,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go
@@ -664,7 +664,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go
@@ -681,7 +681,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go
@@ -698,7 +698,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go
@@ -741,7 +741,7 @@ jobs:
  #   timeout-minutes: 90
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v6
+  #       uses: actions/checkout@v7
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -783,7 +783,7 @@ jobs:
  #   timeout-minutes: 90
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v6
+  #       uses: actions/checkout@v7
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -808,7 +808,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -840,7 +840,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -876,7 +876,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -915,7 +915,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -952,7 +952,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -987,7 +987,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go
@@ -1013,7 +1013,7 @@ jobs:
    timeout-minutes: 150
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -1042,7 +1042,7 @@ jobs:
    timeout-minutes: 60
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go
@@ -1058,7 +1058,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -1091,7 +1091,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -1114,7 +1114,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
@@ -1140,7 +1140,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -21,7 +21,7 @@ jobs:
        go-version: ['1.26.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Free disk space
@@ -84,7 +84,7 @@ jobs:
        go-version: ['1.26.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Setup Go ${{ matrix.go-version }}
--- a/.github/workflows/tests-aio.yml
+++ b/.github/workflows/tests-aio.yml
@@ -62,7 +62,7 @@ jobs:
          sudo rm -rfv build || true
          df -h
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
--- a/.github/workflows/tests-e2e.yml
+++ b/.github/workflows/tests-e2e.yml
@@ -21,7 +21,7 @@ jobs:
        go-version: ['1.25.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Configure apt mirror on runner
--- a/.github/workflows/tests-pii-ner-e2e.yml
+++ b/.github/workflows/tests-pii-ner-e2e.yml
@@ -0,0 +1,97 @@
 ---
 name: 'PII NER tier E2E (live GGUF, CPU)'
 # Runs the real privacy-filter GGUF NER tier end-to-end on CPU — the gap the
 # hermetic tests/e2e suite cannot cover (it only exercises the in-process
 # pattern tier). Heavy (builds the C++ backend image + downloads a ~2.7 GB
 # GGUF), so it is path-filtered on PRs and otherwise runs nightly / on demand.
 #
 # This drives the container-level harness (tests/e2e-backends) via
 # `make test-extra-backend-privacy-filter`: it builds the privacy-filter image,
 # downloads the model, loads it on CPU, and asserts byte-correct, UTF-8-aligned
 # TokenClassify spans. The complementary HTTP-path specs in tests/e2e
 # (e2e_pii_ner_test.go) Skip unless PII_NER_MODEL_GGUF is wired.
 on:
  workflow_dispatch:
  schedule:
    - cron: '0 3 * * *'
  push:
    branches:
      - master
    paths:
      - 'backend/cpp/privacy-filter/**'
      - 'backend/Dockerfile.privacy-filter'
      - 'core/services/routing/pii/**'
      - 'core/services/routing/piidetector/**'
      - 'core/backend/token_classify.go'
      - 'core/http/endpoints/localai/pii.go'
      - 'core/schema/pii.go'
      - 'tests/e2e-backends/**'
      - 'tests/e2e/e2e_pii_ner_test.go'
      - 'tests/e2e/e2e_suite_test.go'
      - '.github/workflows/tests-pii-ner-e2e.yml'
  pull_request:
    paths:
      - 'backend/cpp/privacy-filter/**'
      - 'backend/Dockerfile.privacy-filter'
      - 'core/services/routing/pii/**'
      - 'core/services/routing/piidetector/**'
      - 'core/backend/token_classify.go'
      - 'core/http/endpoints/localai/pii.go'
      - 'core/schema/pii.go'
      - 'tests/e2e-backends/**'
      - 'tests/e2e/e2e_pii_ner_test.go'
      - 'tests/e2e/e2e_suite_test.go'
      - '.github/workflows/tests-pii-ner-e2e.yml'
 concurrency:
  group: ci-tests-pii-ner-e2e-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
 jobs:
  tests-pii-ner-e2e:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        go-version: ['1.25.x']
    steps:
      - name: Clone
        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Free disk space
        run: |
          sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc /opt/hostedtoolcache/CodeQL || true
          sudo docker image prune --all --force || true
          df -h
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - name: Setup Go ${{ matrix.go-version }}
        uses: actions/setup-go@v5
        with:
          go-version: ${{ matrix.go-version }}
          cache: false
      - name: Proto Dependencies
        run: |
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
          rm protoc.zip
          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
          PATH="$PATH:$HOME/go/bin" make protogen-go
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential
      # Builds local-ai-backend:privacy-filter, downloads the GGUF, loads it on
      # CPU and runs the token_classify capability spec (byte-offset contract).
      - name: Run live PII NER backend E2E
        run: PATH="$PATH:$HOME/go/bin" make test-extra-backend-privacy-filter
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
        with:
          detached: true
          connect-timeout-seconds: 180
          limit-access-to-actor: true
--- a/.github/workflows/tests-ui-e2e.yml
+++ b/.github/workflows/tests-ui-e2e.yml
@@ -23,7 +23,7 @@ jobs:
        go-version: ['1.26.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v6
+        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Configure apt mirror on runner
--- a/.github/workflows/update_swagger.yaml
+++ b/.github/workflows/update_swagger.yaml
@@ -10,7 +10,7 @@ jobs:
      fail-fast: false
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v6
+      - uses: actions/checkout@v7
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - uses: actions/setup-go@v5
--- a/.gitignore
+++ b/.gitignore
@@ -91,3 +91,6 @@ core/http/react-ui/test-results/
 # Local worktrees
 .worktrees/
 # SDD / brainstorm scratch (agent-driven development)
 .superpowers/
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -74,6 +74,8 @@ linters:
    paths:
      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
      - 'backend/go/whisper/sources'
      # Vendored upstream supertonic pipeline (supertone-inc/supertonic go/helper.go).
      - 'backend/go/supertonic/helper.go'
      - 'docs/'
    rules:
      # CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
--- a/25
+++ b/25
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter
 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -595,6 +595,8 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/rust/kokoros test
 	$(MAKE) -C backend/go/rfdetr-cpp test
 	$(MAKE) -C backend/go/locate-anything-cpp test
 	$(MAKE) -C backend/go/depth-anything-cpp test
 	$(MAKE) -C backend/go/supertonic test
 ##
 ## End-to-end gRPC tests that exercise a built backend container image.
@@ -688,6 +690,16 @@ test-extra-backend-llama-cpp-transcription: docker-build-llama-cpp
 	BACKEND_TEST_CTX_SIZE=2048 \
 	$(MAKE) test-extra-backend
 ## privacy-filter: the PII/NER token-classification backend. Exercises the
 ## TokenClassify RPC and asserts byte-correct, UTF-8-aligned span offsets
 ## against the openai-privacy-filter multilingual GGUF (CPU-runnable, ~50M
 ## active params). This is the live-backend coverage for the PII NER tier.
 test-extra-backend-privacy-filter: docker-build-privacy-filter
 	BACKEND_IMAGE=local-ai-backend:privacy-filter \
 	BACKEND_TEST_MODEL_URL=https://huggingface.co/LocalAI-io/privacy-filter-multilingual-GGUF/resolve/main/privacy-filter-multilingual-f16.gguf \
 	BACKEND_TEST_CAPS=health,load,token_classify \
 	$(MAKE) test-extra-backend
 ## vllm is resolved from a HuggingFace model id (no file download) and
 ## exercises Predict + streaming + tool-call extraction via the hermes parser.
 ## Requires a host CPU with the SIMD instructions the prebuilt vllm CPU
@@ -1162,6 +1174,10 @@ BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
 # Single-model; hardware-only validation lives at tests/e2e-backends/
 # (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
 BACKEND_DS4 = ds4|ds4|.|false|false
 # privacy-filter wraps the standalone privacy-filter.cpp GGML engine (the
 # openai-privacy-filter PII/NER token classifier) — the TokenClassify RPC for
 # the PII redactor tier, on stock ggml with no llama.cpp carry-patches.
 BACKEND_PRIVACY_FILTER = privacy-filter|privacy-filter|.|false|false
 # Golang backends
 BACKEND_PIPER = piper|golang|.|false|true
@@ -1173,6 +1189,7 @@ BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|tr
 BACKEND_WHISPER = whisper|golang|.|false|true
 BACKEND_CRISPASR = crispasr|golang|.|false|true
 BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
 BACKEND_DEPTH_ANYTHING_CPP = depth-anything-cpp|golang|.|false|true
 BACKEND_VOXTRAL = voxtral|golang|.|false|true
 BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
 BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
@@ -1181,6 +1198,7 @@ BACKEND_VIBEVOICE_CPP = vibevoice-cpp|golang|.|false|true
 BACKEND_LOCALVQE = localvqe|golang|.|false|true
 BACKEND_OPUS = opus|golang|.|false|true
 BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true
 BACKEND_SUPERTONIC = supertonic|golang|.|false|true
 # Python backends with root context
 BACKEND_RERANKERS = rerankers|python|.|false|true
@@ -1254,6 +1272,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
 $(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PRIVACY_FILTER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_CLOUD_PROXY)))
@@ -1263,6 +1282,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
 $(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_DEPTH_ANYTHING_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
 $(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
@@ -1308,12 +1328,13 @@ $(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RFDETR_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SUPERTONIC)))
 # Pattern rule for docker-save targets
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar
-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter
 ########################################################
 ### Mock Backend for E2E Tests
--- a/README.md
+++ b/README.md
@@ -29,6 +29,18 @@
 <a href="https://trendshift.io/repositories/5539" target="_blank"><img src="https://trendshift.io/api/badge/repositories/5539" alt="mudler%2FLocalAI | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 </p>
 <!-- Keep these links, translations synced daily. -->
 <p align="center">
 <a href="https://zdoc.app/de/mudler/LocalAI">Deutsch</a> |
 <a href="https://zdoc.app/es/mudler/LocalAI">Español</a> |
 <a href="https://zdoc.app/fr/mudler/LocalAI">français</a> |
 <a href="https://zdoc.app/ja/mudler/LocalAI">日本語</a> |
 <a href="https://zdoc.app/ko/mudler/LocalAI">한국어</a> |
 <a href="https://zdoc.app/pt/mudler/LocalAI">Português</a> |
 <a href="https://zdoc.app/ru/mudler/LocalAI">Русский</a> |
 <a href="https://zdoc.app/zh/mudler/LocalAI">中文</a>
 </p>
 **LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
 **A small core, not a bundle.** Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.
@@ -208,10 +220,29 @@ For older news and full release notes, see [GitHub Releases](https://github.com/
 ## Supported Backends & Acceleration
-LocalAI supports **36+ backends** including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
+LocalAI supports **60+ backends** including llama.cpp, vLLM, SGLang, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
 See the full [Backend & Model Compatibility Table](https://localai.io/model-compatibility/) and [GPU Acceleration guide](https://localai.io/features/gpu-acceleration/).
 ### Backends built by us
 Most backends wrap a best-in-class upstream engine. A handful of them are native C/C++/GGML engines (no Python at inference) developed and maintained by the LocalAI project itself:
 | Backend | What it does |
 |---------|-------------|
 | [parakeet.cpp](https://github.com/mudler/parakeet.cpp) | C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription |
 | [ced.cpp](https://github.com/mudler/ced.cpp) | C++/GGML port of the CED audio-tagging models: sound-event classification (527-class AudioSet) over REST and the realtime API for live recognition |
 | [voxtral.c](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in pure C |
 | [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp) | Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization |
 | [rf-detr.cpp](https://github.com/mudler/rf-detr.cpp) | Native RF-DETR object detection and instance segmentation |
 | [locate-anything.cpp](https://github.com/mudler/locate-anything.cpp) | Open-vocabulary object detection and visual grounding (LocateAnything-3B) |
 | [depth-anything.cpp](https://github.com/mudler/depth-anything.cpp) | Depth Anything 3 monocular metric depth + camera pose estimation |
 | [privacy-filter.cpp](https://github.com/localai-org/privacy-filter.cpp) | Standalone GGML PII/NER token-classification engine powering LocalAI's PII redaction tier |
 | [LocalVQE](https://github.com/localai-org/LocalVQE) | Joint acoustic echo cancellation, noise suppression, and dereverberation |
 | [local-store](https://github.com/mudler/LocalAI) | Local-first vector database for embeddings (shipped in-tree) |
 We also maintain [apex-quant](https://github.com/localai-org/apex-quant), a per-tensor, per-layer quantization recipe for Mixture-of-Experts models that exploits their structural sparsity to produce GGUFs matching or beating Q8_0 quality - and they run out of the box on stock llama.cpp.
 ## Resources
 - [Documentation](https://localai.io/)
--- a/backend/Dockerfile.golang
+++ b/backend/Dockerfile.golang
@@ -65,7 +65,12 @@ RUN <<EOT bash
            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils && \
        apt-get install -y mesa-vulkan-drivers libdrm2
        # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe) + their manifests. The
        # LunarG SDK below only provides the loader and shader tooling, not
        # hardware drivers — without Mesa, package-gpu-libs.sh has no ICD to
        # bundle and the packaged backend finds no GPU at runtime.
        if [ "amd64" = "$TARGETARCH" ]; then
            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
--- a/backend/Dockerfile.privacy-filter
+++ b/backend/Dockerfile.privacy-filter
@@ -0,0 +1,109 @@
 ARG BASE_IMAGE=ubuntu:24.04
 # BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses when no
 # prebuilt base is supplied; the builder-prebuilt stage is only entered when
 # BUILDER_TARGET=builder-prebuilt, so the fallback content is harmless
 # (BuildKit prunes the unreferenced builder).
 ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
 # BUILDER_TARGET selects which builder stage the scratch image copies from.
 # Declared before any FROM so it is usable in `FROM ${BUILDER_TARGET}`. The
 # backend_build workflow sets it to builder-prebuilt when the matrix entry
 # provides builder-base-image, else builder-fromsource (the local default).
 ARG BUILDER_TARGET=builder-fromsource
 ARG APT_MIRROR=""
 ARG APT_PORTS_MIRROR=""
 # privacy-filter: standalone GGML engine for the openai-privacy-filter PII/NER
 # token classifier, wrapped as a LocalAI gRPC backend.
 #
 # Mirrors backend/Dockerfile.llama-cpp: the build toolchain (gRPC + cmake +
 # protoc + conditional CUDA/Vulkan) comes from the shared
 # .docker/install-base-deps.sh (from-source path) or a prebuilt
 # quay.io/go-skynet/ci-cache:base-grpc-* image (CI path) — nothing GPU-specific
 # is hand-rolled here. BUILD_TYPE selects the engine backend in the Makefile:
 # "" = cpu, "cublas" -> -DPF_CUDA=ON, "vulkan" -> -DPF_VULKAN=ON.
 # ============================================================================
 # Stage: builder-fromsource — self-contained build. Runs the same install
 # script backend/Dockerfile.base-grpc-builder runs, so this path is
 # bit-equivalent to the prebuilt base. Used when BUILDER_TARGET=builder-fromsource
 # (the default; local `make backends/privacy-filter`).
 # ============================================================================
 FROM ${BASE_IMAGE} AS builder-fromsource
 ARG BUILD_TYPE
 ARG CUDA_MAJOR_VERSION
 ARG CUDA_MINOR_VERSION
 ARG CMAKE_FROM_SOURCE=false
 # CUDA Toolkit 13.x needs CMake 3.31.9+ for correct toolchain/arch detection.
 ARG CMAKE_VERSION=3.31.10
 ARG GRPC_VERSION=v1.65.0
 ARG GRPC_MAKEFLAGS="-j4 -Otarget"
 ARG SKIP_DRIVERS=false
 ARG TARGETARCH
 ARG UBUNTU_VERSION=2404
 ARG APT_MIRROR
 ARG APT_PORTS_MIRROR
 ENV BUILD_TYPE=${BUILD_TYPE} \
    CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
    CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
    CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
    CMAKE_VERSION=${CMAKE_VERSION} \
    GRPC_VERSION=${GRPC_VERSION} \
    GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
    SKIP_DRIVERS=${SKIP_DRIVERS} \
    TARGETARCH=${TARGETARCH} \
    UBUNTU_VERSION=${UBUNTU_VERSION} \
    APT_MIRROR=${APT_MIRROR} \
    APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
    DEBIAN_FRONTEND=noninteractive
 # CUDA on PATH (a no-op when CUDA is not installed, e.g. cpu/vulkan builds).
 ENV PATH=/usr/local/cuda/bin:${PATH}
 WORKDIR /build
 # apt deps + cmake + protoc + gRPC + conditional CUDA/Vulkan, all from the
 # shared script (the source of truth that base-grpc-builder also runs).
 RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
    --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
    bash /usr/local/sbin/install-base-deps
 # install-base-deps installs gRPC under /opt/grpc; copy it to /usr/local so the
 # backend's find_package(gRPC CONFIG) resolves it at the canonical prefix.
 RUN cp -a /opt/grpc/. /usr/local/
 COPY . /LocalAI
 RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
 # ============================================================================
 # Stage: builder-prebuilt — FROM a prebuilt
 # quay.io/go-skynet/ci-cache:base-grpc-* image (gRPC at /opt/grpc + apt deps +
 # CUDA/Vulkan already installed). Used in CI when the matrix entry sets
 # builder-base-image.
 # ============================================================================
 FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
 ARG BUILD_TYPE
 ARG TARGETARCH
 ENV BUILD_TYPE=${BUILD_TYPE}
 # CUDA on PATH (a no-op for the cpu/vulkan base images).
 ENV PATH=/usr/local/cuda/bin:${PATH}
 # Mirror builder-fromsource: the base-grpc image installs gRPC to /opt/grpc but
 # does not copy it to /usr/local.
 RUN cp -a /opt/grpc/. /usr/local/
 COPY . /LocalAI
 RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
 # ============================================================================
 # Final stage — copy the package output from the selected builder. BuildKit
 # does not expand variables in `COPY --from=`, so alias the chosen builder to a
 # fixed stage name first.
 # ============================================================================
 FROM ${BUILDER_TARGET} AS builder
 FROM scratch
 COPY --from=builder /LocalAI/backend/cpp/privacy-filter/package/. ./
--- a/backend/Dockerfile.python
+++ b/backend/Dockerfile.python
@@ -66,7 +66,12 @@ RUN <<EOT bash
            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils && \
        apt-get install -y mesa-vulkan-drivers libdrm2
        # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe) + their manifests. The
        # LunarG SDK below only provides the loader and shader tooling, not
        # hardware drivers — without Mesa, package-gpu-libs.sh has no ICD to
        # bundle and the packaged backend finds no GPU at runtime.
        if [ "amd64" = "$TARGETARCH" ]; then
            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -24,6 +24,10 @@ service Backend {
  rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {}
  rpc Status(HealthMessage) returns (StatusResponse) {}
  rpc Detect(DetectOptions) returns (DetectResponse) {}
  // SoundDetection runs an audio-tagging / sound-event-classification model
  // (e.g. CED over the AudioSet ontology) on a clip and returns scored labels.
  rpc SoundDetection(SoundDetectionRequest) returns (SoundDetectionResponse) {}
  rpc Depth(DepthRequest) returns (DepthResponse) {}
  rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
  rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
  rpc VoiceVerify(VoiceVerifyRequest) returns (VoiceVerifyResponse) {}
@@ -670,6 +674,53 @@ message DetectResponse {
  repeated Detection Detections = 1;
 }
 // --- Sound-event classification / audio tagging messages (CED) ---
 message SoundDetectionRequest {
  string src = 1;       // audio file path (LocalAI writes the upload to disk)
  int32 top_k = 2;      // number of top tags to return (0 = all classes)
  float threshold = 3;  // optional: drop tags scoring below this
 }
 message SoundClass {
  string label = 1;     // AudioSet class name, e.g. "Baby cry, infant cry"
  float score = 2;      // per-class probability (multi-label, independent)
  int32 index = 3;      // class index in the model ontology
 }
 message SoundDetectionResponse {
  repeated SoundClass detections = 1;  // score-descending
 }
 // --- Depth estimation messages (Depth Anything 3) ---
 message DepthRequest {
  string src = 1;                  // input image (filesystem path or base64-encoded payload)
  string dst = 2;                  // optional output directory for exports (glb/colmap)
  bool include_depth = 3;          // return the per-pixel metric depth map
  bool include_confidence = 4;     // return the per-pixel confidence map (DualDPT)
  bool include_pose = 5;           // return camera extrinsics/intrinsics (DualDPT)
  bool include_sky = 6;            // return the per-pixel sky map (mono models)
  bool include_points = 7;         // back-project to a 3D point cloud (DualDPT)
  float points_conf_thresh = 8;    // keep points with confidence >= this threshold
  repeated string exports = 9;     // requested exports: "glb", "colmap"
 }
 message DepthResponse {
  int32 width = 1;                 // processed depth-map width
  int32 height = 2;                // processed depth-map height
  repeated float depth = 3;        // width*height row-major metric depth
  repeated float confidence = 4;   // width*height row-major confidence (DualDPT)
  repeated float sky = 5;          // width*height row-major sky map (mono)
  repeated float extrinsics = 6;   // 12 floats, 3x4 row-major (world-to-camera)
  repeated float intrinsics = 7;   // 9 floats, 3x3 row-major
  int32 num_points = 8;            // number of 3D points
  repeated float points = 9;       // num_points*3 xyz, world space
  bytes point_colors = 10;         // num_points*3 uint8 rgb
  repeated string export_paths = 11; // paths written for the requested exports
  bool is_metric = 12;             // depth is in metric units
 }
 // --- Face recognition messages ---
 message FacialArea {
--- a/backend/cpp/ds4/CMakeLists.txt
+++ b/backend/cpp/ds4/CMakeLists.txt
@@ -9,6 +9,22 @@ option(DS4_NATIVE "Compile with -march=native / -mcpu=native" ON)
 set(DS4_GPU "cpu" CACHE STRING "GPU backend: cpu, cuda, or metal")
 set(DS4_DIR "${CMAKE_CURRENT_SOURCE_DIR}/ds4" CACHE PATH "Path to cloned ds4 source")
 if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
    # Homebrew installs protobuf/grpc under a non-default prefix. The generated
    # backend.pb.cc / backend.grpc.pb.cc pull in google/protobuf and grpcpp
    # headers, but the hw_grpc_proto library links neither target, so on macOS
    # the headers (e.g. google/protobuf/runtime_version.h) are never on the
    # compiler's include path. Add the Homebrew prefix globally, matching the
    # llama-cpp backend which builds on Darwin CI.
    if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "arm64")
        set(HOMEBREW_DEFAULT_PREFIX "/opt/homebrew")
    else()
        set(HOMEBREW_DEFAULT_PREFIX "/usr/local")
    endif()
    link_directories("${HOMEBREW_DEFAULT_PREFIX}/lib")
    include_directories("${HOMEBREW_DEFAULT_PREFIX}/include")
 endif()
 find_package(Threads REQUIRED)
 find_package(Protobuf CONFIG QUIET)
 if(NOT Protobuf_FOUND)
--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -1,10 +1,10 @@
 # ds4 backend Makefile.
 #
-# Upstream pin lives below as DS4_VERSION?=d881f2a05e8ff6bec001315a36b794b4aa310173
+# Upstream pin lives below as DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
 # (.github/bump_deps.sh) can find and update it - matches the
 # llama-cpp / ik-llama-cpp / turboquant convention.
-DS4_VERSION?=d881f2a05e8ff6bec001315a36b794b4aa310173
+DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
 DS4_REPO?=https://github.com/antirez/ds4
 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
--- a/backend/cpp/ds4/grpc-server.cpp
+++ b/backend/cpp/ds4/grpc-server.cpp
@@ -25,6 +25,8 @@ extern "C" {
 #include <chrono>
 #include <climits>
 #include <csignal>
 #include <cstddef>
 #include <cstdint>
 #include <cstdlib>
 #include <cstring>
 #include <ctime>
@@ -105,6 +107,130 @@ static bool parse_layers_spec(const std::string &spec, ds4_distributed_layers *o
    return true;
 }
 // Parse a boolean LoadModel option. An empty value (a bare flag-style option
 // like "ssd_streaming" with no colon) means true so model YAMLs can write
 // options: ["ssd_streaming"] to enable a switch.
 static bool parse_bool_option(const std::string &s, bool *out) {
    if (s.empty() || s == "true" || s == "1" || s == "yes" || s == "on") { *out = true; return true; }
    if (s == "false" || s == "0" || s == "no" || s == "off") { *out = false; return true; }
    return false;
 }
 // Table-driven mapping from LoadModel option keys to ds4_engine_options fields.
 // ds4_engine_options is a fixed C struct with no reflection, so the field set
 // is enumerated once here; adding a future engine knob is a one-line table
 // entry rather than a new branch in LoadModel. Two fields need ds4's own typed
 // parsers (Gib, CacheExperts) so a plain string passthrough can't cover them.
 enum class DsOptType { Bool, Int, Uint, Float, Str, Gib, CacheExperts };
 struct DsOptSpec {
    const char *key;
    DsOptType   type;
    size_t      off;      // byte offset into ds4_engine_options
    size_t      off2;     // second offset (CacheExperts writes experts + bytes)
    bool        is_path;  // Str values: resolve a relative value against the model dir
 };
 static const DsOptSpec kEngineOptSpecs[] = {
    {"mtp_path",                      DsOptType::Str,          offsetof(ds4_engine_options, mtp_path),                      0, true},
    {"mtp_draft",                     DsOptType::Int,          offsetof(ds4_engine_options, mtp_draft_tokens),              0},
    {"mtp_margin",                    DsOptType::Float,        offsetof(ds4_engine_options, mtp_margin),                    0},
    {"prefill_chunk",                 DsOptType::Uint,         offsetof(ds4_engine_options, prefill_chunk),                 0},
    {"power_percent",                 DsOptType::Int,          offsetof(ds4_engine_options, power_percent),                 0},
    {"warm_weights",                  DsOptType::Bool,         offsetof(ds4_engine_options, warm_weights),                  0},
    {"quality",                       DsOptType::Bool,         offsetof(ds4_engine_options, quality),                       0},
    {"ssd_streaming",                 DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming),                 0},
    {"ssd_streaming_cold",            DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming_cold),            0},
    {"ssd_streaming_preload_experts", DsOptType::Uint,         offsetof(ds4_engine_options, ssd_streaming_preload_experts), 0},
    {"ssd_streaming_cache_experts",   DsOptType::CacheExperts, offsetof(ds4_engine_options, ssd_streaming_cache_experts),
                                                               offsetof(ds4_engine_options, ssd_streaming_cache_bytes)},
    {"simulate_used_memory",          DsOptType::Gib,          offsetof(ds4_engine_options, simulate_used_memory_bytes),    0},
    {"expert_profile_path",           DsOptType::Str,          offsetof(ds4_engine_options, expert_profile_path),           0, true},
    {"directional_steering_file",     DsOptType::Str,          offsetof(ds4_engine_options, directional_steering_file),     0, true},
    {"directional_steering_attn",     DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_attn),     0},
    {"directional_steering_ffn",      DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_ffn),      0},
 };
 // Apply a single key:value LoadModel option to the engine options struct.
 // Unknown keys are ignored (back-compat: callers pass mixed option sets).
 // String values are copied into `storage`, whose elements the engine reads by
 // pointer during ds4_engine_open; `storage` MUST have reserved capacity so
 // push_back never reallocates and dangles an earlier c_str(). Returns false
 // with `err` set when a recognized key has an invalid value.
 static bool apply_engine_option(ds4_engine_options *opt, const std::string &key,
                                const std::string &val, const std::string &model_dir,
                                std::vector<std::string> &storage, std::string &err) {
    const DsOptSpec *spec = nullptr;
    for (const auto &s : kEngineOptSpecs) {
        if (key == s.key) { spec = &s; break; }
    }
    if (!spec) return true; // unknown key: ignore
    char *base = reinterpret_cast<char *>(opt);
    switch (spec->type) {
    case DsOptType::Bool: {
        bool b = false;
        if (!parse_bool_option(val, &b)) { err = key + " must be true/false"; return false; }
        *reinterpret_cast<bool *>(base + spec->off) = b;
        return true;
    }
    case DsOptType::Int: {
        char *end = nullptr;
        long v = std::strtol(val.c_str(), &end, 10);
        if (val.empty() || !end || *end != '\0') { err = key + " must be an integer"; return false; }
        *reinterpret_cast<int *>(base + spec->off) = static_cast<int>(v);
        return true;
    }
    case DsOptType::Uint: {
        char *end = nullptr;
        long v = std::strtol(val.c_str(), &end, 10);
        if (val.empty() || !end || *end != '\0' || v < 0 || v > static_cast<long>(UINT32_MAX)) {
            err = key + " must be a non-negative integer"; return false;
        }
        *reinterpret_cast<uint32_t *>(base + spec->off) = static_cast<uint32_t>(v);
        return true;
    }
    case DsOptType::Float: {
        char *end = nullptr;
        float f = std::strtof(val.c_str(), &end);
        if (val.empty() || !end || *end != '\0') { err = key + " must be a number"; return false; }
        *reinterpret_cast<float *>(base + spec->off) = f;
        return true;
    }
    case DsOptType::Str: {
        // Resolve a relative path option (e.g. mtp_path: a sibling GGUF the
        // gallery downloaded next to the model) against the model directory, so
        // YAMLs reference companion files by name. Absolute values pass through.
        if (spec->is_path && !model_dir.empty() && !val.empty() && val.front() != '/') {
            storage.push_back(model_dir + "/" + val);
        } else {
            storage.push_back(val);
        }
        *reinterpret_cast<const char **>(base + spec->off) = storage.back().c_str();
        return true;
    }
    case DsOptType::Gib: {
        uint64_t bytes = 0;
        if (!ds4_parse_gib_arg(val.c_str(), &bytes)) {
            err = key + " must be a GiB value, e.g. 64GB"; return false;
        }
        *reinterpret_cast<uint64_t *>(base + spec->off) = bytes;
        return true;
    }
    case DsOptType::CacheExperts: {
        uint32_t experts = 0;
        uint64_t bytes = 0;
        if (!ds4_parse_streaming_cache_experts_arg(val.c_str(), &experts, &bytes)) {
            err = key + " must be a positive expert count or a <number>GB budget"; return false;
        }
        *reinterpret_cast<uint32_t *>(base + spec->off)  = experts;
        *reinterpret_cast<uint64_t *>(base + spec->off2) = bytes;
        return true;
    }
    }
    return true;
 }
 // When acting as a distributed coordinator, block until the worker route
 // covers all layers (ds4_session_distributed_route_ready == 1) or the timeout
 // elapses. Returns an empty string on success, or an error message to return
@@ -476,39 +602,10 @@ public:
            return GStatus::OK;
        }
        std::string mtp_path;
        int mtp_draft = 0;
        float mtp_margin = 3.0f;
        std::string ds4_role, ds4_layers, ds4_listen;
        for (const auto &opt : request->options()) {
            auto [k, v] = split_option(opt);
            if (k == "mtp_path") mtp_path = v;
            else if (k == "mtp_draft") mtp_draft = std::stoi(v);
            else if (k == "mtp_margin") mtp_margin = std::stof(v);
            else if (k == "kv_cache_dir") g_kv_cache_dir = v;
            else if (k == "ds4_role") ds4_role = v;
            else if (k == "ds4_layers") ds4_layers = v;
            else if (k == "ds4_listen") ds4_listen = v;
            else if (k == "ds4_route_timeout") {
                if (!parse_positive_int(v, &g_route_timeout_sec)) {
                    result->set_success(false);
                    result->set_message("ds4: ds4_route_timeout must be a positive integer");
                    return GStatus::OK;
                }
            }
        }
        g_kv_cache.SetDir(g_kv_cache_dir);
        ds4_engine_options opt = {};
        opt.model_path = model_path.c_str();
        opt.mtp_path = mtp_path.empty() ? nullptr : mtp_path.c_str();
        opt.n_threads = request->threads() > 0 ? request->threads() : 0;
-        opt.mtp_draft_tokens = mtp_draft;
+        opt.mtp_margin = 3.0f; // ds4 default; overridable via the mtp_margin option
        opt.mtp_margin = mtp_margin;
        opt.directional_steering_file = nullptr;
        opt.warm_weights = false;
        opt.quality = false;
 #if defined(DS4_NO_GPU)
        opt.backend = DS4_BACKEND_CPU;
@@ -518,6 +615,46 @@ public:
        opt.backend = DS4_BACKEND_CUDA;
 #endif
        // Stable storage for string-valued engine options. The engine reads
        // these by pointer during ds4_engine_open, so the std::string backing
        // store must outlive the call and not reallocate; reserve up front so
        // push_back keeps every prior c_str() valid. Static + clear() reuses
        // the buffer across LoadModel calls (the old engine is closed above).
        static std::vector<std::string> s_opt_strings;
        s_opt_strings.clear();
        s_opt_strings.reserve(sizeof(kEngineOptSpecs) / sizeof(kEngineOptSpecs[0]));
        // Directory of the main model, used to resolve relative path options.
        std::string model_dir;
        if (auto slash = model_path.find_last_of('/'); slash != std::string::npos) {
            model_dir = model_path.substr(0, slash);
        }
        std::string ds4_role, ds4_layers, ds4_listen;
        for (const auto &o : request->options()) {
            auto [k, v] = split_option(o);
            if (k == "kv_cache_dir") { g_kv_cache_dir = v; continue; }
            else if (k == "ds4_role") { ds4_role = v; continue; }
            else if (k == "ds4_layers") { ds4_layers = v; continue; }
            else if (k == "ds4_listen") { ds4_listen = v; continue; }
            else if (k == "ds4_route_timeout") {
                if (!parse_positive_int(v, &g_route_timeout_sec)) {
                    result->set_success(false);
                    result->set_message("ds4: ds4_route_timeout must be a positive integer");
                    return GStatus::OK;
                }
                continue;
            }
            std::string err;
            if (!apply_engine_option(&opt, k, v, model_dir, s_opt_strings, err)) {
                result->set_success(false);
                result->set_message("ds4: " + err);
                return GStatus::OK;
            }
        }
        g_kv_cache.SetDir(g_kv_cache_dir);
        // Coordinator wiring. 'ds4_role:coordinator' enables layer-split
        // distributed inference: this process listens on ds4_listen and owns
        // the ds4_layers slice; workers dial in (see `local-ai worker
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@
-IK_LLAMA_VERSION?=e6f8112f3ba126eed3ff5b30cdd08085414a7516
+IK_LLAMA_VERSION?=7ccf1d209588962b96eacca325b37e9b3e8faf5e
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/CMakeLists.txt
+++ b/backend/cpp/llama-cpp/CMakeLists.txt
@@ -50,8 +50,13 @@ add_custom_command(
        "${hw_proto}"
      DEPENDS "${hw_proto}")
-# hw_grpc_proto
+# hw_grpc_proto: force STATIC. Under the CPU_ALL_VARIANTS build BUILD_SHARED_LIBS=ON
-add_library(hw_grpc_proto
+# (ggml/llama become shared), which would otherwise make this glue library a DSO. As a
 # DSO it references the hidden-visibility symbols in the static libprotobuf.a, which the
 # linker cannot satisfy ("hidden symbol ... in libprotobuf.a is referenced by DSO").
 # Keeping it STATIC links protobuf/gRPC directly into the grpc-server executable while
 # only ggml/llama stay shared. No effect on the static variants (already BUILD_SHARED_LIBS=OFF).
 add_library(hw_grpc_proto STATIC
  ${hw_grpc_srcs}
  ${hw_grpc_hdrs}
  ${hw_proto_srcs}
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@
-LLAMA_VERSION?=4c6595503fe45d5a39f88d194e270f64c7424677
+LLAMA_VERSION?=be4a6a63eb2b848e19c277bdcf2bd399e8af76d9
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
 CMAKE_ARGS?=
@@ -10,8 +10,16 @@ TARGET?=--target grpc-server
 JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
 ARCH?=$(shell uname -m)
-# Disable Shared libs as we are linking on static gRPC and we can't mix shared and static
+# Shared libs default to OFF: we link static gRPC and the avx/avx2/avx512/fallback
-CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=OFF
+# variants are fully static. The CPU_ALL_VARIANTS build flips SHARED_LIBS=ON (ggml/llama
 # become shared so the dynamic CPU backends work; gRPC stays static via its imported
 # targets). SHARED_LIBS is a make variable, not an appended -D, so it survives the
 # recursive sub-make into the VARIANT build dir (which re-parses this Makefile) instead
 # of being re-clobbered by a second -DBUILD_SHARED_LIBS=OFF. EXTRA_CMAKE_ARGS is the hook
 # the CPU_ALL_VARIANTS target uses to inject -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS.
 SHARED_LIBS?=OFF
 EXTRA_CMAKE_ARGS?=
 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=$(SHARED_LIBS) -DLLAMA_CURL=OFF $(EXTRA_CMAKE_ARGS)
 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
 ifeq ($(NATIVE),false)
@@ -120,6 +128,30 @@ llama-cpp-fallback: llama.cpp
 	CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) VARIANT="llama-cpp-fallback-build" build-llama-cpp-grpc-server
 	cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-fallback-build/grpc-server llama-cpp-fallback
 # Single-build CPU backend using ggml's CPU_ALL_VARIANTS. Produces ONE grpc-server
 # plus a set of dlopen-able libggml-cpu-*.so (sandybridge/haswell/skylakex/...) that
 # ggml's backend registry selects from at runtime by probing host CPU features.
 # Replaces the avx/avx2/avx512/fallback multi-binary build on x86.
 #
 # CPU_ALL_VARIANTS requires GGML_BACKEND_DL, which requires BUILD_SHARED_LIBS=ON, so we
 # pass SHARED_LIBS=ON and the DL flags as make variables (NOT pre-expanded into the
 # CMAKE_ARGS env string): command-line make variables propagate through every recursive
 # sub-make, so the deepest VARIANT-dir build computes BUILD_SHARED_LIBS=ON consistently.
 # Only ggml/llama go shared - gRPC is found via its static imported targets, so the
 # grpc-server binary keeps static gRPC and only dynamically links ggml.
 #
 # TARGET adds "ggml": the per-microarch backends are runtime-dlopened, not link deps of
 # grpc-server, so they only build because each is an add_dependencies() of the ggml target.
 llama-cpp-cpu-all: llama.cpp
 	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build purge
 	$(info ${GREEN}I llama-cpp build info:cpu-all-variants${RESET})
 	$(MAKE) SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" VARIANT="llama-cpp-cpu-all-build" build-llama-cpp-grpc-server
 	cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/grpc-server llama-cpp-cpu-all
 	rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
 	find $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
 	@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
 llama-cpp-grpc: llama.cpp
 	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build purge
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -18,6 +18,18 @@
 #if __has_include("server-chat.cpp")
 #include "server-chat.cpp"
 #endif
 // server-schema.cpp exists only in llama.cpp after the upstream refactor that
 // extracted the JSON request-schema evaluation (previously the static
 // server_task::params_from_json_cmpl) into server_schema::eval_llama_cmpl_schema.
 // server-context.cpp and grpc-server.cpp both call into it, so its definitions
 // must be part of this translation unit or the link fails. __has_include keeps
 // the source compatible with older pins/forks (e.g. llama-cpp-turboquant) that
 // predate the split and still expose params_from_json_cmpl (see the guarded
 // call sites below).
 #if __has_include("server-schema.cpp")
 #define LOCALAI_HAS_SERVER_SCHEMA 1
 #include "server-schema.cpp"
 #endif
 #include "server-context.cpp"
 // LocalAI
@@ -1922,25 +1934,27 @@ public:
                    body_json["min_p"] = data["min_p"];
                }
-                // Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
+                // Forward the chat_template_kwargs the Go layer resolved (model config
                // chat_template_kwargs + per-request metadata: enable_thinking,
                // reasoning_effort, preserve_thinking, ...). One generic merge replaces
                // the previous per-key handling - new template levers need no C++ change.
                // oaicompat_chat_params_parse reads these from body_json.
                const auto& metadata = request->metadata();
-                auto et_it = metadata.find("enable_thinking");
+                auto ctk_it = metadata.find("chat_template_kwargs");
-                if (et_it != metadata.end()) {
+                if (ctk_it != metadata.end() && !ctk_it->second.empty()) {
-                    if (!body_json.contains("chat_template_kwargs")) {
+                    try {
-                        body_json["chat_template_kwargs"] = json::object();
+                        json ctk = json::parse(ctk_it->second);
                        if (ctk.is_object()) {
                            if (!body_json.contains("chat_template_kwargs")) {
                                body_json["chat_template_kwargs"] = json::object();
                            }
                            for (auto& el : ctk.items()) {
                                body_json["chat_template_kwargs"][el.key()] = el.value();
                            }
                        }
                    } catch (const std::exception & e) {
                        SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
                    }
                    body_json["chat_template_kwargs"]["enable_thinking"] = (et_it->second == "true");
                }
                // Pass reasoning_effort via chat_template_kwargs too: the lever
                // jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
                // from enable_thinking which those templates ignore.
                auto re_it = metadata.find("reasoning_effort");
                if (re_it != metadata.end() && !re_it->second.empty()) {
                    if (!body_json.contains("chat_template_kwargs")) {
                        body_json["chat_template_kwargs"] = json::object();
                    }
                    body_json["chat_template_kwargs"]["reasoning_effort"] = re_it->second;
                }
                // Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
@@ -2100,7 +2114,11 @@ public:
                task.index = i;
                task.tokens    = std::move(inputs[i]);
 #ifdef LOCALAI_HAS_SERVER_SCHEMA
                task.params           = server_schema::eval_llama_cmpl_schema(
 #else
                task.params           = server_task::params_from_json_cmpl(
 #endif
                        ctx_server.impl->vocab,
                        params_base,
                        ctx_server.get_meta().slot_n_ctx,
@@ -2114,7 +2132,7 @@ public:
                // cannot detect tool calls or separate reasoning from content.
                task.params.res_type                 = TASK_RESPONSE_TYPE_OAI_CHAT;
                task.params.oaicompat_cmpl_id         = completion_id;
-                // oaicompat_model is already populated by params_from_json_cmpl
+                // oaicompat_model is already populated by eval_llama_cmpl_schema
                tasks.push_back(std::move(task));
            }
@@ -2756,25 +2774,26 @@ public:
                    body_json["min_p"] = data["min_p"];
                }
-                // Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
+                // Forward the chat_template_kwargs the Go layer resolved (model config
                // chat_template_kwargs + per-request metadata: enable_thinking,
                // reasoning_effort, preserve_thinking, ...). One generic merge replaces
                // the previous per-key handling - new template levers need no C++ change.
                const auto& predict_metadata = request->metadata();
-                auto predict_et_it = predict_metadata.find("enable_thinking");
+                auto predict_ctk_it = predict_metadata.find("chat_template_kwargs");
-                if (predict_et_it != predict_metadata.end()) {
+                if (predict_ctk_it != predict_metadata.end() && !predict_ctk_it->second.empty()) {
-                    if (!body_json.contains("chat_template_kwargs")) {
+                    try {
-                        body_json["chat_template_kwargs"] = json::object();
+                        json ctk = json::parse(predict_ctk_it->second);
                        if (ctk.is_object()) {
                            if (!body_json.contains("chat_template_kwargs")) {
                                body_json["chat_template_kwargs"] = json::object();
                            }
                            for (auto& el : ctk.items()) {
                                body_json["chat_template_kwargs"][el.key()] = el.value();
                            }
                        }
                    } catch (const std::exception & e) {
                        SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
                    }
                    body_json["chat_template_kwargs"]["enable_thinking"] = (predict_et_it->second == "true");
                }
                // Pass reasoning_effort via chat_template_kwargs too: the lever
                // jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
                // from enable_thinking which those templates ignore.
                auto predict_re_it = predict_metadata.find("reasoning_effort");
                if (predict_re_it != predict_metadata.end() && !predict_re_it->second.empty()) {
                    if (!body_json.contains("chat_template_kwargs")) {
                        body_json["chat_template_kwargs"] = json::object();
                    }
                    body_json["chat_template_kwargs"]["reasoning_effort"] = predict_re_it->second;
                }
                // Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
@@ -2937,7 +2956,11 @@ public:
                task.index = i;
                task.tokens    = std::move(inputs[i]);
 #ifdef LOCALAI_HAS_SERVER_SCHEMA
                task.params           = server_schema::eval_llama_cmpl_schema(
 #else
                task.params           = server_task::params_from_json_cmpl(
 #endif
                        ctx_server.impl->vocab,
                        params_base,
                        ctx_server.get_meta().slot_n_ctx,
@@ -2949,7 +2972,7 @@ public:
                // reasoning, tool calls, and content are classified into ChatDeltas.
                task.params.res_type                 = TASK_RESPONSE_TYPE_OAI_CHAT;
                task.params.oaicompat_cmpl_id         = completion_id;
-                // oaicompat_model is already populated by params_from_json_cmpl
+                // oaicompat_model is already populated by eval_llama_cmpl_schema
                tasks.push_back(std::move(task));
            }
--- a/backend/cpp/llama-cpp/package.sh
+++ b/backend/cpp/llama-cpp/package.sh
@@ -14,6 +14,22 @@ mkdir -p $CURDIR/package/lib
 cp -avrf $CURDIR/llama-cpp-* $CURDIR/package/
 cp -rfv $CURDIR/run.sh $CURDIR/package/
 # Bundle the ggml shared backends produced by the CPU_ALL_VARIANTS build (libggml-base.so,
 # libggml.so, libllama.so and the per-microarch libggml-cpu-*.so), all into package/lib.
 #
 # Two distinct resolution mechanisms both land here:
 #   - NEEDED deps (libggml-base/libggml/libllama): resolved by the dynamic linker via the
 #     LD_LIBRARY_PATH=$CURDIR/lib that run.sh exports.
 #   - The per-microarch libggml-cpu-*.so are NOT linked; ggml *discovers* them at runtime by
 #     scanning the executable's own directory (readlink /proc/self/exe). run.sh launches via
 #     the bundled $CURDIR/lib/ld.so, so /proc/self/exe -> .../lib/ld.so and ggml scans lib/.
 #     That is why the variants must sit in lib/ (next to ld.so), not just on the link path.
 # No-op on builds (arm64/darwin) that don't produce the all-variants set.
 if [ -d "$CURDIR/ggml-shared-libs" ]; then
    echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
    cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
 fi
 # Detect architecture and copy appropriate libraries
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    # x86_64 architecture
--- a/backend/cpp/llama-cpp/run.sh
+++ b/backend/cpp/llama-cpp/run.sh
@@ -12,26 +12,12 @@ grep -e "flags" /proc/cpuinfo | head -1
 BINARY=llama-cpp-fallback
-if grep -q -e "\savx\s" /proc/cpuinfo ; then
+# CPU images (x86, arm64, darwin) ship a single llama-cpp-cpu-all built with ggml
-	echo "CPU:    AVX    found OK"
+# CPU_ALL_VARIANTS: ggml's backend registry dlopens the best libggml-cpu-*.so for this
-	if [ -e $CURDIR/llama-cpp-avx ]; then
+# host, so no shell-side AVX probing. GPU images (cublas/sycl/vulkan/hipblas) ship only
-		BINARY=llama-cpp-avx
+# llama-cpp-fallback (the accelerator does the compute), so fall back to it when absent.
-	fi
+if [ -e $CURDIR/llama-cpp-cpu-all ]; then
-fi
+	BINARY=llama-cpp-cpu-all
 if grep -q -e "\savx2\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX2   found OK"
 	if [ -e $CURDIR/llama-cpp-avx2 ]; then
 		BINARY=llama-cpp-avx2
 	fi
 fi
 # Check avx 512
 if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX512F found OK"
 	if [ -e $CURDIR/llama-cpp-avx512 ]; then
 		BINARY=llama-cpp-avx512
 	fi
 fi
 if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
--- a/backend/cpp/privacy-filter/.gitignore
+++ b/backend/cpp/privacy-filter/.gitignore
@@ -0,0 +1,9 @@
 /privacy-filter.cpp
 build/
 package/
 grpc-server
 *.o
 backend.pb.cc
 backend.pb.h
 backend.grpc.pb.cc
 backend.grpc.pb.h
--- a/backend/cpp/privacy-filter/CMakeLists.txt
+++ b/backend/cpp/privacy-filter/CMakeLists.txt
@@ -0,0 +1,69 @@
 cmake_minimum_required(VERSION 3.21)
 project(privacy-filter-grpc-server LANGUAGES CXX C)
 set(CMAKE_CXX_STANDARD 17)
 set(CMAKE_CXX_STANDARD_REQUIRED ON)
 set(TARGET grpc-server)
 # Path to the privacy-filter.cpp engine sources. The Makefile arranges for this
 # to exist (clone of a pinned commit, or a symlink to PRIVACY_FILTER_SRC).
 set(PRIVACY_FILTER_DIR "${CMAKE_CURRENT_SOURCE_DIR}/privacy-filter.cpp"
    CACHE PATH "Path to the privacy-filter.cpp engine source tree")
 find_package(Threads REQUIRED)
 find_package(Protobuf CONFIG QUIET)
 if(NOT Protobuf_FOUND)
    find_package(Protobuf REQUIRED)
 endif()
 find_package(gRPC CONFIG QUIET)
 if(NOT gRPC_FOUND)
    # Ubuntu's apt-installed grpc++ does not ship a CMake config - fall back.
    find_library(GRPCPP_LIB grpc++ REQUIRED)
    find_library(GRPCPP_REFLECTION_LIB grpc++_reflection REQUIRED)
    add_library(gRPC::grpc++ INTERFACE IMPORTED)
    set_target_properties(gRPC::grpc++ PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_LIB}")
    add_library(gRPC::grpc++_reflection INTERFACE IMPORTED)
    set_target_properties(gRPC::grpc++_reflection PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_REFLECTION_LIB}")
 endif()
 find_program(_PROTOC NAMES protoc REQUIRED)
 find_program(_GRPC_CPP_PLUGIN NAMES grpc_cpp_plugin REQUIRED)
 get_filename_component(HW_PROTO "${CMAKE_CURRENT_SOURCE_DIR}/../../backend.proto" ABSOLUTE)
 get_filename_component(HW_PROTO_PATH "${HW_PROTO}" PATH)
 set(HW_PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.cc")
 set(HW_PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.h")
 set(HW_GRPC_SRCS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.cc")
 set(HW_GRPC_HDRS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.h")
 add_custom_command(
    OUTPUT "${HW_PROTO_SRCS}" "${HW_PROTO_HDRS}" "${HW_GRPC_SRCS}" "${HW_GRPC_HDRS}"
    COMMAND ${_PROTOC}
    ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
         --cpp_out  "${CMAKE_CURRENT_BINARY_DIR}"
         -I "${HW_PROTO_PATH}"
         --plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN}"
         "${HW_PROTO}"
    DEPENDS "${HW_PROTO}")
 add_library(hw_grpc_proto STATIC
    ${HW_GRPC_SRCS} ${HW_GRPC_HDRS}
    ${HW_PROTO_SRCS} ${HW_PROTO_HDRS})
 target_include_directories(hw_grpc_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
 # Build only the pf static lib (+ ggml) from the engine tree — no CLI/bench/tests.
 # PF_VULKAN is honored when passed on the cmake command line (it lands in the
 # shared cache the engine reads).
 set(PF_BUILD_TOOLS OFF CACHE BOOL "" FORCE)
 set(PF_BUILD_TESTS OFF CACHE BOOL "" FORCE)
 add_subdirectory(${PRIVACY_FILTER_DIR} ${CMAKE_CURRENT_BINARY_DIR}/privacy-filter.cpp)
 add_executable(${TARGET} grpc-server.cpp)
 target_link_libraries(${TARGET} PRIVATE
    pf
    hw_grpc_proto
    gRPC::grpc++
    gRPC::grpc++_reflection
    protobuf::libprotobuf
    Threads::Threads)
--- a/backend/cpp/privacy-filter/Makefile
+++ b/backend/cpp/privacy-filter/Makefile
@@ -0,0 +1,77 @@
 # privacy-filter backend Makefile.
 #
 # Wraps the standalone privacy-filter.cpp GGML engine (the openai-privacy-filter
 # PII/NER token classifier) as a LocalAI gRPC backend. The engine source is
 # fetched at the pin below — .github/workflows/bump_deps.yaml finds and updates
 # PRIVACY_FILTER_VERSION, matching the llama-cpp / ds4 convention.
 #
 # Local development: point at a working checkout instead of cloning, e.g.
 #   make PRIVACY_FILTER_SRC=$HOME/c/privacy-filter.cpp grpc-server
 PRIVACY_FILTER_VERSION?=98f52c5ef2250f207cc6b9a6aef05393a120cb7c
 PRIVACY_FILTER_REPO?=https://github.com/localai-org/privacy-filter.cpp
 PRIVACY_FILTER_SRC?=
 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
 BUILD_DIR := build
 BUILD_TYPE ?=
 NATIVE ?= false
 JOBS ?= $(shell nproc 2>/dev/null || echo 4)
 CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
 # GPU backends; the default (cpu) needs no extra flags. 'cublas' is LocalAI's
 # name for the CUDA build (matches llama-cpp / ds4), mapping to the engine's
 # GGML_CUDA path; 'vulkan' selects the ggml Vulkan backend.
 ifeq ($(BUILD_TYPE),cublas)
    CMAKE_ARGS += -DPF_CUDA=ON
 endif
 ifeq ($(BUILD_TYPE),vulkan)
    CMAKE_ARGS += -DPF_VULKAN=ON
 endif
 # Portable binaries for distribution: disable -march=native unless asked.
 ifneq ($(NATIVE),true)
    CMAKE_ARGS += -DGGML_NATIVE=OFF
 endif
 .PHONY: grpc-server package clean purge test all
 all: grpc-server
 # Provide the engine sources at ./privacy-filter.cpp. With PRIVACY_FILTER_SRC
 # set we symlink a local checkout (instant, no network); otherwise we clone the
 # pinned commit and its ggml submodule. The directory/symlink is the target, so
 # make only does this once — run 'make purge && make' to refetch after a bump.
 privacy-filter.cpp:
 ifneq ($(PRIVACY_FILTER_SRC),)
 	ln -sfn $(abspath $(PRIVACY_FILTER_SRC)) privacy-filter.cpp
 else
 	mkdir -p privacy-filter.cpp
 	cd privacy-filter.cpp && \
 	git init -q && \
 	git remote add origin $(PRIVACY_FILTER_REPO) && \
 	git fetch --depth 1 origin $(PRIVACY_FILTER_VERSION) && \
 	git checkout FETCH_HEAD && \
 	git submodule update --init --recursive --depth 1
 endif
 grpc-server: privacy-filter.cpp
 	@echo "Building privacy-filter grpc-server ($(BUILD_TYPE)) with $(CMAKE_ARGS)"
 	mkdir -p $(BUILD_DIR)
 	cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
 	cp $(BUILD_DIR)/grpc-server grpc-server
 package: grpc-server
 	bash package.sh
 test:
 	@echo "privacy-filter backend: parity/regression coverage lives in the engine repo"
 clean:
 	rm -rf $(BUILD_DIR) grpc-server package
 # 'privacy-filter.cpp' may be a symlink (PRIVACY_FILTER_SRC) — rm without a
 # trailing slash removes the link, never the linked-to checkout.
 purge: clean
 	rm -rf privacy-filter.cpp
--- a/backend/cpp/privacy-filter/grpc-server.cpp
+++ b/backend/cpp/privacy-filter/grpc-server.cpp
@@ -0,0 +1,210 @@
 // privacy-filter LocalAI gRPC backend.
 //
 // Thin shim over privacy-filter.cpp's flat C API (include/pf.h): a standalone
 // GGML engine for the openai-privacy-filter token-classification model family
 // (PII NER). It replaces the llama.cpp-patched TokenClassify path for this one
 // model family — same GGUF files, no llama.cpp carry-patches.
 //
 // Only the RPCs the PII tier needs are implemented: LoadModel, TokenClassify,
 // plus Health / Status / Free. Everything else inherits the generated base
 // class default (UNIMPLEMENTED).
 #include "backend.pb.h"
 #include "backend.grpc.pb.h"
 #include "pf.h"
 #include <grpcpp/grpcpp.h>
 #include <grpcpp/server.h>
 #include <grpcpp/server_builder.h>
 #include <grpcpp/ext/proto_server_reflection_plugin.h>
 #include <atomic>
 #include <chrono>
 #include <csignal>
 #include <iostream>
 #include <memory>
 #include <mutex>
 #include <string>
 using grpc::Server;
 using grpc::ServerBuilder;
 using grpc::ServerContext;
 // NOTE: do NOT alias grpc::Status as Status — the Status RPC method below would
 // shadow the type and break the other method signatures. Use GStatus instead.
 using GStatus = ::grpc::Status;
 using grpc::StatusCode;
 namespace {
 // The engine is single-model-per-process: LocalAI spawns one backend process
 // per loaded model. g_mu guards (re)load against in-flight classification.
 std::mutex          g_mu;
 pf_ctx *            g_ctx = nullptr;
 std::atomic<Server *> g_server{nullptr};
 // Resolve the device string the engine expects ("cpu" / "gpu" / "cuda" /
 // "vulkan", optionally ":N"). Priority: an explicit "device:..." in
 // ModelOptions.Options, then a non-zero NGPULayers as a coarse "use the GPU"
 // signal, else CPU. "gpu" lets the engine pick whichever GPU backend this
 // binary was compiled with (CUDA or Vulkan), so the same config works on
 // either build; pin "device:cuda"/"device:vulkan" to be explicit.
 std::string resolve_device(const backend::ModelOptions * opts) {
    for (const auto & o : opts->options()) {
        const std::string prefix = "device:";
        if (o.rfind(prefix, 0) == 0) {
            return o.substr(prefix.size());
        }
    }
    if (opts->ngpulayers() > 0) {
        return "gpu";
    }
    return "cpu";
 }
 class PrivacyFilterBackend final : public backend::Backend::Service {
 public:
    GStatus Health(ServerContext *, const backend::HealthMessage *,
                   backend::Reply * reply) override {
        reply->set_message("OK");
        return GStatus::OK;
    }
    GStatus Status(ServerContext *, const backend::HealthMessage *,
                   backend::StatusResponse * response) override {
        std::lock_guard<std::mutex> lock(g_mu);
        response->set_state(g_ctx ? backend::StatusResponse::READY
                                  : backend::StatusResponse::UNINITIALIZED);
        return GStatus::OK;
    }
    GStatus LoadModel(ServerContext *, const backend::ModelOptions * request,
                      backend::Result * result) override {
        std::lock_guard<std::mutex> lock(g_mu);
        // ModelFile is the absolute path LocalAI resolves; Model is the bare
        // name. Prefer the former, fall back to the latter.
        const std::string path =
            !request->modelfile().empty() ? request->modelfile() : request->model();
        if (path.empty()) {
            result->set_success(false);
            result->set_message("no model path supplied");
            return GStatus::OK;
        }
        const std::string device = resolve_device(request);
        if (g_ctx) { pf_free(g_ctx); g_ctx = nullptr; }
        pf_ctx * ctx = pf_load(path.c_str(), device.c_str(), request->threads());
        const char * err = pf_last_error(ctx);
        if (err) {
            result->set_success(false);
            result->set_message(std::string("privacy-filter load failed: ") + err);
            pf_free(ctx);
            return GStatus::OK;
        }
        // ContextSize, when set, becomes the per-forward window. The engine
        // ignores values that are too small to window (<= 2*halo) and just
        // runs a single forward, so passing it through is always safe.
        if (request->contextsize() > 0) {
            pf_set_window(ctx, request->contextsize());
        }
        g_ctx = ctx;
        result->set_success(true);
        result->set_message("privacy-filter loaded (" + device + ")");
        return GStatus::OK;
    }
    GStatus TokenClassify(ServerContext *, const backend::TokenClassifyRequest * request,
                          backend::TokenClassifyResponse * response) override {
        std::lock_guard<std::mutex> lock(g_mu);
        if (!g_ctx) {
            return GStatus(StatusCode::FAILED_PRECONDITION, "Model not loaded");
        }
        const std::string & text = request->text();
        if (text.empty()) {
            return GStatus::OK;  // no text -> no entities
        }
        pf_entity * ents = nullptr;
        size_t      n    = 0;
        if (pf_classify(g_ctx, text.data(), text.size(), request->threshold(), &ents, &n) != 0) {
            const char * err = pf_last_error(g_ctx);
            return GStatus(StatusCode::INTERNAL,
                           std::string("TokenClassify failed: ") + (err ? err : "unknown"));
        }
        // Byte offsets are into the original UTF-8 text; the engine already
        // applied the threshold and whitespace-trimmed span edges.
        for (size_t i = 0; i < n; i++) {
            backend::TokenClassifyEntity * ent = response->add_entities();
            ent->set_entity_group(ents[i].label ? ents[i].label : "");
            ent->set_start(ents[i].start);
            ent->set_end(ents[i].end);
            ent->set_score(ents[i].score);
            ent->set_text(text.substr((size_t) ents[i].start,
                                      (size_t) (ents[i].end - ents[i].start)));
        }
        pf_entities_free(ents, n);
        return GStatus::OK;
    }
    GStatus Free(ServerContext *, const backend::HealthMessage *,
                 backend::Result * result) override {
        std::lock_guard<std::mutex> lock(g_mu);
        if (g_ctx) { pf_free(g_ctx); g_ctx = nullptr; }
        result->set_success(true);
        return GStatus::OK;
    }
 };
 void RunServer(const std::string & addr) {
    PrivacyFilterBackend service;
    grpc::EnableDefaultHealthCheckService(true);
    grpc::reflection::InitProtoReflectionServerBuilderPlugin();
    ServerBuilder builder;
    builder.AddListeningPort(addr, grpc::InsecureServerCredentials());
    builder.RegisterService(&service);
    builder.SetMaxReceiveMessageSize(64 * 1024 * 1024);
    builder.SetMaxSendMessageSize(64 * 1024 * 1024);
    std::unique_ptr<Server> server(builder.BuildAndStart());
    if (!server) {
        std::cerr << "privacy-filter grpc-server: failed to bind " << addr << "\n";
        std::exit(1);
    }
    g_server = server.get();
    std::cerr << "privacy-filter grpc-server listening on " << addr << "\n";
    server->Wait();
 }
 void signal_handler(int) {
    if (auto * srv = g_server.load()) {
        srv->Shutdown(std::chrono::system_clock::now() + std::chrono::seconds(3));
    }
 }
 } // namespace
 int main(int argc, char * argv[]) {
    std::string addr = "127.0.0.1:50051";
    for (int i = 1; i < argc; ++i) {
        std::string a = argv[i];
        const std::string addr_flag = "--addr=";
        if (a.rfind(addr_flag, 0) == 0)      addr = a.substr(addr_flag.size());
        else if (a == "--addr" && i + 1 < argc) addr = argv[++i];
        else if (a == "--help" || a == "-h") {
            std::cout << "Usage: grpc-server --addr=HOST:PORT\n";
            return 0;
        }
    }
    std::signal(SIGINT,  signal_handler);
    std::signal(SIGTERM, signal_handler);
    RunServer(addr);
    return 0;
 }
--- a/backend/cpp/privacy-filter/package.sh
+++ b/backend/cpp/privacy-filter/package.sh
@@ -0,0 +1,39 @@
 #!/bin/bash
 # Assemble package/ for the from-scratch backend image: the grpc-server binary,
 # run.sh, the dynamic loader, and every shared library the binary needs.
 set -e
 CURDIR=$(dirname "$(realpath "$0")")
 REPO_ROOT="${CURDIR}/../../.."
 mkdir -p "$CURDIR/package/lib"
 cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
 cp -rfv "$CURDIR/run.sh"      "$CURDIR/package/"
 # The dynamic loader, renamed to lib/ld.so so run.sh can invoke it explicitly
 # (makes the image independent of the host's glibc layout).
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
 elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
 else
    echo "package.sh: unknown architecture" >&2; exit 1
 fi
 # Bundle the binary's transitive shared deps (libstdc++, libgomp, and the apt
 # grpc++/protobuf/absl stack) by walking ldd — robust to whichever of those are
 # linked shared vs static. The loader line (no "=>") is skipped; ld.so above
 # already covers it.
 ldd "$CURDIR/grpc-server" | awk '$2 == "=>" && $3 ~ /^\// { print $3 }' | sort -u | \
 while read -r so; do
    [ -f "$so" ] && cp -arfLv "$so" "$CURDIR/package/lib/"
 done
 # Vulkan loader / GPU libs when building the GPU variant.
 GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
 if [ -f "$GPU_LIB_SCRIPT" ]; then
    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
    package_gpu_libs
 fi
 echo "privacy-filter package contents:"
 ls -lah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/cpp/privacy-filter/run.sh
+++ b/backend/cpp/privacy-filter/run.sh
@@ -0,0 +1,9 @@
 #!/bin/bash
 # Entry point for the privacy-filter backend image / BACKEND_BINARY mode.
 set -e
 CURDIR=$(dirname "$(realpath "$0")")
 export LD_LIBRARY_PATH="$CURDIR/lib:$LD_LIBRARY_PATH"
 if [ -f "$CURDIR/lib/ld.so" ]; then
    exec "$CURDIR/lib/ld.so" "$CURDIR/grpc-server" "$@"
 fi
 exec "$CURDIR/grpc-server" "$@"
--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -65,6 +65,29 @@ turboquant-avx:
 turboquant-fallback:
 	$(call turboquant-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
 # Single-build CPU backend via ggml CPU_ALL_VARIANTS (mirrors llama-cpp-cpu-all).
 # turboquant reuses backend/cpp/llama-cpp's CMakeLists.txt (hw_grpc_proto STATIC) and
 # Makefile (SHARED_LIBS make-var + EXTRA_CMAKE_ARGS), so this passes the same overrides
 # through to the copied build: SHARED_LIBS=ON, the DL flags, and --target ggml (which
 # pulls in the per-microarch libggml-cpu-*.so via ggml's add_dependencies). The .so set
 # is collected for package.sh to bundle into package/lib.
 turboquant-cpu-all:
 	rm -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
 	cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build purge
 	bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server.cpp
 	$(info $(GREEN)I turboquant build info:cpu-all-variants$(RESET))
 	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build llama.cpp
 	bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp $(PATCHES_DIR)
 	SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" \
 	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build grpc-server
 	cp -rfv $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server turboquant-cpu-all
 	rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
 	find $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
 	@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
 turboquant-grpc:
 	$(call turboquant-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)
--- a/backend/cpp/turboquant/package.sh
+++ b/backend/cpp/turboquant/package.sh
@@ -14,6 +14,15 @@ mkdir -p $CURDIR/package/lib
 cp -avrf $CURDIR/turboquant-* $CURDIR/package/
 cp -rfv $CURDIR/run.sh $CURDIR/package/
 # Bundle the ggml shared backends from the CPU_ALL_VARIANTS build into package/lib. ggml
 # discovers the per-microarch libggml-cpu-*.so by scanning the executable directory, which
 # (via the bundled lib/ld.so that run.sh launches through) resolves to lib/. See the
 # matching comment in backend/cpp/llama-cpp/package.sh. No-op on the fallback/ROCm builds.
 if [ -d "$CURDIR/ggml-shared-libs" ]; then
    echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
    cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
 fi
 # Detect architecture and copy appropriate libraries
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    # x86_64 architecture
--- a/backend/cpp/turboquant/run.sh
+++ b/backend/cpp/turboquant/run.sh
@@ -12,26 +12,11 @@ grep -e "flags" /proc/cpuinfo | head -1
 BINARY=turboquant-fallback
-if grep -q -e "\savx\s" /proc/cpuinfo ; then
+# x86/arm64 ship a single turboquant-cpu-all built with ggml CPU_ALL_VARIANTS: ggml's
-	echo "CPU:    AVX    found OK"
+# backend registry dlopens the best libggml-cpu-*.so for this host, so no shell-side
-	if [ -e $CURDIR/turboquant-avx ]; then
+# probing. ROCm ships only turboquant-fallback, so fall back to it when cpu-all is absent.
-		BINARY=turboquant-avx
+if [ -e $CURDIR/turboquant-cpu-all ]; then
-	fi
+	BINARY=turboquant-cpu-all
 fi
 if grep -q -e "\savx2\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX2   found OK"
 	if [ -e $CURDIR/turboquant-avx2 ]; then
 		BINARY=turboquant-avx2
 	fi
 fi
 # Check avx 512
 if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX512F found OK"
 	if [ -e $CURDIR/turboquant-avx512 ]; then
 		BINARY=turboquant-avx512
 	fi
 fi
 if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
--- a/backend/go/ced/.gitignore
+++ b/backend/go/ced/.gitignore
@@ -0,0 +1,11 @@
 .cache/
 sources/
 build/
 package/
 ced-grpc
 # build artifacts staged in-tree by the Makefile (cp from sources/) or
 # symlinked for local dev; the real sources live in ced.cpp upstream.
 *.so
 *.so.*
 ced_capi.h
 compile_commands.json
--- a/backend/go/ced/Makefile
+++ b/backend/go/ced/Makefile
@@ -0,0 +1,77 @@
 # ced sound-classification backend Makefile.
 #
 # Upstream pin lives below as CED_VERSION?=<sha> so .github/bump_deps.sh can find
 # and update it (matches the parakeet-cpp / whisper.cpp convention).
 #
 # Local dev shortcut: symlink an out-of-tree ced.cpp shared build + header and
 # skip the clone/cmake steps entirely:
 #   ln -sf /path/to/ced.cpp/build-shared/libced.so .
 #   ln -sf /path/to/ced.cpp/include/ced_capi.h .
 #   go build -o ced-grpc .
 CED_VERSION?=c04ac14b7992d00584d9e812c9bb6268598a6ce7
 CED_REPO?=https://github.com/mudler/ced.cpp
 GOCMD?=go
 GO_TAGS?=
 JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
 BUILD_TYPE?=
 NATIVE?=false
 # Static-link ggml into libced.so (PIC) so the shared lib is self-contained:
 # dlopen needs no libggml*.so alongside it, only system libs the runtime image
 # already provides.
 CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DCED_SHARED=ON -DCED_BUILD_CLI=OFF -DCED_BUILD_TESTS=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
 ifeq ($(NATIVE),false)
 	CMAKE_ARGS+=-DGGML_NATIVE=OFF
 endif
 # ced.cpp gates its ggml backends behind CED_GGML_* options (set(... CACHE BOOL
 # "" FORCE)), so forward those instead of a bare -DGGML_CUDA=ON.
 ifeq ($(BUILD_TYPE),cublas)
 	CMAKE_ARGS+=-DCED_GGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON
 else ifeq ($(BUILD_TYPE),openblas)
 	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
 else ifeq ($(BUILD_TYPE),hipblas)
 	CMAKE_ARGS+=-DCED_GGML_HIP=ON
 else ifeq ($(BUILD_TYPE),vulkan)
 	CMAKE_ARGS+=-DCED_GGML_VULKAN=ON
 endif
 .PHONY: ced-grpc package build clean purge test all
 all: ced-grpc
 sources/ced.cpp:
 	mkdir -p sources/ced.cpp
 	cd sources/ced.cpp && \
 	git init -q && \
 	git remote add origin $(CED_REPO) && \
 	git fetch --depth 1 origin $(CED_VERSION) && \
 	git checkout FETCH_HEAD && \
 	git submodule update --init --recursive --depth 1 --single-branch
 libced.so: sources/ced.cpp
 	cmake -B sources/ced.cpp/build-shared -S sources/ced.cpp $(CMAKE_ARGS)
 	cmake --build sources/ced.cpp/build-shared --config Release -j$(JOBS)
 	cp -fv sources/ced.cpp/build-shared/libced.so* ./ 2>/dev/null || true
 	cp -fv sources/ced.cpp/include/ced_capi.h ./
 ced-grpc: libced.so main.go goced.go
 	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o ced-grpc .
 package: ced-grpc
 	bash package.sh
 build: package
 test:
 	LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1
 clean: purge
 	rm -rf libced.so* ced_capi.h package ced-grpc
 purge:
 	rm -rf sources/ced.cpp
--- a/backend/go/ced/goced.go
+++ b/backend/go/ced/goced.go
@@ -0,0 +1,130 @@
 package main
 // Go side of the ced backend: purego bindings over ced_capi.h plus the gRPC
 // SoundDetection implementation.
 //
 // SKETCH: the pb.SoundDetection* types come from backend.proto (regenerate with
 // `make protogen-go`). The C side is single-threaded per ctx, so we guard the
 // engine with engineMu; LocalAI also serializes via base.SingleThread.
 import (
 	"context"
 	"encoding/json"
 	"errors"
 	"fmt"
 	"sort"
 	"sync"
 	"unsafe"
 	"github.com/mudler/LocalAI/pkg/grpc/base"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 )
 // purego-bound entry points from libced.so. Names match ced_capi.h exactly.
 var (
 	CppAbiVersion       func() int32
 	CppLoad             func(ggufPath string) uintptr
 	CppFree             func(ctx uintptr)
 	CppLastError        func(ctx uintptr) string
 	CppNumClasses       func(ctx uintptr) int32
 	CppSampleRate       func(ctx uintptr) int32
 	CppClassifyPathJSON func(ctx uintptr, wavPath string, topK int32) uintptr
 	CppClassifyPcmJSON  func(ctx uintptr, pcm []float32, nSamples int32, sampleRate int32, topK int32) uintptr
 	CppFreeString       func(s uintptr)
 )
 // cstr copies a malloc'd C string (returned as uintptr) into a Go string and
 // frees the original via ced_capi_free_string. Empty/0 -> "".
 func cstr(p uintptr) string {
 	if p == 0 {
 		return ""
 	}
 	defer CppFreeString(p)
 	var b []byte
 	for i := 0; ; i++ {
 		ch := *(*byte)(unsafe.Pointer(p + uintptr(i))) //nolint:govet // #nosec G103 -- C-owned NUL-terminated string from libced (not Go-GC memory)
 		if ch == 0 {
 			break
 		}
 		b = append(b, ch)
 	}
 	return string(b)
 }
 // Ced is the gRPC backend. One loaded CED model per instance.
 type Ced struct {
 	base.Base
 	ctxPtr   uintptr
 	engineMu sync.Mutex
 }
 // Load resolves the GGUF and opens the C-API context.
 func (c *Ced) Load(opts *pb.ModelOptions) error {
 	if opts.ModelFile == "" {
 		return errors.New("ced: ModelFile is required")
 	}
 	ctx := CppLoad(opts.ModelFile)
 	if ctx == 0 {
 		return fmt.Errorf("ced: ced_capi_load failed for %q: %s", opts.ModelFile, CppLastError(0))
 	}
 	c.ctxPtr = ctx
 	return nil
 }
 // jsonTag mirrors the ced_capi JSON tag objects.
 type jsonTag struct {
 	Index int     `json:"index"`
 	Score float32 `json:"score"`
 	Label string  `json:"label"`
 }
 // SoundDetection classifies the clip at req.Src and returns scored AudioSet tags.
 func (c *Ced) SoundDetection(ctx context.Context, req *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
 	if c.ctxPtr == 0 {
 		return nil, errors.New("ced: model not loaded")
 	}
 	if req.GetSrc() == "" {
 		return nil, errors.New("ced: SoundDetectionRequest.src (audio path) is required")
 	}
 	topK := req.GetTopK()
 	if topK <= 0 {
 		topK = 10 // sensible default for a tagging response
 	}
 	c.engineMu.Lock()
 	out := cstr(CppClassifyPathJSON(c.ctxPtr, req.GetSrc(), topK))
 	lastErr := CppLastError(c.ctxPtr)
 	c.engineMu.Unlock()
 	if out == "" {
 		return nil, fmt.Errorf("ced: classification failed: %s", lastErr)
 	}
 	var tags []jsonTag
 	if err := json.Unmarshal([]byte(out), &tags); err != nil {
 		return nil, fmt.Errorf("ced: bad classifier JSON: %w", err)
 	}
 	thr := req.GetThreshold()
 	resp := &pb.SoundDetectionResponse{}
 	for _, t := range tags {
 		if t.Score < thr {
 			continue
 		}
 		resp.Detections = append(resp.Detections, &pb.SoundClass{
 			Label: t.Label, Score: t.Score, Index: int32(t.Index),
 		})
 	}
 	sort.Slice(resp.Detections, func(i, j int) bool {
 		return resp.Detections[i].Score > resp.Detections[j].Score
 	})
 	return resp, nil
 }
 func (c *Ced) Free() error {
 	c.engineMu.Lock()
 	defer c.engineMu.Unlock()
 	if c.ctxPtr != 0 {
 		CppFree(c.ctxPtr)
 		c.ctxPtr = 0
 	}
 	return nil
 }
--- a/backend/go/ced/main.go
+++ b/backend/go/ced/main.go
@@ -0,0 +1,59 @@
 package main
 // ced sound-classification backend. Started internally by LocalAI: one gRPC
 // server per loaded model. Loads libced.so via purego and registers the flat
 // C-API declared in ced_capi.h. The library name can be overridden with
 // CED_LIBRARY (mirrors PARAKEET_LIBRARY / WHISPER_LIBRARY); the default looks
 // for the .so next to this binary.
 //
 // SKETCH: requires `make protogen-go` after the backend.proto SoundDetection
 // addition, and a built libced.so (see Makefile). See DESIGN.md.
 import (
 	"flag"
 	"fmt"
 	"os"
 	"github.com/ebitengine/purego"
 	grpc "github.com/mudler/LocalAI/pkg/grpc"
 )
 var addr = flag.String("addr", "localhost:50051", "the address to connect to")
 type libFunc struct {
 	ptr  any
 	name string
 }
 func main() {
 	libName := os.Getenv("CED_LIBRARY")
 	if libName == "" {
 		libName = "libced.so"
 	}
 	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
 	if err != nil {
 		panic(fmt.Errorf("ced: dlopen %q: %w", libName, err))
 	}
 	// Bound 1:1 to ced_capi.h. char*-returning functions are declared uintptr
 	// so we can free the same pointer with ced_capi_free_string after copying
 	// (purego's string return would copy and leak the original).
 	for _, lf := range []libFunc{
 		{&CppAbiVersion, "ced_capi_abi_version"},
 		{&CppLoad, "ced_capi_load"},
 		{&CppFree, "ced_capi_free"},
 		{&CppLastError, "ced_capi_last_error"},
 		{&CppNumClasses, "ced_capi_num_classes"},
 		{&CppSampleRate, "ced_capi_sample_rate"},
 		{&CppClassifyPathJSON, "ced_capi_classify_path_json"},
 		{&CppClassifyPcmJSON, "ced_capi_classify_pcm_json"},
 		{&CppFreeString, "ced_capi_free_string"},
 	} {
 		purego.RegisterLibFunc(lf.ptr, lib, lf.name)
 	}
 	fmt.Fprintf(os.Stderr, "[ced] ABI=%d\n", CppAbiVersion())
 	flag.Parse()
 	if err := grpc.StartServer(*addr, &Ced{}); err != nil {
 		panic(err)
 	}
 }
--- a/backend/go/ced/package.sh
+++ b/backend/go/ced/package.sh
@@ -0,0 +1,60 @@
 #!/bin/bash
 #
 # Bundle the ced-grpc binary, libced.so, the core runtime libs (libc/libstdc++/
 # libgomp + ld.so) and the GPU runtime for the active BUILD_TYPE so the package
 # is self-contained. Mirrors backend/go/parakeet-cpp/package.sh; run.sh routes
 # the (CGO_ENABLED=0) binary through lib/ld.so so the packaged libc is used.
 set -e
 CURDIR=$(dirname "$(realpath "$0")")
 REPO_ROOT="${CURDIR}/../../.."
 mkdir -p "$CURDIR/package/lib"
 cp -avf "$CURDIR/ced-grpc" "$CURDIR/package/"
 cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
 cp -avf "$CURDIR"/libced.so* "$CURDIR/package/lib/" 2>/dev/null || {
 	echo "ERROR: libced.so not found in $CURDIR, run 'make' first" >&2
 	exit 1
 }
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    echo "Detected x86_64 architecture, copying x86_64 libraries..."
    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
 elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
    echo "Detected ARM64 architecture, copying ARM64 libraries..."
    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
 elif [ "$(uname -s)" = "Darwin" ]; then
    echo "Detected Darwin"
 else
    echo "Error: Could not detect architecture"
    exit 1
 fi
 GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
 if [ -f "$GPU_LIB_SCRIPT" ]; then
    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
    package_gpu_libs
 fi
 echo "Packaging completed successfully"
 ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/go/ced/run.sh
+++ b/backend/go/ced/run.sh
@@ -0,0 +1,15 @@
 #!/bin/bash
 set -e
 CURDIR=$(dirname "$(realpath "$0")")
 export LD_LIBRARY_PATH="$CURDIR/lib:$CURDIR:${LD_LIBRARY_PATH:-}"
 # If a self-contained ld.so was packaged, route through it so the packaged
 # libc / libstdc++ are used instead of the host's (matches the sibling backends).
 if [ -f "$CURDIR/lib/ld.so" ]; then
 	echo "Using lib/ld.so"
 	exec "$CURDIR/lib/ld.so" "$CURDIR/ced-grpc" "$@"
 fi
 exec "$CURDIR/ced-grpc" "$@"
--- a/backend/go/crispasr/Makefile
+++ b/backend/go/crispasr/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
 # CrispASR version (release tag)
 CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
-CRISPASR_VERSION?=d745bda4386ae0f9d1d2f23fff8ec95d76428221
+CRISPASR_VERSION?=96b2a6ee31d30389fed8a7ef1a54239b75231ddc
 SO_TARGET?=libgocrispasr.so
 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
@@ -67,7 +67,7 @@ sources/CrispASR:
 	# it, so ${CMAKE_SOURCE_DIR} is THIS backend dir and the talk-llama sources
 	# aren't found. Rewrite to ${PROJECT_SOURCE_DIR} (the crispasr project root),
 	# which is correct both standalone and as a subproject. Idempotent.
-	sed -i 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt
+	sed -i.bak 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt && rm -f sources/CrispASR/src/CMakeLists.txt.bak
 # Detect OS
 UNAME_S := $(shell uname -s)
--- a/backend/go/crispasr/cpp/crispasr_shim.cpp
+++ b/backend/go/crispasr/cpp/crispasr_shim.cpp
@@ -47,6 +47,74 @@ extern "C" void set_abort(int v) {
  g_abort.store(v, std::memory_order_relaxed);
 }
 // --- word-level timestamp accessors ---
 extern "C" {
 int crispasr_session_result_n_words(crispasr_session_result *r, int seg_i);
 const char *crispasr_session_result_word_text(crispasr_session_result *r,
                                               int seg_i, int word_i);
 int64_t crispasr_session_result_word_t0(crispasr_session_result *r, int seg_i,
                                         int word_i);
 int64_t crispasr_session_result_word_t1(crispasr_session_result *r, int seg_i,
                                         int word_i);
 // Parakeet-specific word accessors
 int crispasr_parakeet_result_n_words(void *r);
 const char *crispasr_parakeet_result_word_text(void *r, int word_i);
 int64_t crispasr_parakeet_result_word_t0(void *r, int word_i);
 int64_t crispasr_parakeet_result_word_t1(void *r, int word_i);
 }
 void *get_result(void) { return g_result; }
 int get_word_count(int seg_i) {
  if (!g_result)
    return 0;
  return crispasr_session_result_n_words(g_result, seg_i);
 }
 const char *get_word_text(int seg_i, int word_i) {
  if (!g_result)
    return "";
  return crispasr_session_result_word_text(g_result, seg_i, word_i);
 }
 int64_t get_word_t0(int seg_i, int word_i) {
  if (!g_result)
    return 0;
  return crispasr_session_result_word_t0(g_result, seg_i, word_i);
 }
 int64_t get_word_t1(int seg_i, int word_i) {
  if (!g_result)
    return 0;
  return crispasr_session_result_word_t1(g_result, seg_i, word_i);
 }
 // Parakeet-specific word accessors
 int get_parakeet_word_count(void) {
  if (!g_result)
    return 0;
  return crispasr_parakeet_result_n_words(g_result);
 }
 const char *get_parakeet_word_text(int word_i) {
  if (!g_result)
    return "";
  return crispasr_parakeet_result_word_text(g_result, word_i);
 }
 int64_t get_parakeet_word_t0(int word_i) {
  if (!g_result)
    return 0;
  return crispasr_parakeet_result_word_t0(g_result, word_i);
 }
 int64_t get_parakeet_word_t1(int word_i) {
  if (!g_result)
    return 0;
  return crispasr_parakeet_result_word_t1(g_result, word_i);
 }
 static void ggml_log_cb(enum ggml_log_level level, const char *log,
                        void *data) {
  const char *level_str;
--- a/backend/go/crispasr/cpp/crispasr_shim.h
+++ b/backend/go/crispasr/cpp/crispasr_shim.h
@@ -20,4 +20,18 @@ float *tts_synthesize(const char *text, int *out_n_samples); // 24kHz mono float
 void tts_free(float *pcm);
 int tts_set_voice(const char *name); // best-effort speaker selection; 0 ok
 int tts_set_voice_file(const char *path, const char *ref_text); // load voice pack (.gguf) or zero-shot clone (.wav + ref_text)
 // --- word-level timestamp accessors ---
 // Session-based (works for whisper-like backends)
 void *get_result(void);
 int get_word_count(int seg_i);
 const char *get_word_text(int seg_i, int word_i);
 int64_t get_word_t0(int seg_i, int word_i);
 int64_t get_word_t1(int seg_i, int word_i);
 // Parakeet-specific (global word list, no segment index)
 int get_parakeet_word_count(void);
 const char *get_parakeet_word_text(int word_i);
 int64_t get_parakeet_word_t0(int word_i);
 int64_t get_parakeet_word_t1(int word_i);
 }
--- a/backend/go/crispasr/gocrispasr.go
+++ b/backend/go/crispasr/gocrispasr.go
@@ -34,6 +34,18 @@ var (
 	CppTTSFree         func(ptr uintptr)
 	CppTTSSetVoice     func(name string) int
 	CppTTSSetVoiceFile func(path string, refText string) int
 	// Word-level timestamp accessors (session-based, per-segment)
 	CppGetWordCount func(segI int) int
 	CppGetWordText  func(segI int, wordI int) string
 	CppGetWordT0    func(segI int, wordI int) int64
 	CppGetWordT1    func(segI int, wordI int) int64
 	// Parakeet-specific word accessors (global, no segment index)
 	CppGetParakeetWordCount func() int
 	CppGetParakeetWordText  func(wordI int) string
 	CppGetParakeetWordT0    func(wordI int) int64
 	CppGetParakeetWordT1    func(wordI int) int64
 )
 type CrispASR struct {
@@ -212,6 +224,28 @@ func (w *CrispASR) VAD(req *pb.VADRequest) (pb.VADResponse, error) {
 	}, nil
 }
 // isValidWord reports whether a TranscriptWord contains recognisable speech
 // content. The parakeet-specific word accessors can return stale initialisation
 // data (model name, binary blobs) when a segment has no real speech. A word is
 // considered valid only when:
 //   - the text is non-empty after trimming,
 //   - it contains no U+FFFD replacement characters (from binary data scrubbing),
 //   - both timestamps are non-negative,
 //   - the word has positive duration (end > start).
 func isValidWord(w *pb.TranscriptWord) bool {
 	txt := strings.TrimSpace(w.Text)
 	if txt == "" {
 		return false
 	}
 	if strings.ContainsRune(txt, '\uFFFD') {
 		return false
 	}
 	if w.Start < 0 || w.End < 0 || w.End <= w.Start {
 		return false
 	}
 	return true
 }
 func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
 	if err := ctx.Err(); err != nil {
 		return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
@@ -290,15 +324,54 @@ func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRe
 		// IDs, so Tokens is left empty.
 		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
 		// Populate word-level timestamps. Try session-based functions first
 		// (per-segment); fall back to parakeet-specific functions (global word
 		// list with no segment index — only populated on the first segment to
 		// avoid duplication).
 		words := []*pb.TranscriptWord{}
 		wordCount := CppGetWordCount(i)
 		if wordCount == 0 && i == 0 {
 			wordCount = CppGetParakeetWordCount()
 			for j := 0; j < wordCount; j++ {
 				w := &pb.TranscriptWord{
 					Start: CppGetParakeetWordT0(j) * (10000000),
 					End:   CppGetParakeetWordT1(j) * (10000000),
 					Text:  strings.ToValidUTF8(strings.Clone(CppGetParakeetWordText(j)), "<22>"),
 				}
 				if isValidWord(w) {
 					words = append(words, w)
 				}
 			}
 		} else {
 			for j := 0; j < wordCount; j++ {
 				w := &pb.TranscriptWord{
 					Start: CppGetWordT0(i, j) * (10000000),
 					End:   CppGetWordT1(i, j) * (10000000),
 					Text:  strings.ToValidUTF8(strings.Clone(CppGetWordText(i, j)), "<22>"),
 				}
 				if isValidWord(w) {
 					words = append(words, w)
 				}
 			}
 		}
 		// Skip empty segments with no recognisable content (e.g. trailing
 		// silence segments that parakeet emits with stale init data).
 		trimmed := strings.TrimSpace(txt)
 		if trimmed == "" && len(words) == 0 {
 			continue
 		}
 		segment := &pb.TranscriptSegment{
 			Id:    int32(i),
 			Text:  txt,
 			Start: s, End: t,
 			Words: words,
 		}
 		segments = append(segments, segment)
-		text += " " + strings.TrimSpace(txt)
+		text += " " + trimmed
 	}
 	return pb.TranscriptResult{
@@ -390,13 +463,20 @@ func (w *CrispASR) AudioTranscriptionStream(ctx context.Context, opts *pb.Transc
 		s := CppGetSegmentStart(i) * 10000000
 		t := CppGetSegmentEnd(i) * 10000000
 		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
 		// Skip empty segments (e.g. trailing silence that parakeet emits
 		// with stale init data).
 		trimmed := strings.TrimSpace(txt)
 		if trimmed == "" && s == t {
 			continue
 		}
 		segments = append(segments, &pb.TranscriptSegment{
 			Id:    int32(i),
 			Text:  txt,
 			Start: s, End: t,
 		})
 		trimmed := strings.TrimSpace(txt)
 		if trimmed == "" {
 			continue
 		}
--- a/backend/go/crispasr/main.go
+++ b/backend/go/crispasr/main.go
@@ -44,6 +44,14 @@ func main() {
 		{&CppTTSFree, "tts_free"},
 		{&CppTTSSetVoice, "tts_set_voice"},
 		{&CppTTSSetVoiceFile, "tts_set_voice_file"},
 		{&CppGetWordCount, "get_word_count"},
 		{&CppGetWordText, "get_word_text"},
 		{&CppGetWordT0, "get_word_t0"},
 		{&CppGetWordT1, "get_word_t1"},
 		{&CppGetParakeetWordCount, "get_parakeet_word_count"},
 		{&CppGetParakeetWordText, "get_parakeet_word_text"},
 		{&CppGetParakeetWordT0, "get_parakeet_word_t0"},
 		{&CppGetParakeetWordT1, "get_parakeet_word_t1"},
 	}
 	for _, lf := range libFuncs {
--- a/backend/go/depth-anything-cpp/.gitignore
+++ b/backend/go/depth-anything-cpp/.gitignore
@@ -0,0 +1,7 @@
 sources/
 build*/
 package/
 libdepthanythingcpp*.so
 depth-anything-cpp
 test-models/
 test-data/
--- a/backend/go/depth-anything-cpp/CMakeLists.txt
+++ b/backend/go/depth-anything-cpp/CMakeLists.txt
@@ -0,0 +1,28 @@
 cmake_minimum_required(VERSION 3.18)
 project(libdepthanythingcpp LANGUAGES C CXX)
 set(CMAKE_POSITION_INDEPENDENT_CODE ON)
 set(CMAKE_CXX_STANDARD 17)
 set(CMAKE_CXX_STANDARD_REQUIRED ON)
 # Static-link ggml into the depth-anything shared library so the resulting .so
 # has no runtime dependency on an external libggml — only on
 # libc/libstdc++/libgomp, which the LocalAI package step bundles into the
 # docker image.
 set(BUILD_SHARED_LIBS OFF CACHE BOOL "Build static libraries" FORCE)
 # depth-anything.cpp build switches: skip CLI/tests, but build libdepthanything
 # itself as a SHARED library (DA_SHARED) while ggml stays static
 # (BUILD_SHARED_LIBS OFF above). The da_capi_* C ABI is compiled into
 # src/da_capi.cpp and re-exported by that shared library, so no extra MODULE
 # wrapper is needed (unlike locate-anything.cpp).
 set(DA_BUILD_CLI OFF CACHE BOOL "Disable depth-anything CLI" FORCE)
 set(DA_BUILD_TESTS OFF CACHE BOOL "Disable depth-anything tests" FORCE)
 set(DA_SHARED ON CACHE BOOL "Build libdepthanything as a shared lib" FORCE)
 add_subdirectory(./sources/depth-anything.cpp)
 # Emit libdepthanything.so into the top-level build dir so the Makefile can
 # rename it to the per-variant libdepthanythingcpp-<variant>.so.
 set_target_properties(depthanything PROPERTIES
    LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
--- a/backend/go/depth-anything-cpp/Makefile
+++ b/backend/go/depth-anything-cpp/Makefile
@@ -0,0 +1,141 @@
 CMAKE_ARGS?=
 BUILD_TYPE?=
 NATIVE?=false
 GOCMD?=go
 GO_TAGS?=
 JOBS?=$(shell nproc --ignore=1)
 # depth-anything.cpp. Pin to a specific commit for a stable build; a squash
 # merge upstream can orphan a branch, so the native version is pinned by SHA.
 # This SHA adds the Depth Anything V2 engine + C-API routing (depth-only,
 # relative + metric) on top of the nested two-file metric C-API (abi_version 4,
 # da_capi_load_nested) required by the depth-anything-3-nested gallery model.
 # It is kept alive by the upstream tag da2-support (survives a squash-merge);
 # repoint to the master merge commit once mudler/depth-anything.cpp PR #1 lands.
 DEPTHANYTHING_REPO?=https://github.com/mudler/depth-anything.cpp.git
 DEPTHANYTHING_VERSION?=f4e17dea695dd12ae76bea98ba58030996b98118
 ifeq ($(NATIVE),false)
 	CMAKE_ARGS+=-DGGML_NATIVE=OFF
 endif
 # Forward LocalAI's BUILD_TYPE to the matching ggml backend switch. depth-anything.cpp
 # force-sets GGML_CUDA/GGML_VULKAN/GGML_METAL from its own DA_GGML_* options, so
 # those must be toggled via the DA_GGML_* names (a bare -DGGML_CUDA=ON would be
 # overridden); the remaining ggml switches pass straight through.
 ifeq ($(BUILD_TYPE),cublas)
 	CMAKE_ARGS+=-DGGML_CUDA=ON -DDA_GGML_CUDA=ON
 else ifeq ($(BUILD_TYPE),openblas)
 	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
 else ifeq ($(BUILD_TYPE),clblas)
 	CMAKE_ARGS+=-DGGML_CLBLAST=ON
 else ifeq ($(BUILD_TYPE),hipblas)
 	ROCM_HOME ?= /opt/rocm
 	ROCM_PATH ?= /opt/rocm
 	export CXX=$(ROCM_HOME)/llvm/bin/clang++
 	export CC=$(ROCM_HOME)/llvm/bin/clang
 	AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
 	CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
 else ifeq ($(BUILD_TYPE),vulkan)
 	CMAKE_ARGS+=-DGGML_VULKAN=ON -DDA_GGML_VULKAN=ON
 else ifeq ($(OS),Darwin)
 	ifneq ($(BUILD_TYPE),metal)
 		CMAKE_ARGS+=-DGGML_METAL=OFF
 	else
 		CMAKE_ARGS+=-DGGML_METAL=ON
 		CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
 		CMAKE_ARGS+=-DDA_GGML_METAL=ON
 	endif
 endif
 ifeq ($(BUILD_TYPE),sycl_f16)
 	CMAKE_ARGS+=-DGGML_SYCL=ON \
 		-DCMAKE_C_COMPILER=icx \
 		-DCMAKE_CXX_COMPILER=icpx \
 		-DGGML_SYCL_F16=ON
 endif
 ifeq ($(BUILD_TYPE),sycl_f32)
 	CMAKE_ARGS+=-DGGML_SYCL=ON \
 		-DCMAKE_C_COMPILER=icx \
 		-DCMAKE_CXX_COMPILER=icpx
 endif
 sources/depth-anything.cpp:
 	mkdir -p sources && \
 	git clone --recursive $(DEPTHANYTHING_REPO) sources/depth-anything.cpp && \
 	cd sources/depth-anything.cpp && \
 	git checkout $(DEPTHANYTHING_VERSION) && \
 	git submodule update --init --recursive --depth 1 --single-branch
 # Detect OS
 UNAME_S := $(shell uname -s)
 # Only build CPU variants on Linux
 ifeq ($(UNAME_S),Linux)
 	VARIANT_TARGETS = libdepthanythingcpp-avx.so libdepthanythingcpp-avx2.so libdepthanythingcpp-avx512.so libdepthanythingcpp-fallback.so
 else
 	# On non-Linux (e.g., Darwin), build only fallback variant
 	VARIANT_TARGETS = libdepthanythingcpp-fallback.so
 endif
 depth-anything-cpp: main.go godepthanythingcpp.go $(VARIANT_TARGETS)
 	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o depth-anything-cpp ./
 package: depth-anything-cpp
 	bash package.sh
 build: package
 clean: purge
 	rm -rf libdepthanythingcpp*.so depth-anything-cpp package sources
 purge:
 	rm -rf build*
 # Build all variants (Linux only)
 ifeq ($(UNAME_S),Linux)
 libdepthanythingcpp-avx.so: sources/depth-anything.cpp
 	rm -rfv build-$@
 	$(info ${GREEN}I depth-anything-cpp build info:avx${RESET})
 	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libdepthanythingcpp-custom
 	rm -rfv build-$@
 libdepthanythingcpp-avx2.so: sources/depth-anything.cpp
 	rm -rfv build-$@
 	$(info ${GREEN}I depth-anything-cpp build info:avx2${RESET})
 	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libdepthanythingcpp-custom
 	rm -rfv build-$@
 libdepthanythingcpp-avx512.so: sources/depth-anything.cpp
 	rm -rfv build-$@
 	$(info ${GREEN}I depth-anything-cpp build info:avx512${RESET})
 	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libdepthanythingcpp-custom
 	rm -rfv build-$@
 endif
 # Build fallback variant (all platforms)
 libdepthanythingcpp-fallback.so: sources/depth-anything.cpp
 	rm -rfv build-$@
 	$(info ${GREEN}I depth-anything-cpp build info:fallback${RESET})
 	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libdepthanythingcpp-custom
 	rm -rfv build-$@
 libdepthanythingcpp-custom: CMakeLists.txt
 	mkdir -p build-$(SO_TARGET) && \
 	cd build-$(SO_TARGET) && \
 	cmake .. $(CMAKE_ARGS) && \
 	cmake --build . --config Release -j$(JOBS) && \
 	cd .. && \
 	mv build-$(SO_TARGET)/libdepthanything.so ./$(SO_TARGET)
 all: depth-anything-cpp package
 # `test` is invoked by the top-level Makefile's `test-extra` target. It builds
 # the backend binary + the fallback shared library (needed for dlopen at
 # runtime), then runs test.sh which downloads a small GGUF + a test image and
 # exercises the gRPC Load/Predict wire path via the Go smoke test in
 # main_test.go.
 test: depth-anything-cpp libdepthanythingcpp-fallback.so
 	bash test.sh
--- a/backend/go/depth-anything-cpp/godepthanythingcpp.go
+++ b/backend/go/depth-anything-cpp/godepthanythingcpp.go
@@ -0,0 +1,556 @@
 package main
 // godepthanythingcpp.go - gRPC handlers (Load, Predict, GenerateImage) for the
 // depth-anything-cpp backend, wrapping the Depth Anything 3 ggml C-API
 // (libdepthanythingcpp-<variant>.so) via purego.
 //
 // Embeds base.SingleThread to default the unimplemented RPCs to "not supported"
 // and to serialize calls — the C side shares a ggml graph allocator and is NOT
 // reentrant, so all inference must run one-at-a-time.
 //
 // Depth has no native OpenAI endpoint, so the model is exposed two ways:
 //
 //   - GenerateImage(src, dst): run depth on the src image and write a
 //     min-max-normalised grayscale depth PNG to dst.
 //   - Predict(images[0]): run depth+pose and return a JSON blob with the depth
 //     dimensions, depth stats and the camera extrinsics (3x4) / intrinsics (3x3).
 import (
 	"encoding/base64"
 	"encoding/json"
 	"fmt"
 	"image"
 	"image/png"
 	"math"
 	"os"
 	"path/filepath"
 	"strings"
 	"unsafe"
 	"github.com/mudler/LocalAI/pkg/grpc/base"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 )
 // C-API function pointers, registered in main.go via purego. The da_capi_*
 // symbols live inside libdepthanything (src/da_capi.cpp) and are re-exported by
 // the DA_SHARED build.
 var (
 	// da_capi_load(const char* gguf_path, int n_threads) -> da_ctx* (0 = fail)
 	CapiLoad func(gguf string, nThreads int32) uintptr
 	// da_capi_load_nested(const char* anyview_gguf, const char* metric_gguf,
 	//   int n_threads) -> da_ctx* (0 = fail). The returned ctx serves the nested
 	//   metric model: depth/pose calls produce final metric-scale depth + scaled pose.
 	CapiLoadNested func(anyview string, metric string, nThreads int32) uintptr
 	// da_capi_free(da_ctx* ctx) — safe on a 0 handle.
 	CapiFree func(handle uintptr)
 	// da_capi_last_error(da_ctx* ctx) -> const char* (owned by ctx, "" if none).
 	// purego marshals the returned C string into a Go string (a copy), so we
 	// never free it.
 	CapiLastError func(handle uintptr) string
 	// da_capi_depth_path(ctx, image_path, out_h*, out_w*) -> float* depth map
 	// (row-major H*W); nil on error. Caller frees via da_capi_free_floats.
 	CapiDepthPath func(handle uintptr, imagePath string, outH *int32, outW *int32) *float32
 	// da_capi_free_floats(float* p)
 	CapiFreeFloats func(p *float32)
 	// da_capi_pose_path(ctx, image_path, out_ext[12], out_intr[9]) -> 0 ok, -1 err
 	CapiPosePath func(handle uintptr, imagePath string, outExt *float32, outIntr *float32) int32
 	// da_capi_depth_dense(ctx, image_path, out_h*, out_w*, out_depth**, out_conf**,
 	//   out_sky**, out_ext[12], out_intr[9], out_is_metric*) -> 0 ok, -1 err.
 	// Each non-NULL out_depth/out_conf/out_sky receives a malloc'd float[H*W] (free
 	// via da_capi_free_floats); buffers the model doesn't produce are set NULL.
 	CapiDepthDense func(handle uintptr, imagePath string,
 		outH, outW *int32,
 		outDepth, outConf, outSky **float32,
 		outExt, outIntr *float32,
 		outIsMetric *int32) int32
 	// da_capi_points(ctx, image_path, conf_thresh, out_n*, out_xyz**, out_rgb**) ->
 	//   0 ok, -1 err. *out_xyz = malloc'd float[3*N] (free via da_capi_free_floats),
 	//   *out_rgb = malloc'd uint8[3*N] (free via da_capi_free_bytes).
 	CapiPoints func(handle uintptr, imagePath string, confThresh float32,
 		outN *int32, outXyz **float32, outRgb **byte) int32
 	// da_capi_free_bytes(unsigned char* p)
 	CapiFreeBytes func(p *byte)
 	// da_capi_export_glb(ctx, image_path, out_glb) -> 0 ok, -1 err
 	CapiExportGlb func(handle uintptr, imagePath string, outGlb string) int32
 	// da_capi_export_colmap(ctx, image_path, out_dir, binary) -> 0 ok, -1 err
 	CapiExportColmap func(handle uintptr, imagePath string, outDir string, binary int32) int32
 )
 type DepthAnythingCpp struct {
 	base.SingleThread
 	handle uintptr
 }
 // Load loads the GGUF model at opts.ModelFile (joined with opts.ModelPath if
 // relative) and stores the da_ctx handle for later inference calls.
 func (r *DepthAnythingCpp) Load(opts *pb.ModelOptions) error {
 	modelFile := opts.ModelFile
 	if modelFile == "" {
 		modelFile = opts.Model
 	}
 	if modelFile == "" {
 		return fmt.Errorf("depth-anything-cpp: ModelFile is empty")
 	}
 	resolve := func(name string) string {
 		if filepath.IsAbs(name) {
 			return name
 		}
 		return filepath.Join(opts.ModelPath, name)
 	}
 	modelPath := resolve(modelFile)
 	if _, err := os.Stat(modelPath); err != nil {
 		return fmt.Errorf("depth-anything-cpp: model file not found: %s: %w", modelPath, err)
 	}
 	// Nested metric models are a two-file pair: the main model is the anyview
 	// (GIANT) branch and the metric (ViT-L + DPT/sky) branch is named via a
 	// "metric_model:<filename>" entry in opts.Options. When present we load both
 	// branches so the engine runs the nested metric alignment.
 	metricFile := optionValue(opts.Options, "metric_model")
 	threads := opts.Threads
 	if threads <= 0 {
 		threads = 4
 	}
 	// Release previous model if any (re-Load).
 	if r.handle != 0 {
 		CapiFree(r.handle)
 		r.handle = 0
 	}
 	var h uintptr
 	if metricFile != "" {
 		metricPath := resolve(metricFile)
 		if _, err := os.Stat(metricPath); err != nil {
 			return fmt.Errorf("depth-anything-cpp: metric_model file not found: %s: %w", metricPath, err)
 		}
 		h = CapiLoadNested(modelPath, metricPath, threads)
 		if h == 0 {
 			if msg := CapiLastError(0); msg != "" {
 				return fmt.Errorf("depth-anything-cpp: da_capi_load_nested failed for %s + %s: %s", modelPath, metricPath, msg)
 			}
 			return fmt.Errorf("depth-anything-cpp: da_capi_load_nested failed for %s + %s", modelPath, metricPath)
 		}
 	} else {
 		h = CapiLoad(modelPath, threads)
 		if h == 0 {
 			// da_capi_last_error needs a ctx; on a failed load we have none (it
 			// returns "" for a null ctx), so the text is best-effort.
 			if msg := CapiLastError(0); msg != "" {
 				return fmt.Errorf("depth-anything-cpp: da_capi_load failed for %s: %s", modelPath, msg)
 			}
 			return fmt.Errorf("depth-anything-cpp: da_capi_load failed for %s", modelPath)
 		}
 	}
 	r.handle = h
 	return nil
 }
 // optionValue returns the value of the first "key:value" entry in opts whose key
 // matches (case-sensitive), or "" if absent. Mirrors how other LocalAI backends
 // read ModelOptions.Options.
 func optionValue(opts []string, key string) string {
 	prefix := key + ":"
 	for _, o := range opts {
 		if strings.HasPrefix(o, prefix) {
 			return strings.TrimSpace(o[len(prefix):])
 		}
 	}
 	return ""
 }
 // depthResult is the JSON payload returned by Predict.
 type depthResult struct {
 	DepthW     int         `json:"depth_w"`
 	DepthH     int         `json:"depth_h"`
 	DepthMin   float32     `json:"depth_min"`
 	DepthMax   float32     `json:"depth_max"`
 	Extrinsics [12]float32 `json:"extrinsics"` // 3x4 row-major
 	Intrinsics [9]float32  `json:"intrinsics"` // 3x3 row-major
 }
 // Predict runs depth+pose on the first supplied image and returns depth
 // statistics + camera pose as a JSON string. LocalAI wraps the string into the
 // Reply.Message of the gRPC response. The image in Images[0] may be a
 // filesystem path or a base64-encoded payload.
 func (r *DepthAnythingCpp) Predict(opts *pb.PredictOptions) (string, error) {
 	imgs := opts.GetImages()
 	if len(imgs) == 0 {
 		return "", fmt.Errorf("depth-anything-cpp: Predict requires an image in Images[]")
 	}
 	imgPath, cleanup, err := materializeImage(imgs[0])
 	if err != nil {
 		return "", fmt.Errorf("depth-anything-cpp: %w", err)
 	}
 	defer cleanup()
 	depth, h, w, ext, intr, err := r.runDepthPose(imgPath)
 	if err != nil {
 		return "", err
 	}
 	dmin, dmax := minMax(depth)
 	payload, err := json.Marshal(depthResult{
 		DepthW: w, DepthH: h,
 		DepthMin: dmin, DepthMax: dmax,
 		Extrinsics: ext, Intrinsics: intr,
 	})
 	if err != nil {
 		return "", fmt.Errorf("depth-anything-cpp: marshal: %w", err)
 	}
 	return string(payload), nil
 }
 // GenerateImage runs depth on req.Src and writes a normalised grayscale depth
 // PNG to req.Dst.
 func (r *DepthAnythingCpp) GenerateImage(req *pb.GenerateImageRequest) error {
 	if req.GetSrc() == "" {
 		return fmt.Errorf("depth-anything-cpp: GenerateImage requires src")
 	}
 	if req.GetDst() == "" {
 		return fmt.Errorf("depth-anything-cpp: GenerateImage requires dst")
 	}
 	imgPath, cleanup, err := materializeImage(req.GetSrc())
 	if err != nil {
 		return fmt.Errorf("depth-anything-cpp: %w", err)
 	}
 	defer cleanup()
 	depth, h, w, _, _, err := r.runDepthPose(imgPath)
 	if err != nil {
 		return err
 	}
 	return writeDepthPNG(req.GetDst(), depth, h, w)
 }
 // Depth is the typed Depth RPC. It runs the Depth Anything 3 pipeline on the
 // request's src image and fills a DepthResponse honoring the include_* flags and
 // exports: per-pixel metric depth + confidence (DualDPT) or depth + sky (mono),
 // camera extrinsics/intrinsics, an optional back-projected 3D point cloud and
 // glb/COLMAP exports. The src may be a filesystem path or a base64 payload.
 func (r *DepthAnythingCpp) Depth(in *pb.DepthRequest) (pb.DepthResponse, error) {
 	// Accumulate into locals and return a single composite literal at the end:
 	// returning a named pb.DepthResponse value would copy its embedded mutex
 	// (go vet copylocks).
 	if r.handle == 0 {
 		return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: model not loaded")
 	}
 	if in.GetSrc() == "" {
 		return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: Depth requires src")
 	}
 	imgPath, cleanup, err := materializeImage(in.GetSrc())
 	if err != nil {
 		return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: %w", err)
 	}
 	defer cleanup()
 	// Dense per-pixel output + pose. Pass buffer pointers only for the
 	// requested maps so the native side can skip unrequested work; ext/intr
 	// must always point at 12/9 floats per the C ABI.
 	var (
 		h, w, isMetric      int32
 		depthPtr, confPtr   *float32
 		skyPtr              *float32
 		ext                 [12]float32
 		intr                [9]float32
 		pDepth, pConf, pSky **float32
 	)
 	if in.GetIncludeDepth() {
 		pDepth = &depthPtr
 	}
 	if in.GetIncludeConfidence() {
 		pConf = &confPtr
 	}
 	if in.GetIncludeSky() {
 		pSky = &skyPtr
 	}
 	rc := CapiDepthDense(r.handle, imgPath, &h, &w, pDepth, pConf, pSky, &ext[0], &intr[0], &isMetric)
 	if rc != 0 {
 		return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: da_capi_depth_dense failed (rc=%d): %s", rc, r.lastError())
 	}
 	n := int(h) * int(w)
 	var (
 		depth, conf, sky      []float32
 		extrinsics, intrinsic []float32
 		numPoints             int32
 		points                []float32
 		pointColors           []byte
 		exportPaths           []string
 	)
 	if depthPtr != nil {
 		depth = copyFloats(depthPtr, n)
 		CapiFreeFloats(depthPtr)
 	}
 	if confPtr != nil {
 		conf = copyFloats(confPtr, n)
 		CapiFreeFloats(confPtr)
 	}
 	if skyPtr != nil {
 		sky = copyFloats(skyPtr, n)
 		CapiFreeFloats(skyPtr)
 	}
 	if in.GetIncludePose() {
 		extrinsics = append([]float32(nil), ext[:]...)
 		intrinsic = append([]float32(nil), intr[:]...)
 	}
 	// 3D point cloud (DualDPT / pose-capable models only).
 	if in.GetIncludePoints() {
 		var (
 			np     int32
 			xyzPtr *float32
 			rgbPtr *byte
 		)
 		if rc := CapiPoints(r.handle, imgPath, in.GetPointsConfThresh(), &np, &xyzPtr, &rgbPtr); rc != 0 {
 			return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: da_capi_points failed (rc=%d): %s", rc, r.lastError())
 		}
 		numPoints = np
 		if xyzPtr != nil {
 			points = copyFloats(xyzPtr, int(np)*3)
 			CapiFreeFloats(xyzPtr)
 		}
 		if rgbPtr != nil {
 			pointColors = copyBytes(rgbPtr, int(np)*3)
 			CapiFreeBytes(rgbPtr)
 		}
 	}
 	// Exports (glb / colmap). They are written under in.Dst (a directory); a
 	// temp dir is used when Dst is empty.
 	if len(in.GetExports()) > 0 {
 		exportPaths, err = r.runExports(imgPath, in.GetDst(), in.GetExports())
 		if err != nil {
 			return pb.DepthResponse{}, err
 		}
 	}
 	return pb.DepthResponse{
 		Width:       w,
 		Height:      h,
 		Depth:       depth,
 		Confidence:  conf,
 		Sky:         sky,
 		Extrinsics:  extrinsics,
 		Intrinsics:  intrinsic,
 		NumPoints:   numPoints,
 		Points:      points,
 		PointColors: pointColors,
 		ExportPaths: exportPaths,
 		IsMetric:    isMetric != 0,
 	}, nil
 }
 // runExports writes the requested exports for imgPath into dstDir and returns
 // the written paths. Supported exports: "glb", "colmap".
 func (r *DepthAnythingCpp) runExports(imgPath, dstDir string, exports []string) ([]string, error) {
 	if dstDir == "" {
 		tmp, err := os.MkdirTemp("", "depth-anything-export-*")
 		if err != nil {
 			return nil, fmt.Errorf("depth-anything-cpp: mkdir export dir: %w", err)
 		}
 		dstDir = tmp
 	} else if err := os.MkdirAll(dstDir, 0o750); err != nil {
 		return nil, fmt.Errorf("depth-anything-cpp: mkdir %s: %w", dstDir, err)
 	}
 	var paths []string
 	for _, exp := range exports {
 		switch exp {
 		case "glb":
 			out := filepath.Join(dstDir, "pointcloud.glb")
 			if rc := CapiExportGlb(r.handle, imgPath, out); rc != 0 {
 				return nil, fmt.Errorf("depth-anything-cpp: da_capi_export_glb failed (rc=%d): %s", rc, r.lastError())
 			}
 			paths = append(paths, out)
 		case "colmap":
 			out := filepath.Join(dstDir, "colmap")
 			if err := os.MkdirAll(out, 0o750); err != nil {
 				return nil, fmt.Errorf("depth-anything-cpp: mkdir %s: %w", out, err)
 			}
 			if rc := CapiExportColmap(r.handle, imgPath, out, 1); rc != 0 {
 				return nil, fmt.Errorf("depth-anything-cpp: da_capi_export_colmap failed (rc=%d): %s", rc, r.lastError())
 			}
 			paths = append(paths, out)
 		default:
 			return nil, fmt.Errorf("depth-anything-cpp: unknown export %q (want glb|colmap)", exp)
 		}
 	}
 	return paths, nil
 }
 // copyFloats copies n float32 values from a C heap pointer into a fresh Go
 // slice so the C buffer can be freed afterwards.
 func copyFloats(p *float32, n int) []float32 {
 	if p == nil || n <= 0 {
 		return nil
 	}
 	src := unsafe.Slice(p, n)
 	out := make([]float32, n)
 	copy(out, src)
 	return out
 }
 // copyBytes copies n bytes from a C heap pointer into a fresh Go slice.
 func copyBytes(p *byte, n int) []byte {
 	if p == nil || n <= 0 {
 		return nil
 	}
 	src := unsafe.Slice(p, n)
 	out := make([]byte, n)
 	copy(out, src)
 	return out
 }
 // runDepthPose runs depth estimation then pose recovery on an image file. It
 // returns the row-major depth map (length h*w), its dimensions, the 3x4
 // extrinsics (12 floats) and 3x3 intrinsics (9 floats).
 // runDepthPose returns depth + camera pose via two C-API calls (depth then pose).
 // For a nested metric model both calls run the full two-branch pipeline, so this
 // path infers twice; the typed Depth RPC (single da_capi_depth_dense call) is the
 // efficient path for nested models.
 func (r *DepthAnythingCpp) runDepthPose(imagePath string) (depth []float32, h, w int, ext [12]float32, intr [9]float32, err error) {
 	if r.handle == 0 {
 		err = fmt.Errorf("depth-anything-cpp: model not loaded")
 		return
 	}
 	var ch, cw int32
 	ptr := CapiDepthPath(r.handle, imagePath, &ch, &cw)
 	if ptr == nil {
 		err = fmt.Errorf("depth-anything-cpp: da_capi_depth_path failed: %s", r.lastError())
 		return
 	}
 	h, w = int(ch), int(cw)
 	n := h * w
 	if n > 0 {
 		src := unsafe.Slice(ptr, n)
 		depth = make([]float32, n)
 		copy(depth, src)
 	}
 	CapiFreeFloats(ptr)
 	if rc := CapiPosePath(r.handle, imagePath, &ext[0], &intr[0]); rc != 0 {
 		err = fmt.Errorf("depth-anything-cpp: da_capi_pose_path failed (rc=%d): %s", rc, r.lastError())
 		return
 	}
 	return
 }
 // lastError returns the context's last error string, or "" if none.
 func (r *DepthAnythingCpp) lastError() string {
 	if CapiLastError == nil || r.handle == 0 {
 		return ""
 	}
 	return CapiLastError(r.handle)
 }
 // materializeImage returns a filesystem path for an image argument that may be
 // either an existing path or a base64-encoded payload. When the input is
 // base64 it is decoded into a temp file; cleanup removes it (no-op for a path).
 func materializeImage(arg string) (path string, cleanup func(), err error) {
 	cleanup = func() {}
 	if _, statErr := os.Stat(arg); statErr == nil {
 		return arg, cleanup, nil
 	}
 	// Strip an optional data URL prefix (data:image/...;base64,<payload>).
 	b64 := arg
 	if i := indexComma(b64); i >= 0 && hasDataPrefix(b64) {
 		b64 = b64[i+1:]
 	}
 	data, decErr := base64.StdEncoding.DecodeString(b64)
 	if decErr != nil {
 		return "", cleanup, fmt.Errorf("image is neither an existing path nor valid base64: %v", decErr)
 	}
 	f, tErr := os.CreateTemp("", "depth-anything-*.img")
 	if tErr != nil {
 		return "", cleanup, tErr
 	}
 	if _, wErr := f.Write(data); wErr != nil {
 		_ = f.Close()
 		_ = os.Remove(f.Name())
 		return "", cleanup, wErr
 	}
 	_ = f.Close()
 	name := f.Name()
 	return name, func() { _ = os.Remove(name) }, nil
 }
 func hasDataPrefix(s string) bool {
 	return len(s) >= 5 && s[:5] == "data:"
 }
 func indexComma(s string) int {
 	for i := 0; i < len(s); i++ {
 		if s[i] == ',' {
 			return i
 		}
 	}
 	return -1
 }
 // writeDepthPNG min-max normalises a depth map and writes it as an 8-bit
 // grayscale PNG. Near = bright (255), far = dark (0), matching the usual
 // depth-map convention for inverse-depth-like outputs.
 func writeDepthPNG(dst string, depth []float32, h, w int) error {
 	if h <= 0 || w <= 0 || len(depth) < h*w {
 		return fmt.Errorf("depth-anything-cpp: writeDepthPNG: bad dims h=%d w=%d len=%d", h, w, len(depth))
 	}
 	dmin, dmax := minMax(depth)
 	span := dmax - dmin
 	if span <= 0 || math.IsNaN(float64(span)) {
 		span = 1
 	}
 	img := image.NewGray(image.Rect(0, 0, w, h))
 	for y := 0; y < h; y++ {
 		for x := 0; x < w; x++ {
 			v := depth[y*w+x]
 			n := (v - dmin) / span // 0..1
 			if math.IsNaN(float64(n)) {
 				n = 0
 			}
 			if n < 0 {
 				n = 0
 			} else if n > 1 {
 				n = 1
 			}
 			img.Pix[y*img.Stride+x] = uint8(n * 255)
 		}
 	}
 	// dst is the gRPC-provided output path chosen by the LocalAI core (the
 	// intended write destination for the rendered depth map), not
 	// attacker-controlled input, so the variable path is expected here.
 	f, err := os.Create(dst) // #nosec G304
 	if err != nil {
 		return err
 	}
 	defer func() { _ = f.Close() }()
 	return png.Encode(f, img)
 }
 func minMax(v []float32) (mn, mx float32) {
 	if len(v) == 0 {
 		return 0, 0
 	}
 	mn, mx = v[0], v[0]
 	for _, x := range v {
 		if math.IsNaN(float64(x)) || math.IsInf(float64(x), 0) {
 			continue
 		}
 		if x < mn {
 			mn = x
 		}
 		if x > mx {
 			mx = x
 		}
 	}
 	return mn, mx
 }
--- a/backend/go/depth-anything-cpp/main.go
+++ b/backend/go/depth-anything-cpp/main.go
@@ -0,0 +1,62 @@
 package main
 // main.go - entry point for the depth-anything-cpp gRPC backend.
 //
 // Dlopens libdepthanythingcpp-<variant>.so via purego at the path in
 // DEPTHANYTHING_LIBRARY (set by run.sh based on /proc/cpuinfo), registers the
 // da_capi_* C ABI symbols, then starts the gRPC server.
 import (
 	"flag"
 	"os"
 	"github.com/ebitengine/purego"
 	grpc "github.com/mudler/LocalAI/pkg/grpc"
 )
 var (
 	addr = flag.String("addr", "localhost:50051", "the address to connect to")
 )
 type LibFuncs struct {
 	FuncPtr any
 	Name    string
 }
 func main() {
 	// Get library name from environment variable, default to fallback
 	libName := os.Getenv("DEPTHANYTHING_LIBRARY")
 	if libName == "" {
 		libName = "./libdepthanythingcpp-fallback.so"
 	}
 	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
 	if err != nil {
 		panic(err)
 	}
 	libFuncs := []LibFuncs{
 		{&CapiLoad, "da_capi_load"},
 		{&CapiLoadNested, "da_capi_load_nested"},
 		{&CapiFree, "da_capi_free"},
 		{&CapiLastError, "da_capi_last_error"},
 		{&CapiDepthPath, "da_capi_depth_path"},
 		{&CapiFreeFloats, "da_capi_free_floats"},
 		{&CapiPosePath, "da_capi_pose_path"},
 		{&CapiDepthDense, "da_capi_depth_dense"},
 		{&CapiPoints, "da_capi_points"},
 		{&CapiFreeBytes, "da_capi_free_bytes"},
 		{&CapiExportGlb, "da_capi_export_glb"},
 		{&CapiExportColmap, "da_capi_export_colmap"},
 	}
 	for _, lf := range libFuncs {
 		purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
 	}
 	flag.Parse()
 	if err := grpc.StartServer(*addr, &DepthAnythingCpp{}); err != nil {
 		panic(err)
 	}
 }
--- a/backend/go/depth-anything-cpp/main_test.go
+++ b/backend/go/depth-anything-cpp/main_test.go
@@ -0,0 +1,167 @@
 package main
 // main_test.go - end-to-end smoke test for the depth-anything-cpp gRPC backend.
 //
 // Spawns the compiled depth-anything-cpp binary on a free local port, dials it
 // via gRPC, and exercises LoadModel + Predict against the test fixtures
 // downloaded by test.sh: the small (vits) f32 GGUF of Depth Anything 3 and a
 // real photo. Asserts that Predict returns a JSON payload with a positive
 // depth-map width/height.
 //
 // The spec Skip()s cleanly if its fixtures (the model, the test image, the
 // built binary, or the fallback .so) are missing, so the test target stays
 // usable on a fresh checkout / on CI runners where the model hasn't been
 // downloaded.
 import (
 	"context"
 	"encoding/base64"
 	"encoding/json"
 	"fmt"
 	"net"
 	"os"
 	"os/exec"
 	"path/filepath"
 	"testing"
 	"time"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 	"google.golang.org/grpc"
 	"google.golang.org/grpc/credentials/insecure"
 )
 func TestDepth(t *testing.T) {
 	RegisterFailHandler(Fail)
 	RunSpecs(t, "depth-anything-cpp backend smoke suite")
 }
 // freePort grabs an ephemeral TCP port and immediately releases it so the
 // spawned backend can bind to it. There is a tiny TOCTOU window here but in
 // practice it's adequate for a smoke test on a quiet runner.
 func freePort() int {
 	l, err := net.Listen("tcp", "127.0.0.1:0")
 	Expect(err).ToNot(HaveOccurred(), "freePort listen")
 	port := l.Addr().(*net.TCPAddr).Port
 	Expect(l.Close()).To(Succeed())
 	return port
 }
 // startBackend spawns the depth-anything-cpp binary on the given port and waits
 // until it accepts TCP connections (up to 10s). It mirrors how main.go resolves
 // the purego library: the DEPTHANYTHING_LIBRARY env var points the dlopen at the
 // freshly built fallback .so. The returned cleanup func kills the process.
 func startBackend(port int) func() {
 	binary, err := filepath.Abs("./depth-anything-cpp")
 	Expect(err).ToNot(HaveOccurred())
 	if _, err := os.Stat(binary); err != nil {
 		Skip(fmt.Sprintf("backend binary not built: %s (run `make depth-anything-cpp` first)", binary))
 	}
 	libPath, err := filepath.Abs("./libdepthanythingcpp-fallback.so")
 	Expect(err).ToNot(HaveOccurred())
 	if _, err := os.Stat(libPath); err != nil {
 		Skip(fmt.Sprintf("fallback library not built: %s (run `make libdepthanythingcpp-fallback.so` first)", libPath))
 	}
 	addr := fmt.Sprintf("127.0.0.1:%d", port)
 	cmd := exec.Command(binary, "--addr", addr)
 	cmd.Env = append(os.Environ(), "DEPTHANYTHING_LIBRARY="+libPath)
 	cmd.Stdout = os.Stderr
 	cmd.Stderr = os.Stderr
 	Expect(cmd.Start()).To(Succeed())
 	cleanup := func() {
 		if cmd.Process != nil {
 			_ = cmd.Process.Kill()
 			_, _ = cmd.Process.Wait()
 		}
 	}
 	deadline := time.Now().Add(10 * time.Second)
 	for time.Now().Before(deadline) {
 		c, err := net.DialTimeout("tcp", addr, 200*time.Millisecond)
 		if err == nil {
 			_ = c.Close()
 			return cleanup
 		}
 		time.Sleep(200 * time.Millisecond)
 	}
 	cleanup()
 	Fail(fmt.Sprintf("backend did not become ready on %s within 10s", addr))
 	return func() {}
 }
 // loadTestImage reads the test image downloaded by test.sh and returns its
 // base64-encoded content (one of the wire formats accepted by Predict).
 func loadTestImage() string {
 	imgPath, err := filepath.Abs("test-data/test.jpg")
 	Expect(err).ToNot(HaveOccurred())
 	imgBytes, err := os.ReadFile(imgPath)
 	if err != nil {
 		Skip(fmt.Sprintf("test image not present: %s (run test.sh first)", imgPath))
 	}
 	return base64.StdEncoding.EncodeToString(imgBytes)
 }
 // dialBackend opens a gRPC client connection to the spawned backend.
 func dialBackend(port int) (pb.BackendClient, func()) {
 	addr := fmt.Sprintf("127.0.0.1:%d", port)
 	conn, err := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials()))
 	Expect(err).ToNot(HaveOccurred())
 	return pb.NewBackendClient(conn), func() { _ = conn.Close() }
 }
 // modelPathOrSkip resolves the model file under ./test-models/ and Skip()s the
 // current spec if it's missing (not present on a fresh checkout / on CI runners
 // without the download).
 func modelPathOrSkip(name string) string {
 	modelDir, err := filepath.Abs("test-models")
 	Expect(err).ToNot(HaveOccurred())
 	modelPath := filepath.Join(modelDir, name)
 	if _, err := os.Stat(modelPath); err != nil {
 		Skip(fmt.Sprintf("model not present: %s (run test.sh first)", modelPath))
 	}
 	return modelPath
 }
 var _ = Describe("depth-anything-cpp backend", func() {
 	It("runs depth+pose against a known-good image", func() {
 		modelPath := modelPathOrSkip("depth-anything-small-f32.gguf")
 		imgB64 := loadTestImage()
 		port := freePort()
 		cleanup := startBackend(port)
 		defer cleanup()
 		client, closeConn := dialBackend(port)
 		defer closeConn()
 		ctx, cancel := context.WithTimeout(context.Background(), 20*time.Minute)
 		defer cancel()
 		loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
 			Model:     "depth-anything-small-f32.gguf",
 			ModelFile: modelPath,
 			Threads:   4,
 		})
 		Expect(err).ToNot(HaveOccurred(), "LoadModel")
 		Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
 		// Predict runs depth+pose and returns the JSON depthResult in Reply.Message.
 		reply, err := client.Predict(ctx, &pb.PredictOptions{
 			Images: []string{imgB64},
 		})
 		Expect(err).ToNot(HaveOccurred(), "Predict")
 		var res depthResult
 		Expect(json.Unmarshal(reply.GetMessage(), &res)).To(Succeed(), "Predict returned non-JSON: %q", string(reply.GetMessage()))
 		Expect(res.DepthW).To(BeNumerically(">", 0), "depth width should be positive")
 		Expect(res.DepthH).To(BeNumerically(">", 0), "depth height should be positive")
 		_, _ = fmt.Fprintf(GinkgoWriter, "depth OK: %dx%d min=%.3f max=%.3f\n",
 			res.DepthW, res.DepthH, res.DepthMin, res.DepthMax)
 	})
 })
--- a/backend/go/depth-anything-cpp/nested_e2e_test.go
+++ b/backend/go/depth-anything-cpp/nested_e2e_test.go
@@ -0,0 +1,64 @@
 package main
 // nested_e2e_test.go - e2e smoke for the nested two-file metric model. Loads the
 // anyview branch as the main model and points the metric branch via the
 // "metric_model:<file>" option (exactly as the depth-anything-3-nested gallery
 // entry does), then exercises the typed Depth RPC and asserts a metric depth map.
 //
 // Skips cleanly unless both nested GGUFs are present under ./test-models/ and the
 // backend binary + fallback .so are built.
 import (
 	"context"
 	"fmt"
 	"path/filepath"
 	"time"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 )
 var _ = Describe("depth-anything-cpp nested metric model", func() {
 	It("loads the two-file pair via the metric_model option and returns metric depth", func() {
 		anyviewPath := modelPathOrSkip("depth-anything-nested-anyview.gguf")
 		_ = modelPathOrSkip("depth-anything-nested-metric.gguf")
 		imgB64 := loadTestImage()
 		port := freePort()
 		cleanup := startBackend(port)
 		defer cleanup()
 		client, closeConn := dialBackend(port)
 		defer closeConn()
 		ctx, cancel := context.WithTimeout(context.Background(), 25*time.Minute)
 		defer cancel()
 		loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
 			Model:     "depth-anything-nested-anyview.gguf",
 			ModelFile: anyviewPath,
 			ModelPath: filepath.Dir(anyviewPath),
 			Options:   []string{"metric_model:depth-anything-nested-metric.gguf"},
 			Threads:   8,
 		})
 		Expect(err).ToNot(HaveOccurred(), "LoadModel(nested)")
 		Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
 		resp, err := client.Depth(ctx, &pb.DepthRequest{
 			Src:          imgB64,
 			IncludeDepth: true,
 			IncludePose:  true,
 		})
 		Expect(err).ToNot(HaveOccurred(), "Depth(nested)")
 		Expect(resp.GetWidth()).To(BeNumerically(">", 0), "depth width")
 		Expect(resp.GetHeight()).To(BeNumerically(">", 0), "depth height")
 		Expect(resp.GetIsMetric()).To(BeTrue(), "nested output must be metric")
 		Expect(len(resp.GetDepth())).To(Equal(int(resp.GetWidth())*int(resp.GetHeight())), "dense depth length")
 		Expect(len(resp.GetExtrinsics())).To(Equal(12), "extrinsics 3x4")
 		Expect(resp.GetIntrinsics()[0]).To(BeNumerically(">", 0), "fx > 0")
 		_, _ = fmt.Fprintf(GinkgoWriter, "nested depth OK: %dx%d is_metric=%v fx=%.2f\n",
 			resp.GetWidth(), resp.GetHeight(), resp.GetIsMetric(), resp.GetIntrinsics()[0])
 	})
 })
--- a/backend/go/depth-anything-cpp/options_test.go
+++ b/backend/go/depth-anything-cpp/options_test.go
@@ -0,0 +1,20 @@
 package main
 import (
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 )
 var _ = DescribeTable("optionValue",
 	func(opts []string, key, want string) {
 		Expect(optionValue(opts, key)).To(Equal(want))
 	},
 	Entry("present", []string{"foo:bar", "metric_model:m.gguf"}, "metric_model", "m.gguf"),
 	Entry("absent", []string{"foo:bar"}, "metric_model", ""),
 	Entry("nil", []string(nil), "metric_model", ""),
 	Entry("trims space", []string{"metric_model:  m.gguf  "}, "metric_model", "m.gguf"),
 	Entry("value with colon", []string{"metric_model:a:b.gguf"}, "metric_model", "a:b.gguf"),
 	Entry("first wins", []string{"metric_model:first.gguf", "metric_model:second.gguf"}, "metric_model", "first.gguf"),
 	Entry("empty value", []string{"metric_model:"}, "metric_model", ""),
 	Entry("prefix not key", []string{"metric_model_extra:x"}, "metric_model", ""),
 )
--- a/backend/go/depth-anything-cpp/package.sh
+++ b/backend/go/depth-anything-cpp/package.sh
@@ -0,0 +1,59 @@
 #!/bin/bash
 # Script to copy the appropriate libraries based on architecture
 set -e
 CURDIR=$(dirname "$(realpath $0)")
 REPO_ROOT="${CURDIR}/../../.."
 # Create lib directory
 mkdir -p $CURDIR/package/lib
 cp -avf $CURDIR/libdepthanythingcpp-*.so $CURDIR/package/
 cp -avf $CURDIR/depth-anything-cpp $CURDIR/package/
 cp -fv $CURDIR/run.sh $CURDIR/package/
 # Detect architecture and copy appropriate libraries
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    # x86_64 architecture
    echo "Detected x86_64 architecture, copying x86_64 libraries..."
    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
 elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
    # ARM64 architecture
    echo "Detected ARM64 architecture, copying ARM64 libraries..."
    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
 elif [ $(uname -s) = "Darwin" ]; then
    echo "Detected Darwin"
 else
    echo "Error: Could not detect architecture"
    exit 1
 fi
 # Package GPU libraries based on BUILD_TYPE
 GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
 if [ -f "$GPU_LIB_SCRIPT" ]; then
    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
    package_gpu_libs
 fi
 echo "Packaging completed successfully"
 ls -liah $CURDIR/package/
 ls -liah $CURDIR/package/lib/
--- a/backend/go/depth-anything-cpp/run.sh
+++ b/backend/go/depth-anything-cpp/run.sh
@@ -0,0 +1,52 @@
 #!/bin/bash
 set -ex
 # Get the absolute current dir where the script is located
 CURDIR=$(dirname "$(realpath $0)")
 cd /
 echo "CPU info:"
 if [ "$(uname)" != "Darwin" ]; then
 	grep -e "model\sname" /proc/cpuinfo | head -1
 	grep -e "flags" /proc/cpuinfo | head -1
 fi
 LIBRARY="$CURDIR/libdepthanythingcpp-fallback.so"
 if [ "$(uname)" != "Darwin" ]; then
 	if grep -q -e "\savx\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX    found OK"
 		if [ -e $CURDIR/libdepthanythingcpp-avx.so ]; then
 			LIBRARY="$CURDIR/libdepthanythingcpp-avx.so"
 		fi
 	fi
 	if grep -q -e "\savx2\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX2   found OK"
 		if [ -e $CURDIR/libdepthanythingcpp-avx2.so ]; then
 			LIBRARY="$CURDIR/libdepthanythingcpp-avx2.so"
 		fi
 	fi
 	# Check avx 512
 	if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX512F found OK"
 		if [ -e $CURDIR/libdepthanythingcpp-avx512.so ]; then
 			LIBRARY="$CURDIR/libdepthanythingcpp-avx512.so"
 		fi
 	fi
 fi
 export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 export DEPTHANYTHING_LIBRARY=$LIBRARY
 # If there is a lib/ld.so, use it
 if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using library: $LIBRARY"
 	exec $CURDIR/lib/ld.so $CURDIR/depth-anything-cpp "$@"
 fi
 echo "Using library: $LIBRARY"
 exec $CURDIR/depth-anything-cpp "$@"
--- a/backend/go/depth-anything-cpp/test.sh
+++ b/backend/go/depth-anything-cpp/test.sh
@@ -0,0 +1,45 @@
 #!/bin/bash
 set -e
 CURDIR=$(dirname "$(realpath $0)")
 echo "Running depth-anything-cpp backend tests..."
 # Test model from the mudler/depth-anything.cpp-gguf HuggingFace repo. The small
 # (vits) f32 GGUF is the lightest backbone (~131 MB), so it keeps the download
 # cheap. It is resumed with `curl -C -` and skipped entirely if already present.
 DEPTHANYTHING_MODEL_DIR="${DEPTHANYTHING_MODEL_DIR:-$CURDIR/test-models}"
 DEPTHANYTHING_MODEL_FILE="${DEPTHANYTHING_MODEL_FILE:-depth-anything-small-f32.gguf}"
 DEPTHANYTHING_MODEL_URL="${DEPTHANYTHING_MODEL_URL:-https://huggingface.co/mudler/depth-anything.cpp-gguf/resolve/main/depth-anything-small-f32.gguf}"
 mkdir -p "$DEPTHANYTHING_MODEL_DIR"
 if [ ! -f "$DEPTHANYTHING_MODEL_DIR/$DEPTHANYTHING_MODEL_FILE" ]; then
    echo "Downloading depth-anything small f32 model (~131 MB)..."
    # -C - resumes a partial download so an interrupted run doesn't restart from 0.
    curl -L -C - -o "$DEPTHANYTHING_MODEL_DIR/$DEPTHANYTHING_MODEL_FILE" "$DEPTHANYTHING_MODEL_URL" --progress-bar
 fi
 # Use a real photo (people + cars) from the upstream rf-detr.cpp repo (~46 KB).
 # Depth estimation needs real content; a synthetic image would be degenerate.
 TEST_IMAGE_DIR="$CURDIR/test-data"
 TEST_IMAGE_FILE="$TEST_IMAGE_DIR/test.jpg"
 TEST_IMAGE_URL="${TEST_IMAGE_URL:-https://raw.githubusercontent.com/mudler/rf-detr.cpp/main/tests/fixtures/ci/test_image.jpg}"
 mkdir -p "$TEST_IMAGE_DIR"
 if [ ! -f "$TEST_IMAGE_FILE" ]; then
    echo "Downloading test image..."
    curl -L -o "$TEST_IMAGE_FILE" "$TEST_IMAGE_URL" --progress-bar
 fi
 echo "depth-anything-cpp test setup complete."
 echo "  model:      $DEPTHANYTHING_MODEL_DIR/$DEPTHANYTHING_MODEL_FILE"
 echo "  test image: $TEST_IMAGE_FILE"
 # Run the Go smoke test: spawns the backend binary on a free port, calls
 # LoadModel + Predict via gRPC against the downloaded GGUF + image.
 echo ""
 echo "Running Go smoke test..."
 cd "$CURDIR"
 go test -v -timeout 30m ./...
--- a/backend/go/omnivoice-cpp/Makefile
+++ b/backend/go/omnivoice-cpp/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
 # omnivoice.cpp version
 OMNIVOICE_REPO?=https://github.com/ServeurpersoCom/omnivoice.cpp
-OMNIVOICE_VERSION?=2603355a5dfacae5cfc33531d5d0933221843509
+OMNIVOICE_VERSION?=96d30169afd5e6bb3fd6a0e9be0eb505bfe81fcd
 SO_TARGET?=libgomnivoicecpp.so
 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/go/parakeet-cpp/Makefile
+++ b/backend/go/parakeet-cpp/Makefile
@@ -1,6 +1,6 @@
 # parakeet-cpp backend Makefile.
 #
-# Upstream pin lives below as PARAKEET_VERSION?=b8012f11e5269126eddb7f4fd02f891a2ccc29b0
+# Upstream pin lives below as PARAKEET_VERSION?=89f5e2977b4d8bccd45e7bcc6f2ef7c4ed49e89a
 # (.github/bump_deps.sh) can find and update it - matches the
 # whisper.cpp / ds4 / vibevoice-cpp convention.
 #
@@ -15,7 +15,7 @@
 # That's what the L0 smoke test uses. The default target below does the
 # proper clone-at-pin + cmake build so CI doesn't need a side-checkout.
-PARAKEET_VERSION?=b8012f11e5269126eddb7f4fd02f891a2ccc29b0
+PARAKEET_VERSION?=89f5e2977b4d8bccd45e7bcc6f2ef7c4ed49e89a
 PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp
 GOCMD?=go
--- a/backend/go/parakeet-cpp/package.sh
+++ b/backend/go/parakeet-cpp/package.sh
@@ -1,23 +1,68 @@
 #!/bin/bash
 #
-# L0 packaging stub: copy the binary, run.sh and libparakeet.so* into
+# Bundle the parakeet-cpp-grpc binary, libparakeet.so, the core runtime
-# package/. The full ldd walk (libc, libstdc++, libgomp, GPU runtimes,
+# libs (libc/libstdc++/libgomp + ld.so) and the GPU runtime for the active
-# arch detection) lands in L3, mirroring backend/go/whisper/package.sh.
+# BUILD_TYPE so the package is self-contained. Mirrors
 # backend/go/whisper/package.sh; run.sh routes the (CGO_ENABLED=0) binary
 # through lib/ld.so so the packaged libc is used instead of the host's.
 set -e
 CURDIR=$(dirname "$(realpath "$0")")
 REPO_ROOT="${CURDIR}/../../.."
 mkdir -p "$CURDIR/package/lib"
 cp -avf "$CURDIR/parakeet-cpp-grpc" "$CURDIR/package/"
 cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
-# libparakeet.so + any soname symlinks (libparakeet.so.X, libparakeet.so.X.Y).
+# libparakeet.so + any soname symlinks (libparakeet.so.X[.Y]). purego.Dlopen
 # resolves it via LD_LIBRARY_PATH, which run.sh points at lib/.
 cp -avf "$CURDIR"/libparakeet.so* "$CURDIR/package/lib/" 2>/dev/null || {
 	echo "ERROR: libparakeet.so not found in $CURDIR, run 'make' first" >&2
 	exit 1
 }
-echo "L0 package layout (full ldd walk lands in L3):"
+# Detect architecture and copy the core runtime libs libparakeet.so links
 # against, plus the matching dynamic loader as lib/ld.so.
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    echo "Detected x86_64 architecture, copying x86_64 libraries..."
    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
 elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
    echo "Detected ARM64 architecture, copying ARM64 libraries..."
    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
 elif [ "$(uname -s)" = "Darwin" ]; then
    echo "Detected Darwin"
 else
    echo "Error: Could not detect architecture"
    exit 1
 fi
 # Package GPU libraries (CUDA/ROCm/Intel/Vulkan loader + ICDs + drivers)
 # based on BUILD_TYPE so the backend can reach the GPU without the runtime
 # base image shipping those drivers.
 GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
 if [ -f "$GPU_LIB_SCRIPT" ]; then
    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
    package_gpu_libs
 fi
 echo "Packaging completed successfully"
 ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/go/qwen3-tts-cpp/Makefile
+++ b/backend/go/qwen3-tts-cpp/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
 # qwentts.cpp version
 QWEN3TTS_REPO?=https://github.com/ServeurpersoCom/qwentts.cpp
-QWEN3TTS_CPP_VERSION?=0bf4a18b22e8bb8718d95294e9f7f45c0d4270a4
+QWEN3TTS_CPP_VERSION?=4536dcdce27c3764a93a06d6bf64026b124962f5
 SO_TARGET?=libgoqwen3ttscpp.so
 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/go/stablediffusion-ggml/Makefile
+++ b/backend/go/stablediffusion-ggml/Makefile
@@ -8,10 +8,16 @@ JOBS?=$(shell nproc --ignore=1)
 # stablediffusion.cpp (ggml)
 STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
-STABLEDIFFUSION_GGML_VERSION?=19bdfe22d255d5b4dff39d449318b9bc5ea2317f
+STABLEDIFFUSION_GGML_VERSION?=f440ad9c29dd8bc34e5d1f4b863832b96d6ea05f
 CMAKE_ARGS+=-DGGML_MAX_NAME=128
 # Enable the ggml RPC backend so generation can be sharded across remote
 # rpc-server workers (the same backend-agnostic ggml rpc-server used by the
 # llama.cpp backend). Servers are selected via the `rpc_servers` option or the
 # LLAMACPP_GRPC_SERVERS env var (populated automatically in p2p worker mode).
 CMAKE_ARGS+=-DSD_RPC=ON
 ifeq ($(NATIVE),false)
 	CMAKE_ARGS+=-DGGML_NATIVE=OFF
 endif
--- a/backend/go/stablediffusion-ggml/cpp/gosd.cpp
+++ b/backend/go/stablediffusion-ggml/cpp/gosd.cpp
@@ -391,10 +391,18 @@ int load_model(const char *model, char *model_path, char* options[], int threads
    const char *control_net_path = "";
    const char *embedding_dir = "";
    const char *photo_maker_path = "";
    const char *pulid_weights_path = "";
    const char *tensor_type_rules = "";
    char *lora_dir = model_path;
-    bool vae_decode_only = true;
+    // Upstream backend/parameter placement specs (see docs/.../stablediffusion).
    // Empty means "leave at upstream default" (nullptr).
    const char *backend_arg = "";
    const char *params_backend_arg = "";
    const char *rpc_servers_arg = "";
    const char *max_vram_arg = "";
    bool stream_layers = false;
    int n_threads = threads;
    enum sd_type_t wtype = SD_TYPE_COUNT;
    enum rng_type_t rng_type = CUDA_RNG;
@@ -418,7 +426,9 @@ int load_model(const char *model, char *model_path, char* options[], int threads
    // If options is not NULL, parse options
    for (int i = 0; options[i] != NULL; i++) {
        const char *optname = strtok(options[i], ":");
-        const char *optval = strtok(NULL, ":");
+        // Take everything after the first ':' as the value so values may
        // themselves contain colons (e.g. rpc_servers host:port lists).
        const char *optval = strtok(NULL, "");
        if (optval == NULL) {
            optval = "true";
        }
@@ -490,9 +500,21 @@ int load_model(const char *model, char *model_path, char* options[], int threads
            }
        }
        if (!strcmp(optname, "photo_maker_path")) photo_maker_path = strdup(optval);
        if (!strcmp(optname, "pulid_weights_path")) pulid_weights_path = strdup(optval);
        if (!strcmp(optname, "tensor_type_rules")) tensor_type_rules = strdup(optval);
-        if (!strcmp(optname, "vae_decode_only")) vae_decode_only = (strcmp(optval, "true") == 0 || strcmp(optval, "1") == 0);
+        // Backend / parameter placement specs (see prepare_backend_assignments
        // in the upstream CLI). These compose with the legacy keep_*_on_cpu /
        // offload_params_to_cpu booleans below.
        if (!strcmp(optname, "backend")) backend_arg = strdup(optval);
        if (!strcmp(optname, "params_backend")) params_backend_arg = strdup(optval);
        if (!strcmp(optname, "rpc_servers")) rpc_servers_arg = strdup(optval);
        if (!strcmp(optname, "max_vram")) max_vram_arg = strdup(optval);
        if (!strcmp(optname, "stream_layers")) stream_layers = (strcmp(optval, "true") == 0 || strcmp(optval, "1") == 0);
        // vae_decode_only is still accepted for backwards compatibility with
        // existing gallery configs, but upstream dropped the option (the model
        // now decides), so it is parsed and ignored.
        if (!strcmp(optname, "offload_params_to_cpu")) offload_params_to_cpu = (strcmp(optval, "true") == 0 || strcmp(optval, "1") == 0);
        if (!strcmp(optname, "keep_clip_on_cpu")) keep_clip_on_cpu = (strcmp(optval, "true") == 0 || strcmp(optval, "1") == 0);
        if (!strcmp(optname, "keep_control_net_on_cpu")) keep_control_net_on_cpu = (strcmp(optval, "true") == 0 || strcmp(optval, "1") == 0);
@@ -591,20 +613,48 @@ int load_model(const char *model, char *model_path, char* options[], int threads
    ctx_params.embeddings = embedding_vec.empty() ? NULL : embedding_vec.data();
    ctx_params.embedding_count = static_cast<uint32_t>(embedding_vec.size());
    ctx_params.photo_maker_path = photo_maker_path;
    if (strlen(pulid_weights_path) > 0) ctx_params.pulid_weights_path = pulid_weights_path;
    ctx_params.tensor_type_rules = tensor_type_rules;
    ctx_params.vae_decode_only = vae_decode_only;
    // XXX: Setting to true causes a segfault on the second run
    ctx_params.free_params_immediately = false;
    ctx_params.n_threads = n_threads;
    ctx_params.rng_type = rng_type;
    ctx_params.keep_clip_on_cpu = keep_clip_on_cpu;
    if (wtype != SD_TYPE_COUNT) ctx_params.wtype = wtype;
    if (sampler_rng_type != RNG_TYPE_COUNT) ctx_params.sampler_rng_type = sampler_rng_type;
    if (prediction != PREDICTION_COUNT) ctx_params.prediction = prediction;
    if (lora_apply_mode != LORA_APPLY_MODE_COUNT) ctx_params.lora_apply_mode = lora_apply_mode;
-    ctx_params.offload_params_to_cpu = offload_params_to_cpu;
+    // Backend / parameter placement specs. Upstream replaced the boolean
-    ctx_params.keep_control_net_on_cpu = keep_control_net_on_cpu;
+    // CPU-offload knobs (offload_params_to_cpu, keep_clip_on_cpu, keep_vae_on_cpu,
-    ctx_params.keep_vae_on_cpu = keep_vae_on_cpu;
+    // keep_control_net_on_cpu) with these specs. Seed from the explicit
    // backend/params_backend options, then prepend the legacy boolean-derived
    // assignments, mirroring prepare_backend_assignments() in the upstream CLI.
    // These strings must outlive new_sd_ctx() below.
    std::string backend_spec = backend_arg;
    std::string params_backend_spec = params_backend_arg;
    auto prepend_spec = [](std::string& spec, const char* assignment) {
        spec = spec.empty() ? std::string(assignment) : std::string(assignment) + "," + spec;
    };
    if (offload_params_to_cpu) prepend_spec(params_backend_spec, "*=cpu");
    if (keep_clip_on_cpu) prepend_spec(backend_spec, "te=cpu");
    if (keep_vae_on_cpu) prepend_spec(backend_spec, "vae=cpu");
    if (keep_control_net_on_cpu) prepend_spec(backend_spec, "controlnet=cpu");
    if (!backend_spec.empty()) ctx_params.backend = backend_spec.c_str();
    if (!params_backend_spec.empty()) ctx_params.params_backend = params_backend_spec.c_str();
    // RPC servers: prefer the explicit option, otherwise fall back to the
    // LLAMACPP_GRPC_SERVERS env var. LocalAI's p2p worker mode populates that
    // var with discovered ggml rpc-server workers (shared with the llama.cpp
    // backend), so distributed image generation works with no extra config.
    if (strlen(rpc_servers_arg) > 0) {
        ctx_params.rpc_servers = rpc_servers_arg;
    } else {
        const char* env_rpc_servers = std::getenv("LLAMACPP_GRPC_SERVERS");
        if (env_rpc_servers != NULL && strlen(env_rpc_servers) > 0) {
            ctx_params.rpc_servers = env_rpc_servers;
        }
    }
    // max_vram: GiB budget or per-backend spec for graph-cut segmented param
    // offload ("0" = disabled, "-1" = auto). stream_layers only has effect when
    // max_vram is set.
    if (strlen(max_vram_arg) > 0) ctx_params.max_vram = max_vram_arg;
    ctx_params.stream_layers = stream_layers;
    ctx_params.diffusion_flash_attn = diffusion_flash_attn;
    ctx_params.tae_preview_only = tae_preview_only;
    ctx_params.diffusion_conv_direct = diffusion_conv_direct;
--- a/backend/go/supertonic/.gitignore
+++ b/backend/go/supertonic/.gitignore
@@ -0,0 +1,4 @@
 /supertonic
 /sources/
 /backend-assets/
 /package/
--- a/backend/go/supertonic/Makefile
+++ b/backend/go/supertonic/Makefile
@@ -0,0 +1,62 @@
 CURRENT_DIR=$(abspath ./)
 GOCMD=go
 ONNX_VERSION?=1.24.4
 ONNX_ARCH?=x64
 ONNX_OS?=linux
 ifneq (,$(findstring aarch64,$(shell uname -m)))
 	ONNX_ARCH=aarch64
 endif
 ifeq ($(OS),Darwin)
 	ONNX_OS=osx
 	ifneq (,$(findstring arm64,$(shell uname -m)))
 		ONNX_ARCH=arm64
 	else
 		ONNX_ARCH=x86_64
 	endif
 endif
 # CUDA 12 ships as -gpu, CUDA 13 as -gpu_cuda13 (underscore). CPU has no suffix.
 ifeq ($(BUILD_TYPE),cublas)
 	ONNX_PROVIDER=cuda
 	ifeq ($(CUDA_MAJOR_VERSION),13)
 		ONNX_VARIANT=-gpu_cuda13
 	else
 		ONNX_VARIANT=-gpu
 	endif
 else
 	ONNX_VARIANT=
 	ONNX_PROVIDER=cpu
 endif
 sources/onnxruntime:
 	mkdir -p sources/onnxruntime
 	curl -L https://github.com/microsoft/onnxruntime/releases/download/v$(ONNX_VERSION)/onnxruntime-$(ONNX_OS)-$(ONNX_ARCH)$(ONNX_VARIANT)-$(ONNX_VERSION).tgz \
 	  -o sources/onnxruntime/onnxruntime.tgz
 	cd sources/onnxruntime && tar -xf onnxruntime.tgz --strip-components=1 && rm onnxruntime.tgz
 backend-assets/lib: sources/onnxruntime
 	mkdir -p backend-assets/lib
 	cp -rfLv sources/onnxruntime/lib/* backend-assets/lib/
 supertonic: backend-assets/lib
 	CGO_ENABLED=1 $(GOCMD) build \
 	  -ldflags "$(LD_FLAGS) -X main.onnxProvider=$(ONNX_PROVIDER)" \
 	  -tags "$(GO_TAGS)" -o supertonic ./
 package:
 	bash package.sh
 build: supertonic package
 # Tests need only the Go toolchain (gcc); yalue dlopens onnxruntime at
 # runtime, so no tarball download is required to compile or run unit specs.
 test:
 	CGO_ENABLED=1 $(GOCMD) test -v -timeout 120s ./...
 clean:
 	rm -rf supertonic sources/ backend-assets/ package/
 .PHONY: build package clean test
--- a/backend/go/supertonic/backend.go
+++ b/backend/go/supertonic/backend.go
@@ -0,0 +1,307 @@
 package main
 import (
 	"bytes"
 	"encoding/binary"
 	"fmt"
 	"os"
 	"path/filepath"
 	"strconv"
 	"strings"
 	"sync"
 	laudio "github.com/mudler/LocalAI/pkg/audio"
 	"github.com/mudler/LocalAI/pkg/grpc/base"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 )
 // onnxProvider is set via -ldflags "-X main.onnxProvider=cuda" by the
 // CUDA build (later phase). Defaults to CPU.
 var onnxProvider = "cpu"
 // Per-model generation defaults, overridable via ModelOptions.Options:
 //
 //	supertonic.steps=<int>          denoising steps (quality), default 8
 //	supertonic.speed=<float>        speech rate, default 1.05
 //	supertonic.silence=<float>      inter-chunk silence seconds, default 0.3
 //	supertonic.default_voice=<name> voice-style used when request omits voice
 //	supertonic.default_lang=<lang>  language tag used when request omits it
 const (
 	optionSteps        = "supertonic.steps="
 	optionSpeed        = "supertonic.speed="
 	optionSilence      = "supertonic.silence="
 	optionDefaultVoice = "supertonic.default_voice="
 	optionDefaultLang  = "supertonic.default_lang="
 )
 type SupertonicBackend struct {
 	base.SingleThread
 	tts          *TextToSpeech
 	cfg          Config
 	modelDir     string
 	voicesDir    string
 	defaultVoice string
 	defaultLang  string
 	steps        int
 	speed        float32
 	silence      float32
 	styleMu sync.Mutex
 	styles  map[string]*Style // voice name -> loaded style cache
 }
 func (s *SupertonicBackend) Load(opts *pb.ModelOptions) error {
 	modelDir, err := resolveModelDir(opts.ModelFile)
 	if err != nil {
 		return err
 	}
 	s.modelDir = modelDir
 	s.voicesDir = resolveVoicesDir(modelDir)
 	cfg, err := LoadCfgs(modelDir)
 	if err != nil {
 		return fmt.Errorf("loading tts.json from %s: %w", modelDir, err)
 	}
 	s.cfg = cfg
 	// onnxProvider is "cpu" for the CPU build; the CUDA build sets it to
 	// "cuda" via -ldflags. Upstream LoadTextToSpeech still errors on GPU
 	// until the CUDA phase wires the execution provider.
 	tts, err := LoadTextToSpeech(modelDir, onnxProvider == "cuda", cfg)
 	if err != nil {
 		return fmt.Errorf("loading supertonic models from %s: %w", modelDir, err)
 	}
 	s.tts = tts
 	s.steps = int(findOptionInt(opts, optionSteps, 8))
 	s.speed = findOptionFloat(opts, optionSpeed, 1.05)
 	s.silence = findOptionFloat(opts, optionSilence, 0.3)
 	s.defaultVoice = findOptionValue(opts, optionDefaultVoice, "")
 	s.defaultLang = findOptionValue(opts, optionDefaultLang, "na")
 	s.styles = map[string]*Style{}
 	return nil
 }
 func (s *SupertonicBackend) TTS(req *pb.TTSRequest) error {
 	wav, sr, err := s.synthesize(req)
 	if err != nil {
 		return err
 	}
 	out := make([]float64, len(wav))
 	for i, v := range wav {
 		out[i] = float64(v)
 	}
 	if err := writeWavFile(req.Dst, out, sr); err != nil {
 		return fmt.Errorf("writing wav to %s: %w", req.Dst, err)
 	}
 	return nil
 }
 func (s *SupertonicBackend) TTSStream(req *pb.TTSRequest, results chan []byte) error {
 	defer close(results)
 	wav, sr, err := s.synthesize(req)
 	if err != nil {
 		return err
 	}
 	results <- streamingWAVHeader(uint32(sr))
 	const chunkSamples = 4096
 	for off := 0; off < len(wav); off += chunkSamples {
 		end := off + chunkSamples
 		if end > len(wav) {
 			end = len(wav)
 		}
 		results <- pcmFloatToInt16LE(wav[off:end])
 	}
 	return nil
 }
 // synthesize runs the full pipeline and returns the trimmed mono float32
 // PCM and its sample rate.
 func (s *SupertonicBackend) synthesize(req *pb.TTSRequest) ([]float32, int, error) {
 	if s.tts == nil {
 		return nil, 0, fmt.Errorf("supertonic model not loaded")
 	}
 	if strings.TrimSpace(req.Text) == "" {
 		return nil, 0, fmt.Errorf("empty text")
 	}
 	style, err := s.loadStyle(s.voiceName(req.Voice))
 	if err != nil {
 		return nil, 0, err
 	}
 	lang := s.resolveLang("")
 	if req.Language != nil {
 		lang = s.resolveLang(*req.Language)
 	}
 	wav, dur, err := s.tts.Call(req.Text, lang, style, s.steps, s.speed, s.silence)
 	if err != nil {
 		return nil, 0, err
 	}
 	sr := s.tts.SampleRate
 	// Call returns concatenated audio; trim to the reported duration.
 	wavLen := int(float32(sr) * dur)
 	if wavLen < 0 {
 		wavLen = 0
 	}
 	if wavLen > len(wav) {
 		wavLen = len(wav)
 	}
 	return wav[:wavLen], sr, nil
 }
 // voiceName picks the request voice, falling back to the model default.
 func (s *SupertonicBackend) voiceName(reqVoice string) string {
 	v := strings.TrimSpace(reqVoice)
 	if v == "" {
 		return s.defaultVoice
 	}
 	return v
 }
 // resolveLang validates against AvailableLangs, falling back to the model
 // default (then "na").
 func (s *SupertonicBackend) resolveLang(reqLang string) string {
 	l := strings.TrimSpace(reqLang)
 	if l != "" && isValidLang(l) {
 		return l
 	}
 	if s.defaultLang != "" && isValidLang(s.defaultLang) {
 		return s.defaultLang
 	}
 	return "na"
 }
 // loadStyle resolves and caches a voice-style. An empty name with no model
 // default is an error (supertonic requires a style embedding).
 func (s *SupertonicBackend) loadStyle(name string) (*Style, error) {
 	if name == "" {
 		return nil, fmt.Errorf("no voice specified and no supertonic.default_voice set")
 	}
 	s.styleMu.Lock()
 	defer s.styleMu.Unlock()
 	if st, ok := s.styles[name]; ok {
 		return st, nil
 	}
 	path := s.voiceStylePath(name)
 	st, err := LoadVoiceStyle([]string{path}, false)
 	if err != nil {
 		return nil, fmt.Errorf("loading voice style %q (%s): %w", name, path, err)
 	}
 	s.styles[name] = st
 	return st, nil
 }
 // voiceStylePath maps a voice name to a JSON path. Absolute paths are honored;
 // names containing a separator resolve under modelDir; bare names resolve under
 // the resolved voicesDir (see resolveVoicesDir).
 func (s *SupertonicBackend) voiceStylePath(name string) string {
 	if !strings.HasSuffix(name, ".json") {
 		name += ".json"
 	}
 	if filepath.IsAbs(name) {
 		return name
 	}
 	if strings.ContainsRune(name, filepath.Separator) {
 		return filepath.Join(s.modelDir, name)
 	}
 	return filepath.Join(s.voicesDir, name)
 }
 // resolveVoicesDir locates the voice_styles directory. The HF model layout
 // puts the ONNX files in an onnx/ subdir with voice_styles/ as its sibling,
 // so check modelDir/voice_styles first, then the parent's voice_styles.
 func resolveVoicesDir(modelDir string) string {
 	candidates := []string{
 		filepath.Join(modelDir, "voice_styles"),
 		filepath.Join(filepath.Dir(modelDir), "voice_styles"),
 	}
 	for _, c := range candidates {
 		if info, err := os.Stat(c); err == nil && info.IsDir() {
 			return c
 		}
 	}
 	return candidates[0]
 }
 // resolveModelDir accepts either a directory (used as-is) or a file (its
 // parent dir is used).
 func resolveModelDir(modelFile string) (string, error) {
 	if modelFile == "" {
 		return "", fmt.Errorf("empty model path")
 	}
 	info, err := os.Stat(modelFile)
 	if err != nil {
 		return "", fmt.Errorf("stat model path %s: %w", modelFile, err)
 	}
 	if info.IsDir() {
 		return modelFile, nil
 	}
 	return filepath.Dir(modelFile), nil
 }
 // ---- option helpers (mirrors backend/go/sherpa-onnx/backend.go) ----
 func findOptionValue(opts *pb.ModelOptions, prefix, def string) string {
 	for _, o := range opts.Options {
 		if strings.HasPrefix(o, prefix) {
 			return strings.TrimPrefix(o, prefix)
 		}
 	}
 	return def
 }
 func findOptionFloat(opts *pb.ModelOptions, prefix string, def float32) float32 {
 	raw := findOptionValue(opts, prefix, "")
 	if raw == "" {
 		return def
 	}
 	v, err := strconv.ParseFloat(raw, 32)
 	if err != nil {
 		return def
 	}
 	return float32(v)
 }
 func findOptionInt(opts *pb.ModelOptions, prefix string, def int32) int32 {
 	raw := findOptionValue(opts, prefix, "")
 	if raw == "" {
 		return def
 	}
 	v, err := strconv.ParseInt(raw, 10, 32)
 	if err != nil {
 		return def
 	}
 	return int32(v)
 }
 // ---- PCM helpers ----
 func pcmFloatToInt16LE(samples []float32) []byte {
 	buf := make([]byte, len(samples)*2)
 	for i, f := range samples {
 		v := int32(f * 32767)
 		if v > 32767 {
 			v = 32767
 		} else if v < -32768 {
 			v = -32768
 		}
 		binary.LittleEndian.PutUint16(buf[2*i:], uint16(int16(v)))
 	}
 	return buf
 }
 func streamingWAVHeader(sampleRate uint32) []byte {
 	const streamingSize = 0xFFFFFFFF
 	h := laudio.NewWAVHeaderWithRate(streamingSize, sampleRate)
 	h.ChunkSize = streamingSize
 	var buf bytes.Buffer
 	_ = h.Write(&buf)
 	return buf.Bytes()
 }
--- a/backend/go/supertonic/backend_test.go
+++ b/backend/go/supertonic/backend_test.go
@@ -0,0 +1,86 @@
 package main
 import (
 	"os"
 	"path/filepath"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 )
 var _ = Describe("voiceStylePath", func() {
 	s := &SupertonicBackend{modelDir: "/models/st/onnx", voicesDir: "/models/st/voice_styles"}
 	It("resolves a bare name under the resolved voicesDir", func() {
 		Expect(s.voiceStylePath("M1")).To(Equal(filepath.Join("/models/st/voice_styles", "M1.json")))
 	})
 	It("keeps an explicit .json suffix", func() {
 		Expect(s.voiceStylePath("M1.json")).To(Equal(filepath.Join("/models/st/voice_styles", "M1.json")))
 	})
 	It("honors absolute paths", func() {
 		Expect(s.voiceStylePath("/abs/v.json")).To(Equal("/abs/v.json"))
 	})
 })
 var _ = Describe("resolveVoicesDir", func() {
 	It("prefers voice_styles under modelDir", func() {
 		dir := GinkgoT().TempDir()
 		Expect(os.MkdirAll(filepath.Join(dir, "voice_styles"), 0o755)).To(Succeed())
 		Expect(resolveVoicesDir(dir)).To(Equal(filepath.Join(dir, "voice_styles")))
 	})
 	It("falls back to the sibling voice_styles next to an onnx subdir", func() {
 		root := GinkgoT().TempDir()
 		Expect(os.MkdirAll(filepath.Join(root, "voice_styles"), 0o755)).To(Succeed())
 		Expect(os.MkdirAll(filepath.Join(root, "onnx"), 0o755)).To(Succeed())
 		Expect(resolveVoicesDir(filepath.Join(root, "onnx"))).To(Equal(filepath.Join(root, "voice_styles")))
 	})
 })
 var _ = Describe("resolveLang", func() {
 	It("accepts a valid request language", func() {
 		s := &SupertonicBackend{defaultLang: "na"}
 		Expect(s.resolveLang("ko")).To(Equal("ko"))
 	})
 	It("falls back to the model default for an invalid language", func() {
 		s := &SupertonicBackend{defaultLang: "en"}
 		Expect(s.resolveLang("zz")).To(Equal("en"))
 	})
 	It("falls back to na when nothing is valid", func() {
 		s := &SupertonicBackend{defaultLang: ""}
 		Expect(s.resolveLang("")).To(Equal("na"))
 	})
 })
 var _ = Describe("pcmFloatToInt16LE", func() {
 	It("clamps and encodes little-endian", func() {
 		out := pcmFloatToInt16LE([]float32{0, 1.0, -1.0, 2.0})
 		Expect(out).To(HaveLen(8))
 		Expect(out[0:2]).To(Equal([]byte{0x00, 0x00})) // 0
 		Expect(out[2:4]).To(Equal([]byte{0xff, 0x7f})) // 32767
 		Expect(out[6:8]).To(Equal([]byte{0xff, 0x7f})) // clamp 2.0 -> 32767
 	})
 })
 var _ = Describe("end-to-end synthesis", Ordered, func() {
 	var modelDir string
 	BeforeAll(func() {
 		modelDir = os.Getenv("SUPERTONIC_MODEL_PATH")
 		if modelDir == "" {
 			Skip("set SUPERTONIC_MODEL_PATH to a supertonic model dir to run")
 		}
 		Expect(InitializeONNXRuntime()).To(Succeed())
 	})
 	It("synthesizes a wav file", func() {
 		b := &SupertonicBackend{}
 		Expect(b.Load(&pb.ModelOptions{ModelFile: modelDir, Options: []string{"supertonic.default_voice=F1"}})).To(Succeed())
 		dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
 		lang := "en"
 		Expect(b.TTS(&pb.TTSRequest{Text: "Hello from LocalAI.", Dst: dst, Language: &lang})).To(Succeed())
 		info, err := os.Stat(dst)
 		Expect(err).ToNot(HaveOccurred())
 		Expect(info.Size()).To(BeNumerically(">", 44)) // header + PCM
 	})
 })
--- a/backend/go/supertonic/helper.go
+++ b/backend/go/supertonic/helper.go
--- a/backend/go/supertonic/main.go
+++ b/backend/go/supertonic/main.go
@@ -0,0 +1,27 @@
 package main
 // Started internally by LocalAI; a server is allocated per model.
 import (
 	"flag"
 	grpc "github.com/mudler/LocalAI/pkg/grpc"
 	ort "github.com/yalue/onnxruntime_go"
 )
 var addr = flag.String("addr", "localhost:50051", "the address to connect to")
 func main() {
 	flag.Parse()
 	// InitializeONNXRuntime reads ONNXRUNTIME_LIB_PATH (set by run.sh) and
 	// dlopens libonnxruntime before any session is created in Load().
 	if err := InitializeONNXRuntime(); err != nil {
 		panic(err)
 	}
 	defer func() { _ = ort.DestroyEnvironment() }()
 	if err := grpc.StartServer(*addr, &SupertonicBackend{}); err != nil {
 		panic(err)
 	}
 }
--- a/core/services/cloudproxy/ssewire/ssewire_suite_test.go
+++ b/core/services/cloudproxy/ssewire/ssewire_suite_test.go
@@ -1,4 +1,4 @@
-package ssewire
+package main
 import (
 	"testing"
@@ -7,7 +7,7 @@ import (
 	. "github.com/onsi/gomega"
 )
-func TestSsewire(t *testing.T) {
+func TestSupertonic(t *testing.T) {
 	RegisterFailHandler(Fail)
-	RunSpecs(t, "ssewire test suite")
+	RunSpecs(t, "Supertonic backend test suite")
 }
--- a/backend/go/supertonic/package.sh
+++ b/backend/go/supertonic/package.sh
@@ -0,0 +1,49 @@
 #!/bin/bash
 set -e
 CURDIR=$(dirname "$(realpath $0)")
 REPO_ROOT="${CURDIR}/../../.."
 mkdir -p $CURDIR/package/lib
 cp -avf $CURDIR/supertonic $CURDIR/package/
 cp -avf $CURDIR/run.sh $CURDIR/package/
 cp -rfLv $CURDIR/backend-assets/lib/* $CURDIR/package/lib/
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    echo "Detected x86_64 architecture, copying x86_64 libraries..."
    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
 elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
    echo "Detected ARM64 architecture, copying ARM64 libraries..."
    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
 else
    echo "Error: Could not detect architecture"
    exit 1
 fi
 GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
 if [ -f "$GPU_LIB_SCRIPT" ]; then
    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
    package_gpu_libs
 fi
 echo "Packaging completed successfully"
 ls -liah $CURDIR/package/
 ls -liah $CURDIR/package/lib/
--- a/backend/go/supertonic/run.sh
+++ b/backend/go/supertonic/run.sh
@@ -0,0 +1,14 @@
 #!/bin/bash
 set -ex
 CURDIR=$(dirname "$(realpath $0)")
 export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 export ONNXRUNTIME_LIB_PATH=$CURDIR/lib/libonnxruntime.so
 if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	exec $CURDIR/lib/ld.so $CURDIR/supertonic "$@"
 fi
 exec $CURDIR/supertonic "$@"
--- a/backend/go/whisper/Makefile
+++ b/backend/go/whisper/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
 # whisper.cpp version
 WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
-WHISPER_CPP_VERSION?=df7638d8229a243af8a4b5a8ae557e0d74e0a0ae
+WHISPER_CPP_VERSION?=43d78af5be58f41d6ffbc227d608f104577741ea
 SO_TARGET?=libgowhisper.so
 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -178,6 +178,37 @@
    nvidia-cuda-12: "cuda12-parakeet-cpp"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-parakeet-cpp"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-parakeet-cpp"
 - &ced
  name: "ced"
  alias: "ced"
  license: mit
  icon: https://avatars.githubusercontent.com/u/95302084
  description: |
    CED sound-event classification / audio tagging (527-class AudioSet).
    ced.cpp is a C++/ggml port that performs audio tagging over the AudioSet
    taxonomy, exposed through the SoundDetection gRPC rpc and the
    /v1/audio/classification REST endpoint. It runs on CPU, NVIDIA CUDA,
    AMD ROCm/HIP, Intel SYCL, Vulkan and NVIDIA Jetson (L4T) targets.
  urls:
    - https://github.com/mudler/ced.cpp
  tags:
    - audio-classification
    - CPU
    - GPU
    - CUDA
    - HIP
  capabilities:
    default: "cpu-ced"
    nvidia: "cuda12-ced"
    intel: "intel-sycl-f16-ced"
    metal: "metal-ced"
    amd: "rocm-ced"
    vulkan: "vulkan-ced"
    nvidia-l4t: "nvidia-l4t-arm64-ced"
    nvidia-cuda-13: "cuda13-ced"
    nvidia-cuda-12: "cuda12-ced"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-ced"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ced"
 - &voxtral
  name: "voxtral"
  alias: "voxtral"
@@ -458,6 +489,126 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-locate-anything-cpp"
  mirrors:
    - localai/localai-backends:master-gpu-vulkan-locate-anything-cpp
 - &depthanything
  name: "depth-anything"
  alias: "depth-anything"
  license: apache-2.0
  description: |
    Depth Anything 3 monocular metric depth + camera pose estimation in C/C++
    using GGML. Loads pre-built GGUF weights and, given an image, returns a
    dense depth map plus the recovered camera extrinsics (3x4) and intrinsics
    (3x3). No Python at inference (purego, cgo-less).
  urls:
    - https://github.com/mudler/depth-anything.cpp
    - https://huggingface.co/depth-anything/Depth-Anything-V3
  tags:
    - depth-estimation
    - camera-pose
    - depth-anything
    - gpu
    - cpu
  capabilities:
    default: "cpu-depth-anything-cpp"
    nvidia: "cuda12-depth-anything-cpp"
    nvidia-cuda-12: "cuda12-depth-anything-cpp"
    nvidia-cuda-13: "cuda13-depth-anything-cpp"
    nvidia-l4t: "nvidia-l4t-arm64-depth-anything-cpp"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-depth-anything-cpp"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-depth-anything-cpp"
    intel: "intel-sycl-f32-depth-anything-cpp"
    vulkan: "vulkan-depth-anything-cpp"
 - !!merge <<: *depthanything
  name: "depth-anything-development"
  capabilities:
    default: "cpu-depth-anything-cpp-development"
    nvidia: "cuda12-depth-anything-cpp-development"
    nvidia-cuda-12: "cuda12-depth-anything-cpp-development"
    nvidia-cuda-13: "cuda13-depth-anything-cpp-development"
    nvidia-l4t: "nvidia-l4t-arm64-depth-anything-cpp-development"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-depth-anything-cpp-development"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-depth-anything-cpp-development"
    intel: "intel-sycl-f32-depth-anything-cpp-development"
    vulkan: "vulkan-depth-anything-cpp-development"
 - !!merge <<: *depthanything
  name: "cpu-depth-anything-cpp"
  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:latest-cpu-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "cpu-depth-anything-cpp-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:master-cpu-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "cuda12-depth-anything-cpp"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-12-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "cuda12-depth-anything-cpp-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "cuda13-depth-anything-cpp"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-13-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "cuda13-depth-anything-cpp-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-13-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "nvidia-l4t-arm64-depth-anything-cpp"
  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:latest-nvidia-l4t-arm64-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "nvidia-l4t-arm64-depth-anything-cpp-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:master-nvidia-l4t-arm64-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "cuda13-nvidia-l4t-arm64-depth-anything-cpp"
  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "cuda13-nvidia-l4t-arm64-depth-anything-cpp-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "intel-sycl-f32-depth-anything-cpp"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:latest-gpu-intel-sycl-f32-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "intel-sycl-f32-depth-anything-cpp-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:master-gpu-intel-sycl-f32-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "intel-sycl-f16-depth-anything-cpp"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:latest-gpu-intel-sycl-f16-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "intel-sycl-f16-depth-anything-cpp-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:master-gpu-intel-sycl-f16-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "vulkan-depth-anything-cpp"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:latest-gpu-vulkan-depth-anything-cpp
 - !!merge <<: *depthanything
  name: "vulkan-depth-anything-cpp-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-depth-anything-cpp"
  mirrors:
    - localai/localai-backends:master-gpu-vulkan-depth-anything-cpp
 - &vllm
  name: "vllm"
  license: apache-2.0
@@ -879,6 +1030,42 @@
    nvidia-l4t: "vulkan-localvqe"
    nvidia-l4t-cuda-12: "vulkan-localvqe"
    nvidia-l4t-cuda-13: "vulkan-localvqe"
 - &privacyfilter
  name: "privacy-filter"
  alias: "privacy-filter"
  icon: https://cdn-avatars.huggingface.co/v1/production/uploads/5fd5e18a90b6dc4633f6d292/QPiv8pt4JNxr0FdGnpFef.png
  description: |
    Standalone GGML engine (privacy-filter.cpp) for the OpenMed privacy-filter
    PII/NER token-classification model family. It runs the openai-privacy-filter
    architecture (a gpt-oss-style sparse-MoE bidirectional token classifier) on
    stock upstream GGML — no llama.cpp coupling and no Python — and serves the
    TokenClassify RPC (constrained BIOES Viterbi decode into UTF-8 byte-offset
    entity spans) used by LocalAI's NER PII redaction tier.
  urls:
    - https://github.com/localai-org/privacy-filter.cpp
  tags:
    - token-classification
    - ner
    - pii
    - privacy
    - CPU
    - GPU
  license: apache-2.0
  # Builds: CPU (amd64+arm64 manifest), Vulkan (amd64) and CUDA 13 (amd64).
  # Only a host that actually reports CUDA 13 gets the CUDA image (it bundles
  # the CUDA 13 runtime and needs a recent driver); every other GPU — including
  # NVIDIA without a CUDA-13 toolkit, AMD and Intel — routes to the Vulkan
  # image, which only needs a Vulkan ICD. Everything else (incl. arm64/Jetson,
  # where Vulkan/CUDA images are a future add) falls back to the CPU build,
  # already fast for this ~50M-active-param model.
  capabilities:
    default: "cpu-privacy-filter"
    nvidia: "vulkan-privacy-filter"
    nvidia-cuda-12: "vulkan-privacy-filter"
    nvidia-cuda-13: "cuda13-privacy-filter"
    amd: "vulkan-privacy-filter"
    intel: "vulkan-privacy-filter"
    vulkan: "vulkan-privacy-filter"
 - &faster-whisper
  icon: https://avatars.githubusercontent.com/u/1520500?s=200&v=4
  description: |
@@ -1368,6 +1555,20 @@
    nvidia: "cuda12-sherpa-onnx"
    nvidia-cuda-12: "cuda12-sherpa-onnx"
    metal: "metal-sherpa-onnx"
 - &supertonic
  name: "supertonic"
  alias: "supertonic"
  urls:
    - https://github.com/supertone-inc/supertonic
  description: |
    Supertonic backend: lightning-fast, on-device multilingual text-to-speech via ONNX Runtime.
    Runs Supertone's flow-matching TTS model (Supertone/supertonic-3), 44.1kHz output, 31 languages,
    multiple preset voice styles. No espeak-ng dependency.
  tags:
    - text-to-speech
    - TTS
  capabilities:
    default: "cpu-supertonic"
 - !!merge <<: *neutts
  name: "neutts-development"
  capabilities:
@@ -2480,6 +2681,121 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-parakeet-cpp"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-13-parakeet-cpp
 ## ced
 - !!merge <<: *ced
  name: "ced-development"
  capabilities:
    default: "cpu-ced-development"
    nvidia: "cuda12-ced-development"
    intel: "intel-sycl-f16-ced-development"
    metal: "metal-ced-development"
    amd: "rocm-ced-development"
    vulkan: "vulkan-ced-development"
    nvidia-l4t: "nvidia-l4t-arm64-ced-development"
    nvidia-cuda-13: "cuda13-ced-development"
    nvidia-cuda-12: "cuda12-ced-development"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-ced-development"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ced-development"
 - !!merge <<: *ced
  name: "nvidia-l4t-arm64-ced"
  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-ced"
  mirrors:
    - localai/localai-backends:latest-nvidia-l4t-arm64-ced
 - !!merge <<: *ced
  name: "nvidia-l4t-arm64-ced-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-ced"
  mirrors:
    - localai/localai-backends:master-nvidia-l4t-arm64-ced
 - !!merge <<: *ced
  name: "cuda13-nvidia-l4t-arm64-ced"
  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-ced"
  mirrors:
    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-ced
 - !!merge <<: *ced
  name: "cuda13-nvidia-l4t-arm64-ced-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-ced"
  mirrors:
    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-ced
 - !!merge <<: *ced
  name: "cpu-ced"
  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-ced"
  mirrors:
    - localai/localai-backends:latest-cpu-ced
 - !!merge <<: *ced
  name: "cpu-ced-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-ced"
  mirrors:
    - localai/localai-backends:master-cpu-ced
 - !!merge <<: *ced
  name: "metal-ced"
  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-ced"
  mirrors:
    - localai/localai-backends:latest-metal-darwin-arm64-ced
 - !!merge <<: *ced
  name: "metal-ced-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-ced"
  mirrors:
    - localai/localai-backends:master-metal-darwin-arm64-ced
 - !!merge <<: *ced
  name: "cuda12-ced"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-ced"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-12-ced
 - !!merge <<: *ced
  name: "cuda12-ced-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-ced"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-ced
 - !!merge <<: *ced
  name: "rocm-ced"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-ced"
  mirrors:
    - localai/localai-backends:latest-gpu-rocm-hipblas-ced
 - !!merge <<: *ced
  name: "rocm-ced-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-ced"
  mirrors:
    - localai/localai-backends:master-gpu-rocm-hipblas-ced
 - !!merge <<: *ced
  name: "intel-sycl-f32-ced"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-ced"
  mirrors:
    - localai/localai-backends:latest-gpu-intel-sycl-f32-ced
 - !!merge <<: *ced
  name: "intel-sycl-f32-ced-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-ced"
  mirrors:
    - localai/localai-backends:master-gpu-intel-sycl-f32-ced
 - !!merge <<: *ced
  name: "intel-sycl-f16-ced"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-ced"
  mirrors:
    - localai/localai-backends:latest-gpu-intel-sycl-f16-ced
 - !!merge <<: *ced
  name: "intel-sycl-f16-ced-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-ced"
  mirrors:
    - localai/localai-backends:master-gpu-intel-sycl-f16-ced
 - !!merge <<: *ced
  name: "vulkan-ced"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-ced"
  mirrors:
    - localai/localai-backends:latest-gpu-vulkan-ced
 - !!merge <<: *ced
  name: "vulkan-ced-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-ced"
  mirrors:
    - localai/localai-backends:master-gpu-vulkan-ced
 - !!merge <<: *ced
  name: "cuda13-ced"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-ced"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-13-ced
 - !!merge <<: *ced
  name: "cuda13-ced-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-ced"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-13-ced
 ## stablediffusion-ggml
 - !!merge <<: *stablediffusionggml
  name: "cpu-stablediffusion-ggml"
@@ -2569,6 +2885,37 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-stablediffusion-ggml"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-13-stablediffusion-ggml
 ## privacy-filter
 - !!merge <<: *privacyfilter
  name: "cpu-privacy-filter"
  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-privacy-filter"
  mirrors:
    - localai/localai-backends:latest-cpu-privacy-filter
 - !!merge <<: *privacyfilter
  name: "cpu-privacy-filter-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-privacy-filter"
  mirrors:
    - localai/localai-backends:master-cpu-privacy-filter
 - !!merge <<: *privacyfilter
  name: "vulkan-privacy-filter"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-privacy-filter"
  mirrors:
    - localai/localai-backends:latest-gpu-vulkan-privacy-filter
 - !!merge <<: *privacyfilter
  name: "vulkan-privacy-filter-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-privacy-filter"
  mirrors:
    - localai/localai-backends:master-gpu-vulkan-privacy-filter
 - !!merge <<: *privacyfilter
  name: "cuda13-privacy-filter"
  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-privacy-filter"
  mirrors:
    - localai/localai-backends:latest-gpu-nvidia-cuda-13-privacy-filter
 - !!merge <<: *privacyfilter
  name: "cuda13-privacy-filter-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-privacy-filter"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-13-privacy-filter
 # vllm
 - !!merge <<: *vllm
  name: "vllm-development"
@@ -5132,3 +5479,18 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-sherpa-onnx"
  mirrors:
    - localai/localai-backends:master-metal-darwin-arm64-sherpa-onnx
 ## supertonic
 - !!merge <<: *supertonic
  name: "supertonic-development"
  capabilities:
    default: "cpu-supertonic-development"
 - !!merge <<: *supertonic
  name: "cpu-supertonic"
  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-supertonic"
  mirrors:
    - localai/localai-backends:latest-cpu-supertonic
 - !!merge <<: *supertonic
  name: "cpu-supertonic-development"
  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-supertonic"
  mirrors:
    - localai/localai-backends:master-cpu-supertonic
--- a/backend/python/diffusers/requirements-cpu.txt
+++ b/backend/python/diffusers/requirements-cpu.txt
@@ -1,7 +1,7 @@
 --extra-index-url https://download.pytorch.org/whl/cpu
-git+https://github.com/huggingface/diffusers
+diffusers==0.38.0
 opencv-python
-transformers
+transformers==4.57.6
 torchvision==0.22.1
 accelerate
 git+https://github.com/xhinker/sd_embed
@@ -10,9 +10,15 @@ sentencepiece
 torch==2.7.1
 optimum-quanto
 ftfy
-# TODO: re-add compel once it supports transformers >= 5.
+# diffusers and transformers are pinned together on purpose. transformers v5
-# Tracking: https://github.com/damian0815/compel/pull/129
+# restructured CLIPTextModel and dropped the `.text_model` attribute, which
-#           https://github.com/damian0815/compel/issues/128
+# breaks single-file Stable Diffusion loading on every released diffusers
-# compel currently pins transformers~=4.25, which forced pip into multi-hour
+# (<=0.38.0); only unreleased diffusers main supports transformers v5. Tracking
-# resolver backtracking storms in CI. backend.py imports it lazily and gates
+# main via git froze whichever broken pair existed at image-build time. Pin the
-# the COMPEL=1 env var on the import succeeding, so dropping it here is safe.
+# last known-good released pair so builds are reproducible and can't drift into
 # the broken window. See https://github.com/mudler/LocalAI/issues/9979
 #
 # compel is intentionally omitted: it pins transformers~=4.25, which conflicts
 # with this pin and previously forced pip into multi-hour resolver backtracking
 # storms in CI. backend.py imports it lazily and gates the COMPEL=1 env var on
 # the import succeeding, so dropping it here is safe.
--- a/Show More
+++ b/Show More