feat(config): default concurrent serving (n_parallel) by GPU VRAM

The llama.cpp backend defaults n_parallel=1, which serializes multi-user requests and leaves continuous batching off (it auto-enables only at n_parallel>1). Fold a VRAM-scaled parallel-slot default into the hardware-config path so multi-user serving works out of the box: >=32GiB->8, >=8GiB->4, >=4GiB->2, else unchanged. With the backend's unified KV the slots SHARE the context budget, so this adds concurrency without multiplying KV memory. Explicit parallel/n_parallel always wins. EnsureParallelOption is shared by the single-host path (ApplyHardwareDefaults with the local GPU) and the distributed router (per selected node's reported VRAM, since the frontend may have no GPU). LocalGPU now also reports VRAM. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
test(config): injectable local-GPU seam + single-instance coverage
2026-06-20 06:39:01 -04:00 · 2026-06-20 09:35:04 +00:00 · 2026-06-19 22:18:27 +00:00 · 2026-06-19 22:02:14 +00:00 · 2026-06-19 21:36:25 +02:00 · 2026-06-19 21:35:21 +02:00
364 changed files with 21022 additions and 6126 deletions
--- a/.agents/ds4-backend.md
+++ b/.agents/ds4-backend.md
@@ -44,6 +44,39 @@ maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_
 via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
 NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).

+## Engine options (LoadModel)
+
+`LoadModel` maps `ModelOptions.Options[]` (`"key:value"`, from model-YAML
+`options:`) onto `ds4_engine_options` through a **declarative table**
+(`kEngineOptSpecs` + `apply_engine_option` in `grpc-server.cpp`). The struct is
+plain C with no reflection, so the field set is enumerated once in the table;
+adding a future engine knob is a one-line table row, not a new branch. Unknown
+keys are ignored (back-compat). A bare flag (`ssd_streaming` with no value)
+means `true`. Path-type values (`mtp_path`, `expert_profile_path`,
+`directional_steering_file`) resolve **relative to the model directory**, so a
+gallery entry can reference a companion file it downloaded by bare filename;
+absolute values pass through. `ds4_role` / `ds4_layers` / `ds4_listen` /
+`ds4_route_timeout` / `kv_cache_dir` keep their dedicated handling (validation
+ coordinator wiring) and are not in the table.
+
+Wired keys: `mtp_path`, `mtp_draft`, `mtp_margin`, `prefill_chunk`,
+`power_percent`, `warm_weights`, `quality`, `ssd_streaming`,
+`ssd_streaming_cold`, `ssd_streaming_preload_experts`,
+`ssd_streaming_cache_experts` (count or `NGB`, sets both experts+bytes via
+`ds4_parse_streaming_cache_experts_arg`), `simulate_used_memory` (`NGB` via
+`ds4_parse_gib_arg`), `expert_profile_path`, `directional_steering_file`,
+`directional_steering_attn`, `directional_steering_ffn`.
+
+## SSD streaming (running models larger than RAM)
+
+ds4's **SSD streaming** keeps non-routed weights resident and streams routed MoE
+experts from the GGUF on cache misses, turning "does it fit in RAM" into a speed
+spectrum. **Metal (Darwin) only** - it is a no-op on CUDA/CPU. Enable with
+`options: ["ssd_streaming"]`; size the routed-expert cache with
+`ssd_streaming_cache_experts:NGB` (omit for ds4's automatic 80%-of-working-set
+budget). Gallery entries built on this: `deepseek-v4-flash-q4-ssd` (153 GB Flash
+on a 128 GB Mac) and `deepseek-v4-pro-q2-ssd` (433 GB Pro, experimental).
+
 ## Build matrix

 | Build | Where | Notes |
--- a/.docker/install-base-deps.sh
+++ b/.docker/install-base-deps.sh
@@ -70,6 +70,12 @@ if [ "${BUILD_TYPE:-}" = "vulkan" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; t
        git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
        ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
        clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
+    # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe + Arm SoC) and their ICD
+    # manifests. The LunarG SDK below only provides the loader and shader
+    # tooling, not hardware drivers — without Mesa the packaged Vulkan backend
+    # would ship a loader that finds no GPU. package-gpu-libs.sh bundles these
+    # .so files plus their deps into the backend so it stays self-contained.
+    apt-get install -y mesa-vulkan-drivers libdrm2
    if [ "amd64" = "${TARGETARCH:-}" ]; then
        wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz"
        tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz
--- a/.dockerignore
+++ b/.dockerignore
@@ -31,6 +31,15 @@ backend/python/**/source
 backend/cpp/llama-cpp/llama.cpp
 backend/cpp/llama-cpp-*-build

+# privacy-filter: same in-place pattern. The Makefile fetches privacy-filter.cpp
+# at the pinned commit (or symlinks a PRIVACY_FILTER_SRC checkout for local dev).
+# A stale dir/symlink COPY'd into the image makes the clone step fail (dangling
+# symlink) or compile against the wrong commit, so keep host build state out.
+backend/cpp/privacy-filter/privacy-filter.cpp
+backend/cpp/privacy-filter/build
+backend/cpp/privacy-filter/grpc-server
+backend/cpp/privacy-filter/package
+
 # Rust backend build output (sources are tracked; target/ is generated)
 backend/rust/*/target

--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -716,6 +716,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "12"
+    cuda-minor-version: "8"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-12-depth-anything-cpp'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "depth-anything-cpp"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "8"
@@ -1582,6 +1595,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-13-depth-anything-cpp'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "depth-anything-cpp"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1621,6 +1647,19 @@ include:
    backend: "locate-anything-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/arm64'
+    skip-drivers: 'false'
+    tag-latest: 'auto'
+    tag-suffix: '-nvidia-l4t-cuda-13-arm64-depth-anything-cpp'
+    base-image: "ubuntu:24.04"
+    ubuntu-version: '2404'
+    runs-on: 'ubuntu-24.04-arm'
+    backend: "depth-anything-cpp"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -2631,6 +2670,78 @@ include:
    dockerfile: "./backend/Dockerfile.ds4"
    context: "./"
    ubuntu-version: '2404'
+  # privacy-filter: PII/NER token classifier (per-arch native -> manifest merge).
+  # Every variant builds FROM a prebuilt quay.io/go-skynet/ci-cache:base-grpc-*
+  # image (gRPC + cmake + protoc + conditional CUDA/Vulkan already installed),
+  # exactly like llama-cpp — no toolchain is installed in Dockerfile.privacy-filter.
+  # builder-base-image makes the workflow use the Dockerfile's builder-prebuilt
+  # stage; without it (local builds) the builder-fromsource stage runs the same
+  # .docker/install-base-deps.sh.
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    platform-tag: 'amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-privacy-filter'
+    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-amd64'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'true'
+    backend: "privacy-filter"
+    dockerfile: "./backend/Dockerfile.privacy-filter"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/arm64'
+    platform-tag: 'arm64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-privacy-filter'
+    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-arm64'
+    runs-on: 'ubuntu-24.04-arm'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'true'
+    backend: "privacy-filter"
+    dockerfile: "./backend/Dockerfile.privacy-filter"
+    context: "./"
+    ubuntu-version: '2404'
+  # Vulkan: base-grpc-vulkan-amd64 carries the SDK. arm64 vulkan is a one-line
+  # add once amd64 is proven in CI.
+  - build-type: 'vulkan'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    platform-tag: 'amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-vulkan-privacy-filter'
+    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-vulkan-amd64'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "privacy-filter"
+    dockerfile: "./backend/Dockerfile.privacy-filter"
+    context: "./"
+    ubuntu-version: '2404'
+  # CUDA: base-grpc-cuda-13-amd64 carries the toolkit; BUILD_TYPE=cublas ->
+  # -DPF_CUDA=ON. cuda-12 and arm64/l4t are one-line adds once cuda-13 amd64 is
+  # proven in CI.
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/amd64'
+    platform-tag: 'amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-13-privacy-filter'
+    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'true'
+    backend: "privacy-filter"
+    dockerfile: "./backend/Dockerfile.privacy-filter"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2898,6 +3009,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-depth-anything-cpp'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "depth-anything-cpp"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'sycl_f32'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2911,6 +3035,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'sycl_f32'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-intel-sycl-f32-depth-anything-cpp'
+    runs-on: 'ubuntu-latest'
+    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+    skip-drivers: 'false'
+    backend: "depth-anything-cpp"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'sycl_f16'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2924,6 +3061,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'sycl_f16'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-intel-sycl-f16-depth-anything-cpp'
+    runs-on: 'ubuntu-latest'
+    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+    skip-drivers: 'false'
+    backend: "depth-anything-cpp"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2938,6 +3088,20 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'vulkan'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    platform-tag: 'amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-vulkan-depth-anything-cpp'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "depth-anything-cpp"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2952,6 +3116,20 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'vulkan'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/arm64'
+    platform-tag: 'arm64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-vulkan-depth-anything-cpp'
+    runs-on: 'ubuntu-24.04-arm'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "depth-anything-cpp"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'sycl_f32'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -3058,6 +3236,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2204'
+  - build-type: 'cublas'
+    cuda-major-version: "12"
+    cuda-minor-version: "0"
+    platforms: 'linux/arm64'
+    skip-drivers: 'false'
+    tag-latest: 'auto'
+    tag-suffix: '-nvidia-l4t-arm64-depth-anything-cpp'
+    base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
+    runs-on: 'ubuntu-24.04-arm'
+    backend: "depth-anything-cpp"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2204'
  # whisper
  - build-type: ''
    cuda-major-version: ""
@@ -4490,6 +4681,36 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  # supertonic CPU (amd64)
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    platform-tag: 'amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-supertonic'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "supertonic"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
+  # supertonic CPU (arm64)
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/arm64'
+    platform-tag: 'arm64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-supertonic'
+    runs-on: 'ubuntu-24.04-arm'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "supertonic"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'

 # Darwin matrix (consumed by backend-jobs-darwin).
 includeDarwin:
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -98,6 +98,7 @@ jobs:
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
+            /opt/homebrew/Cellar/nlohmann-json
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}

      - name: Dependencies
@@ -109,7 +110,10 @@ jobs:
          # Without explicitly installing them, a brew cache-hit run restores
          # ccache's Cellar dir but skips installing those transitive deps,
          # and ccache fails at runtime with `dyld: Library not loaded`.
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd
+          # nlohmann-json is header-only and required by the ds4 backend
+          # (dsml_renderer.cpp includes <nlohmann/json.hpp>); on Linux it comes
+          # from the apt-installed nlohmann-json3-dev in the build image.
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json
          # Force-reinstall ccache so brew re-validates its full runtime-dep
          # closure on every run. This is the durable fix: when the upstream
          # ccache formula gains a new transitive dep (as it has multiple times
@@ -128,7 +132,7 @@ jobs:
          # and decides "already installed" without re-linking, so on a cache-
          # hit run the formulas aren't on PATH. Force-link them; --overwrite
          # tolerates pre-existing symlinks from earlier installs.
-          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd 2>/dev/null || true
+          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json 2>/dev/null || true

      - name: Save Homebrew cache
        if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
@@ -148,6 +152,7 @@ jobs:
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
+            /opt/homebrew/Cellar/nlohmann-json
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}

      # ---- ccache for llama.cpp CMake builds ----
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -26,6 +26,10 @@ jobs:
            variable: "DS4_VERSION"
            branch: "main"
            file: "backend/cpp/ds4/Makefile"
+          - repository: "localai-org/privacy-filter.cpp"
+            variable: "PRIVACY_FILTER_VERSION"
+            branch: "master"
+            file: "backend/cpp/privacy-filter/Makefile"
          - repository: "ggml-org/whisper.cpp"
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
@@ -38,6 +42,10 @@ jobs:
            variable: "PARAKEET_VERSION"
            branch: "master"
            file: "backend/go/parakeet-cpp/Makefile"
+          - repository: "mudler/depth-anything.cpp"
+            variable: "DEPTHANYTHING_VERSION"
+            branch: "master"
+            file: "backend/go/depth-anything-cpp/Makefile"
          - repository: "leejet/stable-diffusion.cpp"
            variable: "STABLEDIFFUSION_GGML_VERSION"
            branch: "master"
@@ -66,9 +74,9 @@ jobs:
            variable: "LOCATEANYTHING_VERSION"
            branch: "master"
            file: "backend/go/locate-anything-cpp/Makefile"
-          - repository: "predict-woo/qwen3-tts.cpp"
+          - repository: "ServeurpersoCom/qwentts.cpp"
            variable: "QWEN3TTS_CPP_VERSION"
-            branch: "main"
+            branch: "master"
            file: "backend/go/qwen3-tts-cpp/Makefile"
          - repository: "ServeurpersoCom/omnivoice.cpp"
            variable: "OMNIVOICE_VERSION"
--- a/.github/workflows/secscan.yaml
+++ b/.github/workflows/secscan.yaml
@@ -21,7 +21,10 @@ jobs:
        uses: securego/gosec@v2.27.1
        with:
          # we let the report trigger content trigger a failure using the GitHub Security features.
-          args: '-no-fail -fmt sarif -out results.sarif ./...'
+          # backend/go/supertonic is excluded: it vendors upstream supertone-inc/supertonic
+          # (helper.go), whose findings (G304 model-file loads, G404 math/rand for flow-matching
+          # noise, G104 unhandled errors) are inherent to that upstream code, not ours to rewrite.
+          args: '-no-fail -exclude-dir=backend/go/supertonic -fmt sarif -out results.sarif ./...'
      - name: Upload SARIF file
        if: ${{ github.actor != 'dependabot[bot]' }}
        uses: github/codeql-action/upload-sarif@v4
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -74,6 +74,8 @@ linters:
    paths:
      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
      - 'backend/go/whisper/sources'
+      # Vendored upstream supertonic pipeline (supertone-inc/supertonic go/helper.go).
+      - 'backend/go/supertonic/helper.go'
      - 'docs/'
    rules:
      # CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
--- a/15
+++ b/15
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -595,6 +595,8 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/rust/kokoros test
 	$(MAKE) -C backend/go/rfdetr-cpp test
 	$(MAKE) -C backend/go/locate-anything-cpp test
+	$(MAKE) -C backend/go/depth-anything-cpp test
+	$(MAKE) -C backend/go/supertonic test

 ##
 ## End-to-end gRPC tests that exercise a built backend container image.
@@ -1162,6 +1164,10 @@ BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
 # Single-model; hardware-only validation lives at tests/e2e-backends/
 # (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
 BACKEND_DS4 = ds4|ds4|.|false|false
+# privacy-filter wraps the standalone privacy-filter.cpp GGML engine (the
+# openai-privacy-filter PII/NER token classifier) — the TokenClassify RPC for
+# the PII redactor tier, on stock ggml with no llama.cpp carry-patches.
+BACKEND_PRIVACY_FILTER = privacy-filter|privacy-filter|.|false|false

 # Golang backends
 BACKEND_PIPER = piper|golang|.|false|true
@@ -1173,6 +1179,7 @@ BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|tr
 BACKEND_WHISPER = whisper|golang|.|false|true
 BACKEND_CRISPASR = crispasr|golang|.|false|true
 BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
+BACKEND_DEPTH_ANYTHING_CPP = depth-anything-cpp|golang|.|false|true
 BACKEND_VOXTRAL = voxtral|golang|.|false|true
 BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
 BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
@@ -1181,6 +1188,7 @@ BACKEND_VIBEVOICE_CPP = vibevoice-cpp|golang|.|false|true
 BACKEND_LOCALVQE = localvqe|golang|.|false|true
 BACKEND_OPUS = opus|golang|.|false|true
 BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true
+BACKEND_SUPERTONIC = supertonic|golang|.|false|true

 # Python backends with root context
 BACKEND_RERANKERS = rerankers|python|.|false|true
@@ -1254,6 +1262,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
 $(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
+$(eval $(call generate-docker-build-target,$(BACKEND_PRIVACY_FILTER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_CLOUD_PROXY)))
@@ -1263,6 +1272,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
 $(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
+$(eval $(call generate-docker-build-target,$(BACKEND_DEPTH_ANYTHING_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
 $(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
@@ -1308,12 +1318,13 @@ $(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RFDETR_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
+$(eval $(call generate-docker-build-target,$(BACKEND_SUPERTONIC)))

 # Pattern rule for docker-save targets
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter

 ########################################################
 ### Mock Backend for E2E Tests
--- a/README.md
+++ b/README.md
@@ -29,6 +29,18 @@
 <a href="https://trendshift.io/repositories/5539" target="_blank"><img src="https://trendshift.io/api/badge/repositories/5539" alt="mudler%2FLocalAI | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 </p>

+<!-- Keep these links, translations synced daily. -->
+<p align="center">
+<a href="https://zdoc.app/de/mudler/LocalAI">Deutsch</a> |
+<a href="https://zdoc.app/es/mudler/LocalAI">Español</a> |
+<a href="https://zdoc.app/fr/mudler/LocalAI">français</a> |
+<a href="https://zdoc.app/ja/mudler/LocalAI">日本語</a> |
+<a href="https://zdoc.app/ko/mudler/LocalAI">한국어</a> |
+<a href="https://zdoc.app/pt/mudler/LocalAI">Português</a> |
+<a href="https://zdoc.app/ru/mudler/LocalAI">Русский</a> |
+<a href="https://zdoc.app/zh/mudler/LocalAI">中文</a>
+</p>
+
 **LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

 **A small core, not a bundle.** Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.
@@ -208,10 +220,26 @@ For older news and full release notes, see [GitHub Releases](https://github.com/

 ## Supported Backends & Acceleration

-LocalAI supports **36+ backends** including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
+LocalAI supports **60+ backends** including llama.cpp, vLLM, SGLang, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).

 See the full [Backend & Model Compatibility Table](https://localai.io/model-compatibility/) and [GPU Acceleration guide](https://localai.io/features/gpu-acceleration/).

+### Backends built by us
+
+Most backends wrap a best-in-class upstream engine. A handful of them are native C/C++/GGML engines (no Python at inference) developed and maintained by the LocalAI project itself:
+
+| Backend | What it does |
+|---------|-------------|
+| [parakeet.cpp](https://github.com/mudler/parakeet.cpp) | C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription |
+| [voxtral.c](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in pure C |
+| [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp) | Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization |
+| [rf-detr.cpp](https://github.com/mudler/rf-detr.cpp) | Native RF-DETR object detection and instance segmentation |
+| [locate-anything.cpp](https://github.com/mudler/locate-anything.cpp) | Open-vocabulary object detection and visual grounding (LocateAnything-3B) |
+| [depth-anything.cpp](https://github.com/mudler/depth-anything.cpp) | Depth Anything 3 monocular metric depth + camera pose estimation |
+| [privacy-filter.cpp](https://github.com/localai-org/privacy-filter.cpp) | Standalone GGML PII/NER token-classification engine powering LocalAI's PII redaction tier |
+| [LocalVQE](https://github.com/localai-org/LocalVQE) | Joint acoustic echo cancellation, noise suppression, and dereverberation |
+| [local-store](https://github.com/mudler/LocalAI) | Local-first vector database for embeddings (shipped in-tree) |
+
 ## Resources

 - [Documentation](https://localai.io/)
--- a/backend/Dockerfile.golang
+++ b/backend/Dockerfile.golang
@@ -65,7 +65,12 @@ RUN <<EOT bash
            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils && \
+        apt-get install -y mesa-vulkan-drivers libdrm2
+        # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe) + their manifests. The
+        # LunarG SDK below only provides the loader and shader tooling, not
+        # hardware drivers — without Mesa, package-gpu-libs.sh has no ICD to
+        # bundle and the packaged backend finds no GPU at runtime.
        if [ "amd64" = "$TARGETARCH" ]; then
            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
--- a/backend/Dockerfile.privacy-filter
+++ b/backend/Dockerfile.privacy-filter
@@ -0,0 +1,109 @@
+ARG BASE_IMAGE=ubuntu:24.04
+# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses when no
+# prebuilt base is supplied; the builder-prebuilt stage is only entered when
+# BUILDER_TARGET=builder-prebuilt, so the fallback content is harmless
+# (BuildKit prunes the unreferenced builder).
+ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
+# BUILDER_TARGET selects which builder stage the scratch image copies from.
+# Declared before any FROM so it is usable in `FROM ${BUILDER_TARGET}`. The
+# backend_build workflow sets it to builder-prebuilt when the matrix entry
+# provides builder-base-image, else builder-fromsource (the local default).
+ARG BUILDER_TARGET=builder-fromsource
+ARG APT_MIRROR=""
+ARG APT_PORTS_MIRROR=""
+
+# privacy-filter: standalone GGML engine for the openai-privacy-filter PII/NER
+# token classifier, wrapped as a LocalAI gRPC backend.
+#
+# Mirrors backend/Dockerfile.llama-cpp: the build toolchain (gRPC + cmake +
+# protoc + conditional CUDA/Vulkan) comes from the shared
+# .docker/install-base-deps.sh (from-source path) or a prebuilt
+# quay.io/go-skynet/ci-cache:base-grpc-* image (CI path) — nothing GPU-specific
+# is hand-rolled here. BUILD_TYPE selects the engine backend in the Makefile:
+# "" = cpu, "cublas" -> -DPF_CUDA=ON, "vulkan" -> -DPF_VULKAN=ON.
+
+# ============================================================================
+# Stage: builder-fromsource — self-contained build. Runs the same install
+# script backend/Dockerfile.base-grpc-builder runs, so this path is
+# bit-equivalent to the prebuilt base. Used when BUILDER_TARGET=builder-fromsource
+# (the default; local `make backends/privacy-filter`).
+# ============================================================================
+FROM ${BASE_IMAGE} AS builder-fromsource
+ARG BUILD_TYPE
+ARG CUDA_MAJOR_VERSION
+ARG CUDA_MINOR_VERSION
+ARG CMAKE_FROM_SOURCE=false
+# CUDA Toolkit 13.x needs CMake 3.31.9+ for correct toolchain/arch detection.
+ARG CMAKE_VERSION=3.31.10
+ARG GRPC_VERSION=v1.65.0
+ARG GRPC_MAKEFLAGS="-j4 -Otarget"
+ARG SKIP_DRIVERS=false
+ARG TARGETARCH
+ARG UBUNTU_VERSION=2404
+ARG APT_MIRROR
+ARG APT_PORTS_MIRROR
+
+ENV BUILD_TYPE=${BUILD_TYPE} \
+    CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
+    CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
+    CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
+    CMAKE_VERSION=${CMAKE_VERSION} \
+    GRPC_VERSION=${GRPC_VERSION} \
+    GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
+    SKIP_DRIVERS=${SKIP_DRIVERS} \
+    TARGETARCH=${TARGETARCH} \
+    UBUNTU_VERSION=${UBUNTU_VERSION} \
+    APT_MIRROR=${APT_MIRROR} \
+    APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
+    DEBIAN_FRONTEND=noninteractive
+# CUDA on PATH (a no-op when CUDA is not installed, e.g. cpu/vulkan builds).
+ENV PATH=/usr/local/cuda/bin:${PATH}
+
+WORKDIR /build
+
+# apt deps + cmake + protoc + gRPC + conditional CUDA/Vulkan, all from the
+# shared script (the source of truth that base-grpc-builder also runs).
+RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
+    --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
+    bash /usr/local/sbin/install-base-deps
+
+# install-base-deps installs gRPC under /opt/grpc; copy it to /usr/local so the
+# backend's find_package(gRPC CONFIG) resolves it at the canonical prefix.
+RUN cp -a /opt/grpc/. /usr/local/
+
+COPY . /LocalAI
+
+RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
+    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
+
+# ============================================================================
+# Stage: builder-prebuilt — FROM a prebuilt
+# quay.io/go-skynet/ci-cache:base-grpc-* image (gRPC at /opt/grpc + apt deps +
+# CUDA/Vulkan already installed). Used in CI when the matrix entry sets
+# builder-base-image.
+# ============================================================================
+FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
+ARG BUILD_TYPE
+ARG TARGETARCH
+ENV BUILD_TYPE=${BUILD_TYPE}
+# CUDA on PATH (a no-op for the cpu/vulkan base images).
+ENV PATH=/usr/local/cuda/bin:${PATH}
+
+# Mirror builder-fromsource: the base-grpc image installs gRPC to /opt/grpc but
+# does not copy it to /usr/local.
+RUN cp -a /opt/grpc/. /usr/local/
+
+COPY . /LocalAI
+
+RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
+    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
+
+# ============================================================================
+# Final stage — copy the package output from the selected builder. BuildKit
+# does not expand variables in `COPY --from=`, so alias the chosen builder to a
+# fixed stage name first.
+# ============================================================================
+FROM ${BUILDER_TARGET} AS builder
+
+FROM scratch
+COPY --from=builder /LocalAI/backend/cpp/privacy-filter/package/. ./
--- a/backend/Dockerfile.python
+++ b/backend/Dockerfile.python
@@ -66,7 +66,12 @@ RUN <<EOT bash
            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils && \
+        apt-get install -y mesa-vulkan-drivers libdrm2
+        # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe) + their manifests. The
+        # LunarG SDK below only provides the loader and shader tooling, not
+        # hardware drivers — without Mesa, package-gpu-libs.sh has no ICD to
+        # bundle and the packaged backend finds no GPU at runtime.
        if [ "amd64" = "$TARGETARCH" ]; then
            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -24,6 +24,7 @@ service Backend {
  rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {}
  rpc Status(HealthMessage) returns (StatusResponse) {}
  rpc Detect(DetectOptions) returns (DetectResponse) {}
+  rpc Depth(DepthRequest) returns (DepthResponse) {}
  rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
  rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
  rpc VoiceVerify(VoiceVerifyRequest) returns (VoiceVerifyResponse) {}
@@ -670,6 +671,35 @@ message DetectResponse {
  repeated Detection Detections = 1;
 }

+// --- Depth estimation messages (Depth Anything 3) ---
+
+message DepthRequest {
+  string src = 1;                  // input image (filesystem path or base64-encoded payload)
+  string dst = 2;                  // optional output directory for exports (glb/colmap)
+  bool include_depth = 3;          // return the per-pixel metric depth map
+  bool include_confidence = 4;     // return the per-pixel confidence map (DualDPT)
+  bool include_pose = 5;           // return camera extrinsics/intrinsics (DualDPT)
+  bool include_sky = 6;            // return the per-pixel sky map (mono models)
+  bool include_points = 7;         // back-project to a 3D point cloud (DualDPT)
+  float points_conf_thresh = 8;    // keep points with confidence >= this threshold
+  repeated string exports = 9;     // requested exports: "glb", "colmap"
+}
+
+message DepthResponse {
+  int32 width = 1;                 // processed depth-map width
+  int32 height = 2;                // processed depth-map height
+  repeated float depth = 3;        // width*height row-major metric depth
+  repeated float confidence = 4;   // width*height row-major confidence (DualDPT)
+  repeated float sky = 5;          // width*height row-major sky map (mono)
+  repeated float extrinsics = 6;   // 12 floats, 3x4 row-major (world-to-camera)
+  repeated float intrinsics = 7;   // 9 floats, 3x3 row-major
+  int32 num_points = 8;            // number of 3D points
+  repeated float points = 9;       // num_points*3 xyz, world space
+  bytes point_colors = 10;         // num_points*3 uint8 rgb
+  repeated string export_paths = 11; // paths written for the requested exports
+  bool is_metric = 12;             // depth is in metric units
+}
+
 // --- Face recognition messages ---

 message FacialArea {
--- a/backend/cpp/ds4/CMakeLists.txt
+++ b/backend/cpp/ds4/CMakeLists.txt
@@ -9,6 +9,22 @@ option(DS4_NATIVE "Compile with -march=native / -mcpu=native" ON)
 set(DS4_GPU "cpu" CACHE STRING "GPU backend: cpu, cuda, or metal")
 set(DS4_DIR "${CMAKE_CURRENT_SOURCE_DIR}/ds4" CACHE PATH "Path to cloned ds4 source")

+if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
+    # Homebrew installs protobuf/grpc under a non-default prefix. The generated
+    # backend.pb.cc / backend.grpc.pb.cc pull in google/protobuf and grpcpp
+    # headers, but the hw_grpc_proto library links neither target, so on macOS
+    # the headers (e.g. google/protobuf/runtime_version.h) are never on the
+    # compiler's include path. Add the Homebrew prefix globally, matching the
+    # llama-cpp backend which builds on Darwin CI.
+    if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "arm64")
+        set(HOMEBREW_DEFAULT_PREFIX "/opt/homebrew")
+    else()
+        set(HOMEBREW_DEFAULT_PREFIX "/usr/local")
+    endif()
+    link_directories("${HOMEBREW_DEFAULT_PREFIX}/lib")
+    include_directories("${HOMEBREW_DEFAULT_PREFIX}/include")
+endif()
+
 find_package(Threads REQUIRED)
 find_package(Protobuf CONFIG QUIET)
 if(NOT Protobuf_FOUND)
--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -1,10 +1,10 @@
 # ds4 backend Makefile.
 #
-# Upstream pin lives below as DS4_VERSION?=d881f2a05e8ff6bec001315a36b794b4aa310173
+# Upstream pin lives below as DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
 # (.github/bump_deps.sh) can find and update it - matches the
 # llama-cpp / ik-llama-cpp / turboquant convention.

-DS4_VERSION?=d881f2a05e8ff6bec001315a36b794b4aa310173
+DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
 DS4_REPO?=https://github.com/antirez/ds4

 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
--- a/backend/cpp/ds4/grpc-server.cpp
+++ b/backend/cpp/ds4/grpc-server.cpp
@@ -25,6 +25,8 @@ extern "C" {
 #include <chrono>
 #include <climits>
 #include <csignal>
+#include <cstddef>
+#include <cstdint>
 #include <cstdlib>
 #include <cstring>
 #include <ctime>
@@ -105,6 +107,130 @@ static bool parse_layers_spec(const std::string &spec, ds4_distributed_layers *o
    return true;
 }

+// Parse a boolean LoadModel option. An empty value (a bare flag-style option
+// like "ssd_streaming" with no colon) means true so model YAMLs can write
+// options: ["ssd_streaming"] to enable a switch.
+static bool parse_bool_option(const std::string &s, bool *out) {
+    if (s.empty() || s == "true" || s == "1" || s == "yes" || s == "on") { *out = true; return true; }
+    if (s == "false" || s == "0" || s == "no" || s == "off") { *out = false; return true; }
+    return false;
+}
+
+// Table-driven mapping from LoadModel option keys to ds4_engine_options fields.
+// ds4_engine_options is a fixed C struct with no reflection, so the field set
+// is enumerated once here; adding a future engine knob is a one-line table
+// entry rather than a new branch in LoadModel. Two fields need ds4's own typed
+// parsers (Gib, CacheExperts) so a plain string passthrough can't cover them.
+enum class DsOptType { Bool, Int, Uint, Float, Str, Gib, CacheExperts };
+
+struct DsOptSpec {
+    const char *key;
+    DsOptType   type;
+    size_t      off;      // byte offset into ds4_engine_options
+    size_t      off2;     // second offset (CacheExperts writes experts + bytes)
+    bool        is_path;  // Str values: resolve a relative value against the model dir
+};
+
+static const DsOptSpec kEngineOptSpecs[] = {
+    {"mtp_path",                      DsOptType::Str,          offsetof(ds4_engine_options, mtp_path),                      0, true},
+    {"mtp_draft",                     DsOptType::Int,          offsetof(ds4_engine_options, mtp_draft_tokens),              0},
+    {"mtp_margin",                    DsOptType::Float,        offsetof(ds4_engine_options, mtp_margin),                    0},
+    {"prefill_chunk",                 DsOptType::Uint,         offsetof(ds4_engine_options, prefill_chunk),                 0},
+    {"power_percent",                 DsOptType::Int,          offsetof(ds4_engine_options, power_percent),                 0},
+    {"warm_weights",                  DsOptType::Bool,         offsetof(ds4_engine_options, warm_weights),                  0},
+    {"quality",                       DsOptType::Bool,         offsetof(ds4_engine_options, quality),                       0},
+    {"ssd_streaming",                 DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming),                 0},
+    {"ssd_streaming_cold",            DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming_cold),            0},
+    {"ssd_streaming_preload_experts", DsOptType::Uint,         offsetof(ds4_engine_options, ssd_streaming_preload_experts), 0},
+    {"ssd_streaming_cache_experts",   DsOptType::CacheExperts, offsetof(ds4_engine_options, ssd_streaming_cache_experts),
+                                                               offsetof(ds4_engine_options, ssd_streaming_cache_bytes)},
+    {"simulate_used_memory",          DsOptType::Gib,          offsetof(ds4_engine_options, simulate_used_memory_bytes),    0},
+    {"expert_profile_path",           DsOptType::Str,          offsetof(ds4_engine_options, expert_profile_path),           0, true},
+    {"directional_steering_file",     DsOptType::Str,          offsetof(ds4_engine_options, directional_steering_file),     0, true},
+    {"directional_steering_attn",     DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_attn),     0},
+    {"directional_steering_ffn",      DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_ffn),      0},
+};
+
+// Apply a single key:value LoadModel option to the engine options struct.
+// Unknown keys are ignored (back-compat: callers pass mixed option sets).
+// String values are copied into `storage`, whose elements the engine reads by
+// pointer during ds4_engine_open; `storage` MUST have reserved capacity so
+// push_back never reallocates and dangles an earlier c_str(). Returns false
+// with `err` set when a recognized key has an invalid value.
+static bool apply_engine_option(ds4_engine_options *opt, const std::string &key,
+                                const std::string &val, const std::string &model_dir,
+                                std::vector<std::string> &storage, std::string &err) {
+    const DsOptSpec *spec = nullptr;
+    for (const auto &s : kEngineOptSpecs) {
+        if (key == s.key) { spec = &s; break; }
+    }
+    if (!spec) return true; // unknown key: ignore
+
+    char *base = reinterpret_cast<char *>(opt);
+    switch (spec->type) {
+    case DsOptType::Bool: {
+        bool b = false;
+        if (!parse_bool_option(val, &b)) { err = key + " must be true/false"; return false; }
+        *reinterpret_cast<bool *>(base + spec->off) = b;
+        return true;
+    }
+    case DsOptType::Int: {
+        char *end = nullptr;
+        long v = std::strtol(val.c_str(), &end, 10);
+        if (val.empty() || !end || *end != '\0') { err = key + " must be an integer"; return false; }
+        *reinterpret_cast<int *>(base + spec->off) = static_cast<int>(v);
+        return true;
+    }
+    case DsOptType::Uint: {
+        char *end = nullptr;
+        long v = std::strtol(val.c_str(), &end, 10);
+        if (val.empty() || !end || *end != '\0' || v < 0 || v > static_cast<long>(UINT32_MAX)) {
+            err = key + " must be a non-negative integer"; return false;
+        }
+        *reinterpret_cast<uint32_t *>(base + spec->off) = static_cast<uint32_t>(v);
+        return true;
+    }
+    case DsOptType::Float: {
+        char *end = nullptr;
+        float f = std::strtof(val.c_str(), &end);
+        if (val.empty() || !end || *end != '\0') { err = key + " must be a number"; return false; }
+        *reinterpret_cast<float *>(base + spec->off) = f;
+        return true;
+    }
+    case DsOptType::Str: {
+        // Resolve a relative path option (e.g. mtp_path: a sibling GGUF the
+        // gallery downloaded next to the model) against the model directory, so
+        // YAMLs reference companion files by name. Absolute values pass through.
+        if (spec->is_path && !model_dir.empty() && !val.empty() && val.front() != '/') {
+            storage.push_back(model_dir + "/" + val);
+        } else {
+            storage.push_back(val);
+        }
+        *reinterpret_cast<const char **>(base + spec->off) = storage.back().c_str();
+        return true;
+    }
+    case DsOptType::Gib: {
+        uint64_t bytes = 0;
+        if (!ds4_parse_gib_arg(val.c_str(), &bytes)) {
+            err = key + " must be a GiB value, e.g. 64GB"; return false;
+        }
+        *reinterpret_cast<uint64_t *>(base + spec->off) = bytes;
+        return true;
+    }
+    case DsOptType::CacheExperts: {
+        uint32_t experts = 0;
+        uint64_t bytes = 0;
+        if (!ds4_parse_streaming_cache_experts_arg(val.c_str(), &experts, &bytes)) {
+            err = key + " must be a positive expert count or a <number>GB budget"; return false;
+        }
+        *reinterpret_cast<uint32_t *>(base + spec->off)  = experts;
+        *reinterpret_cast<uint64_t *>(base + spec->off2) = bytes;
+        return true;
+    }
+    }
+    return true;
+}
+
 // When acting as a distributed coordinator, block until the worker route
 // covers all layers (ds4_session_distributed_route_ready == 1) or the timeout
 // elapses. Returns an empty string on success, or an error message to return
@@ -476,39 +602,10 @@ public:
            return GStatus::OK;
        }

-        std::string mtp_path;
-        int mtp_draft = 0;
-        float mtp_margin = 3.0f;
-        std::string ds4_role, ds4_layers, ds4_listen;
-        for (const auto &opt : request->options()) {
-            auto [k, v] = split_option(opt);
-            if (k == "mtp_path") mtp_path = v;
-            else if (k == "mtp_draft") mtp_draft = std::stoi(v);
-            else if (k == "mtp_margin") mtp_margin = std::stof(v);
-            else if (k == "kv_cache_dir") g_kv_cache_dir = v;
-            else if (k == "ds4_role") ds4_role = v;
-            else if (k == "ds4_layers") ds4_layers = v;
-            else if (k == "ds4_listen") ds4_listen = v;
-            else if (k == "ds4_route_timeout") {
-                if (!parse_positive_int(v, &g_route_timeout_sec)) {
-                    result->set_success(false);
-                    result->set_message("ds4: ds4_route_timeout must be a positive integer");
-                    return GStatus::OK;
-                }
-            }
-        }
-
-        g_kv_cache.SetDir(g_kv_cache_dir);
-
        ds4_engine_options opt = {};
        opt.model_path = model_path.c_str();
-        opt.mtp_path = mtp_path.empty() ? nullptr : mtp_path.c_str();
        opt.n_threads = request->threads() > 0 ? request->threads() : 0;
-        opt.mtp_draft_tokens = mtp_draft;
-        opt.mtp_margin = mtp_margin;
-        opt.directional_steering_file = nullptr;
-        opt.warm_weights = false;
-        opt.quality = false;
+        opt.mtp_margin = 3.0f; // ds4 default; overridable via the mtp_margin option

 #if defined(DS4_NO_GPU)
        opt.backend = DS4_BACKEND_CPU;
@@ -518,6 +615,46 @@ public:
        opt.backend = DS4_BACKEND_CUDA;
 #endif

+        // Stable storage for string-valued engine options. The engine reads
+        // these by pointer during ds4_engine_open, so the std::string backing
+        // store must outlive the call and not reallocate; reserve up front so
+        // push_back keeps every prior c_str() valid. Static + clear() reuses
+        // the buffer across LoadModel calls (the old engine is closed above).
+        static std::vector<std::string> s_opt_strings;
+        s_opt_strings.clear();
+        s_opt_strings.reserve(sizeof(kEngineOptSpecs) / sizeof(kEngineOptSpecs[0]));
+
+        // Directory of the main model, used to resolve relative path options.
+        std::string model_dir;
+        if (auto slash = model_path.find_last_of('/'); slash != std::string::npos) {
+            model_dir = model_path.substr(0, slash);
+        }
+
+        std::string ds4_role, ds4_layers, ds4_listen;
+        for (const auto &o : request->options()) {
+            auto [k, v] = split_option(o);
+            if (k == "kv_cache_dir") { g_kv_cache_dir = v; continue; }
+            else if (k == "ds4_role") { ds4_role = v; continue; }
+            else if (k == "ds4_layers") { ds4_layers = v; continue; }
+            else if (k == "ds4_listen") { ds4_listen = v; continue; }
+            else if (k == "ds4_route_timeout") {
+                if (!parse_positive_int(v, &g_route_timeout_sec)) {
+                    result->set_success(false);
+                    result->set_message("ds4: ds4_route_timeout must be a positive integer");
+                    return GStatus::OK;
+                }
+                continue;
+            }
+            std::string err;
+            if (!apply_engine_option(&opt, k, v, model_dir, s_opt_strings, err)) {
+                result->set_success(false);
+                result->set_message("ds4: " + err);
+                return GStatus::OK;
+            }
+        }
+
+        g_kv_cache.SetDir(g_kv_cache_dir);
+
        // Coordinator wiring. 'ds4_role:coordinator' enables layer-split
        // distributed inference: this process listens on ds4_listen and owns
        // the ds4_layers slice; workers dial in (see `local-ai worker
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=e6f8112f3ba126eed3ff5b30cdd08085414a7516
+IK_LLAMA_VERSION?=b3dfb7858cfcb9166e92f366e5af87f19ebc94be
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=4c6595503fe45d5a39f88d194e270f64c7424677
+LLAMA_VERSION?=f3e182816421c648188b5eab269853bf1531d950
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -1922,25 +1922,27 @@ public:
                    body_json["min_p"] = data["min_p"];
                }

-                // Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
+                // Forward the chat_template_kwargs the Go layer resolved (model config
+                // chat_template_kwargs + per-request metadata: enable_thinking,
+                // reasoning_effort, preserve_thinking, ...). One generic merge replaces
+                // the previous per-key handling - new template levers need no C++ change.
+                // oaicompat_chat_params_parse reads these from body_json.
                const auto& metadata = request->metadata();
-                auto et_it = metadata.find("enable_thinking");
-                if (et_it != metadata.end()) {
-                    if (!body_json.contains("chat_template_kwargs")) {
-                        body_json["chat_template_kwargs"] = json::object();
+                auto ctk_it = metadata.find("chat_template_kwargs");
+                if (ctk_it != metadata.end() && !ctk_it->second.empty()) {
+                    try {
+                        json ctk = json::parse(ctk_it->second);
+                        if (ctk.is_object()) {
+                            if (!body_json.contains("chat_template_kwargs")) {
+                                body_json["chat_template_kwargs"] = json::object();
+                            }
+                            for (auto& el : ctk.items()) {
+                                body_json["chat_template_kwargs"][el.key()] = el.value();
+                            }
+                        }
+                    } catch (const std::exception & e) {
+                        SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
                    }
-                    body_json["chat_template_kwargs"]["enable_thinking"] = (et_it->second == "true");
-                }
-
-                // Pass reasoning_effort via chat_template_kwargs too: the lever
-                // jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
-                // from enable_thinking which those templates ignore.
-                auto re_it = metadata.find("reasoning_effort");
-                if (re_it != metadata.end() && !re_it->second.empty()) {
-                    if (!body_json.contains("chat_template_kwargs")) {
-                        body_json["chat_template_kwargs"] = json::object();
-                    }
-                    body_json["chat_template_kwargs"]["reasoning_effort"] = re_it->second;
                }

                // Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
@@ -2756,25 +2758,26 @@ public:
                    body_json["min_p"] = data["min_p"];
                }

-                // Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
+                // Forward the chat_template_kwargs the Go layer resolved (model config
+                // chat_template_kwargs + per-request metadata: enable_thinking,
+                // reasoning_effort, preserve_thinking, ...). One generic merge replaces
+                // the previous per-key handling - new template levers need no C++ change.
                const auto& predict_metadata = request->metadata();
-                auto predict_et_it = predict_metadata.find("enable_thinking");
-                if (predict_et_it != predict_metadata.end()) {
-                    if (!body_json.contains("chat_template_kwargs")) {
-                        body_json["chat_template_kwargs"] = json::object();
+                auto predict_ctk_it = predict_metadata.find("chat_template_kwargs");
+                if (predict_ctk_it != predict_metadata.end() && !predict_ctk_it->second.empty()) {
+                    try {
+                        json ctk = json::parse(predict_ctk_it->second);
+                        if (ctk.is_object()) {
+                            if (!body_json.contains("chat_template_kwargs")) {
+                                body_json["chat_template_kwargs"] = json::object();
+                            }
+                            for (auto& el : ctk.items()) {
+                                body_json["chat_template_kwargs"][el.key()] = el.value();
+                            }
+                        }
+                    } catch (const std::exception & e) {
+                        SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
                    }
-                    body_json["chat_template_kwargs"]["enable_thinking"] = (predict_et_it->second == "true");
-                }
-
-                // Pass reasoning_effort via chat_template_kwargs too: the lever
-                // jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
-                // from enable_thinking which those templates ignore.
-                auto predict_re_it = predict_metadata.find("reasoning_effort");
-                if (predict_re_it != predict_metadata.end() && !predict_re_it->second.empty()) {
-                    if (!body_json.contains("chat_template_kwargs")) {
-                        body_json["chat_template_kwargs"] = json::object();
-                    }
-                    body_json["chat_template_kwargs"]["reasoning_effort"] = predict_re_it->second;
                }

                // Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
--- a/backend/cpp/privacy-filter/.gitignore
+++ b/backend/cpp/privacy-filter/.gitignore
@@ -0,0 +1,9 @@
+/privacy-filter.cpp
+build/
+package/
+grpc-server
+*.o
+backend.pb.cc
+backend.pb.h
+backend.grpc.pb.cc
+backend.grpc.pb.h
--- a/backend/cpp/privacy-filter/CMakeLists.txt
+++ b/backend/cpp/privacy-filter/CMakeLists.txt
@@ -0,0 +1,69 @@
+cmake_minimum_required(VERSION 3.21)
+project(privacy-filter-grpc-server LANGUAGES CXX C)
+
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+set(TARGET grpc-server)
+
+# Path to the privacy-filter.cpp engine sources. The Makefile arranges for this
+# to exist (clone of a pinned commit, or a symlink to PRIVACY_FILTER_SRC).
+set(PRIVACY_FILTER_DIR "${CMAKE_CURRENT_SOURCE_DIR}/privacy-filter.cpp"
+    CACHE PATH "Path to the privacy-filter.cpp engine source tree")
+
+find_package(Threads REQUIRED)
+find_package(Protobuf CONFIG QUIET)
+if(NOT Protobuf_FOUND)
+    find_package(Protobuf REQUIRED)
+endif()
+find_package(gRPC CONFIG QUIET)
+if(NOT gRPC_FOUND)
+    # Ubuntu's apt-installed grpc++ does not ship a CMake config - fall back.
+    find_library(GRPCPP_LIB grpc++ REQUIRED)
+    find_library(GRPCPP_REFLECTION_LIB grpc++_reflection REQUIRED)
+    add_library(gRPC::grpc++ INTERFACE IMPORTED)
+    set_target_properties(gRPC::grpc++ PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_LIB}")
+    add_library(gRPC::grpc++_reflection INTERFACE IMPORTED)
+    set_target_properties(gRPC::grpc++_reflection PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_REFLECTION_LIB}")
+endif()
+
+find_program(_PROTOC NAMES protoc REQUIRED)
+find_program(_GRPC_CPP_PLUGIN NAMES grpc_cpp_plugin REQUIRED)
+
+get_filename_component(HW_PROTO "${CMAKE_CURRENT_SOURCE_DIR}/../../backend.proto" ABSOLUTE)
+get_filename_component(HW_PROTO_PATH "${HW_PROTO}" PATH)
+
+set(HW_PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.cc")
+set(HW_PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.h")
+set(HW_GRPC_SRCS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.cc")
+set(HW_GRPC_HDRS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.h")
+
+add_custom_command(
+    OUTPUT "${HW_PROTO_SRCS}" "${HW_PROTO_HDRS}" "${HW_GRPC_SRCS}" "${HW_GRPC_HDRS}"
+    COMMAND ${_PROTOC}
+    ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
+         --cpp_out  "${CMAKE_CURRENT_BINARY_DIR}"
+         -I "${HW_PROTO_PATH}"
+         --plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN}"
+         "${HW_PROTO}"
+    DEPENDS "${HW_PROTO}")
+
+add_library(hw_grpc_proto STATIC
+    ${HW_GRPC_SRCS} ${HW_GRPC_HDRS}
+    ${HW_PROTO_SRCS} ${HW_PROTO_HDRS})
+target_include_directories(hw_grpc_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
+
+# Build only the pf static lib (+ ggml) from the engine tree — no CLI/bench/tests.
+# PF_VULKAN is honored when passed on the cmake command line (it lands in the
+# shared cache the engine reads).
+set(PF_BUILD_TOOLS OFF CACHE BOOL "" FORCE)
+set(PF_BUILD_TESTS OFF CACHE BOOL "" FORCE)
+add_subdirectory(${PRIVACY_FILTER_DIR} ${CMAKE_CURRENT_BINARY_DIR}/privacy-filter.cpp)
+
+add_executable(${TARGET} grpc-server.cpp)
+target_link_libraries(${TARGET} PRIVATE
+    pf
+    hw_grpc_proto
+    gRPC::grpc++
+    gRPC::grpc++_reflection
+    protobuf::libprotobuf
+    Threads::Threads)
--- a/backend/cpp/privacy-filter/Makefile
+++ b/backend/cpp/privacy-filter/Makefile
@@ -0,0 +1,77 @@
+# privacy-filter backend Makefile.
+#
+# Wraps the standalone privacy-filter.cpp GGML engine (the openai-privacy-filter
+# PII/NER token classifier) as a LocalAI gRPC backend. The engine source is
+# fetched at the pin below — .github/workflows/bump_deps.yaml finds and updates
+# PRIVACY_FILTER_VERSION, matching the llama-cpp / ds4 convention.
+#
+# Local development: point at a working checkout instead of cloning, e.g.
+#   make PRIVACY_FILTER_SRC=$HOME/c/privacy-filter.cpp grpc-server
+
+PRIVACY_FILTER_VERSION?=646342f7a59c6b7d195185eac60bad762e572f1d
+PRIVACY_FILTER_REPO?=https://github.com/localai-org/privacy-filter.cpp
+PRIVACY_FILTER_SRC?=
+
+CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
+BUILD_DIR := build
+
+BUILD_TYPE ?=
+NATIVE ?= false
+JOBS ?= $(shell nproc 2>/dev/null || echo 4)
+
+CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
+
+# GPU backends; the default (cpu) needs no extra flags. 'cublas' is LocalAI's
+# name for the CUDA build (matches llama-cpp / ds4), mapping to the engine's
+# GGML_CUDA path; 'vulkan' selects the ggml Vulkan backend.
+ifeq ($(BUILD_TYPE),cublas)
+    CMAKE_ARGS += -DPF_CUDA=ON
+endif
+ifeq ($(BUILD_TYPE),vulkan)
+    CMAKE_ARGS += -DPF_VULKAN=ON
+endif
+
+# Portable binaries for distribution: disable -march=native unless asked.
+ifneq ($(NATIVE),true)
+    CMAKE_ARGS += -DGGML_NATIVE=OFF
+endif
+
+.PHONY: grpc-server package clean purge test all
+all: grpc-server
+
+# Provide the engine sources at ./privacy-filter.cpp. With PRIVACY_FILTER_SRC
+# set we symlink a local checkout (instant, no network); otherwise we clone the
+# pinned commit and its ggml submodule. The directory/symlink is the target, so
+# make only does this once — run 'make purge && make' to refetch after a bump.
+privacy-filter.cpp:
+ifneq ($(PRIVACY_FILTER_SRC),)
+	ln -sfn $(abspath $(PRIVACY_FILTER_SRC)) privacy-filter.cpp
+else
+	mkdir -p privacy-filter.cpp
+	cd privacy-filter.cpp && \
+	git init -q && \
+	git remote add origin $(PRIVACY_FILTER_REPO) && \
+	git fetch --depth 1 origin $(PRIVACY_FILTER_VERSION) && \
+	git checkout FETCH_HEAD && \
+	git submodule update --init --recursive --depth 1
+endif
+
+grpc-server: privacy-filter.cpp
+	@echo "Building privacy-filter grpc-server ($(BUILD_TYPE)) with $(CMAKE_ARGS)"
+	mkdir -p $(BUILD_DIR)
+	cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
+	cp $(BUILD_DIR)/grpc-server grpc-server
+
+package: grpc-server
+	bash package.sh
+
+test:
+	@echo "privacy-filter backend: parity/regression coverage lives in the engine repo"
+
+clean:
+	rm -rf $(BUILD_DIR) grpc-server package
+
+# 'privacy-filter.cpp' may be a symlink (PRIVACY_FILTER_SRC) — rm without a
+# trailing slash removes the link, never the linked-to checkout.
+purge: clean
+	rm -rf privacy-filter.cpp
--- a/backend/cpp/privacy-filter/grpc-server.cpp
+++ b/backend/cpp/privacy-filter/grpc-server.cpp
@@ -0,0 +1,210 @@
+// privacy-filter LocalAI gRPC backend.
+//
+// Thin shim over privacy-filter.cpp's flat C API (include/pf.h): a standalone
+// GGML engine for the openai-privacy-filter token-classification model family
+// (PII NER). It replaces the llama.cpp-patched TokenClassify path for this one
+// model family — same GGUF files, no llama.cpp carry-patches.
+//
+// Only the RPCs the PII tier needs are implemented: LoadModel, TokenClassify,
+// plus Health / Status / Free. Everything else inherits the generated base
+// class default (UNIMPLEMENTED).
+
+#include "backend.pb.h"
+#include "backend.grpc.pb.h"
+
+#include "pf.h"
+
+#include <grpcpp/grpcpp.h>
+#include <grpcpp/server.h>
+#include <grpcpp/server_builder.h>
+#include <grpcpp/ext/proto_server_reflection_plugin.h>
+
+#include <atomic>
+#include <chrono>
+#include <csignal>
+#include <iostream>
+#include <memory>
+#include <mutex>
+#include <string>
+
+using grpc::Server;
+using grpc::ServerBuilder;
+using grpc::ServerContext;
+// NOTE: do NOT alias grpc::Status as Status — the Status RPC method below would
+// shadow the type and break the other method signatures. Use GStatus instead.
+using GStatus = ::grpc::Status;
+using grpc::StatusCode;
+
+namespace {
+
+// The engine is single-model-per-process: LocalAI spawns one backend process
+// per loaded model. g_mu guards (re)load against in-flight classification.
+std::mutex          g_mu;
+pf_ctx *            g_ctx = nullptr;
+std::atomic<Server *> g_server{nullptr};
+
+// Resolve the device string the engine expects ("cpu" / "gpu" / "cuda" /
+// "vulkan", optionally ":N"). Priority: an explicit "device:..." in
+// ModelOptions.Options, then a non-zero NGPULayers as a coarse "use the GPU"
+// signal, else CPU. "gpu" lets the engine pick whichever GPU backend this
+// binary was compiled with (CUDA or Vulkan), so the same config works on
+// either build; pin "device:cuda"/"device:vulkan" to be explicit.
+std::string resolve_device(const backend::ModelOptions * opts) {
+    for (const auto & o : opts->options()) {
+        const std::string prefix = "device:";
+        if (o.rfind(prefix, 0) == 0) {
+            return o.substr(prefix.size());
+        }
+    }
+    if (opts->ngpulayers() > 0) {
+        return "gpu";
+    }
+    return "cpu";
+}
+
+class PrivacyFilterBackend final : public backend::Backend::Service {
+public:
+    GStatus Health(ServerContext *, const backend::HealthMessage *,
+                   backend::Reply * reply) override {
+        reply->set_message("OK");
+        return GStatus::OK;
+    }
+
+    GStatus Status(ServerContext *, const backend::HealthMessage *,
+                   backend::StatusResponse * response) override {
+        std::lock_guard<std::mutex> lock(g_mu);
+        response->set_state(g_ctx ? backend::StatusResponse::READY
+                                  : backend::StatusResponse::UNINITIALIZED);
+        return GStatus::OK;
+    }
+
+    GStatus LoadModel(ServerContext *, const backend::ModelOptions * request,
+                      backend::Result * result) override {
+        std::lock_guard<std::mutex> lock(g_mu);
+
+        // ModelFile is the absolute path LocalAI resolves; Model is the bare
+        // name. Prefer the former, fall back to the latter.
+        const std::string path =
+            !request->modelfile().empty() ? request->modelfile() : request->model();
+        if (path.empty()) {
+            result->set_success(false);
+            result->set_message("no model path supplied");
+            return GStatus::OK;
+        }
+
+        const std::string device = resolve_device(request);
+
+        if (g_ctx) { pf_free(g_ctx); g_ctx = nullptr; }
+
+        pf_ctx * ctx = pf_load(path.c_str(), device.c_str(), request->threads());
+        const char * err = pf_last_error(ctx);
+        if (err) {
+            result->set_success(false);
+            result->set_message(std::string("privacy-filter load failed: ") + err);
+            pf_free(ctx);
+            return GStatus::OK;
+        }
+
+        // ContextSize, when set, becomes the per-forward window. The engine
+        // ignores values that are too small to window (<= 2*halo) and just
+        // runs a single forward, so passing it through is always safe.
+        if (request->contextsize() > 0) {
+            pf_set_window(ctx, request->contextsize());
+        }
+
+        g_ctx = ctx;
+        result->set_success(true);
+        result->set_message("privacy-filter loaded (" + device + ")");
+        return GStatus::OK;
+    }
+
+    GStatus TokenClassify(ServerContext *, const backend::TokenClassifyRequest * request,
+                          backend::TokenClassifyResponse * response) override {
+        std::lock_guard<std::mutex> lock(g_mu);
+        if (!g_ctx) {
+            return GStatus(StatusCode::FAILED_PRECONDITION, "Model not loaded");
+        }
+
+        const std::string & text = request->text();
+        if (text.empty()) {
+            return GStatus::OK;  // no text -> no entities
+        }
+
+        pf_entity * ents = nullptr;
+        size_t      n    = 0;
+        if (pf_classify(g_ctx, text.data(), text.size(), request->threshold(), &ents, &n) != 0) {
+            const char * err = pf_last_error(g_ctx);
+            return GStatus(StatusCode::INTERNAL,
+                           std::string("TokenClassify failed: ") + (err ? err : "unknown"));
+        }
+
+        // Byte offsets are into the original UTF-8 text; the engine already
+        // applied the threshold and whitespace-trimmed span edges.
+        for (size_t i = 0; i < n; i++) {
+            backend::TokenClassifyEntity * ent = response->add_entities();
+            ent->set_entity_group(ents[i].label ? ents[i].label : "");
+            ent->set_start(ents[i].start);
+            ent->set_end(ents[i].end);
+            ent->set_score(ents[i].score);
+            ent->set_text(text.substr((size_t) ents[i].start,
+                                      (size_t) (ents[i].end - ents[i].start)));
+        }
+        pf_entities_free(ents, n);
+        return GStatus::OK;
+    }
+
+    GStatus Free(ServerContext *, const backend::HealthMessage *,
+                 backend::Result * result) override {
+        std::lock_guard<std::mutex> lock(g_mu);
+        if (g_ctx) { pf_free(g_ctx); g_ctx = nullptr; }
+        result->set_success(true);
+        return GStatus::OK;
+    }
+};
+
+void RunServer(const std::string & addr) {
+    PrivacyFilterBackend service;
+    grpc::EnableDefaultHealthCheckService(true);
+    grpc::reflection::InitProtoReflectionServerBuilderPlugin();
+
+    ServerBuilder builder;
+    builder.AddListeningPort(addr, grpc::InsecureServerCredentials());
+    builder.RegisterService(&service);
+    builder.SetMaxReceiveMessageSize(64 * 1024 * 1024);
+    builder.SetMaxSendMessageSize(64 * 1024 * 1024);
+
+    std::unique_ptr<Server> server(builder.BuildAndStart());
+    if (!server) {
+        std::cerr << "privacy-filter grpc-server: failed to bind " << addr << "\n";
+        std::exit(1);
+    }
+    g_server = server.get();
+    std::cerr << "privacy-filter grpc-server listening on " << addr << "\n";
+    server->Wait();
+}
+
+void signal_handler(int) {
+    if (auto * srv = g_server.load()) {
+        srv->Shutdown(std::chrono::system_clock::now() + std::chrono::seconds(3));
+    }
+}
+
+} // namespace
+
+int main(int argc, char * argv[]) {
+    std::string addr = "127.0.0.1:50051";
+    for (int i = 1; i < argc; ++i) {
+        std::string a = argv[i];
+        const std::string addr_flag = "--addr=";
+        if (a.rfind(addr_flag, 0) == 0)      addr = a.substr(addr_flag.size());
+        else if (a == "--addr" && i + 1 < argc) addr = argv[++i];
+        else if (a == "--help" || a == "-h") {
+            std::cout << "Usage: grpc-server --addr=HOST:PORT\n";
+            return 0;
+        }
+    }
+    std::signal(SIGINT,  signal_handler);
+    std::signal(SIGTERM, signal_handler);
+    RunServer(addr);
+    return 0;
+}
--- a/backend/cpp/privacy-filter/package.sh
+++ b/backend/cpp/privacy-filter/package.sh
@@ -0,0 +1,39 @@
+#!/bin/bash
+# Assemble package/ for the from-scratch backend image: the grpc-server binary,
+# run.sh, the dynamic loader, and every shared library the binary needs.
+set -e
+CURDIR=$(dirname "$(realpath "$0")")
+REPO_ROOT="${CURDIR}/../../.."
+
+mkdir -p "$CURDIR/package/lib"
+cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
+cp -rfv "$CURDIR/run.sh"      "$CURDIR/package/"
+
+# The dynamic loader, renamed to lib/ld.so so run.sh can invoke it explicitly
+# (makes the image independent of the host's glibc layout).
+if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
+elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
+    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
+else
+    echo "package.sh: unknown architecture" >&2; exit 1
+fi
+
+# Bundle the binary's transitive shared deps (libstdc++, libgomp, and the apt
+# grpc++/protobuf/absl stack) by walking ldd — robust to whichever of those are
+# linked shared vs static. The loader line (no "=>") is skipped; ld.so above
+# already covers it.
+ldd "$CURDIR/grpc-server" | awk '$2 == "=>" && $3 ~ /^\// { print $3 }' | sort -u | \
+while read -r so; do
+    [ -f "$so" ] && cp -arfLv "$so" "$CURDIR/package/lib/"
+done
+
+# Vulkan loader / GPU libs when building the GPU variant.
+GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
+if [ -f "$GPU_LIB_SCRIPT" ]; then
+    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
+    package_gpu_libs
+fi
+
+echo "privacy-filter package contents:"
+ls -lah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/cpp/privacy-filter/run.sh
+++ b/backend/cpp/privacy-filter/run.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+# Entry point for the privacy-filter backend image / BACKEND_BINARY mode.
+set -e
+CURDIR=$(dirname "$(realpath "$0")")
+export LD_LIBRARY_PATH="$CURDIR/lib:$LD_LIBRARY_PATH"
+if [ -f "$CURDIR/lib/ld.so" ]; then
+    exec "$CURDIR/lib/ld.so" "$CURDIR/grpc-server" "$@"
+fi
+exec "$CURDIR/grpc-server" "$@"
--- a/backend/go/crispasr/Makefile
+++ b/backend/go/crispasr/Makefile
@@ -67,7 +67,7 @@ sources/CrispASR:
 	# it, so ${CMAKE_SOURCE_DIR} is THIS backend dir and the talk-llama sources
 	# aren't found. Rewrite to ${PROJECT_SOURCE_DIR} (the crispasr project root),
 	# which is correct both standalone and as a subproject. Idempotent.
-	sed -i 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt
+	sed -i.bak 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt && rm -f sources/CrispASR/src/CMakeLists.txt.bak

 # Detect OS
 UNAME_S := $(shell uname -s)
--- a/backend/go/crispasr/cpp/crispasr_shim.cpp
+++ b/backend/go/crispasr/cpp/crispasr_shim.cpp
@@ -47,6 +47,74 @@ extern "C" void set_abort(int v) {
  g_abort.store(v, std::memory_order_relaxed);
 }

+// --- word-level timestamp accessors ---
+extern "C" {
+int crispasr_session_result_n_words(crispasr_session_result *r, int seg_i);
+const char *crispasr_session_result_word_text(crispasr_session_result *r,
+                                               int seg_i, int word_i);
+int64_t crispasr_session_result_word_t0(crispasr_session_result *r, int seg_i,
+                                         int word_i);
+int64_t crispasr_session_result_word_t1(crispasr_session_result *r, int seg_i,
+                                         int word_i);
+
+// Parakeet-specific word accessors
+int crispasr_parakeet_result_n_words(void *r);
+const char *crispasr_parakeet_result_word_text(void *r, int word_i);
+int64_t crispasr_parakeet_result_word_t0(void *r, int word_i);
+int64_t crispasr_parakeet_result_word_t1(void *r, int word_i);
+}
+
+void *get_result(void) { return g_result; }
+
+int get_word_count(int seg_i) {
+  if (!g_result)
+    return 0;
+  return crispasr_session_result_n_words(g_result, seg_i);
+}
+
+const char *get_word_text(int seg_i, int word_i) {
+  if (!g_result)
+    return "";
+  return crispasr_session_result_word_text(g_result, seg_i, word_i);
+}
+
+int64_t get_word_t0(int seg_i, int word_i) {
+  if (!g_result)
+    return 0;
+  return crispasr_session_result_word_t0(g_result, seg_i, word_i);
+}
+
+int64_t get_word_t1(int seg_i, int word_i) {
+  if (!g_result)
+    return 0;
+  return crispasr_session_result_word_t1(g_result, seg_i, word_i);
+}
+
+// Parakeet-specific word accessors
+int get_parakeet_word_count(void) {
+  if (!g_result)
+    return 0;
+  return crispasr_parakeet_result_n_words(g_result);
+}
+
+const char *get_parakeet_word_text(int word_i) {
+  if (!g_result)
+    return "";
+  return crispasr_parakeet_result_word_text(g_result, word_i);
+}
+
+int64_t get_parakeet_word_t0(int word_i) {
+  if (!g_result)
+    return 0;
+  return crispasr_parakeet_result_word_t0(g_result, word_i);
+}
+
+int64_t get_parakeet_word_t1(int word_i) {
+  if (!g_result)
+    return 0;
+  return crispasr_parakeet_result_word_t1(g_result, word_i);
+}
+
 static void ggml_log_cb(enum ggml_log_level level, const char *log,
                        void *data) {
  const char *level_str;
--- a/backend/go/crispasr/cpp/crispasr_shim.h
+++ b/backend/go/crispasr/cpp/crispasr_shim.h
@@ -20,4 +20,18 @@ float *tts_synthesize(const char *text, int *out_n_samples); // 24kHz mono float
 void tts_free(float *pcm);
 int tts_set_voice(const char *name); // best-effort speaker selection; 0 ok
 int tts_set_voice_file(const char *path, const char *ref_text); // load voice pack (.gguf) or zero-shot clone (.wav + ref_text)
+
+// --- word-level timestamp accessors ---
+// Session-based (works for whisper-like backends)
+void *get_result(void);
+int get_word_count(int seg_i);
+const char *get_word_text(int seg_i, int word_i);
+int64_t get_word_t0(int seg_i, int word_i);
+int64_t get_word_t1(int seg_i, int word_i);
+
+// Parakeet-specific (global word list, no segment index)
+int get_parakeet_word_count(void);
+const char *get_parakeet_word_text(int word_i);
+int64_t get_parakeet_word_t0(int word_i);
+int64_t get_parakeet_word_t1(int word_i);
 }
--- a/backend/go/crispasr/gocrispasr.go
+++ b/backend/go/crispasr/gocrispasr.go
@@ -34,6 +34,18 @@ var (
 	CppTTSFree         func(ptr uintptr)
 	CppTTSSetVoice     func(name string) int
 	CppTTSSetVoiceFile func(path string, refText string) int
+
+	// Word-level timestamp accessors (session-based, per-segment)
+	CppGetWordCount func(segI int) int
+	CppGetWordText  func(segI int, wordI int) string
+	CppGetWordT0    func(segI int, wordI int) int64
+	CppGetWordT1    func(segI int, wordI int) int64
+
+	// Parakeet-specific word accessors (global, no segment index)
+	CppGetParakeetWordCount func() int
+	CppGetParakeetWordText  func(wordI int) string
+	CppGetParakeetWordT0    func(wordI int) int64
+	CppGetParakeetWordT1    func(wordI int) int64
 )

 type CrispASR struct {
@@ -290,10 +302,36 @@ func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRe
 		// IDs, so Tokens is left empty.
 		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")

+		// Populate word-level timestamps. Try session-based functions first
+		// (per-segment); fall back to parakeet-specific functions (global word
+		// list with no segment index — only populated on the first segment to
+		// avoid duplication).
+		words := []*pb.TranscriptWord{}
+		wordCount := CppGetWordCount(i)
+		if wordCount == 0 && i == 0 {
+			wordCount = CppGetParakeetWordCount()
+			for j := 0; j < wordCount; j++ {
+				words = append(words, &pb.TranscriptWord{
+					Start: CppGetParakeetWordT0(j) * (10000000),
+					End:   CppGetParakeetWordT1(j) * (10000000),
+					Text:  strings.ToValidUTF8(strings.Clone(CppGetParakeetWordText(j)), "<22>"),
+				})
+			}
+		} else {
+			for j := 0; j < wordCount; j++ {
+				words = append(words, &pb.TranscriptWord{
+					Start: CppGetWordT0(i, j) * (10000000),
+					End:   CppGetWordT1(i, j) * (10000000),
+					Text:  strings.ToValidUTF8(strings.Clone(CppGetWordText(i, j)), "<22>"),
+				})
+			}
+		}
+
 		segment := &pb.TranscriptSegment{
 			Id:    int32(i),
 			Text:  txt,
 			Start: s, End: t,
+			Words: words,
 		}

 		segments = append(segments, segment)
--- a/backend/go/crispasr/main.go
+++ b/backend/go/crispasr/main.go
@@ -44,6 +44,14 @@ func main() {
 		{&CppTTSFree, "tts_free"},
 		{&CppTTSSetVoice, "tts_set_voice"},
 		{&CppTTSSetVoiceFile, "tts_set_voice_file"},
+		{&CppGetWordCount, "get_word_count"},
+		{&CppGetWordText, "get_word_text"},
+		{&CppGetWordT0, "get_word_t0"},
+		{&CppGetWordT1, "get_word_t1"},
+		{&CppGetParakeetWordCount, "get_parakeet_word_count"},
+		{&CppGetParakeetWordText, "get_parakeet_word_text"},
+		{&CppGetParakeetWordT0, "get_parakeet_word_t0"},
+		{&CppGetParakeetWordT1, "get_parakeet_word_t1"},
 	}

 	for _, lf := range libFuncs {
--- a/backend/go/depth-anything-cpp/.gitignore
+++ b/backend/go/depth-anything-cpp/.gitignore
@@ -0,0 +1,7 @@
+sources/
+build*/
+package/
+libdepthanythingcpp*.so
+depth-anything-cpp
+test-models/
+test-data/
--- a/backend/go/depth-anything-cpp/CMakeLists.txt
+++ b/backend/go/depth-anything-cpp/CMakeLists.txt
@@ -0,0 +1,28 @@
+cmake_minimum_required(VERSION 3.18)
+project(libdepthanythingcpp LANGUAGES C CXX)
+
+set(CMAKE_POSITION_INDEPENDENT_CODE ON)
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+
+# Static-link ggml into the depth-anything shared library so the resulting .so
+# has no runtime dependency on an external libggml — only on
+# libc/libstdc++/libgomp, which the LocalAI package step bundles into the
+# docker image.
+set(BUILD_SHARED_LIBS OFF CACHE BOOL "Build static libraries" FORCE)
+
+# depth-anything.cpp build switches: skip CLI/tests, but build libdepthanything
+# itself as a SHARED library (DA_SHARED) while ggml stays static
+# (BUILD_SHARED_LIBS OFF above). The da_capi_* C ABI is compiled into
+# src/da_capi.cpp and re-exported by that shared library, so no extra MODULE
+# wrapper is needed (unlike locate-anything.cpp).
+set(DA_BUILD_CLI OFF CACHE BOOL "Disable depth-anything CLI" FORCE)
+set(DA_BUILD_TESTS OFF CACHE BOOL "Disable depth-anything tests" FORCE)
+set(DA_SHARED ON CACHE BOOL "Build libdepthanything as a shared lib" FORCE)
+
+add_subdirectory(./sources/depth-anything.cpp)
+
+# Emit libdepthanything.so into the top-level build dir so the Makefile can
+# rename it to the per-variant libdepthanythingcpp-<variant>.so.
+set_target_properties(depthanything PROPERTIES
+    LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
--- a/backend/go/depth-anything-cpp/Makefile
+++ b/backend/go/depth-anything-cpp/Makefile
@@ -0,0 +1,139 @@
+CMAKE_ARGS?=
+BUILD_TYPE?=
+NATIVE?=false
+
+GOCMD?=go
+GO_TAGS?=
+JOBS?=$(shell nproc --ignore=1)
+
+# depth-anything.cpp. Pin to a specific commit for a stable build; a squash
+# merge upstream can orphan a branch, so the native version is pinned by SHA.
+# This SHA adds the nested two-file metric C-API (abi_version 4,
+# da_capi_load_nested) required by the depth-anything-3-nested gallery model;
+# tag it (e.g. v0.1.3) upstream to keep the SHA alive.
+DEPTHANYTHING_REPO?=https://github.com/mudler/depth-anything.cpp.git
+DEPTHANYTHING_VERSION?=cce5edc395fd1843806093d7ccc0c8b0d0b97b72
+
+ifeq ($(NATIVE),false)
+	CMAKE_ARGS+=-DGGML_NATIVE=OFF
+endif
+
+# Forward LocalAI's BUILD_TYPE to the matching ggml backend switch. depth-anything.cpp
+# force-sets GGML_CUDA/GGML_VULKAN/GGML_METAL from its own DA_GGML_* options, so
+# those must be toggled via the DA_GGML_* names (a bare -DGGML_CUDA=ON would be
+# overridden); the remaining ggml switches pass straight through.
+ifeq ($(BUILD_TYPE),cublas)
+	CMAKE_ARGS+=-DGGML_CUDA=ON -DDA_GGML_CUDA=ON
+else ifeq ($(BUILD_TYPE),openblas)
+	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
+else ifeq ($(BUILD_TYPE),clblas)
+	CMAKE_ARGS+=-DGGML_CLBLAST=ON
+else ifeq ($(BUILD_TYPE),hipblas)
+	ROCM_HOME ?= /opt/rocm
+	ROCM_PATH ?= /opt/rocm
+	export CXX=$(ROCM_HOME)/llvm/bin/clang++
+	export CC=$(ROCM_HOME)/llvm/bin/clang
+	AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
+	CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
+else ifeq ($(BUILD_TYPE),vulkan)
+	CMAKE_ARGS+=-DGGML_VULKAN=ON -DDA_GGML_VULKAN=ON
+else ifeq ($(OS),Darwin)
+	ifneq ($(BUILD_TYPE),metal)
+		CMAKE_ARGS+=-DGGML_METAL=OFF
+	else
+		CMAKE_ARGS+=-DGGML_METAL=ON
+		CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
+		CMAKE_ARGS+=-DDA_GGML_METAL=ON
+	endif
+endif
+
+ifeq ($(BUILD_TYPE),sycl_f16)
+	CMAKE_ARGS+=-DGGML_SYCL=ON \
+		-DCMAKE_C_COMPILER=icx \
+		-DCMAKE_CXX_COMPILER=icpx \
+		-DGGML_SYCL_F16=ON
+endif
+
+ifeq ($(BUILD_TYPE),sycl_f32)
+	CMAKE_ARGS+=-DGGML_SYCL=ON \
+		-DCMAKE_C_COMPILER=icx \
+		-DCMAKE_CXX_COMPILER=icpx
+endif
+
+sources/depth-anything.cpp:
+	mkdir -p sources && \
+	git clone --recursive $(DEPTHANYTHING_REPO) sources/depth-anything.cpp && \
+	cd sources/depth-anything.cpp && \
+	git checkout $(DEPTHANYTHING_VERSION) && \
+	git submodule update --init --recursive --depth 1 --single-branch
+
+# Detect OS
+UNAME_S := $(shell uname -s)
+
+# Only build CPU variants on Linux
+ifeq ($(UNAME_S),Linux)
+	VARIANT_TARGETS = libdepthanythingcpp-avx.so libdepthanythingcpp-avx2.so libdepthanythingcpp-avx512.so libdepthanythingcpp-fallback.so
+else
+	# On non-Linux (e.g., Darwin), build only fallback variant
+	VARIANT_TARGETS = libdepthanythingcpp-fallback.so
+endif
+
+depth-anything-cpp: main.go godepthanythingcpp.go $(VARIANT_TARGETS)
+	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o depth-anything-cpp ./
+
+package: depth-anything-cpp
+	bash package.sh
+
+build: package
+
+clean: purge
+	rm -rf libdepthanythingcpp*.so depth-anything-cpp package sources
+
+purge:
+	rm -rf build*
+
+# Build all variants (Linux only)
+ifeq ($(UNAME_S),Linux)
+libdepthanythingcpp-avx.so: sources/depth-anything.cpp
+	rm -rfv build-$@
+	$(info ${GREEN}I depth-anything-cpp build info:avx${RESET})
+	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libdepthanythingcpp-custom
+	rm -rfv build-$@
+
+libdepthanythingcpp-avx2.so: sources/depth-anything.cpp
+	rm -rfv build-$@
+	$(info ${GREEN}I depth-anything-cpp build info:avx2${RESET})
+	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libdepthanythingcpp-custom
+	rm -rfv build-$@
+
+libdepthanythingcpp-avx512.so: sources/depth-anything.cpp
+	rm -rfv build-$@
+	$(info ${GREEN}I depth-anything-cpp build info:avx512${RESET})
+	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libdepthanythingcpp-custom
+	rm -rfv build-$@
+endif
+
+# Build fallback variant (all platforms)
+libdepthanythingcpp-fallback.so: sources/depth-anything.cpp
+	rm -rfv build-$@
+	$(info ${GREEN}I depth-anything-cpp build info:fallback${RESET})
+	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libdepthanythingcpp-custom
+	rm -rfv build-$@
+
+libdepthanythingcpp-custom: CMakeLists.txt
+	mkdir -p build-$(SO_TARGET) && \
+	cd build-$(SO_TARGET) && \
+	cmake .. $(CMAKE_ARGS) && \
+	cmake --build . --config Release -j$(JOBS) && \
+	cd .. && \
+	mv build-$(SO_TARGET)/libdepthanything.so ./$(SO_TARGET)
+
+all: depth-anything-cpp package
+
+# `test` is invoked by the top-level Makefile's `test-extra` target. It builds
+# the backend binary + the fallback shared library (needed for dlopen at
+# runtime), then runs test.sh which downloads a small GGUF + a test image and
+# exercises the gRPC Load/Predict wire path via the Go smoke test in
+# main_test.go.
+test: depth-anything-cpp libdepthanythingcpp-fallback.so
+	bash test.sh
--- a/backend/go/depth-anything-cpp/godepthanythingcpp.go
+++ b/backend/go/depth-anything-cpp/godepthanythingcpp.go
@@ -0,0 +1,556 @@
+package main
+
+// godepthanythingcpp.go - gRPC handlers (Load, Predict, GenerateImage) for the
+// depth-anything-cpp backend, wrapping the Depth Anything 3 ggml C-API
+// (libdepthanythingcpp-<variant>.so) via purego.
+//
+// Embeds base.SingleThread to default the unimplemented RPCs to "not supported"
+// and to serialize calls — the C side shares a ggml graph allocator and is NOT
+// reentrant, so all inference must run one-at-a-time.
+//
+// Depth has no native OpenAI endpoint, so the model is exposed two ways:
+//
+//   - GenerateImage(src, dst): run depth on the src image and write a
+//     min-max-normalised grayscale depth PNG to dst.
+//   - Predict(images[0]): run depth+pose and return a JSON blob with the depth
+//     dimensions, depth stats and the camera extrinsics (3x4) / intrinsics (3x3).
+
+import (
+	"encoding/base64"
+	"encoding/json"
+	"fmt"
+	"image"
+	"image/png"
+	"math"
+	"os"
+	"path/filepath"
+	"strings"
+	"unsafe"
+
+	"github.com/mudler/LocalAI/pkg/grpc/base"
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+)
+
+// C-API function pointers, registered in main.go via purego. The da_capi_*
+// symbols live inside libdepthanything (src/da_capi.cpp) and are re-exported by
+// the DA_SHARED build.
+var (
+	// da_capi_load(const char* gguf_path, int n_threads) -> da_ctx* (0 = fail)
+	CapiLoad func(gguf string, nThreads int32) uintptr
+	// da_capi_load_nested(const char* anyview_gguf, const char* metric_gguf,
+	//   int n_threads) -> da_ctx* (0 = fail). The returned ctx serves the nested
+	//   metric model: depth/pose calls produce final metric-scale depth + scaled pose.
+	CapiLoadNested func(anyview string, metric string, nThreads int32) uintptr
+	// da_capi_free(da_ctx* ctx) — safe on a 0 handle.
+	CapiFree func(handle uintptr)
+	// da_capi_last_error(da_ctx* ctx) -> const char* (owned by ctx, "" if none).
+	// purego marshals the returned C string into a Go string (a copy), so we
+	// never free it.
+	CapiLastError func(handle uintptr) string
+	// da_capi_depth_path(ctx, image_path, out_h*, out_w*) -> float* depth map
+	// (row-major H*W); nil on error. Caller frees via da_capi_free_floats.
+	CapiDepthPath func(handle uintptr, imagePath string, outH *int32, outW *int32) *float32
+	// da_capi_free_floats(float* p)
+	CapiFreeFloats func(p *float32)
+	// da_capi_pose_path(ctx, image_path, out_ext[12], out_intr[9]) -> 0 ok, -1 err
+	CapiPosePath func(handle uintptr, imagePath string, outExt *float32, outIntr *float32) int32
+	// da_capi_depth_dense(ctx, image_path, out_h*, out_w*, out_depth**, out_conf**,
+	//   out_sky**, out_ext[12], out_intr[9], out_is_metric*) -> 0 ok, -1 err.
+	// Each non-NULL out_depth/out_conf/out_sky receives a malloc'd float[H*W] (free
+	// via da_capi_free_floats); buffers the model doesn't produce are set NULL.
+	CapiDepthDense func(handle uintptr, imagePath string,
+		outH, outW *int32,
+		outDepth, outConf, outSky **float32,
+		outExt, outIntr *float32,
+		outIsMetric *int32) int32
+	// da_capi_points(ctx, image_path, conf_thresh, out_n*, out_xyz**, out_rgb**) ->
+	//   0 ok, -1 err. *out_xyz = malloc'd float[3*N] (free via da_capi_free_floats),
+	//   *out_rgb = malloc'd uint8[3*N] (free via da_capi_free_bytes).
+	CapiPoints func(handle uintptr, imagePath string, confThresh float32,
+		outN *int32, outXyz **float32, outRgb **byte) int32
+	// da_capi_free_bytes(unsigned char* p)
+	CapiFreeBytes func(p *byte)
+	// da_capi_export_glb(ctx, image_path, out_glb) -> 0 ok, -1 err
+	CapiExportGlb func(handle uintptr, imagePath string, outGlb string) int32
+	// da_capi_export_colmap(ctx, image_path, out_dir, binary) -> 0 ok, -1 err
+	CapiExportColmap func(handle uintptr, imagePath string, outDir string, binary int32) int32
+)
+
+type DepthAnythingCpp struct {
+	base.SingleThread
+	handle uintptr
+}
+
+// Load loads the GGUF model at opts.ModelFile (joined with opts.ModelPath if
+// relative) and stores the da_ctx handle for later inference calls.
+func (r *DepthAnythingCpp) Load(opts *pb.ModelOptions) error {
+	modelFile := opts.ModelFile
+	if modelFile == "" {
+		modelFile = opts.Model
+	}
+	if modelFile == "" {
+		return fmt.Errorf("depth-anything-cpp: ModelFile is empty")
+	}
+
+	resolve := func(name string) string {
+		if filepath.IsAbs(name) {
+			return name
+		}
+		return filepath.Join(opts.ModelPath, name)
+	}
+	modelPath := resolve(modelFile)
+
+	if _, err := os.Stat(modelPath); err != nil {
+		return fmt.Errorf("depth-anything-cpp: model file not found: %s: %w", modelPath, err)
+	}
+
+	// Nested metric models are a two-file pair: the main model is the anyview
+	// (GIANT) branch and the metric (ViT-L + DPT/sky) branch is named via a
+	// "metric_model:<filename>" entry in opts.Options. When present we load both
+	// branches so the engine runs the nested metric alignment.
+	metricFile := optionValue(opts.Options, "metric_model")
+
+	threads := opts.Threads
+	if threads <= 0 {
+		threads = 4
+	}
+
+	// Release previous model if any (re-Load).
+	if r.handle != 0 {
+		CapiFree(r.handle)
+		r.handle = 0
+	}
+
+	var h uintptr
+	if metricFile != "" {
+		metricPath := resolve(metricFile)
+		if _, err := os.Stat(metricPath); err != nil {
+			return fmt.Errorf("depth-anything-cpp: metric_model file not found: %s: %w", metricPath, err)
+		}
+		h = CapiLoadNested(modelPath, metricPath, threads)
+		if h == 0 {
+			if msg := CapiLastError(0); msg != "" {
+				return fmt.Errorf("depth-anything-cpp: da_capi_load_nested failed for %s + %s: %s", modelPath, metricPath, msg)
+			}
+			return fmt.Errorf("depth-anything-cpp: da_capi_load_nested failed for %s + %s", modelPath, metricPath)
+		}
+	} else {
+		h = CapiLoad(modelPath, threads)
+		if h == 0 {
+			// da_capi_last_error needs a ctx; on a failed load we have none (it
+			// returns "" for a null ctx), so the text is best-effort.
+			if msg := CapiLastError(0); msg != "" {
+				return fmt.Errorf("depth-anything-cpp: da_capi_load failed for %s: %s", modelPath, msg)
+			}
+			return fmt.Errorf("depth-anything-cpp: da_capi_load failed for %s", modelPath)
+		}
+	}
+	r.handle = h
+	return nil
+}
+
+// optionValue returns the value of the first "key:value" entry in opts whose key
+// matches (case-sensitive), or "" if absent. Mirrors how other LocalAI backends
+// read ModelOptions.Options.
+func optionValue(opts []string, key string) string {
+	prefix := key + ":"
+	for _, o := range opts {
+		if strings.HasPrefix(o, prefix) {
+			return strings.TrimSpace(o[len(prefix):])
+		}
+	}
+	return ""
+}
+
+// depthResult is the JSON payload returned by Predict.
+type depthResult struct {
+	DepthW     int         `json:"depth_w"`
+	DepthH     int         `json:"depth_h"`
+	DepthMin   float32     `json:"depth_min"`
+	DepthMax   float32     `json:"depth_max"`
+	Extrinsics [12]float32 `json:"extrinsics"` // 3x4 row-major
+	Intrinsics [9]float32  `json:"intrinsics"` // 3x3 row-major
+}
+
+// Predict runs depth+pose on the first supplied image and returns depth
+// statistics + camera pose as a JSON string. LocalAI wraps the string into the
+// Reply.Message of the gRPC response. The image in Images[0] may be a
+// filesystem path or a base64-encoded payload.
+func (r *DepthAnythingCpp) Predict(opts *pb.PredictOptions) (string, error) {
+	imgs := opts.GetImages()
+	if len(imgs) == 0 {
+		return "", fmt.Errorf("depth-anything-cpp: Predict requires an image in Images[]")
+	}
+
+	imgPath, cleanup, err := materializeImage(imgs[0])
+	if err != nil {
+		return "", fmt.Errorf("depth-anything-cpp: %w", err)
+	}
+	defer cleanup()
+
+	depth, h, w, ext, intr, err := r.runDepthPose(imgPath)
+	if err != nil {
+		return "", err
+	}
+
+	dmin, dmax := minMax(depth)
+	payload, err := json.Marshal(depthResult{
+		DepthW: w, DepthH: h,
+		DepthMin: dmin, DepthMax: dmax,
+		Extrinsics: ext, Intrinsics: intr,
+	})
+	if err != nil {
+		return "", fmt.Errorf("depth-anything-cpp: marshal: %w", err)
+	}
+	return string(payload), nil
+}
+
+// GenerateImage runs depth on req.Src and writes a normalised grayscale depth
+// PNG to req.Dst.
+func (r *DepthAnythingCpp) GenerateImage(req *pb.GenerateImageRequest) error {
+	if req.GetSrc() == "" {
+		return fmt.Errorf("depth-anything-cpp: GenerateImage requires src")
+	}
+	if req.GetDst() == "" {
+		return fmt.Errorf("depth-anything-cpp: GenerateImage requires dst")
+	}
+
+	imgPath, cleanup, err := materializeImage(req.GetSrc())
+	if err != nil {
+		return fmt.Errorf("depth-anything-cpp: %w", err)
+	}
+	defer cleanup()
+
+	depth, h, w, _, _, err := r.runDepthPose(imgPath)
+	if err != nil {
+		return err
+	}
+	return writeDepthPNG(req.GetDst(), depth, h, w)
+}
+
+// Depth is the typed Depth RPC. It runs the Depth Anything 3 pipeline on the
+// request's src image and fills a DepthResponse honoring the include_* flags and
+// exports: per-pixel metric depth + confidence (DualDPT) or depth + sky (mono),
+// camera extrinsics/intrinsics, an optional back-projected 3D point cloud and
+// glb/COLMAP exports. The src may be a filesystem path or a base64 payload.
+func (r *DepthAnythingCpp) Depth(in *pb.DepthRequest) (pb.DepthResponse, error) {
+	// Accumulate into locals and return a single composite literal at the end:
+	// returning a named pb.DepthResponse value would copy its embedded mutex
+	// (go vet copylocks).
+	if r.handle == 0 {
+		return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: model not loaded")
+	}
+	if in.GetSrc() == "" {
+		return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: Depth requires src")
+	}
+
+	imgPath, cleanup, err := materializeImage(in.GetSrc())
+	if err != nil {
+		return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: %w", err)
+	}
+	defer cleanup()
+
+	// Dense per-pixel output + pose. Pass buffer pointers only for the
+	// requested maps so the native side can skip unrequested work; ext/intr
+	// must always point at 12/9 floats per the C ABI.
+	var (
+		h, w, isMetric      int32
+		depthPtr, confPtr   *float32
+		skyPtr              *float32
+		ext                 [12]float32
+		intr                [9]float32
+		pDepth, pConf, pSky **float32
+	)
+	if in.GetIncludeDepth() {
+		pDepth = &depthPtr
+	}
+	if in.GetIncludeConfidence() {
+		pConf = &confPtr
+	}
+	if in.GetIncludeSky() {
+		pSky = &skyPtr
+	}
+
+	rc := CapiDepthDense(r.handle, imgPath, &h, &w, pDepth, pConf, pSky, &ext[0], &intr[0], &isMetric)
+	if rc != 0 {
+		return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: da_capi_depth_dense failed (rc=%d): %s", rc, r.lastError())
+	}
+
+	n := int(h) * int(w)
+	var (
+		depth, conf, sky      []float32
+		extrinsics, intrinsic []float32
+		numPoints             int32
+		points                []float32
+		pointColors           []byte
+		exportPaths           []string
+	)
+
+	if depthPtr != nil {
+		depth = copyFloats(depthPtr, n)
+		CapiFreeFloats(depthPtr)
+	}
+	if confPtr != nil {
+		conf = copyFloats(confPtr, n)
+		CapiFreeFloats(confPtr)
+	}
+	if skyPtr != nil {
+		sky = copyFloats(skyPtr, n)
+		CapiFreeFloats(skyPtr)
+	}
+	if in.GetIncludePose() {
+		extrinsics = append([]float32(nil), ext[:]...)
+		intrinsic = append([]float32(nil), intr[:]...)
+	}
+
+	// 3D point cloud (DualDPT / pose-capable models only).
+	if in.GetIncludePoints() {
+		var (
+			np     int32
+			xyzPtr *float32
+			rgbPtr *byte
+		)
+		if rc := CapiPoints(r.handle, imgPath, in.GetPointsConfThresh(), &np, &xyzPtr, &rgbPtr); rc != 0 {
+			return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: da_capi_points failed (rc=%d): %s", rc, r.lastError())
+		}
+		numPoints = np
+		if xyzPtr != nil {
+			points = copyFloats(xyzPtr, int(np)*3)
+			CapiFreeFloats(xyzPtr)
+		}
+		if rgbPtr != nil {
+			pointColors = copyBytes(rgbPtr, int(np)*3)
+			CapiFreeBytes(rgbPtr)
+		}
+	}
+
+	// Exports (glb / colmap). They are written under in.Dst (a directory); a
+	// temp dir is used when Dst is empty.
+	if len(in.GetExports()) > 0 {
+		exportPaths, err = r.runExports(imgPath, in.GetDst(), in.GetExports())
+		if err != nil {
+			return pb.DepthResponse{}, err
+		}
+	}
+
+	return pb.DepthResponse{
+		Width:       w,
+		Height:      h,
+		Depth:       depth,
+		Confidence:  conf,
+		Sky:         sky,
+		Extrinsics:  extrinsics,
+		Intrinsics:  intrinsic,
+		NumPoints:   numPoints,
+		Points:      points,
+		PointColors: pointColors,
+		ExportPaths: exportPaths,
+		IsMetric:    isMetric != 0,
+	}, nil
+}
+
+// runExports writes the requested exports for imgPath into dstDir and returns
+// the written paths. Supported exports: "glb", "colmap".
+func (r *DepthAnythingCpp) runExports(imgPath, dstDir string, exports []string) ([]string, error) {
+	if dstDir == "" {
+		tmp, err := os.MkdirTemp("", "depth-anything-export-*")
+		if err != nil {
+			return nil, fmt.Errorf("depth-anything-cpp: mkdir export dir: %w", err)
+		}
+		dstDir = tmp
+	} else if err := os.MkdirAll(dstDir, 0o750); err != nil {
+		return nil, fmt.Errorf("depth-anything-cpp: mkdir %s: %w", dstDir, err)
+	}
+
+	var paths []string
+	for _, exp := range exports {
+		switch exp {
+		case "glb":
+			out := filepath.Join(dstDir, "pointcloud.glb")
+			if rc := CapiExportGlb(r.handle, imgPath, out); rc != 0 {
+				return nil, fmt.Errorf("depth-anything-cpp: da_capi_export_glb failed (rc=%d): %s", rc, r.lastError())
+			}
+			paths = append(paths, out)
+		case "colmap":
+			out := filepath.Join(dstDir, "colmap")
+			if err := os.MkdirAll(out, 0o750); err != nil {
+				return nil, fmt.Errorf("depth-anything-cpp: mkdir %s: %w", out, err)
+			}
+			if rc := CapiExportColmap(r.handle, imgPath, out, 1); rc != 0 {
+				return nil, fmt.Errorf("depth-anything-cpp: da_capi_export_colmap failed (rc=%d): %s", rc, r.lastError())
+			}
+			paths = append(paths, out)
+		default:
+			return nil, fmt.Errorf("depth-anything-cpp: unknown export %q (want glb|colmap)", exp)
+		}
+	}
+	return paths, nil
+}
+
+// copyFloats copies n float32 values from a C heap pointer into a fresh Go
+// slice so the C buffer can be freed afterwards.
+func copyFloats(p *float32, n int) []float32 {
+	if p == nil || n <= 0 {
+		return nil
+	}
+	src := unsafe.Slice(p, n)
+	out := make([]float32, n)
+	copy(out, src)
+	return out
+}
+
+// copyBytes copies n bytes from a C heap pointer into a fresh Go slice.
+func copyBytes(p *byte, n int) []byte {
+	if p == nil || n <= 0 {
+		return nil
+	}
+	src := unsafe.Slice(p, n)
+	out := make([]byte, n)
+	copy(out, src)
+	return out
+}
+
+// runDepthPose runs depth estimation then pose recovery on an image file. It
+// returns the row-major depth map (length h*w), its dimensions, the 3x4
+// extrinsics (12 floats) and 3x3 intrinsics (9 floats).
+// runDepthPose returns depth + camera pose via two C-API calls (depth then pose).
+// For a nested metric model both calls run the full two-branch pipeline, so this
+// path infers twice; the typed Depth RPC (single da_capi_depth_dense call) is the
+// efficient path for nested models.
+func (r *DepthAnythingCpp) runDepthPose(imagePath string) (depth []float32, h, w int, ext [12]float32, intr [9]float32, err error) {
+	if r.handle == 0 {
+		err = fmt.Errorf("depth-anything-cpp: model not loaded")
+		return
+	}
+
+	var ch, cw int32
+	ptr := CapiDepthPath(r.handle, imagePath, &ch, &cw)
+	if ptr == nil {
+		err = fmt.Errorf("depth-anything-cpp: da_capi_depth_path failed: %s", r.lastError())
+		return
+	}
+	h, w = int(ch), int(cw)
+	n := h * w
+	if n > 0 {
+		src := unsafe.Slice(ptr, n)
+		depth = make([]float32, n)
+		copy(depth, src)
+	}
+	CapiFreeFloats(ptr)
+
+	if rc := CapiPosePath(r.handle, imagePath, &ext[0], &intr[0]); rc != 0 {
+		err = fmt.Errorf("depth-anything-cpp: da_capi_pose_path failed (rc=%d): %s", rc, r.lastError())
+		return
+	}
+	return
+}
+
+// lastError returns the context's last error string, or "" if none.
+func (r *DepthAnythingCpp) lastError() string {
+	if CapiLastError == nil || r.handle == 0 {
+		return ""
+	}
+	return CapiLastError(r.handle)
+}
+
+// materializeImage returns a filesystem path for an image argument that may be
+// either an existing path or a base64-encoded payload. When the input is
+// base64 it is decoded into a temp file; cleanup removes it (no-op for a path).
+func materializeImage(arg string) (path string, cleanup func(), err error) {
+	cleanup = func() {}
+	if _, statErr := os.Stat(arg); statErr == nil {
+		return arg, cleanup, nil
+	}
+	// Strip an optional data URL prefix (data:image/...;base64,<payload>).
+	b64 := arg
+	if i := indexComma(b64); i >= 0 && hasDataPrefix(b64) {
+		b64 = b64[i+1:]
+	}
+	data, decErr := base64.StdEncoding.DecodeString(b64)
+	if decErr != nil {
+		return "", cleanup, fmt.Errorf("image is neither an existing path nor valid base64: %v", decErr)
+	}
+	f, tErr := os.CreateTemp("", "depth-anything-*.img")
+	if tErr != nil {
+		return "", cleanup, tErr
+	}
+	if _, wErr := f.Write(data); wErr != nil {
+		_ = f.Close()
+		_ = os.Remove(f.Name())
+		return "", cleanup, wErr
+	}
+	_ = f.Close()
+	name := f.Name()
+	return name, func() { _ = os.Remove(name) }, nil
+}
+
+func hasDataPrefix(s string) bool {
+	return len(s) >= 5 && s[:5] == "data:"
+}
+
+func indexComma(s string) int {
+	for i := 0; i < len(s); i++ {
+		if s[i] == ',' {
+			return i
+		}
+	}
+	return -1
+}
+
+// writeDepthPNG min-max normalises a depth map and writes it as an 8-bit
+// grayscale PNG. Near = bright (255), far = dark (0), matching the usual
+// depth-map convention for inverse-depth-like outputs.
+func writeDepthPNG(dst string, depth []float32, h, w int) error {
+	if h <= 0 || w <= 0 || len(depth) < h*w {
+		return fmt.Errorf("depth-anything-cpp: writeDepthPNG: bad dims h=%d w=%d len=%d", h, w, len(depth))
+	}
+	dmin, dmax := minMax(depth)
+	span := dmax - dmin
+	if span <= 0 || math.IsNaN(float64(span)) {
+		span = 1
+	}
+	img := image.NewGray(image.Rect(0, 0, w, h))
+	for y := 0; y < h; y++ {
+		for x := 0; x < w; x++ {
+			v := depth[y*w+x]
+			n := (v - dmin) / span // 0..1
+			if math.IsNaN(float64(n)) {
+				n = 0
+			}
+			if n < 0 {
+				n = 0
+			} else if n > 1 {
+				n = 1
+			}
+			img.Pix[y*img.Stride+x] = uint8(n * 255)
+		}
+	}
+	// dst is the gRPC-provided output path chosen by the LocalAI core (the
+	// intended write destination for the rendered depth map), not
+	// attacker-controlled input, so the variable path is expected here.
+	f, err := os.Create(dst) // #nosec G304
+	if err != nil {
+		return err
+	}
+	defer func() { _ = f.Close() }()
+	return png.Encode(f, img)
+}
+
+func minMax(v []float32) (mn, mx float32) {
+	if len(v) == 0 {
+		return 0, 0
+	}
+	mn, mx = v[0], v[0]
+	for _, x := range v {
+		if math.IsNaN(float64(x)) || math.IsInf(float64(x), 0) {
+			continue
+		}
+		if x < mn {
+			mn = x
+		}
+		if x > mx {
+			mx = x
+		}
+	}
+	return mn, mx
+}
--- a/backend/go/depth-anything-cpp/main.go
+++ b/backend/go/depth-anything-cpp/main.go
@@ -0,0 +1,62 @@
+package main
+
+// main.go - entry point for the depth-anything-cpp gRPC backend.
+//
+// Dlopens libdepthanythingcpp-<variant>.so via purego at the path in
+// DEPTHANYTHING_LIBRARY (set by run.sh based on /proc/cpuinfo), registers the
+// da_capi_* C ABI symbols, then starts the gRPC server.
+
+import (
+	"flag"
+	"os"
+
+	"github.com/ebitengine/purego"
+	grpc "github.com/mudler/LocalAI/pkg/grpc"
+)
+
+var (
+	addr = flag.String("addr", "localhost:50051", "the address to connect to")
+)
+
+type LibFuncs struct {
+	FuncPtr any
+	Name    string
+}
+
+func main() {
+	// Get library name from environment variable, default to fallback
+	libName := os.Getenv("DEPTHANYTHING_LIBRARY")
+	if libName == "" {
+		libName = "./libdepthanythingcpp-fallback.so"
+	}
+
+	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
+	if err != nil {
+		panic(err)
+	}
+
+	libFuncs := []LibFuncs{
+		{&CapiLoad, "da_capi_load"},
+		{&CapiLoadNested, "da_capi_load_nested"},
+		{&CapiFree, "da_capi_free"},
+		{&CapiLastError, "da_capi_last_error"},
+		{&CapiDepthPath, "da_capi_depth_path"},
+		{&CapiFreeFloats, "da_capi_free_floats"},
+		{&CapiPosePath, "da_capi_pose_path"},
+		{&CapiDepthDense, "da_capi_depth_dense"},
+		{&CapiPoints, "da_capi_points"},
+		{&CapiFreeBytes, "da_capi_free_bytes"},
+		{&CapiExportGlb, "da_capi_export_glb"},
+		{&CapiExportColmap, "da_capi_export_colmap"},
+	}
+
+	for _, lf := range libFuncs {
+		purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
+	}
+
+	flag.Parse()
+
+	if err := grpc.StartServer(*addr, &DepthAnythingCpp{}); err != nil {
+		panic(err)
+	}
+}
--- a/backend/go/depth-anything-cpp/main_test.go
+++ b/backend/go/depth-anything-cpp/main_test.go
@@ -0,0 +1,167 @@
+package main
+
+// main_test.go - end-to-end smoke test for the depth-anything-cpp gRPC backend.
+//
+// Spawns the compiled depth-anything-cpp binary on a free local port, dials it
+// via gRPC, and exercises LoadModel + Predict against the test fixtures
+// downloaded by test.sh: the small (vits) f32 GGUF of Depth Anything 3 and a
+// real photo. Asserts that Predict returns a JSON payload with a positive
+// depth-map width/height.
+//
+// The spec Skip()s cleanly if its fixtures (the model, the test image, the
+// built binary, or the fallback .so) are missing, so the test target stays
+// usable on a fresh checkout / on CI runners where the model hasn't been
+// downloaded.
+
+import (
+	"context"
+	"encoding/base64"
+	"encoding/json"
+	"fmt"
+	"net"
+	"os"
+	"os/exec"
+	"path/filepath"
+	"testing"
+	"time"
+
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+	"google.golang.org/grpc"
+	"google.golang.org/grpc/credentials/insecure"
+)
+
+func TestDepth(t *testing.T) {
+	RegisterFailHandler(Fail)
+	RunSpecs(t, "depth-anything-cpp backend smoke suite")
+}
+
+// freePort grabs an ephemeral TCP port and immediately releases it so the
+// spawned backend can bind to it. There is a tiny TOCTOU window here but in
+// practice it's adequate for a smoke test on a quiet runner.
+func freePort() int {
+	l, err := net.Listen("tcp", "127.0.0.1:0")
+	Expect(err).ToNot(HaveOccurred(), "freePort listen")
+	port := l.Addr().(*net.TCPAddr).Port
+	Expect(l.Close()).To(Succeed())
+	return port
+}
+
+// startBackend spawns the depth-anything-cpp binary on the given port and waits
+// until it accepts TCP connections (up to 10s). It mirrors how main.go resolves
+// the purego library: the DEPTHANYTHING_LIBRARY env var points the dlopen at the
+// freshly built fallback .so. The returned cleanup func kills the process.
+func startBackend(port int) func() {
+	binary, err := filepath.Abs("./depth-anything-cpp")
+	Expect(err).ToNot(HaveOccurred())
+	if _, err := os.Stat(binary); err != nil {
+		Skip(fmt.Sprintf("backend binary not built: %s (run `make depth-anything-cpp` first)", binary))
+	}
+
+	libPath, err := filepath.Abs("./libdepthanythingcpp-fallback.so")
+	Expect(err).ToNot(HaveOccurred())
+	if _, err := os.Stat(libPath); err != nil {
+		Skip(fmt.Sprintf("fallback library not built: %s (run `make libdepthanythingcpp-fallback.so` first)", libPath))
+	}
+
+	addr := fmt.Sprintf("127.0.0.1:%d", port)
+	cmd := exec.Command(binary, "--addr", addr)
+	cmd.Env = append(os.Environ(), "DEPTHANYTHING_LIBRARY="+libPath)
+	cmd.Stdout = os.Stderr
+	cmd.Stderr = os.Stderr
+	Expect(cmd.Start()).To(Succeed())
+
+	cleanup := func() {
+		if cmd.Process != nil {
+			_ = cmd.Process.Kill()
+			_, _ = cmd.Process.Wait()
+		}
+	}
+
+	deadline := time.Now().Add(10 * time.Second)
+	for time.Now().Before(deadline) {
+		c, err := net.DialTimeout("tcp", addr, 200*time.Millisecond)
+		if err == nil {
+			_ = c.Close()
+			return cleanup
+		}
+		time.Sleep(200 * time.Millisecond)
+	}
+
+	cleanup()
+	Fail(fmt.Sprintf("backend did not become ready on %s within 10s", addr))
+	return func() {}
+}
+
+// loadTestImage reads the test image downloaded by test.sh and returns its
+// base64-encoded content (one of the wire formats accepted by Predict).
+func loadTestImage() string {
+	imgPath, err := filepath.Abs("test-data/test.jpg")
+	Expect(err).ToNot(HaveOccurred())
+	imgBytes, err := os.ReadFile(imgPath)
+	if err != nil {
+		Skip(fmt.Sprintf("test image not present: %s (run test.sh first)", imgPath))
+	}
+	return base64.StdEncoding.EncodeToString(imgBytes)
+}
+
+// dialBackend opens a gRPC client connection to the spawned backend.
+func dialBackend(port int) (pb.BackendClient, func()) {
+	addr := fmt.Sprintf("127.0.0.1:%d", port)
+	conn, err := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials()))
+	Expect(err).ToNot(HaveOccurred())
+	return pb.NewBackendClient(conn), func() { _ = conn.Close() }
+}
+
+// modelPathOrSkip resolves the model file under ./test-models/ and Skip()s the
+// current spec if it's missing (not present on a fresh checkout / on CI runners
+// without the download).
+func modelPathOrSkip(name string) string {
+	modelDir, err := filepath.Abs("test-models")
+	Expect(err).ToNot(HaveOccurred())
+	modelPath := filepath.Join(modelDir, name)
+	if _, err := os.Stat(modelPath); err != nil {
+		Skip(fmt.Sprintf("model not present: %s (run test.sh first)", modelPath))
+	}
+	return modelPath
+}
+
+var _ = Describe("depth-anything-cpp backend", func() {
+	It("runs depth+pose against a known-good image", func() {
+		modelPath := modelPathOrSkip("depth-anything-small-f32.gguf")
+		imgB64 := loadTestImage()
+
+		port := freePort()
+		cleanup := startBackend(port)
+		defer cleanup()
+
+		client, closeConn := dialBackend(port)
+		defer closeConn()
+
+		ctx, cancel := context.WithTimeout(context.Background(), 20*time.Minute)
+		defer cancel()
+
+		loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
+			Model:     "depth-anything-small-f32.gguf",
+			ModelFile: modelPath,
+			Threads:   4,
+		})
+		Expect(err).ToNot(HaveOccurred(), "LoadModel")
+		Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
+
+		// Predict runs depth+pose and returns the JSON depthResult in Reply.Message.
+		reply, err := client.Predict(ctx, &pb.PredictOptions{
+			Images: []string{imgB64},
+		})
+		Expect(err).ToNot(HaveOccurred(), "Predict")
+
+		var res depthResult
+		Expect(json.Unmarshal(reply.GetMessage(), &res)).To(Succeed(), "Predict returned non-JSON: %q", string(reply.GetMessage()))
+		Expect(res.DepthW).To(BeNumerically(">", 0), "depth width should be positive")
+		Expect(res.DepthH).To(BeNumerically(">", 0), "depth height should be positive")
+
+		_, _ = fmt.Fprintf(GinkgoWriter, "depth OK: %dx%d min=%.3f max=%.3f\n",
+			res.DepthW, res.DepthH, res.DepthMin, res.DepthMax)
+	})
+})
--- a/backend/go/depth-anything-cpp/nested_e2e_test.go
+++ b/backend/go/depth-anything-cpp/nested_e2e_test.go
@@ -0,0 +1,64 @@
+package main
+
+// nested_e2e_test.go - e2e smoke for the nested two-file metric model. Loads the
+// anyview branch as the main model and points the metric branch via the
+// "metric_model:<file>" option (exactly as the depth-anything-3-nested gallery
+// entry does), then exercises the typed Depth RPC and asserts a metric depth map.
+//
+// Skips cleanly unless both nested GGUFs are present under ./test-models/ and the
+// backend binary + fallback .so are built.
+
+import (
+	"context"
+	"fmt"
+	"path/filepath"
+	"time"
+
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("depth-anything-cpp nested metric model", func() {
+	It("loads the two-file pair via the metric_model option and returns metric depth", func() {
+		anyviewPath := modelPathOrSkip("depth-anything-nested-anyview.gguf")
+		_ = modelPathOrSkip("depth-anything-nested-metric.gguf")
+		imgB64 := loadTestImage()
+
+		port := freePort()
+		cleanup := startBackend(port)
+		defer cleanup()
+
+		client, closeConn := dialBackend(port)
+		defer closeConn()
+
+		ctx, cancel := context.WithTimeout(context.Background(), 25*time.Minute)
+		defer cancel()
+
+		loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
+			Model:     "depth-anything-nested-anyview.gguf",
+			ModelFile: anyviewPath,
+			ModelPath: filepath.Dir(anyviewPath),
+			Options:   []string{"metric_model:depth-anything-nested-metric.gguf"},
+			Threads:   8,
+		})
+		Expect(err).ToNot(HaveOccurred(), "LoadModel(nested)")
+		Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
+
+		resp, err := client.Depth(ctx, &pb.DepthRequest{
+			Src:          imgB64,
+			IncludeDepth: true,
+			IncludePose:  true,
+		})
+		Expect(err).ToNot(HaveOccurred(), "Depth(nested)")
+		Expect(resp.GetWidth()).To(BeNumerically(">", 0), "depth width")
+		Expect(resp.GetHeight()).To(BeNumerically(">", 0), "depth height")
+		Expect(resp.GetIsMetric()).To(BeTrue(), "nested output must be metric")
+		Expect(len(resp.GetDepth())).To(Equal(int(resp.GetWidth())*int(resp.GetHeight())), "dense depth length")
+		Expect(len(resp.GetExtrinsics())).To(Equal(12), "extrinsics 3x4")
+		Expect(resp.GetIntrinsics()[0]).To(BeNumerically(">", 0), "fx > 0")
+
+		_, _ = fmt.Fprintf(GinkgoWriter, "nested depth OK: %dx%d is_metric=%v fx=%.2f\n",
+			resp.GetWidth(), resp.GetHeight(), resp.GetIsMetric(), resp.GetIntrinsics()[0])
+	})
+})
--- a/backend/go/depth-anything-cpp/options_test.go
+++ b/backend/go/depth-anything-cpp/options_test.go
@@ -0,0 +1,20 @@
+package main
+
+import (
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = DescribeTable("optionValue",
+	func(opts []string, key, want string) {
+		Expect(optionValue(opts, key)).To(Equal(want))
+	},
+	Entry("present", []string{"foo:bar", "metric_model:m.gguf"}, "metric_model", "m.gguf"),
+	Entry("absent", []string{"foo:bar"}, "metric_model", ""),
+	Entry("nil", []string(nil), "metric_model", ""),
+	Entry("trims space", []string{"metric_model:  m.gguf  "}, "metric_model", "m.gguf"),
+	Entry("value with colon", []string{"metric_model:a:b.gguf"}, "metric_model", "a:b.gguf"),
+	Entry("first wins", []string{"metric_model:first.gguf", "metric_model:second.gguf"}, "metric_model", "first.gguf"),
+	Entry("empty value", []string{"metric_model:"}, "metric_model", ""),
+	Entry("prefix not key", []string{"metric_model_extra:x"}, "metric_model", ""),
+)
--- a/backend/go/depth-anything-cpp/package.sh
+++ b/backend/go/depth-anything-cpp/package.sh
@@ -0,0 +1,59 @@
+#!/bin/bash
+
+# Script to copy the appropriate libraries based on architecture
+
+set -e
+
+CURDIR=$(dirname "$(realpath $0)")
+REPO_ROOT="${CURDIR}/../../.."
+
+# Create lib directory
+mkdir -p $CURDIR/package/lib
+
+cp -avf $CURDIR/libdepthanythingcpp-*.so $CURDIR/package/
+cp -avf $CURDIR/depth-anything-cpp $CURDIR/package/
+cp -fv $CURDIR/run.sh $CURDIR/package/
+
+# Detect architecture and copy appropriate libraries
+if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+    # x86_64 architecture
+    echo "Detected x86_64 architecture, copying x86_64 libraries..."
+    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
+    # ARM64 architecture
+    echo "Detected ARM64 architecture, copying ARM64 libraries..."
+    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+elif [ $(uname -s) = "Darwin" ]; then
+    echo "Detected Darwin"
+else
+    echo "Error: Could not detect architecture"
+    exit 1
+fi
+
+# Package GPU libraries based on BUILD_TYPE
+GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
+if [ -f "$GPU_LIB_SCRIPT" ]; then
+    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
+    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
+    package_gpu_libs
+fi
+
+echo "Packaging completed successfully"
+ls -liah $CURDIR/package/
+ls -liah $CURDIR/package/lib/
--- a/backend/go/depth-anything-cpp/run.sh
+++ b/backend/go/depth-anything-cpp/run.sh
@@ -0,0 +1,52 @@
+#!/bin/bash
+set -ex
+
+# Get the absolute current dir where the script is located
+CURDIR=$(dirname "$(realpath $0)")
+
+cd /
+
+echo "CPU info:"
+if [ "$(uname)" != "Darwin" ]; then
+	grep -e "model\sname" /proc/cpuinfo | head -1
+	grep -e "flags" /proc/cpuinfo | head -1
+fi
+
+LIBRARY="$CURDIR/libdepthanythingcpp-fallback.so"
+
+if [ "$(uname)" != "Darwin" ]; then
+	if grep -q -e "\savx\s" /proc/cpuinfo ; then
+		echo "CPU:    AVX    found OK"
+		if [ -e $CURDIR/libdepthanythingcpp-avx.so ]; then
+			LIBRARY="$CURDIR/libdepthanythingcpp-avx.so"
+		fi
+	fi
+
+	if grep -q -e "\savx2\s" /proc/cpuinfo ; then
+		echo "CPU:    AVX2   found OK"
+		if [ -e $CURDIR/libdepthanythingcpp-avx2.so ]; then
+			LIBRARY="$CURDIR/libdepthanythingcpp-avx2.so"
+		fi
+	fi
+
+	# Check avx 512
+	if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
+		echo "CPU:    AVX512F found OK"
+		if [ -e $CURDIR/libdepthanythingcpp-avx512.so ]; then
+			LIBRARY="$CURDIR/libdepthanythingcpp-avx512.so"
+		fi
+	fi
+fi
+
+export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
+export DEPTHANYTHING_LIBRARY=$LIBRARY
+
+# If there is a lib/ld.so, use it
+if [ -f $CURDIR/lib/ld.so ]; then
+	echo "Using lib/ld.so"
+	echo "Using library: $LIBRARY"
+	exec $CURDIR/lib/ld.so $CURDIR/depth-anything-cpp "$@"
+fi
+
+echo "Using library: $LIBRARY"
+exec $CURDIR/depth-anything-cpp "$@"
--- a/backend/go/depth-anything-cpp/test.sh
+++ b/backend/go/depth-anything-cpp/test.sh
@@ -0,0 +1,45 @@
+#!/bin/bash
+set -e
+
+CURDIR=$(dirname "$(realpath $0)")
+
+echo "Running depth-anything-cpp backend tests..."
+
+# Test model from the mudler/depth-anything.cpp-gguf HuggingFace repo. The small
+# (vits) f32 GGUF is the lightest backbone (~131 MB), so it keeps the download
+# cheap. It is resumed with `curl -C -` and skipped entirely if already present.
+DEPTHANYTHING_MODEL_DIR="${DEPTHANYTHING_MODEL_DIR:-$CURDIR/test-models}"
+
+DEPTHANYTHING_MODEL_FILE="${DEPTHANYTHING_MODEL_FILE:-depth-anything-small-f32.gguf}"
+DEPTHANYTHING_MODEL_URL="${DEPTHANYTHING_MODEL_URL:-https://huggingface.co/mudler/depth-anything.cpp-gguf/resolve/main/depth-anything-small-f32.gguf}"
+
+mkdir -p "$DEPTHANYTHING_MODEL_DIR"
+
+if [ ! -f "$DEPTHANYTHING_MODEL_DIR/$DEPTHANYTHING_MODEL_FILE" ]; then
+    echo "Downloading depth-anything small f32 model (~131 MB)..."
+    # -C - resumes a partial download so an interrupted run doesn't restart from 0.
+    curl -L -C - -o "$DEPTHANYTHING_MODEL_DIR/$DEPTHANYTHING_MODEL_FILE" "$DEPTHANYTHING_MODEL_URL" --progress-bar
+fi
+
+# Use a real photo (people + cars) from the upstream rf-detr.cpp repo (~46 KB).
+# Depth estimation needs real content; a synthetic image would be degenerate.
+TEST_IMAGE_DIR="$CURDIR/test-data"
+TEST_IMAGE_FILE="$TEST_IMAGE_DIR/test.jpg"
+TEST_IMAGE_URL="${TEST_IMAGE_URL:-https://raw.githubusercontent.com/mudler/rf-detr.cpp/main/tests/fixtures/ci/test_image.jpg}"
+
+mkdir -p "$TEST_IMAGE_DIR"
+if [ ! -f "$TEST_IMAGE_FILE" ]; then
+    echo "Downloading test image..."
+    curl -L -o "$TEST_IMAGE_FILE" "$TEST_IMAGE_URL" --progress-bar
+fi
+
+echo "depth-anything-cpp test setup complete."
+echo "  model:      $DEPTHANYTHING_MODEL_DIR/$DEPTHANYTHING_MODEL_FILE"
+echo "  test image: $TEST_IMAGE_FILE"
+
+# Run the Go smoke test: spawns the backend binary on a free port, calls
+# LoadModel + Predict via gRPC against the downloaded GGUF + image.
+echo ""
+echo "Running Go smoke test..."
+cd "$CURDIR"
+go test -v -timeout 30m ./...
--- a/backend/go/parakeet-cpp/Makefile
+++ b/backend/go/parakeet-cpp/Makefile
@@ -1,6 +1,6 @@
 # parakeet-cpp backend Makefile.
 #
-# Upstream pin lives below as PARAKEET_VERSION?=b8012f11e5269126eddb7f4fd02f891a2ccc29b0
+# Upstream pin lives below as PARAKEET_VERSION?=92a5f0306be354c109150fe58ae4cc4f8a21ca45
 # (.github/bump_deps.sh) can find and update it - matches the
 # whisper.cpp / ds4 / vibevoice-cpp convention.
 #
@@ -15,7 +15,7 @@
 # That's what the L0 smoke test uses. The default target below does the
 # proper clone-at-pin + cmake build so CI doesn't need a side-checkout.

-PARAKEET_VERSION?=b8012f11e5269126eddb7f4fd02f891a2ccc29b0
+PARAKEET_VERSION?=92a5f0306be354c109150fe58ae4cc4f8a21ca45
 PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp

 GOCMD?=go
--- a/backend/go/parakeet-cpp/package.sh
+++ b/backend/go/parakeet-cpp/package.sh
@@ -1,23 +1,68 @@
 #!/bin/bash
 #
-# L0 packaging stub: copy the binary, run.sh and libparakeet.so* into
-# package/. The full ldd walk (libc, libstdc++, libgomp, GPU runtimes,
-# arch detection) lands in L3, mirroring backend/go/whisper/package.sh.
+# Bundle the parakeet-cpp-grpc binary, libparakeet.so, the core runtime
+# libs (libc/libstdc++/libgomp + ld.so) and the GPU runtime for the active
+# BUILD_TYPE so the package is self-contained. Mirrors
+# backend/go/whisper/package.sh; run.sh routes the (CGO_ENABLED=0) binary
+# through lib/ld.so so the packaged libc is used instead of the host's.

 set -e

 CURDIR=$(dirname "$(realpath "$0")")
+REPO_ROOT="${CURDIR}/../../.."

 mkdir -p "$CURDIR/package/lib"

 cp -avf "$CURDIR/parakeet-cpp-grpc" "$CURDIR/package/"
 cp -avf "$CURDIR/run.sh" "$CURDIR/package/"

-# libparakeet.so + any soname symlinks (libparakeet.so.X, libparakeet.so.X.Y).
+# libparakeet.so + any soname symlinks (libparakeet.so.X[.Y]). purego.Dlopen
+# resolves it via LD_LIBRARY_PATH, which run.sh points at lib/.
 cp -avf "$CURDIR"/libparakeet.so* "$CURDIR/package/lib/" 2>/dev/null || {
 	echo "ERROR: libparakeet.so not found in $CURDIR, run 'make' first" >&2
 	exit 1
 }

-echo "L0 package layout (full ldd walk lands in L3):"
+# Detect architecture and copy the core runtime libs libparakeet.so links
+# against, plus the matching dynamic loader as lib/ld.so.
+if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+    echo "Detected x86_64 architecture, copying x86_64 libraries..."
+    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
+    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
+    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
+    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
+    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
+    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
+    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
+    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
+    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
+elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
+    echo "Detected ARM64 architecture, copying ARM64 libraries..."
+    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
+    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
+    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
+    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
+    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
+    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
+    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
+    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
+    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
+elif [ "$(uname -s)" = "Darwin" ]; then
+    echo "Detected Darwin"
+else
+    echo "Error: Could not detect architecture"
+    exit 1
+fi
+
+# Package GPU libraries (CUDA/ROCm/Intel/Vulkan loader + ICDs + drivers)
+# based on BUILD_TYPE so the backend can reach the GPU without the runtime
+# base image shipping those drivers.
+GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
+if [ -f "$GPU_LIB_SCRIPT" ]; then
+    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
+    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
+    package_gpu_libs
+fi
+
+echo "Packaging completed successfully"
 ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/go/stablediffusion-ggml/Makefile
+++ b/backend/go/stablediffusion-ggml/Makefile
@@ -8,10 +8,16 @@ JOBS?=$(shell nproc --ignore=1)

 # stablediffusion.cpp (ggml)
 STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
-STABLEDIFFUSION_GGML_VERSION?=19bdfe22d255d5b4dff39d449318b9bc5ea2317f
+STABLEDIFFUSION_GGML_VERSION?=7f0e728b7d42f2490dfa5dd9539082d904f2f6b2

 CMAKE_ARGS+=-DGGML_MAX_NAME=128

+# Enable the ggml RPC backend so generation can be sharded across remote
+# rpc-server workers (the same backend-agnostic ggml rpc-server used by the
+# llama.cpp backend). Servers are selected via the `rpc_servers` option or the
+# LLAMACPP_GRPC_SERVERS env var (populated automatically in p2p worker mode).
+CMAKE_ARGS+=-DSD_RPC=ON
+
 ifeq ($(NATIVE),false)
 	CMAKE_ARGS+=-DGGML_NATIVE=OFF
 endif
--- a/backend/go/stablediffusion-ggml/cpp/gosd.cpp
+++ b/backend/go/stablediffusion-ggml/cpp/gosd.cpp
@@ -391,10 +391,18 @@ int load_model(const char *model, char *model_path, char* options[], int threads
    const char *control_net_path = "";
    const char *embedding_dir = "";
    const char *photo_maker_path = "";
+    const char *pulid_weights_path = "";
    const char *tensor_type_rules = "";
    char *lora_dir = model_path;

-    bool vae_decode_only = true;
+    // Upstream backend/parameter placement specs (see docs/.../stablediffusion).
+    // Empty means "leave at upstream default" (nullptr).
+    const char *backend_arg = "";
+    const char *params_backend_arg = "";
+    const char *rpc_servers_arg = "";
+    const char *max_vram_arg = "";
+    bool stream_layers = false;
+
    int n_threads = threads;
    enum sd_type_t wtype = SD_TYPE_COUNT;
    enum rng_type_t rng_type = CUDA_RNG;
@@ -418,7 +426,9 @@ int load_model(const char *model, char *model_path, char* options[], int threads
    // If options is not NULL, parse options
    for (int i = 0; options[i] != NULL; i++) {
        const char *optname = strtok(options[i], ":");
-        const char *optval = strtok(NULL, ":");
+        // Take everything after the first ':' as the value so values may
+        // themselves contain colons (e.g. rpc_servers host:port lists).
+        const char *optval = strtok(NULL, "");
        if (optval == NULL) {
            optval = "true";
        }
@@ -490,9 +500,21 @@ int load_model(const char *model, char *model_path, char* options[], int threads
            }
        }
        if (!strcmp(optname, "photo_maker_path")) photo_maker_path = strdup(optval);
+        if (!strcmp(optname, "pulid_weights_path")) pulid_weights_path = strdup(optval);
        if (!strcmp(optname, "tensor_type_rules")) tensor_type_rules = strdup(optval);

-        if (!strcmp(optname, "vae_decode_only")) vae_decode_only = (strcmp(optval, "true") == 0 || strcmp(optval, "1") == 0);
+        // Backend / parameter placement specs (see prepare_backend_assignments
+        // in the upstream CLI). These compose with the legacy keep_*_on_cpu /
+        // offload_params_to_cpu booleans below.
+        if (!strcmp(optname, "backend")) backend_arg = strdup(optval);
+        if (!strcmp(optname, "params_backend")) params_backend_arg = strdup(optval);
+        if (!strcmp(optname, "rpc_servers")) rpc_servers_arg = strdup(optval);
+        if (!strcmp(optname, "max_vram")) max_vram_arg = strdup(optval);
+        if (!strcmp(optname, "stream_layers")) stream_layers = (strcmp(optval, "true") == 0 || strcmp(optval, "1") == 0);
+
+        // vae_decode_only is still accepted for backwards compatibility with
+        // existing gallery configs, but upstream dropped the option (the model
+        // now decides), so it is parsed and ignored.
        if (!strcmp(optname, "offload_params_to_cpu")) offload_params_to_cpu = (strcmp(optval, "true") == 0 || strcmp(optval, "1") == 0);
        if (!strcmp(optname, "keep_clip_on_cpu")) keep_clip_on_cpu = (strcmp(optval, "true") == 0 || strcmp(optval, "1") == 0);
        if (!strcmp(optname, "keep_control_net_on_cpu")) keep_control_net_on_cpu = (strcmp(optval, "true") == 0 || strcmp(optval, "1") == 0);
@@ -591,20 +613,48 @@ int load_model(const char *model, char *model_path, char* options[], int threads
    ctx_params.embeddings = embedding_vec.empty() ? NULL : embedding_vec.data();
    ctx_params.embedding_count = static_cast<uint32_t>(embedding_vec.size());
    ctx_params.photo_maker_path = photo_maker_path;
+    if (strlen(pulid_weights_path) > 0) ctx_params.pulid_weights_path = pulid_weights_path;
    ctx_params.tensor_type_rules = tensor_type_rules;
-    ctx_params.vae_decode_only = vae_decode_only;
-    // XXX: Setting to true causes a segfault on the second run
-    ctx_params.free_params_immediately = false;
    ctx_params.n_threads = n_threads;
    ctx_params.rng_type = rng_type;
-    ctx_params.keep_clip_on_cpu = keep_clip_on_cpu;
    if (wtype != SD_TYPE_COUNT) ctx_params.wtype = wtype;
    if (sampler_rng_type != RNG_TYPE_COUNT) ctx_params.sampler_rng_type = sampler_rng_type;
    if (prediction != PREDICTION_COUNT) ctx_params.prediction = prediction;
    if (lora_apply_mode != LORA_APPLY_MODE_COUNT) ctx_params.lora_apply_mode = lora_apply_mode;
-    ctx_params.offload_params_to_cpu = offload_params_to_cpu;
-    ctx_params.keep_control_net_on_cpu = keep_control_net_on_cpu;
-    ctx_params.keep_vae_on_cpu = keep_vae_on_cpu;
+    // Backend / parameter placement specs. Upstream replaced the boolean
+    // CPU-offload knobs (offload_params_to_cpu, keep_clip_on_cpu, keep_vae_on_cpu,
+    // keep_control_net_on_cpu) with these specs. Seed from the explicit
+    // backend/params_backend options, then prepend the legacy boolean-derived
+    // assignments, mirroring prepare_backend_assignments() in the upstream CLI.
+    // These strings must outlive new_sd_ctx() below.
+    std::string backend_spec = backend_arg;
+    std::string params_backend_spec = params_backend_arg;
+    auto prepend_spec = [](std::string& spec, const char* assignment) {
+        spec = spec.empty() ? std::string(assignment) : std::string(assignment) + "," + spec;
+    };
+    if (offload_params_to_cpu) prepend_spec(params_backend_spec, "*=cpu");
+    if (keep_clip_on_cpu) prepend_spec(backend_spec, "te=cpu");
+    if (keep_vae_on_cpu) prepend_spec(backend_spec, "vae=cpu");
+    if (keep_control_net_on_cpu) prepend_spec(backend_spec, "controlnet=cpu");
+    if (!backend_spec.empty()) ctx_params.backend = backend_spec.c_str();
+    if (!params_backend_spec.empty()) ctx_params.params_backend = params_backend_spec.c_str();
+    // RPC servers: prefer the explicit option, otherwise fall back to the
+    // LLAMACPP_GRPC_SERVERS env var. LocalAI's p2p worker mode populates that
+    // var with discovered ggml rpc-server workers (shared with the llama.cpp
+    // backend), so distributed image generation works with no extra config.
+    if (strlen(rpc_servers_arg) > 0) {
+        ctx_params.rpc_servers = rpc_servers_arg;
+    } else {
+        const char* env_rpc_servers = std::getenv("LLAMACPP_GRPC_SERVERS");
+        if (env_rpc_servers != NULL && strlen(env_rpc_servers) > 0) {
+            ctx_params.rpc_servers = env_rpc_servers;
+        }
+    }
+    // max_vram: GiB budget or per-backend spec for graph-cut segmented param
+    // offload ("0" = disabled, "-1" = auto). stream_layers only has effect when
+    // max_vram is set.
+    if (strlen(max_vram_arg) > 0) ctx_params.max_vram = max_vram_arg;
+    ctx_params.stream_layers = stream_layers;
    ctx_params.diffusion_flash_attn = diffusion_flash_attn;
    ctx_params.tae_preview_only = tae_preview_only;
    ctx_params.diffusion_conv_direct = diffusion_conv_direct;
--- a/backend/go/supertonic/.gitignore
+++ b/backend/go/supertonic/.gitignore
@@ -0,0 +1,4 @@
+/supertonic
+/sources/
+/backend-assets/
+/package/
--- a/backend/go/supertonic/Makefile
+++ b/backend/go/supertonic/Makefile
@@ -0,0 +1,62 @@
+CURRENT_DIR=$(abspath ./)
+GOCMD=go
+
+ONNX_VERSION?=1.24.4
+ONNX_ARCH?=x64
+ONNX_OS?=linux
+
+ifneq (,$(findstring aarch64,$(shell uname -m)))
+	ONNX_ARCH=aarch64
+endif
+
+ifeq ($(OS),Darwin)
+	ONNX_OS=osx
+	ifneq (,$(findstring arm64,$(shell uname -m)))
+		ONNX_ARCH=arm64
+	else
+		ONNX_ARCH=x86_64
+	endif
+endif
+
+# CUDA 12 ships as -gpu, CUDA 13 as -gpu_cuda13 (underscore). CPU has no suffix.
+ifeq ($(BUILD_TYPE),cublas)
+	ONNX_PROVIDER=cuda
+	ifeq ($(CUDA_MAJOR_VERSION),13)
+		ONNX_VARIANT=-gpu_cuda13
+	else
+		ONNX_VARIANT=-gpu
+	endif
+else
+	ONNX_VARIANT=
+	ONNX_PROVIDER=cpu
+endif
+
+sources/onnxruntime:
+	mkdir -p sources/onnxruntime
+	curl -L https://github.com/microsoft/onnxruntime/releases/download/v$(ONNX_VERSION)/onnxruntime-$(ONNX_OS)-$(ONNX_ARCH)$(ONNX_VARIANT)-$(ONNX_VERSION).tgz \
+	  -o sources/onnxruntime/onnxruntime.tgz
+	cd sources/onnxruntime && tar -xf onnxruntime.tgz --strip-components=1 && rm onnxruntime.tgz
+
+backend-assets/lib: sources/onnxruntime
+	mkdir -p backend-assets/lib
+	cp -rfLv sources/onnxruntime/lib/* backend-assets/lib/
+
+supertonic: backend-assets/lib
+	CGO_ENABLED=1 $(GOCMD) build \
+	  -ldflags "$(LD_FLAGS) -X main.onnxProvider=$(ONNX_PROVIDER)" \
+	  -tags "$(GO_TAGS)" -o supertonic ./
+
+package:
+	bash package.sh
+
+build: supertonic package
+
+# Tests need only the Go toolchain (gcc); yalue dlopens onnxruntime at
+# runtime, so no tarball download is required to compile or run unit specs.
+test:
+	CGO_ENABLED=1 $(GOCMD) test -v -timeout 120s ./...
+
+clean:
+	rm -rf supertonic sources/ backend-assets/ package/
+
+.PHONY: build package clean test
--- a/backend/go/supertonic/backend.go
+++ b/backend/go/supertonic/backend.go
@@ -0,0 +1,307 @@
+package main
+
+import (
+	"bytes"
+	"encoding/binary"
+	"fmt"
+	"os"
+	"path/filepath"
+	"strconv"
+	"strings"
+	"sync"
+
+	laudio "github.com/mudler/LocalAI/pkg/audio"
+	"github.com/mudler/LocalAI/pkg/grpc/base"
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+)
+
+// onnxProvider is set via -ldflags "-X main.onnxProvider=cuda" by the
+// CUDA build (later phase). Defaults to CPU.
+var onnxProvider = "cpu"
+
+// Per-model generation defaults, overridable via ModelOptions.Options:
+//
+//	supertonic.steps=<int>          denoising steps (quality), default 8
+//	supertonic.speed=<float>        speech rate, default 1.05
+//	supertonic.silence=<float>      inter-chunk silence seconds, default 0.3
+//	supertonic.default_voice=<name> voice-style used when request omits voice
+//	supertonic.default_lang=<lang>  language tag used when request omits it
+const (
+	optionSteps        = "supertonic.steps="
+	optionSpeed        = "supertonic.speed="
+	optionSilence      = "supertonic.silence="
+	optionDefaultVoice = "supertonic.default_voice="
+	optionDefaultLang  = "supertonic.default_lang="
+)
+
+type SupertonicBackend struct {
+	base.SingleThread
+
+	tts          *TextToSpeech
+	cfg          Config
+	modelDir     string
+	voicesDir    string
+	defaultVoice string
+	defaultLang  string
+	steps        int
+	speed        float32
+	silence      float32
+
+	styleMu sync.Mutex
+	styles  map[string]*Style // voice name -> loaded style cache
+}
+
+func (s *SupertonicBackend) Load(opts *pb.ModelOptions) error {
+	modelDir, err := resolveModelDir(opts.ModelFile)
+	if err != nil {
+		return err
+	}
+	s.modelDir = modelDir
+	s.voicesDir = resolveVoicesDir(modelDir)
+
+	cfg, err := LoadCfgs(modelDir)
+	if err != nil {
+		return fmt.Errorf("loading tts.json from %s: %w", modelDir, err)
+	}
+	s.cfg = cfg
+
+	// onnxProvider is "cpu" for the CPU build; the CUDA build sets it to
+	// "cuda" via -ldflags. Upstream LoadTextToSpeech still errors on GPU
+	// until the CUDA phase wires the execution provider.
+	tts, err := LoadTextToSpeech(modelDir, onnxProvider == "cuda", cfg)
+	if err != nil {
+		return fmt.Errorf("loading supertonic models from %s: %w", modelDir, err)
+	}
+	s.tts = tts
+
+	s.steps = int(findOptionInt(opts, optionSteps, 8))
+	s.speed = findOptionFloat(opts, optionSpeed, 1.05)
+	s.silence = findOptionFloat(opts, optionSilence, 0.3)
+	s.defaultVoice = findOptionValue(opts, optionDefaultVoice, "")
+	s.defaultLang = findOptionValue(opts, optionDefaultLang, "na")
+	s.styles = map[string]*Style{}
+	return nil
+}
+
+func (s *SupertonicBackend) TTS(req *pb.TTSRequest) error {
+	wav, sr, err := s.synthesize(req)
+	if err != nil {
+		return err
+	}
+	out := make([]float64, len(wav))
+	for i, v := range wav {
+		out[i] = float64(v)
+	}
+	if err := writeWavFile(req.Dst, out, sr); err != nil {
+		return fmt.Errorf("writing wav to %s: %w", req.Dst, err)
+	}
+	return nil
+}
+
+func (s *SupertonicBackend) TTSStream(req *pb.TTSRequest, results chan []byte) error {
+	defer close(results)
+
+	wav, sr, err := s.synthesize(req)
+	if err != nil {
+		return err
+	}
+
+	results <- streamingWAVHeader(uint32(sr))
+
+	const chunkSamples = 4096
+	for off := 0; off < len(wav); off += chunkSamples {
+		end := off + chunkSamples
+		if end > len(wav) {
+			end = len(wav)
+		}
+		results <- pcmFloatToInt16LE(wav[off:end])
+	}
+	return nil
+}
+
+// synthesize runs the full pipeline and returns the trimmed mono float32
+// PCM and its sample rate.
+func (s *SupertonicBackend) synthesize(req *pb.TTSRequest) ([]float32, int, error) {
+	if s.tts == nil {
+		return nil, 0, fmt.Errorf("supertonic model not loaded")
+	}
+	if strings.TrimSpace(req.Text) == "" {
+		return nil, 0, fmt.Errorf("empty text")
+	}
+
+	style, err := s.loadStyle(s.voiceName(req.Voice))
+	if err != nil {
+		return nil, 0, err
+	}
+
+	lang := s.resolveLang("")
+	if req.Language != nil {
+		lang = s.resolveLang(*req.Language)
+	}
+
+	wav, dur, err := s.tts.Call(req.Text, lang, style, s.steps, s.speed, s.silence)
+	if err != nil {
+		return nil, 0, err
+	}
+
+	sr := s.tts.SampleRate
+	// Call returns concatenated audio; trim to the reported duration.
+	wavLen := int(float32(sr) * dur)
+	if wavLen < 0 {
+		wavLen = 0
+	}
+	if wavLen > len(wav) {
+		wavLen = len(wav)
+	}
+	return wav[:wavLen], sr, nil
+}
+
+// voiceName picks the request voice, falling back to the model default.
+func (s *SupertonicBackend) voiceName(reqVoice string) string {
+	v := strings.TrimSpace(reqVoice)
+	if v == "" {
+		return s.defaultVoice
+	}
+	return v
+}
+
+// resolveLang validates against AvailableLangs, falling back to the model
+// default (then "na").
+func (s *SupertonicBackend) resolveLang(reqLang string) string {
+	l := strings.TrimSpace(reqLang)
+	if l != "" && isValidLang(l) {
+		return l
+	}
+	if s.defaultLang != "" && isValidLang(s.defaultLang) {
+		return s.defaultLang
+	}
+	return "na"
+}
+
+// loadStyle resolves and caches a voice-style. An empty name with no model
+// default is an error (supertonic requires a style embedding).
+func (s *SupertonicBackend) loadStyle(name string) (*Style, error) {
+	if name == "" {
+		return nil, fmt.Errorf("no voice specified and no supertonic.default_voice set")
+	}
+	s.styleMu.Lock()
+	defer s.styleMu.Unlock()
+	if st, ok := s.styles[name]; ok {
+		return st, nil
+	}
+	path := s.voiceStylePath(name)
+	st, err := LoadVoiceStyle([]string{path}, false)
+	if err != nil {
+		return nil, fmt.Errorf("loading voice style %q (%s): %w", name, path, err)
+	}
+	s.styles[name] = st
+	return st, nil
+}
+
+// voiceStylePath maps a voice name to a JSON path. Absolute paths are honored;
+// names containing a separator resolve under modelDir; bare names resolve under
+// the resolved voicesDir (see resolveVoicesDir).
+func (s *SupertonicBackend) voiceStylePath(name string) string {
+	if !strings.HasSuffix(name, ".json") {
+		name += ".json"
+	}
+	if filepath.IsAbs(name) {
+		return name
+	}
+	if strings.ContainsRune(name, filepath.Separator) {
+		return filepath.Join(s.modelDir, name)
+	}
+	return filepath.Join(s.voicesDir, name)
+}
+
+// resolveVoicesDir locates the voice_styles directory. The HF model layout
+// puts the ONNX files in an onnx/ subdir with voice_styles/ as its sibling,
+// so check modelDir/voice_styles first, then the parent's voice_styles.
+func resolveVoicesDir(modelDir string) string {
+	candidates := []string{
+		filepath.Join(modelDir, "voice_styles"),
+		filepath.Join(filepath.Dir(modelDir), "voice_styles"),
+	}
+	for _, c := range candidates {
+		if info, err := os.Stat(c); err == nil && info.IsDir() {
+			return c
+		}
+	}
+	return candidates[0]
+}
+
+// resolveModelDir accepts either a directory (used as-is) or a file (its
+// parent dir is used).
+func resolveModelDir(modelFile string) (string, error) {
+	if modelFile == "" {
+		return "", fmt.Errorf("empty model path")
+	}
+	info, err := os.Stat(modelFile)
+	if err != nil {
+		return "", fmt.Errorf("stat model path %s: %w", modelFile, err)
+	}
+	if info.IsDir() {
+		return modelFile, nil
+	}
+	return filepath.Dir(modelFile), nil
+}
+
+// ---- option helpers (mirrors backend/go/sherpa-onnx/backend.go) ----
+
+func findOptionValue(opts *pb.ModelOptions, prefix, def string) string {
+	for _, o := range opts.Options {
+		if strings.HasPrefix(o, prefix) {
+			return strings.TrimPrefix(o, prefix)
+		}
+	}
+	return def
+}
+
+func findOptionFloat(opts *pb.ModelOptions, prefix string, def float32) float32 {
+	raw := findOptionValue(opts, prefix, "")
+	if raw == "" {
+		return def
+	}
+	v, err := strconv.ParseFloat(raw, 32)
+	if err != nil {
+		return def
+	}
+	return float32(v)
+}
+
+func findOptionInt(opts *pb.ModelOptions, prefix string, def int32) int32 {
+	raw := findOptionValue(opts, prefix, "")
+	if raw == "" {
+		return def
+	}
+	v, err := strconv.ParseInt(raw, 10, 32)
+	if err != nil {
+		return def
+	}
+	return int32(v)
+}
+
+// ---- PCM helpers ----
+
+func pcmFloatToInt16LE(samples []float32) []byte {
+	buf := make([]byte, len(samples)*2)
+	for i, f := range samples {
+		v := int32(f * 32767)
+		if v > 32767 {
+			v = 32767
+		} else if v < -32768 {
+			v = -32768
+		}
+		binary.LittleEndian.PutUint16(buf[2*i:], uint16(int16(v)))
+	}
+	return buf
+}
+
+func streamingWAVHeader(sampleRate uint32) []byte {
+	const streamingSize = 0xFFFFFFFF
+	h := laudio.NewWAVHeaderWithRate(streamingSize, sampleRate)
+	h.ChunkSize = streamingSize
+	var buf bytes.Buffer
+	_ = h.Write(&buf)
+	return buf.Bytes()
+}
--- a/backend/go/supertonic/backend_test.go
+++ b/backend/go/supertonic/backend_test.go
@@ -0,0 +1,86 @@
+package main
+
+import (
+	"os"
+	"path/filepath"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+)
+
+var _ = Describe("voiceStylePath", func() {
+	s := &SupertonicBackend{modelDir: "/models/st/onnx", voicesDir: "/models/st/voice_styles"}
+
+	It("resolves a bare name under the resolved voicesDir", func() {
+		Expect(s.voiceStylePath("M1")).To(Equal(filepath.Join("/models/st/voice_styles", "M1.json")))
+	})
+	It("keeps an explicit .json suffix", func() {
+		Expect(s.voiceStylePath("M1.json")).To(Equal(filepath.Join("/models/st/voice_styles", "M1.json")))
+	})
+	It("honors absolute paths", func() {
+		Expect(s.voiceStylePath("/abs/v.json")).To(Equal("/abs/v.json"))
+	})
+})
+
+var _ = Describe("resolveVoicesDir", func() {
+	It("prefers voice_styles under modelDir", func() {
+		dir := GinkgoT().TempDir()
+		Expect(os.MkdirAll(filepath.Join(dir, "voice_styles"), 0o755)).To(Succeed())
+		Expect(resolveVoicesDir(dir)).To(Equal(filepath.Join(dir, "voice_styles")))
+	})
+	It("falls back to the sibling voice_styles next to an onnx subdir", func() {
+		root := GinkgoT().TempDir()
+		Expect(os.MkdirAll(filepath.Join(root, "voice_styles"), 0o755)).To(Succeed())
+		Expect(os.MkdirAll(filepath.Join(root, "onnx"), 0o755)).To(Succeed())
+		Expect(resolveVoicesDir(filepath.Join(root, "onnx"))).To(Equal(filepath.Join(root, "voice_styles")))
+	})
+})
+
+var _ = Describe("resolveLang", func() {
+	It("accepts a valid request language", func() {
+		s := &SupertonicBackend{defaultLang: "na"}
+		Expect(s.resolveLang("ko")).To(Equal("ko"))
+	})
+	It("falls back to the model default for an invalid language", func() {
+		s := &SupertonicBackend{defaultLang: "en"}
+		Expect(s.resolveLang("zz")).To(Equal("en"))
+	})
+	It("falls back to na when nothing is valid", func() {
+		s := &SupertonicBackend{defaultLang: ""}
+		Expect(s.resolveLang("")).To(Equal("na"))
+	})
+})
+
+var _ = Describe("pcmFloatToInt16LE", func() {
+	It("clamps and encodes little-endian", func() {
+		out := pcmFloatToInt16LE([]float32{0, 1.0, -1.0, 2.0})
+		Expect(out).To(HaveLen(8))
+		Expect(out[0:2]).To(Equal([]byte{0x00, 0x00})) // 0
+		Expect(out[2:4]).To(Equal([]byte{0xff, 0x7f})) // 32767
+		Expect(out[6:8]).To(Equal([]byte{0xff, 0x7f})) // clamp 2.0 -> 32767
+	})
+})
+
+var _ = Describe("end-to-end synthesis", Ordered, func() {
+	var modelDir string
+	BeforeAll(func() {
+		modelDir = os.Getenv("SUPERTONIC_MODEL_PATH")
+		if modelDir == "" {
+			Skip("set SUPERTONIC_MODEL_PATH to a supertonic model dir to run")
+		}
+		Expect(InitializeONNXRuntime()).To(Succeed())
+	})
+
+	It("synthesizes a wav file", func() {
+		b := &SupertonicBackend{}
+		Expect(b.Load(&pb.ModelOptions{ModelFile: modelDir, Options: []string{"supertonic.default_voice=F1"}})).To(Succeed())
+		dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
+		lang := "en"
+		Expect(b.TTS(&pb.TTSRequest{Text: "Hello from LocalAI.", Dst: dst, Language: &lang})).To(Succeed())
+		info, err := os.Stat(dst)
+		Expect(err).ToNot(HaveOccurred())
+		Expect(info.Size()).To(BeNumerically(">", 44)) // header + PCM
+	})
+})
--- a/backend/go/supertonic/helper.go
+++ b/backend/go/supertonic/helper.go
--- a/backend/go/supertonic/main.go
+++ b/backend/go/supertonic/main.go
@@ -0,0 +1,27 @@
+package main
+
+// Started internally by LocalAI; a server is allocated per model.
+
+import (
+	"flag"
+
+	grpc "github.com/mudler/LocalAI/pkg/grpc"
+	ort "github.com/yalue/onnxruntime_go"
+)
+
+var addr = flag.String("addr", "localhost:50051", "the address to connect to")
+
+func main() {
+	flag.Parse()
+
+	// InitializeONNXRuntime reads ONNXRUNTIME_LIB_PATH (set by run.sh) and
+	// dlopens libonnxruntime before any session is created in Load().
+	if err := InitializeONNXRuntime(); err != nil {
+		panic(err)
+	}
+	defer func() { _ = ort.DestroyEnvironment() }()
+
+	if err := grpc.StartServer(*addr, &SupertonicBackend{}); err != nil {
+		panic(err)
+	}
+}
--- a/core/services/cloudproxy/ssewire/ssewire_suite_test.go
+++ b/core/services/cloudproxy/ssewire/ssewire_suite_test.go
@@ -1,4 +1,4 @@
-package ssewire
+package main

 import (
 	"testing"
@@ -7,7 +7,7 @@ import (
 	. "github.com/onsi/gomega"
 )

-func TestSsewire(t *testing.T) {
+func TestSupertonic(t *testing.T) {
 	RegisterFailHandler(Fail)
-	RunSpecs(t, "ssewire test suite")
+	RunSpecs(t, "Supertonic backend test suite")
 }
--- a/backend/go/supertonic/package.sh
+++ b/backend/go/supertonic/package.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+set -e
+
+CURDIR=$(dirname "$(realpath $0)")
+REPO_ROOT="${CURDIR}/../../.."
+
+mkdir -p $CURDIR/package/lib
+
+cp -avf $CURDIR/supertonic $CURDIR/package/
+cp -avf $CURDIR/run.sh $CURDIR/package/
+cp -rfLv $CURDIR/backend-assets/lib/* $CURDIR/package/lib/
+
+if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+    echo "Detected x86_64 architecture, copying x86_64 libraries..."
+    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
+    echo "Detected ARM64 architecture, copying ARM64 libraries..."
+    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+else
+    echo "Error: Could not detect architecture"
+    exit 1
+fi
+
+GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
+if [ -f "$GPU_LIB_SCRIPT" ]; then
+    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
+    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
+    package_gpu_libs
+fi
+
+echo "Packaging completed successfully"
+ls -liah $CURDIR/package/
+ls -liah $CURDIR/package/lib/
--- a/backend/go/supertonic/run.sh
+++ b/backend/go/supertonic/run.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+set -ex
+
+CURDIR=$(dirname "$(realpath $0)")
+
+export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
+export ONNXRUNTIME_LIB_PATH=$CURDIR/lib/libonnxruntime.so
+
+if [ -f $CURDIR/lib/ld.so ]; then
+	echo "Using lib/ld.so"
+	exec $CURDIR/lib/ld.so $CURDIR/supertonic "$@"
+fi
+
+exec $CURDIR/supertonic "$@"
--- a/backend/go/whisper/Makefile
+++ b/backend/go/whisper/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # whisper.cpp version
 WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
-WHISPER_CPP_VERSION?=df7638d8229a243af8a4b5a8ae557e0d74e0a0ae
+WHISPER_CPP_VERSION?=86c40c3bd6fc86f1187fb751d111b49e0fc18e84
 SO_TARGET?=libgowhisper.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -458,6 +458,126 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-locate-anything-cpp"
  mirrors:
    - localai/localai-backends:master-gpu-vulkan-locate-anything-cpp
+- &depthanything
+  name: "depth-anything"
+  alias: "depth-anything"
+  license: apache-2.0
+  description: |
+    Depth Anything 3 monocular metric depth + camera pose estimation in C/C++
+    using GGML. Loads pre-built GGUF weights and, given an image, returns a
+    dense depth map plus the recovered camera extrinsics (3x4) and intrinsics
+    (3x3). No Python at inference (purego, cgo-less).
+  urls:
+    - https://github.com/mudler/depth-anything.cpp
+    - https://huggingface.co/depth-anything/Depth-Anything-V3
+  tags:
+    - depth-estimation
+    - camera-pose
+    - depth-anything
+    - gpu
+    - cpu
+  capabilities:
+    default: "cpu-depth-anything-cpp"
+    nvidia: "cuda12-depth-anything-cpp"
+    nvidia-cuda-12: "cuda12-depth-anything-cpp"
+    nvidia-cuda-13: "cuda13-depth-anything-cpp"
+    nvidia-l4t: "nvidia-l4t-arm64-depth-anything-cpp"
+    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-depth-anything-cpp"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-depth-anything-cpp"
+    intel: "intel-sycl-f32-depth-anything-cpp"
+    vulkan: "vulkan-depth-anything-cpp"
+- !!merge <<: *depthanything
+  name: "depth-anything-development"
+  capabilities:
+    default: "cpu-depth-anything-cpp-development"
+    nvidia: "cuda12-depth-anything-cpp-development"
+    nvidia-cuda-12: "cuda12-depth-anything-cpp-development"
+    nvidia-cuda-13: "cuda13-depth-anything-cpp-development"
+    nvidia-l4t: "nvidia-l4t-arm64-depth-anything-cpp-development"
+    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-depth-anything-cpp-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-depth-anything-cpp-development"
+    intel: "intel-sycl-f32-depth-anything-cpp-development"
+    vulkan: "vulkan-depth-anything-cpp-development"
+- !!merge <<: *depthanything
+  name: "cpu-depth-anything-cpp"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:latest-cpu-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "cpu-depth-anything-cpp-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:master-cpu-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "cuda12-depth-anything-cpp"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-12-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "cuda12-depth-anything-cpp-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-12-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "cuda13-depth-anything-cpp"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "cuda13-depth-anything-cpp-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "nvidia-l4t-arm64-depth-anything-cpp"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-arm64-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "nvidia-l4t-arm64-depth-anything-cpp-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-arm64-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "cuda13-nvidia-l4t-arm64-depth-anything-cpp"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "cuda13-nvidia-l4t-arm64-depth-anything-cpp-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "intel-sycl-f32-depth-anything-cpp"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:latest-gpu-intel-sycl-f32-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "intel-sycl-f32-depth-anything-cpp-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:master-gpu-intel-sycl-f32-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "intel-sycl-f16-depth-anything-cpp"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:latest-gpu-intel-sycl-f16-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "intel-sycl-f16-depth-anything-cpp-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:master-gpu-intel-sycl-f16-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "vulkan-depth-anything-cpp"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:latest-gpu-vulkan-depth-anything-cpp
+- !!merge <<: *depthanything
+  name: "vulkan-depth-anything-cpp-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-depth-anything-cpp"
+  mirrors:
+    - localai/localai-backends:master-gpu-vulkan-depth-anything-cpp
 - &vllm
  name: "vllm"
  license: apache-2.0
@@ -879,6 +999,42 @@
    nvidia-l4t: "vulkan-localvqe"
    nvidia-l4t-cuda-12: "vulkan-localvqe"
    nvidia-l4t-cuda-13: "vulkan-localvqe"
+- &privacyfilter
+  name: "privacy-filter"
+  alias: "privacy-filter"
+  icon: https://cdn-avatars.huggingface.co/v1/production/uploads/5fd5e18a90b6dc4633f6d292/QPiv8pt4JNxr0FdGnpFef.png
+  description: |
+    Standalone GGML engine (privacy-filter.cpp) for the OpenMed privacy-filter
+    PII/NER token-classification model family. It runs the openai-privacy-filter
+    architecture (a gpt-oss-style sparse-MoE bidirectional token classifier) on
+    stock upstream GGML — no llama.cpp coupling and no Python — and serves the
+    TokenClassify RPC (constrained BIOES Viterbi decode into UTF-8 byte-offset
+    entity spans) used by LocalAI's NER PII redaction tier.
+  urls:
+    - https://github.com/localai-org/privacy-filter.cpp
+  tags:
+    - token-classification
+    - ner
+    - pii
+    - privacy
+    - CPU
+    - GPU
+  license: apache-2.0
+  # Builds: CPU (amd64+arm64 manifest), Vulkan (amd64) and CUDA 13 (amd64).
+  # Only a host that actually reports CUDA 13 gets the CUDA image (it bundles
+  # the CUDA 13 runtime and needs a recent driver); every other GPU — including
+  # NVIDIA without a CUDA-13 toolkit, AMD and Intel — routes to the Vulkan
+  # image, which only needs a Vulkan ICD. Everything else (incl. arm64/Jetson,
+  # where Vulkan/CUDA images are a future add) falls back to the CPU build,
+  # already fast for this ~50M-active-param model.
+  capabilities:
+    default: "cpu-privacy-filter"
+    nvidia: "vulkan-privacy-filter"
+    nvidia-cuda-12: "vulkan-privacy-filter"
+    nvidia-cuda-13: "cuda13-privacy-filter"
+    amd: "vulkan-privacy-filter"
+    intel: "vulkan-privacy-filter"
+    vulkan: "vulkan-privacy-filter"
 - &faster-whisper
  icon: https://avatars.githubusercontent.com/u/1520500?s=200&v=4
  description: |
@@ -1368,6 +1524,20 @@
    nvidia: "cuda12-sherpa-onnx"
    nvidia-cuda-12: "cuda12-sherpa-onnx"
    metal: "metal-sherpa-onnx"
+- &supertonic
+  name: "supertonic"
+  alias: "supertonic"
+  urls:
+    - https://github.com/supertone-inc/supertonic
+  description: |
+    Supertonic backend: lightning-fast, on-device multilingual text-to-speech via ONNX Runtime.
+    Runs Supertone's flow-matching TTS model (Supertone/supertonic-3), 44.1kHz output, 31 languages,
+    multiple preset voice styles. No espeak-ng dependency.
+  tags:
+    - text-to-speech
+    - TTS
+  capabilities:
+    default: "cpu-supertonic"
 - !!merge <<: *neutts
  name: "neutts-development"
  capabilities:
@@ -2569,6 +2739,37 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-stablediffusion-ggml"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-13-stablediffusion-ggml
+## privacy-filter
+- !!merge <<: *privacyfilter
+  name: "cpu-privacy-filter"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-privacy-filter"
+  mirrors:
+    - localai/localai-backends:latest-cpu-privacy-filter
+- !!merge <<: *privacyfilter
+  name: "cpu-privacy-filter-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-privacy-filter"
+  mirrors:
+    - localai/localai-backends:master-cpu-privacy-filter
+- !!merge <<: *privacyfilter
+  name: "vulkan-privacy-filter"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-privacy-filter"
+  mirrors:
+    - localai/localai-backends:latest-gpu-vulkan-privacy-filter
+- !!merge <<: *privacyfilter
+  name: "vulkan-privacy-filter-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-privacy-filter"
+  mirrors:
+    - localai/localai-backends:master-gpu-vulkan-privacy-filter
+- !!merge <<: *privacyfilter
+  name: "cuda13-privacy-filter"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-privacy-filter"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-privacy-filter
+- !!merge <<: *privacyfilter
+  name: "cuda13-privacy-filter-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-privacy-filter"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-privacy-filter
 # vllm
 - !!merge <<: *vllm
  name: "vllm-development"
@@ -5132,3 +5333,18 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-sherpa-onnx"
  mirrors:
    - localai/localai-backends:master-metal-darwin-arm64-sherpa-onnx
+## supertonic
+- !!merge <<: *supertonic
+  name: "supertonic-development"
+  capabilities:
+    default: "cpu-supertonic-development"
+- !!merge <<: *supertonic
+  name: "cpu-supertonic"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-supertonic"
+  mirrors:
+    - localai/localai-backends:latest-cpu-supertonic
+- !!merge <<: *supertonic
+  name: "cpu-supertonic-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-supertonic"
+  mirrors:
+    - localai/localai-backends:master-cpu-supertonic
--- a/backend/python/transformers/backend.py
+++ b/backend/python/transformers/backend.py
@@ -270,10 +270,17 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):

    def TokenClassify(self, request, context):
        # Runs HuggingFace's token-classification pipeline and returns
-        # the aggregated entity spans. The pipeline gives us byte
-        # offsets via aggregation_strategy="simple" (set at load
-        # time), so the caller can slice the original text without
-        # re-tokenising on the Go side.
+        # the aggregated entity spans.
+        #
+        # OFFSET UNITS: the proto contract (TokenClassifyEntity.start/end)
+        # is UTF-8 BYTE offsets into request.text. HuggingFace's pipeline,
+        # however, reports start/end as CODEPOINT offsets into the Python
+        # str (derived from the fast tokenizer's offset_mapping). Those
+        # coincide only for ASCII; for any multi-byte character they
+        # diverge — and this entry point exists to serve the explicitly
+        # multilingual privacy-filter model, so the conversion is
+        # mandatory, not a nicety. We build one prefix table mapping each
+        # codepoint index to its byte offset and translate every span.
        if not getattr(self, "TokenClassification", False):
            context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
            context.set_details("model was not loaded as Type=TokenClassification")
@@ -286,18 +293,50 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
            context.set_details(f"token-classification failed: {err}")
            return backend_pb2.TokenClassifyResponse()

+        text = request.text
+        # byte_at[i] = byte length of text[:i]; len == len(text)+1 so an
+        # exclusive end offset that points one past the last codepoint
+        # maps to len(text.encode("utf-8")). Built in a single O(n) pass.
+        byte_at = [0] * (len(text) + 1)
+        acc = 0
+        for i, ch in enumerate(text):
+            byte_at[i] = acc
+            acc += len(ch.encode("utf-8"))
+        byte_at[len(text)] = acc
+
+        def to_byte(cp_index, default):
+            # Clamp out-of-range codepoint indices into the table rather
+            # than throwing: a span we can't place is better dropped Go-side
+            # than crashing the RPC.
+            if cp_index is None:
+                cp_index = default
+            if cp_index < 0:
+                cp_index = 0
+            elif cp_index > len(text):
+                cp_index = len(text)
+            return byte_at[cp_index]
+
        threshold = request.threshold if request.threshold > 0 else 0.0
        entities = []
        for r in results:
            score = float(r.get("score", 0.0))
            if score < threshold:
                continue
+            cp_start = r.get("start")
+            cp_end = r.get("end")
+            start = to_byte(cp_start, 0)
+            end = to_byte(cp_end, 0)
            entities.append(backend_pb2.TokenClassifyEntity(
                entity_group=str(r.get("entity_group") or r.get("entity") or ""),
-                start=int(r.get("start", 0)),
-                end=int(r.get("end", 0)),
+                start=start,
+                end=end,
                score=score,
-                text=str(r.get("word", "")),
+                # Slice the original text by the (codepoint) span so the
+                # echoed text matches start..end exactly, instead of the
+                # pipeline's reconstructed "word" which can carry wordpiece
+                # artifacts. Falls back to "word" when offsets are absent.
+                text=(text[cp_start:cp_end] if cp_start is not None and cp_end is not None
+                      else str(r.get("word", ""))),
            ))
        return backend_pb2.TokenClassifyResponse(entities=entities)

--- a/backend/python/vllm/requirements.txt
+++ b/backend/python/vllm/requirements.txt
@@ -1,4 +1,4 @@
-grpcio==1.81.0
+grpcio==1.81.1
 protobuf
 certifi
 setuptools
--- a/backend/python/whisperx/backend.py
+++ b/backend/python/whisperx/backend.py
@@ -79,6 +79,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):

    def AudioTranscription(self, request, context):
        import whisperx
+        from whisperx.diarize import DiarizationPipeline

        resultSegments = []
        text = ""
@@ -106,8 +107,8 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
            # Diarize if requested and HF token is available
            if request.diarize and self.hf_token:
                if self.diarize_pipeline is None:
-                    self.diarize_pipeline = whisperx.DiarizationPipeline(
-                        use_auth_token=self.hf_token,
+                    self.diarize_pipeline = DiarizationPipeline(
+                        token=self.hf_token,
                        device=self.device,
                    )
                diarize_segments = self.diarize_pipeline(audio)
--- a/cmd/launcher/internal/launcher.go
+++ b/cmd/launcher/internal/launcher.go
@@ -635,8 +635,11 @@ func (l *Launcher) showDownloadProgress(version, title string) {
 		progressBar := widget.NewProgressBar()
 		progressBar.SetValue(0)

-		// Status label
+		// Status label. Truncate with an ellipsis so a long "Download failed:
+		// <url>" message can't stretch the window (and progress bar) to fit the
+		// whole error on one line; the full error is shown in the dialog below.
 		statusLabel := widget.NewLabel("Preparing download...")
+		statusLabel.Truncation = fyne.TextTruncateEllipsis

 		// Release notes button
 		releaseNotesButton := widget.NewButton("View Release Notes", func() {
--- a/cmd/launcher/internal/systray_manager.go
+++ b/cmd/launcher/internal/systray_manager.go
@@ -454,8 +454,11 @@ func (sm *SystrayManager) showDownloadProgress(version string) {
 	progressBar := widget.NewProgressBar()
 	progressBar.SetValue(0)

-	// Status label
+	// Status label. Truncate with an ellipsis so a long "Download failed:
+	// <url>" message can't stretch the window (and progress bar) to fit the
+	// whole error on one line; the full error is shown in the dialog below.
 	statusLabel := widget.NewLabel("Preparing download...")
+	statusLabel.Truncation = fyne.TextTruncateEllipsis

 	// Release notes button
 	releaseNotesButton := widget.NewButton("View Release Notes", func() {
--- a/cmd/launcher/internal/ui.go
+++ b/cmd/launcher/internal/ui.go
@@ -57,8 +57,16 @@ type LauncherUI struct {

 // NewLauncherUI creates a new UI instance
 func NewLauncherUI() *LauncherUI {
+	// Truncate the status text with an ellipsis. Status messages can carry a
+	// download error containing a long, unbreakable URL/path; without this the
+	// label demands the full single-line width and stretches the window (and
+	// the progress bar) arbitrarily wide. The full error is still shown in the
+	// error dialog.
+	statusLabel := widget.NewLabel("Initializing...")
+	statusLabel.Truncation = fyne.TextTruncateEllipsis
+
 	return &LauncherUI{
-		statusLabel:       widget.NewLabel("Initializing..."),
+		statusLabel:       statusLabel,
 		versionLabel:      widget.NewLabel("Version: Unknown"),
 		startStopButton:   widget.NewButton("Start LocalAI", nil),
 		webUIButton:       widget.NewButton("Open WebUI", nil),
@@ -602,8 +610,11 @@ func (ui *LauncherUI) showDownloadProgress(version, title string) {
 		progressBar := widget.NewProgressBar()
 		progressBar.SetValue(0)

-		// Status label
+		// Status label. Truncate with an ellipsis so a long "Download failed:
+		// <url>" message can't stretch the window (and progress bar) to fit the
+		// whole error on one line; the full error is shown in the dialog below.
 		statusLabel := widget.NewLabel("Preparing download...")
+		statusLabel.Truncation = fyne.TextTruncateEllipsis

 		// Release notes button
 		releaseNotesButton := widget.NewButton("View Release Notes", func() {
--- a/core/application/application.go
+++ b/core/application/application.go
@@ -12,14 +12,15 @@ import (
 	"github.com/mudler/LocalAI/core/http/auth"
 	mcpTools "github.com/mudler/LocalAI/core/http/endpoints/mcp"
 	"github.com/mudler/LocalAI/core/services/agentpool"
+	"github.com/mudler/LocalAI/core/services/cloudproxy/mitm"
 	"github.com/mudler/LocalAI/core/services/facerecognition"
 	"github.com/mudler/LocalAI/core/services/galleryop"
 	"github.com/mudler/LocalAI/core/services/monitoring"
 	"github.com/mudler/LocalAI/core/services/nodes"
 	"github.com/mudler/LocalAI/core/services/routing/admission"
 	"github.com/mudler/LocalAI/core/services/routing/billing"
-	"github.com/mudler/LocalAI/core/services/cloudproxy/mitm"
 	"github.com/mudler/LocalAI/core/services/routing/pii"
+	"github.com/mudler/LocalAI/core/services/routing/piidetector"
 	"github.com/mudler/LocalAI/core/services/routing/router"
 	"github.com/mudler/LocalAI/core/services/voicerecognition"
 	"github.com/mudler/LocalAI/core/templates"
@@ -71,15 +72,15 @@ type Application struct {
 	// 1-to-1 host↔model invariant the dispatcher relies on. Read by
 	// /api/middleware/status so the admin UI can surface the cause.
 	mitmHostConflicts atomic.Pointer[map[string][]string]
-	routerDecisions    router.DecisionStore
-	routerRegistry     *router.Registry
-	admissionLimiter   *admission.Limiter
-	watchdogMutex      sync.Mutex
-	watchdogStop       chan bool
-	p2pMutex           sync.Mutex
-	p2pCtx             context.Context
-	p2pCancel          context.CancelFunc
-	agentJobMutex      sync.Mutex
+	routerDecisions   router.DecisionStore
+	routerRegistry    *router.Registry
+	admissionLimiter  *admission.Limiter
+	watchdogMutex     sync.Mutex
+	watchdogStop      chan bool
+	p2pMutex          sync.Mutex
+	p2pCtx            context.Context
+	p2pCancel         context.CancelFunc
+	agentJobMutex     sync.Mutex

 	// Distributed mode services (nil when not in distributed mode)
 	distributed *DistributedServices
@@ -254,6 +255,122 @@ func (a *Application) PIIEvents() pii.EventStore {
 	return a.piiEvents
 }

+// PIINERResolver returns the resolver the chat PII middleware uses to
+// turn a configured detector model name into a ready-to-use NERConfig:
+// a token-classifier bound over the shared model loader (lazy — the
+// model loads on first Detect) plus the detection policy read from that
+// model's own pii_detection block. Unknown names resolve to (zero,
+// false) so the middleware fails closed. Pass it via pii.WithNERResolver.
+func (a *Application) PIINERResolver() pii.NERDetectorResolver {
+	return func(modelName string) (pii.NERConfig, bool) {
+		if modelName == "" {
+			return pii.NERConfig{}, false
+		}
+		cfg, ok := a.ModelConfigLoader().GetModelConfig(modelName)
+		if !ok {
+			return pii.NERConfig{}, false
+		}
+
+		// Pattern detectors match secrets with the restricted-regex tier
+		// in-process (no backend load). Build a pattern matcher instead of the
+		// gRPC token-classifier; on a compile error fail closed with an error
+		// detector so the request is blocked, not silently unscanned.
+		if cfg.IsPatternDetector() {
+			det, err := piidetector.NewPattern(cfg, a.ApplicationConfig())
+			if err != nil {
+				det = pii.NewErrNERDetector(err.Error())
+			}
+			return pii.NERConfigFromRaw(
+				det,
+				0, // patterns are deterministic — no confidence floor
+				cfg.PIIDetectionDefaultAction(),
+				patternEntityActions(cfg),
+				pii.SourcePattern,
+			), true
+		}
+
+		det := piidetector.New(a.ModelLoader(), cfg, a.ApplicationConfig())
+		return pii.NERConfigFromRaw(
+			det,
+			cfg.PIIDetectionMinScore(),
+			cfg.PIIDetectionDefaultAction(),
+			cfg.PIIDetectionEntityActions(),
+			pii.SourceNER,
+		), true
+	}
+}
+
+// patternEntityActions merges a pattern detector's per-pattern Action overrides
+// into its entity_actions map. A pattern reports matches under its Name, so a
+// per-pattern action is just an entity_actions[Name] entry; explicit
+// entity_actions still win if both are set.
+func patternEntityActions(cfg config.ModelConfig) map[string]string {
+	out := cfg.PIIDetectionEntityActions()
+	for _, p := range cfg.PIIDetection.Patterns {
+		if p.Action == "" || p.Name == "" {
+			continue
+		}
+		if out == nil {
+			out = map[string]string{}
+		}
+		if _, exists := out[p.Name]; !exists {
+			out[p.Name] = p.Action
+		}
+	}
+	return out
+}
+
+// ResolvePIIPolicy resolves the effective request-side PII policy for a
+// consuming model, layering the instance-wide default detector
+// (PIIDefaultDetectors, set via POST /api/settings) on top of the per-model
+// config. It is the single decision point shared by the chat middleware (via
+// WithPolicyResolver) and the MITM listener so both agree.
+//
+//   - enabled: an explicit pii.enabled on the model always wins (true OR
+//     false). Otherwise PII is on when the backend defaults it on — today
+//     that means cloud-proxy models, which cross the network to a third party.
+//   - detectors: the model's own pii.detectors, or — when it lists none — the
+//     global PIIDefaultDetectors fallback. This is what makes cloud-proxy/MITM
+//     redaction work out of the box.
+//
+// appConfig is read live, so changes via the settings API take effect on the
+// next request without a restart.
+func (a *Application) ResolvePIIPolicy(cfg *config.ModelConfig) (enabled bool, detectors []string) {
+	if cfg == nil {
+		return false, nil
+	}
+	appCfg := a.ApplicationConfig()
+
+	if cfg.PII.Enabled != nil {
+		enabled = *cfg.PII.Enabled
+	} else {
+		enabled = cfg.PIIIsEnabled() // backend default (cloud-proxy)
+	}
+	if !enabled {
+		return false, nil
+	}
+
+	detectors = cfg.PIIDetectors()
+	if len(detectors) == 0 {
+		detectors = append([]string(nil), appCfg.PIIDefaultDetectors...)
+	}
+	return enabled, detectors
+}
+
+// PIIPolicyResolver adapts ResolvePIIPolicy to pii.PolicyResolver for
+// pii.WithPolicyResolver. The middleware carries the resolved model config as
+// `any` (the MODEL_CONFIG context value, a *config.ModelConfig); this asserts
+// it back and applies the instance-wide defaults.
+func (a *Application) PIIPolicyResolver() pii.PolicyResolver {
+	return func(modelCfg any) (bool, []string) {
+		cfg, ok := modelCfg.(*config.ModelConfig)
+		if !ok {
+			return false, nil
+		}
+		return a.ResolvePIIPolicy(cfg)
+	}
+}
+
 // MITMCA returns the cloudproxy MITM proxy's CA, or nil when the
 // MITM listener is disabled.
 func (a *Application) MITMCA() *mitm.CA { return a.mitmCA.Load() }
--- a/core/application/mitm.go
+++ b/core/application/mitm.go
@@ -8,6 +8,7 @@ import (

 	"github.com/mudler/LocalAI/core/config"
 	"github.com/mudler/LocalAI/core/services/cloudproxy/mitm"
+	"github.com/mudler/LocalAI/core/services/routing/pii"
 	"github.com/mudler/xlog"
 )

@@ -91,25 +92,41 @@ func startMITMLocked(app *Application, options *config.ApplicationConfig) error
 	}
 	sort.Strings(effectiveHosts)

-	// Per-host PII gate inherits from the owning model's pii.enabled.
-	// A non-cloud-proxy backend with no explicit pii.enabled resolves
-	// to false → host is intercepted but the regex pass is skipped
-	// (audit events still record).
-	var piiDisabled []string
+	// Per-host NER detectors come from the owning model's pii.detectors
+	// (resolved against each detector model's pii_detection policy). A
+	// host whose model has pii.enabled=false, lists no detectors, or
+	// whose detectors can't be resolved gets no entry → it is intercepted
+	// and forwarded unredacted (audit events still record traffic). An
+	// unresolvable detector is recorded as an error-detector so the
+	// request fails closed at request time rather than leaking.
+	resolver := app.PIINERResolver()
+	detectorsByHost := map[string][]pii.NERConfig{}
 	for host, modelName := range ownership.Owners {
 		cfg, exists := app.backendLoader.GetModelConfig(modelName)
 		if !exists {
 			continue
 		}
-		if !cfg.PIIIsEnabled() {
-			piiDisabled = append(piiDisabled, host)
+		// Resolve through the shared policy so cloud-proxy hosts inherit the
+		// instance-wide default detector when they name none of their own.
+		enabled, detectors := app.ResolvePIIPolicy(&cfg)
+		if !enabled || len(detectors) == 0 {
+			continue
 		}
+		cfgs := make([]pii.NERConfig, 0, len(detectors))
+		for _, name := range detectors {
+			nc, ok := resolver(name)
+			if !ok {
+				xlog.Error("mitm: detector model not resolvable; requests to host will fail closed", "host", host, "detector", name)
+				nc = pii.NERConfig{Detector: pii.NewErrNERDetector("detector model '" + name + "' not resolvable")}
+			}
+			cfgs = append(cfgs, nc)
+		}
+		detectorsByHost[host] = cfgs
 	}

 	handler := mitm.NewPIIHandler(mitm.PIIHandlerOptions{
-		Redactor:             app.piiRedactor,
-		EventStore:           app.piiEvents,
-		HostsWithPIIDisabled: piiDisabled,
+		EventStore:      app.piiEvents,
+		DetectorsByHost: detectorsByHost,
 	})

 	srv, err := mitm.NewServer(mitm.Config{
@@ -132,7 +149,7 @@ func startMITMLocked(app *Application, options *config.ApplicationConfig) error
 		"ca_dir", caDir,
 		"intercept_hosts", effectiveHosts,
 		"model_owned_hosts", len(ownership.Owners),
-		"pii_disabled_hosts", len(piiDisabled),
+		"pii_detector_hosts", len(detectorsByHost),
 	)
 	return nil
 }
--- a/core/application/pii_policy_test.go
+++ b/core/application/pii_policy_test.go
@@ -0,0 +1,51 @@
+package application
+
+import (
+	"github.com/mudler/LocalAI/core/config"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("ResolvePIIPolicy", func() {
+	chat := config.FLAG_CHAT
+	bp := func(b bool) *bool { return &b }
+	mk := func(c *config.ApplicationConfig) *Application {
+		return &Application{applicationConfig: c}
+	}
+
+	It("lets an explicit pii.enabled=false win over the global default detector", func() {
+		app := mk(&config.ApplicationConfig{PIIDefaultDetectors: []string{"pf"}})
+		cfg := &config.ModelConfig{Backend: "cloud-proxy", KnownUsecases: &chat}
+		cfg.PII.Enabled = bp(false)
+		enabled, dets := app.ResolvePIIPolicy(cfg)
+		Expect(enabled).To(BeFalse())
+		Expect(dets).To(BeNil())
+	})
+
+	It("enables a cloud-proxy model with the global default detector (closes the no-op gap)", func() {
+		// cloud-proxy defaults PIIIsEnabled()==true but lists no detectors, so
+		// without a global default it scans with nothing.
+		app := mk(&config.ApplicationConfig{PIIDefaultDetectors: []string{"pf"}})
+		cfg := &config.ModelConfig{Backend: "cloud-proxy"}
+		enabled, dets := app.ResolvePIIPolicy(cfg)
+		Expect(enabled).To(BeTrue())
+		Expect(dets).To(Equal([]string{"pf"}))
+	})
+
+	It("leaves a non-cloud model off by default (no instance usecase default-on)", func() {
+		app := mk(&config.ApplicationConfig{PIIDefaultDetectors: []string{"pf"}})
+		cfg := &config.ModelConfig{Backend: "llama-cpp", KnownUsecases: &chat}
+		enabled, _ := app.ResolvePIIPolicy(cfg)
+		Expect(enabled).To(BeFalse())
+	})
+
+	It("prefers the model's own detectors over the global default", func() {
+		app := mk(&config.ApplicationConfig{PIIDefaultDetectors: []string{"global-pf"}})
+		cfg := &config.ModelConfig{Backend: "cloud-proxy"}
+		cfg.PII.Detectors = []string{"own-pf"}
+		enabled, dets := app.ResolvePIIPolicy(cfg)
+		Expect(enabled).To(BeTrue())
+		Expect(dets).To(Equal([]string{"own-pf"}))
+	})
+})
--- a/core/application/startup.go
+++ b/core/application/startup.go
@@ -25,6 +25,7 @@ import (
 	"github.com/mudler/LocalAI/core/services/storage"
 	coreStartup "github.com/mudler/LocalAI/core/startup"
 	"github.com/mudler/LocalAI/internal"
+	"github.com/mudler/LocalAI/pkg/downloader"
 	"github.com/mudler/LocalAI/pkg/signals"
 	"github.com/mudler/LocalAI/pkg/vram"

@@ -53,7 +54,6 @@ func New(opts ...config.AppOption) (*Application, error) {
 	caps, err := xsysinfo.CPUCapabilities()
 	if err == nil {
 		xlog.Debug("CPU capabilities", "capabilities", caps)
-
 	}
 	gpus, err := xsysinfo.GPUs()
 	if err == nil {
@@ -68,18 +68,28 @@ func New(opts ...config.AppOption) (*Application, error) {
 		return nil, fmt.Errorf("models path cannot be empty")
 	}

-	err = os.MkdirAll(options.SystemState.Model.ModelsPath, 0750)
+	err = os.MkdirAll(options.SystemState.Model.ModelsPath, 0o750)
 	if err != nil {
 		return nil, fmt.Errorf("unable to create ModelPath: %q", err)
 	}
+
+	// Reap *.partial downloads abandoned by a previous run (killed mid-transfer
+	// by an OOM/restart, or stalled before cleanup could run). The 24h window
+	// is well beyond any legitimate in-flight download, so this never trims an
+	// active transfer; it just stops dead partials accumulating on the volume.
+	if removed, cErr := downloader.CleanupStalePartialFiles(options.SystemState.Model.ModelsPath, 24*time.Hour); cErr != nil {
+		xlog.Warn("Failed to reap stale partial downloads", "error", cErr)
+	} else if removed > 0 {
+		xlog.Info("Reaped stale partial downloads", "count", removed)
+	}
 	if options.GeneratedContentDir != "" {
-		err := os.MkdirAll(options.GeneratedContentDir, 0750)
+		err := os.MkdirAll(options.GeneratedContentDir, 0o750)
 		if err != nil {
 			return nil, fmt.Errorf("unable to create ImageDir: %q", err)
 		}
 	}
 	if options.UploadDir != "" {
-		err := os.MkdirAll(options.UploadDir, 0750)
+		err := os.MkdirAll(options.UploadDir, 0o750)
 		if err != nil {
 			return nil, fmt.Errorf("unable to create UploadDir: %q", err)
 		}
@@ -87,7 +97,7 @@ func New(opts ...config.AppOption) (*Application, error) {

 	// Create and migrate data directory
 	if options.DataPath != "" {
-		if err := os.MkdirAll(options.DataPath, 0750); err != nil {
+		if err := os.MkdirAll(options.DataPath, 0o750); err != nil {
 			return nil, fmt.Errorf("unable to create DataPath: %q", err)
 		}
 		// Migrate data from DynamicConfigsDir to DataPath if needed
@@ -192,44 +202,14 @@ func New(opts ...config.AppOption) (*Application, error) {
 		xlog.Info("stats: disabled by --disable-stats")
 	}

-	// Wire the regex PII filter. Default-on: a single-user box gets
-	// the built-in pattern set the first time it starts, with email/
-	// phone/SSN/credit-card on mask and api_key_prefix on block. If
-	// the operator wants different actions, --pii-config points at a
-	// YAML file that overrides per-id; --disable-pii turns it off
-	// entirely.
-	if !options.DisablePII {
-		patterns, err := pii.LoadConfig(options.PIIConfigPath)
-		if err != nil {
-			return nil, fmt.Errorf("pii config: %w", err)
-		}
-		application.piiRedactor = pii.NewRedactor(patterns)
-		application.piiEvents = pii.NewMemoryEventStore(0)
-		// Apply persisted per-pattern overrides — admins toggling
-		// action/disabled via the UI and clicking "Save to disk" land
-		// here on the next start. Bad ids are warned and ignored so a
-		// stale entry doesn't block startup.
-		for id, ov := range options.PIIPatternOverrides {
-			if ov.Action != nil {
-				if err := application.piiRedactor.SetAction(id, pii.Action(*ov.Action)); err != nil {
-					xlog.Warn("pii: persisted override skipped", "pattern", id, "error", err)
-					continue
-				}
-			}
-			if ov.Disabled != nil {
-				if err := application.piiRedactor.SetDisabled(id, *ov.Disabled); err != nil {
-					xlog.Warn("pii: persisted disable skipped", "pattern", id, "error", err)
-				}
-			}
-		}
-		xlog.Info("pii: filter enabled",
-			"patterns", len(patterns),
-			"config_path", options.PIIConfigPath,
-			"persisted_overrides", len(options.PIIPatternOverrides),
-		)
-	} else {
-		xlog.Info("pii: disabled by --disable-pii")
-	}
+	// Wire the PII filter subsystem. The redactor is now a stateless
+	// handle — detection is driven by per-model NER detectors
+	// (pii.detectors → the detector model's pii_detection policy), run
+	// request-side by the chat middleware and the MITM input path. The
+	// regex tier was removed; redaction is opt-in per model via
+	// PIIIsEnabled(). The event store backs the /api/pii/events audit log.
+	application.piiRedactor = &pii.Redactor{}
+	application.piiEvents = pii.NewMemoryEventStore(0)

 	// Wire the routing decision log. Always-on when stats are enabled —
 	// the per-router admin page reads this as the live activity feed
@@ -517,7 +497,7 @@ func startWatcher(options *config.ApplicationConfig) {
 	if _, err := os.Stat(options.DynamicConfigsDir); err != nil {
 		if os.IsNotExist(err) {
 			// We try to create the directory if it does not exist and was specified
-			if err := os.MkdirAll(options.DynamicConfigsDir, 0700); err != nil {
+			if err := os.MkdirAll(options.DynamicConfigsDir, 0o700); err != nil {
 				xlog.Error("failed creating DynamicConfigsDir", "error", err)
 			}
 		} else {
@@ -764,16 +744,6 @@ func loadRuntimeSettingsFromFile(options *config.ApplicationConfig) {
 		options.MITMListen = *settings.MITMListen
 	}

-	// PII pattern overrides — file is the only source; CLI flags don't
-	// reach into this map. Apply unconditionally when present; the
-	// redactor wiring below sees the result on first construction.
-	if settings.PIIPatternOverrides != nil {
-		options.PIIPatternOverrides = make(map[string]config.PIIPatternRuntimeOverride, len(*settings.PIIPatternOverrides))
-		for id, ov := range *settings.PIIPatternOverrides {
-			options.PIIPatternOverrides[id] = ov
-		}
-	}
-
 	// Backend upgrade flags
 	if settings.AutoUpgradeBackends != nil {
 		if !options.AutoUpgradeBackends {
@@ -924,7 +894,7 @@ func loadOrGenerateHMACSecret(path string) (string, error) {
 	}
 	secret := hex.EncodeToString(b)

-	if err := os.WriteFile(path, []byte(secret), 0600); err != nil {
+	if err := os.WriteFile(path, []byte(secret), 0o600); err != nil {
 		return "", fmt.Errorf("failed to persist HMAC secret: %w", err)
 	}

--- a/core/backend/depth.go
+++ b/core/backend/depth.go
@@ -0,0 +1,66 @@
+package backend
+
+import (
+	"context"
+	"fmt"
+	"time"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/trace"
+	"github.com/mudler/LocalAI/pkg/grpc/proto"
+	"github.com/mudler/LocalAI/pkg/model"
+)
+
+// Depth runs depth estimation (Depth Anything 3) on the supplied image and
+// returns the full DepthResponse: per-pixel metric depth + confidence + sky,
+// camera pose (extrinsics/intrinsics), an optional 3D point cloud and any
+// requested exports (glb/colmap). The include_* flags and exports mirror the
+// DepthRequest proto so callers can ask for less work.
+func Depth(
+	ctx context.Context,
+	in *proto.DepthRequest,
+	loader *model.ModelLoader,
+	appConfig *config.ApplicationConfig,
+	modelConfig config.ModelConfig,
+) (*proto.DepthResponse, error) {
+	opts := ModelOptions(modelConfig, appConfig)
+	depthModel, err := loader.Load(opts...)
+	if err != nil {
+		recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
+		return nil, err
+	}
+
+	if depthModel == nil {
+		return nil, fmt.Errorf("could not load depth model")
+	}
+
+	var startTime time.Time
+	if appConfig.EnableTracing {
+		trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
+		startTime = time.Now()
+	}
+
+	res, err := depthModel.Depth(ctx, in)
+
+	if appConfig.EnableTracing {
+		errStr := ""
+		if err != nil {
+			errStr = err.Error()
+		}
+
+		trace.RecordBackendTrace(trace.BackendTrace{
+			Timestamp: startTime,
+			Duration:  time.Since(startTime),
+			Type:      trace.BackendTraceDepth,
+			ModelName: modelConfig.Name,
+			Backend:   modelConfig.Backend,
+			Summary:   trace.TruncateString(in.GetSrc(), 200),
+			Error:     errStr,
+			Data: map[string]any{
+				"exports": in.GetExports(),
+			},
+		})
+	}
+
+	return res, err
+}
--- a/core/backend/options.go
+++ b/core/backend/options.go
@@ -368,6 +368,25 @@ func gRPCPredictOpts(c config.ModelConfig, modelPath string) *pb.PredictOptions
 	if c.ReasoningEffort != "" {
 		metadata["reasoning_effort"] = c.ReasoningEffort
 	}
+	// Client request metadata overrides the server-derived reasoning levers and
+	// reaches every backend through these standalone string keys (Python backends
+	// read them directly). The reserved blob key is server-owned and skipped.
+	for k, v := range c.RequestMetadata {
+		if k == "chat_template_kwargs" {
+			continue
+		}
+		metadata[k] = v
+	}
+	// Build the generic chat_template_kwargs blob (model config map + coerced
+	// metadata) for llama.cpp and write it LAST so a client cannot clobber it.
+	if blob := c.ResolveChatTemplateKwargs(metadata); len(blob) > 0 {
+		b, err := json.Marshal(blob)
+		if err != nil {
+			xlog.Warn("failed to marshal chat_template_kwargs", "error", err)
+		} else {
+			metadata["chat_template_kwargs"] = string(b)
+		}
+	}
 	pbOpts.Metadata = metadata

 	// Logprobs and TopLogprobs are set by the caller if provided
--- a/core/backend/options_internal_test.go
+++ b/core/backend/options_internal_test.go
@@ -161,3 +161,67 @@ var _ = Describe("grpcModelOpts NBatch", func() {
 		Expect(opts.ContextSize).To(BeEquivalentTo(4096), "n_batch must match the effective n_ctx the backend receives")
 	})
 })
+
+// Guards the generic chat_template_kwargs forwarding: the model config map plus any
+// per-request metadata overrides are merged, coerced, and serialised into the
+// backend metadata blob that llama.cpp reads. Client metadata also overrides the
+// server-derived standalone enable_thinking key (cross-backend consistency).
+var _ = Describe("gRPCPredictOpts chat_template_kwargs metadata", func() {
+	baseCfg := func() config.ModelConfig {
+		cfg := config.ModelConfig{}
+		cfg.SetDefaults()
+		return cfg
+	}
+
+	It("serialises the config map into the chat_template_kwargs blob", func() {
+		cfg := baseCfg()
+		cfg.ChatTemplateKwargs = map[string]any{"preserve_thinking": true}
+		opts := gRPCPredictOpts(cfg, "/tmp/models")
+		Expect(opts.Metadata).To(HaveKey("chat_template_kwargs"))
+		var blob map[string]any
+		Expect(json.Unmarshal([]byte(opts.Metadata["chat_template_kwargs"]), &blob)).To(Succeed())
+		Expect(blob).To(HaveKeyWithValue("preserve_thinking", true))
+	})
+
+	It("serialises reasoning_effort into the blob as a JSON string", func() {
+		cfg := baseCfg()
+		cfg.ReasoningEffort = "high"
+		opts := gRPCPredictOpts(cfg, "/tmp/models")
+		Expect(opts.Metadata).To(HaveKey("chat_template_kwargs"))
+		var blob map[string]any
+		Expect(json.Unmarshal([]byte(opts.Metadata["chat_template_kwargs"]), &blob)).To(Succeed())
+		// reasoning_effort must remain a string in the blob (jinja templates that
+		// key on the level read a string), unlike enable_thinking which is a bool.
+		Expect(blob["reasoning_effort"]).To(BeAssignableToTypeOf(""))
+		Expect(blob).To(HaveKeyWithValue("reasoning_effort", "high"))
+	})
+
+	It("lets client request metadata override the server-derived enable_thinking key", func() {
+		cfg := baseCfg()
+		disable := true
+		cfg.ReasoningConfig = reasoning.Config{DisableReasoning: &disable} // server: enable_thinking=false
+		cfg.RequestMetadata = map[string]string{"enable_thinking": "true"} // client overrides
+		opts := gRPCPredictOpts(cfg, "/tmp/models")
+		// standalone key (Python backends) reflects the client override
+		Expect(opts.Metadata).To(HaveKeyWithValue("enable_thinking", "true"))
+		// blob (llama.cpp) reflects it too, as a real bool
+		var blob map[string]any
+		Expect(json.Unmarshal([]byte(opts.Metadata["chat_template_kwargs"]), &blob)).To(Succeed())
+		Expect(blob).To(HaveKeyWithValue("enable_thinking", true))
+	})
+
+	It("does not let a client clobber the blob via a chat_template_kwargs metadata key", func() {
+		cfg := baseCfg()
+		cfg.ChatTemplateKwargs = map[string]any{"preserve_thinking": true}
+		cfg.RequestMetadata = map[string]string{"chat_template_kwargs": "{\"preserve_thinking\": false}"}
+		opts := gRPCPredictOpts(cfg, "/tmp/models")
+		var blob map[string]any
+		Expect(json.Unmarshal([]byte(opts.Metadata["chat_template_kwargs"]), &blob)).To(Succeed())
+		Expect(blob).To(HaveKeyWithValue("preserve_thinking", true))
+	})
+
+	It("omits the blob when there is nothing to forward", func() {
+		opts := gRPCPredictOpts(baseCfg(), "/tmp/models")
+		Expect(opts.Metadata).ToNot(HaveKey("chat_template_kwargs"))
+	})
+})
--- a/core/backend/token_classify.go
+++ b/core/backend/token_classify.go
@@ -0,0 +1,150 @@
+package backend
+
+import (
+	"context"
+	"time"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/trace"
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+	model "github.com/mudler/LocalAI/pkg/model"
+)
+
+// TokenEntity is one detected span from a token-classification (NER)
+// model. Mirrors pb.TokenClassifyEntity but keeps the proto type out of
+// consumers. Start/End are BYTE offsets into the classified text,
+// half-open (addressing text[Start:End]) — the proto contract. Group is
+// the model's entity label (e.g. "private_person", "EMAIL").
+type TokenEntity struct {
+	Group string  `json:"group"`
+	Start int     `json:"start"`
+	End   int     `json:"end"`
+	Score float32 `json:"score"`
+	Text  string  `json:"text"`
+}
+
+// TokenClassifyOptions controls a single TokenClassify request.
+type TokenClassifyOptions struct {
+	// Threshold drops entities the backend scores below this value at
+	// the source. 0 returns everything the model emits; downstream
+	// callers (e.g. the PII redactor's MinScore) can still filter
+	// further once they know the per-request policy.
+	Threshold float32
+}
+
+// TokenClassifier runs a token-classification model over text and
+// returns the detected entity spans. Implemented by NewTokenClassifier
+// over a model-loaded backend; the PII redactor's encoder/NER tier
+// consumes this via a pii.NERDetector adapter (see
+// core/services/routing/piidetector).
+type TokenClassifier interface {
+	TokenClassify(ctx context.Context, text string) ([]TokenEntity, error)
+}
+
+// NewTokenClassifier binds (loader, modelConfig, appConfig) into a
+// TokenClassifier. The underlying backend is resolved lazily on the
+// first call, mirroring NewScorer.
+func NewTokenClassifier(loader *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig, opts TokenClassifyOptions) TokenClassifier {
+	return &modelTokenClassifier{loader: loader, modelConfig: modelConfig, appConfig: appConfig, opts: opts}
+}
+
+type modelTokenClassifier struct {
+	loader      *model.ModelLoader
+	modelConfig config.ModelConfig
+	appConfig   *config.ApplicationConfig
+	opts        TokenClassifyOptions
+}
+
+func (m *modelTokenClassifier) TokenClassify(ctx context.Context, text string) ([]TokenEntity, error) {
+	fn, err := ModelTokenClassify(text, m.opts, m.loader, m.modelConfig, m.appConfig)
+	if err != nil {
+		return nil, err
+	}
+	return fn(ctx)
+}
+
+// ModelTokenClassify loads the backend for modelConfig and returns a
+// closure that classifies `text`. Mirrors ModelScore: the closure is
+// bound to the loaded model so a caller can reuse it within a request
+// without re-resolving the backend.
+//
+// When tracing is enabled it records a BackendTraceTokenClassify row so the
+// detector's output — every entity's group, byte range, confidence and the
+// matched substring — shows in the Traces UI alongside the request it gated.
+// This is the technical view for debugging false positives (e.g. a phone
+// number scored as SSN); the persisted PIIEvent keeps only a hash.
+func ModelTokenClassify(text string, opts TokenClassifyOptions, loader *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (func(ctx context.Context) ([]TokenEntity, error), error) {
+	modelOpts := ModelOptions(modelConfig, appConfig)
+	inferenceModel, err := loader.Load(modelOpts...)
+	if err != nil {
+		recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
+		return nil, err
+	}
+	return func(ctx context.Context) ([]TokenEntity, error) {
+		var startTime time.Time
+		if appConfig.EnableTracing {
+			trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)
+			startTime = time.Now()
+		}
+		resp, err := inferenceModel.TokenClassify(ctx, &pb.TokenClassifyRequest{
+			Text:      text,
+			Threshold: opts.Threshold,
+		})
+		entities := tokenClassifyResponseToEntities(resp)
+		if appConfig.EnableTracing {
+			trace.RecordBackendTrace(tokenClassifyTrace(modelConfig, text, opts.Threshold, entities, startTime, err))
+		}
+		if err != nil {
+			return nil, err
+		}
+		return entities, nil
+	}, nil
+}
+
+// tokenClassifyTrace assembles the Traces-UI row for one NER call: the input
+// preview, the threshold, and every detected entity (group, byte range,
+// confidence, matched text). Split out from the closure so the Data assembly
+// is unit-testable without a live backend.
+func tokenClassifyTrace(modelConfig config.ModelConfig, text string, threshold float32, entities []TokenEntity, start time.Time, callErr error) trace.BackendTrace {
+	errStr := ""
+	if callErr != nil {
+		errStr = callErr.Error()
+	}
+	return trace.BackendTrace{
+		Timestamp: start,
+		Duration:  time.Since(start),
+		Type:      trace.BackendTraceTokenClassify,
+		ModelName: modelConfig.Name,
+		Backend:   modelConfig.Backend,
+		Summary:   trace.TruncateString(text, 200),
+		Error:     errStr,
+		Data: map[string]any{
+			"input_chars": len(text),
+			"threshold":   threshold,
+			"entities":    entities,
+		},
+	}
+}
+
+// tokenClassifyResponseToEntities converts the wire-format response into
+// the value type consumed by callers. Extracted so the conversion can be
+// unit-tested without a real backend (see token_classify_test.go).
+func tokenClassifyResponseToEntities(resp *pb.TokenClassifyResponse) []TokenEntity {
+	if resp == nil {
+		return nil
+	}
+	out := make([]TokenEntity, 0, len(resp.Entities))
+	for _, e := range resp.Entities {
+		if e == nil {
+			continue
+		}
+		out = append(out, TokenEntity{
+			Group: e.EntityGroup,
+			Start: int(e.Start),
+			End:   int(e.End),
+			Score: e.Score,
+			Text:  e.Text,
+		})
+	}
+	return out
+}
--- a/core/backend/token_classify_test.go
+++ b/core/backend/token_classify_test.go
@@ -0,0 +1,61 @@
+package backend
+
+import (
+	"errors"
+	"time"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/trace"
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("tokenClassifyResponseToEntities", func() {
+	It("returns nil for a nil response", func() {
+		Expect(tokenClassifyResponseToEntities(nil)).To(BeNil())
+	})
+
+	It("maps proto entities to TokenEntity, skipping nil rows", func() {
+		resp := &pb.TokenClassifyResponse{
+			Entities: []*pb.TokenClassifyEntity{
+				{EntityGroup: "private_person", Start: 3, End: 8, Score: 0.97, Text: "Alice"},
+				nil,
+				{EntityGroup: "EMAIL", Start: 20, End: 40, Score: 0.5, Text: "a@b.com"},
+			},
+		}
+		Expect(tokenClassifyResponseToEntities(resp)).To(Equal([]TokenEntity{
+			{Group: "private_person", Start: 3, End: 8, Score: 0.97, Text: "Alice"},
+			{Group: "EMAIL", Start: 20, End: 40, Score: 0.5, Text: "a@b.com"},
+		}))
+	})
+
+	It("returns an empty (non-nil) slice for a response with no entities", func() {
+		out := tokenClassifyResponseToEntities(&pb.TokenClassifyResponse{})
+		Expect(out).NotTo(BeNil())
+		Expect(out).To(BeEmpty())
+	})
+})
+
+var _ = Describe("tokenClassifyTrace", func() {
+	cfg := config.ModelConfig{Name: "privacy-filter", Backend: "privacy-filter"}
+	ents := []TokenEntity{{Group: "SSN", Start: 5, End: 16, Score: 0.62, Text: "123-45-6789"}}
+
+	It("captures model, input preview, threshold and per-entity detail", func() {
+		tr := tokenClassifyTrace(cfg, "ssn is 123-45-6789", 0.5, ents, time.Now(), nil)
+		Expect(tr.Type).To(Equal(trace.BackendTraceTokenClassify))
+		Expect(tr.ModelName).To(Equal("privacy-filter"))
+		Expect(tr.Backend).To(Equal("privacy-filter"))
+		Expect(tr.Summary).To(ContainSubstring("ssn is"))
+		Expect(tr.Error).To(BeEmpty())
+		Expect(tr.Data["input_chars"]).To(Equal(len("ssn is 123-45-6789")))
+		Expect(tr.Data["threshold"]).To(BeEquivalentTo(float32(0.5)))
+		Expect(tr.Data["entities"]).To(Equal(ents))
+	})
+
+	It("records the backend error string when the call failed", func() {
+		tr := tokenClassifyTrace(cfg, "x", 0, nil, time.Now(), errors.New("boom"))
+		Expect(tr.Error).To(Equal("boom"))
+	})
+})
--- a/core/config/application_config.go
+++ b/core/config/application_config.go
@@ -57,25 +57,6 @@ type ApplicationConfig struct {
 	// touch disk or memory.
 	DisableStats bool

-	// PIIConfigPath points to an optional YAML file describing the PII
-	// pattern set. When empty, the routing/pii module's DefaultPatterns()
-	// (email, phone, SSN, credit card, IPv4, API key prefixes) are
-	// loaded with their default actions. Each entry overrides the
-	// matching default by ID:
-	//
-	//   patterns:
-	//     - id: email
-	//       action: allow            # downgrade default mask -> allow (log only)
-	//     - id: ssn
-	//       action: block            # upgrade default mask -> block
-	//
-	// Unknown ids are rejected with a clear error at startup.
-	PIIConfigPath string
-
-	// DisablePII turns the regex PII filter off entirely. Default
-	// (false) enables it on the OpenAI chat completions route.
-	DisablePII bool
-
 	// MITMListen is the address (host:port) the cloudproxy MITM
 	// listener binds on. Empty disables the MITM proxy entirely.
 	// Use case: redacting PII from Claude Code / Codex CLI traffic
@@ -84,18 +65,20 @@ type ApplicationConfig struct {
 	// LocalAI exposes at /api/middleware/proxy-ca.crt.
 	MITMListen string

+	// PIIDefaultDetectors lists token-classification (NER) detector model
+	// names applied to any PII-enabled model that does not name its own
+	// pii.detectors. This makes cloud-proxy / MITM redaction work out of the
+	// box (those default to PII-enabled but carry no detector list) and lets
+	// an operator set one detector for the whole instance. Set at runtime via
+	// POST /api/settings; read live by Application.ResolvePIIPolicy.
+	PIIDefaultDetectors []string
+
 	// MITMCADir holds the persisted MITM proxy CA cert and private
 	// key. The CA is generated on first start; subsequent starts
 	// reload it so clients keep trusting the same root. The key
 	// file is mode 0600.
 	MITMCADir string

-	// PIIPatternOverrides applies persisted per-id deltas (action,
-	// disabled) to the live redactor at startup. Loaded from
-	// runtime_settings.json and applied right after pii.NewRedactor.
-	// nil/empty leaves the YAML defaults in place.
-	PIIPatternOverrides map[string]PIIPatternRuntimeOverride
-
 	DisableWebUI                       bool
 	OllamaAPIRootEndpoint              bool
 	EnforcePredownloadScans            bool
@@ -488,6 +471,16 @@ func (o *ApplicationConfig) GetEffectiveMaxActiveBackends() int {
 	return 0
 }

+// WatchdogShouldRun reports whether the live watchdog process should be
+// running for the current config. It mirrors the gating in
+// (*Application).startWatchdog so the /api/settings start/stop decision and
+// the startup path agree on a single source of truth: the watchdog runs when
+// idle/busy checks are enabled (WatchDog), when LRU eviction is active
+// (effective max active backends > 0), or when the memory reclaimer is on.
+func (o *ApplicationConfig) WatchdogShouldRun() bool {
+	return o.WatchDog || o.GetEffectiveMaxActiveBackends() > 0 || o.MemoryReclaimerEnabled
+}
+
 // WithForceEvictionWhenBusy sets whether to force eviction even when models have active API calls
 func WithForceEvictionWhenBusy(enabled bool) AppOption {
 	return func(o *ApplicationConfig) {
@@ -603,6 +596,7 @@ func WithJSONStringPreload(configFile string) AppOption {
 		o.PreloadJSONModels = configFile
 	}
 }
+
 func WithConfigFile(configFile string) AppOption {
 	return func(o *ApplicationConfig) {
 		o.ConfigFile = configFile
@@ -691,21 +685,6 @@ func WithDisableStats(disable bool) AppOption {
 	}
 }

-// WithPIIConfigPath points the routing PII filter at a YAML config
-// file. CLI: --pii-config.
-func WithPIIConfigPath(path string) AppOption {
-	return func(o *ApplicationConfig) {
-		o.PIIConfigPath = path
-	}
-}
-
-// WithDisablePII turns the regex PII filter off. CLI: --disable-pii.
-func WithDisablePII(disable bool) AppOption {
-	return func(o *ApplicationConfig) {
-		o.DisablePII = disable
-	}
-}
-
 // WithMITMListen sets the address the cloudproxy MITM listener
 // binds on. Empty = disabled. CLI: --mitm-listen.
 func WithMITMListen(addr string) AppOption {
@@ -1127,6 +1106,8 @@ func (o *ApplicationConfig) ToRuntimeSettings() RuntimeSettings {

 	mitmListen := o.MITMListen

+	piiDefaultDetectors := append([]string(nil), o.PIIDefaultDetectors...)
+
 	return RuntimeSettings{
 		WatchdogEnabled:           &watchdogEnabled,
 		WatchdogIdleEnabled:       &watchdogIdle,
@@ -1181,6 +1162,7 @@ func (o *ApplicationConfig) ToRuntimeSettings() RuntimeSettings {
 		LogoHorizontalFile:        &logoHorizontalFile,
 		FaviconFile:               &faviconFile,
 		MITMListen:                &mitmListen,
+		PIIDefaultDetectors:       &piiDefaultDetectors,
 	}
 }

@@ -1198,18 +1180,22 @@ func (o *ApplicationConfig) ApplyRuntimeSettings(settings *RuntimeSettings) (req
 	}
 	if settings.WatchdogIdleEnabled != nil {
 		o.WatchDogIdle = *settings.WatchdogIdleEnabled
-		if o.WatchDogIdle {
-			o.WatchDog = true
-		}
 		requireRestart = true
 	}
 	if settings.WatchdogBusyEnabled != nil {
 		o.WatchDogBusy = *settings.WatchdogBusyEnabled
-		if o.WatchDogBusy {
-			o.WatchDog = true
-		}
 		requireRestart = true
 	}
+	// The React Settings "Enable Watchdog" master toggle manages only the
+	// idle/busy checks — watchdog_enabled is vestigial in that UI. Whenever
+	// either idle/busy field is present in the body, derive the run-state from
+	// idle||busy so a cold enable starts the watchdog and a full disable stops
+	// it, instead of trusting the stale watchdog_enabled the UI never updates.
+	// This mirrors the startup invariant in startup.go. An API client posting
+	// only watchdog_enabled (idle/busy absent) keeps its explicit value.
+	if settings.WatchdogIdleEnabled != nil || settings.WatchdogBusyEnabled != nil {
+		o.WatchDog = o.WatchDogIdle || o.WatchDogBusy
+	}
 	if settings.WatchdogIdleTimeout != nil {
 		if dur, err := time.ParseDuration(*settings.WatchdogIdleTimeout); err == nil {
 			o.WatchDogIdleTimeout = dur
@@ -1410,6 +1396,10 @@ func (o *ApplicationConfig) ApplyRuntimeSettings(settings *RuntimeSettings) (req
 		o.MITMListen = *settings.MITMListen
 	}

+	if settings.PIIDefaultDetectors != nil {
+		o.PIIDefaultDetectors = append([]string(nil), (*settings.PIIDefaultDetectors)...)
+	}
+
 	// Note: ApiKeys requires special handling (merging with startup keys) - handled in caller

 	return requireRestart
--- a/core/config/application_config_test.go
+++ b/core/config/application_config_test.go
@@ -223,6 +223,69 @@ var _ = Describe("ApplicationConfig RuntimeSettings Conversion", func() {
 			Expect(appConfig.WatchDogBusy).To(BeTrue())
 		})

+		// Residual #9125: the React Settings "Enable Watchdog" master toggle
+		// manages only watchdog_idle_enabled / watchdog_busy_enabled — it never
+		// touches the vestigial watchdog_enabled field. On a cold enable the
+		// body therefore carries watchdog_enabled=false alongside idle/busy=true.
+		// The derived run-state (WatchDog) must follow idle||busy so the live
+		// watchdog actually starts, not the stale watchdog_enabled=false.
+		It("should derive WatchDog from idle||busy on a cold enable even when watchdog_enabled=false", func() {
+			appConfig := &ApplicationConfig{WatchDog: false}
+
+			watchdogEnabled := false
+			watchdogIdle := true
+			watchdogBusy := true
+			rs := &RuntimeSettings{
+				WatchdogEnabled:     &watchdogEnabled,
+				WatchdogIdleEnabled: &watchdogIdle,
+				WatchdogBusyEnabled: &watchdogBusy,
+			}
+
+			appConfig.ApplyRuntimeSettings(rs)
+
+			Expect(appConfig.WatchDog).To(BeTrue())
+			Expect(appConfig.WatchdogShouldRun()).To(BeTrue())
+		})
+
+		// The disable direction: the master toggle off sends idle=false,
+		// busy=false, but watchdog_enabled may still be the stale true loaded
+		// before the change. WatchDog must follow idle||busy down to false so
+		// the live watchdog is stopped (it stays stopped unless LRU / memory
+		// reclaimer keep it alive, which is gated by WatchdogShouldRun).
+		It("should disable WatchDog when both idle and busy are turned off", func() {
+			appConfig := &ApplicationConfig{WatchDog: true, WatchDogIdle: true, WatchDogBusy: true}
+
+			watchdogEnabled := true
+			watchdogIdle := false
+			watchdogBusy := false
+			rs := &RuntimeSettings{
+				WatchdogEnabled:     &watchdogEnabled,
+				WatchdogIdleEnabled: &watchdogIdle,
+				WatchdogBusyEnabled: &watchdogBusy,
+			}
+
+			appConfig.ApplyRuntimeSettings(rs)
+
+			Expect(appConfig.WatchDog).To(BeFalse())
+			Expect(appConfig.WatchdogShouldRun()).To(BeFalse())
+		})
+
+		// Backward compatibility: an API client that posts only watchdog_enabled
+		// (idle/busy nil) keeps the explicit value — the idle/busy derivation
+		// only kicks in when those fields are actually present in the body.
+		It("should preserve explicit watchdog_enabled when idle/busy are absent", func() {
+			appConfig := &ApplicationConfig{WatchDog: false}
+
+			watchdogEnabled := true
+			rs := &RuntimeSettings{
+				WatchdogEnabled: &watchdogEnabled,
+			}
+
+			appConfig.ApplyRuntimeSettings(rs)
+
+			Expect(appConfig.WatchDog).To(BeTrue())
+		})
+
 		It("should handle MaxActiveBackends and update SingleBackend accordingly", func() {
 			appConfig := &ApplicationConfig{}

--- a/core/config/backend_capabilities.go
+++ b/core/config/backend_capabilities.go
@@ -8,25 +8,27 @@ import (
 // Usecase name constants — the canonical string values used in gallery entries,
 // model configs (known_usecases), and UsecaseInfoMap keys.
 const (
-	UsecaseChat            = "chat"
-	UsecaseCompletion      = "completion"
-	UsecaseEdit            = "edit"
-	UsecaseVision          = "vision"
-	UsecaseEmbeddings      = "embeddings"
-	UsecaseTokenize        = "tokenize"
-	UsecaseImage           = "image"
-	UsecaseVideo           = "video"
-	UsecaseTranscript      = "transcript"
-	UsecaseTTS             = "tts"
-	UsecaseSoundGeneration = "sound_generation"
-	UsecaseRerank          = "rerank"
-	UsecaseDetection       = "detection"
-	UsecaseVAD             = "vad"
-	UsecaseAudioTransform      = "audio_transform"
-	UsecaseDiarization         = "diarization"
-	UsecaseRealtimeAudio       = "realtime_audio"
-	UsecaseFaceRecognition     = "face_recognition"
-	UsecaseSpeakerRecognition  = "speaker_recognition"
+	UsecaseChat               = "chat"
+	UsecaseCompletion         = "completion"
+	UsecaseEdit               = "edit"
+	UsecaseVision             = "vision"
+	UsecaseEmbeddings         = "embeddings"
+	UsecaseTokenize           = "tokenize"
+	UsecaseImage              = "image"
+	UsecaseVideo              = "video"
+	UsecaseTranscript         = "transcript"
+	UsecaseTTS                = "tts"
+	UsecaseSoundGeneration    = "sound_generation"
+	UsecaseRerank             = "rerank"
+	UsecaseDetection          = "detection"
+	UsecaseDepth              = "depth"
+	UsecaseVAD                = "vad"
+	UsecaseAudioTransform     = "audio_transform"
+	UsecaseDiarization        = "diarization"
+	UsecaseRealtimeAudio      = "realtime_audio"
+	UsecaseFaceRecognition    = "face_recognition"
+	UsecaseSpeakerRecognition = "speaker_recognition"
+	UsecaseTokenClassify      = "token_classify"
 )

 // GRPCMethod identifies a Backend service RPC from backend.proto.
@@ -44,6 +46,7 @@ const (
 	MethodSoundGeneration    GRPCMethod = "SoundGeneration"
 	MethodTokenizeString     GRPCMethod = "TokenizeString"
 	MethodDetect             GRPCMethod = "Detect"
+	MethodDepth              GRPCMethod = "Depth"
 	MethodRerank             GRPCMethod = "Rerank"
 	MethodVAD                GRPCMethod = "VAD"
 	MethodAudioTransform     GRPCMethod = "AudioTransform"
@@ -54,6 +57,7 @@ const (
 	MethodVoiceVerify        GRPCMethod = "VoiceVerify"
 	MethodVoiceEmbed         GRPCMethod = "VoiceEmbed"
 	MethodVoiceAnalyze       GRPCMethod = "VoiceAnalyze"
+	MethodTokenClassify      GRPCMethod = "TokenClassify"
 )

 // UsecaseInfo describes a single known_usecase value and how it maps
@@ -141,6 +145,11 @@ var UsecaseInfoMap = map[string]UsecaseInfo{
 		GRPCMethod:  MethodDetect,
 		Description: "Object detection via the Detect RPC with bounding boxes.",
 	},
+	UsecaseDepth: {
+		Flag:        FLAG_DEPTH,
+		GRPCMethod:  MethodDepth,
+		Description: "Per-pixel metric depth, camera pose and 3D point cloud via the Depth RPC (Depth Anything 3).",
+	},
 	UsecaseVAD: {
 		Flag:        FLAG_VAD,
 		GRPCMethod:  MethodVAD,
@@ -171,6 +180,11 @@ var UsecaseInfoMap = map[string]UsecaseInfo{
 		GRPCMethod:  MethodVoiceVerify,
 		Description: "Speaker recognition — verify identity, embed and analyze voice via VoiceVerify, VoiceEmbed and VoiceAnalyze RPCs.",
 	},
+	UsecaseTokenClassify: {
+		Flag:        FLAG_TOKEN_CLASSIFY,
+		GRPCMethod:  MethodTokenClassify,
+		Description: "Per-token classification (NER) via the TokenClassify RPC — the PII detector tier. Declared explicitly via known_usecases; never auto-guessed, since the token-classification head is not useful as general generation or embeddings.",
+	},
 }

 // BackendCapability describes which gRPC methods and usecases a backend supports.
@@ -207,6 +221,17 @@ var BackendCapabilities = map[string]BackendCapability{
 		AcceptsImages:    true, // requires mmproj
 		Description:      "llama.cpp GGUF models — LLM inference with optional vision via mmproj",
 	},
+	// privacy-filter is the standalone GGML engine (backend/cpp/privacy-filter,
+	// wrapping privacy-filter.cpp) for the openai-privacy-filter PII/NER token
+	// classifier — the dedicated TokenClassify path that replaces the
+	// patched-llama.cpp route. Never auto-guessed; declared explicitly via
+	// known_usecases: [token_classify].
+	"privacy-filter": {
+		GRPCMethods:      []GRPCMethod{MethodTokenClassify},
+		PossibleUsecases: []string{UsecaseTokenClassify},
+		DefaultUsecases:  []string{UsecaseTokenClassify},
+		Description:      "privacy-filter.cpp — standalone GGML backend for openai-privacy-filter PII/NER token classification",
+	},
 	"vllm": {
 		GRPCMethods:      []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
 		PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseVision},
@@ -488,6 +513,13 @@ var BackendCapabilities = map[string]BackendCapability{
 		DefaultUsecases:  []string{UsecaseDetection},
 		Description:      "RF-DETR C++ object detection",
 	},
+	"depth-anything": {
+		GRPCMethods:      []GRPCMethod{MethodDepth, MethodPredict, MethodGenerateImage},
+		PossibleUsecases: []string{UsecaseDepth},
+		DefaultUsecases:  []string{UsecaseDepth},
+		AcceptsImages:    true,
+		Description:      "Depth Anything 3 C++ — per-pixel metric depth, camera pose and 3D point cloud",
+	},

 	// --- Face and speaker recognition backends ---
 	"insightface": {
--- a/core/config/chat_template_kwargs_test.go
+++ b/core/config/chat_template_kwargs_test.go
@@ -0,0 +1,48 @@
+package config_test
+
+import (
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+
+	"github.com/mudler/LocalAI/core/config"
+)
+
+// ResolveChatTemplateKwargs layers the model config map (base) under the coerced
+// backend metadata (server reasoning levers + client request overrides).
+var _ = Describe("ModelConfig.ResolveChatTemplateKwargs", func() {
+	It("returns nil when nothing is set", func() {
+		c := &config.ModelConfig{}
+		Expect(c.ResolveChatTemplateKwargs(nil)).To(BeNil())
+	})
+
+	It("returns the config map when no metadata is present", func() {
+		c := &config.ModelConfig{ChatTemplateKwargs: map[string]any{"preserve_thinking": true}}
+		Expect(c.ResolveChatTemplateKwargs(nil)).To(HaveKeyWithValue("preserve_thinking", true))
+	})
+
+	It("lets metadata override the config map", func() {
+		c := &config.ModelConfig{ChatTemplateKwargs: map[string]any{"enable_thinking": true}}
+		got := c.ResolveChatTemplateKwargs(map[string]string{"enable_thinking": "false"})
+		Expect(got).To(HaveKeyWithValue("enable_thinking", false))
+	})
+
+	It("coerces true/false to bool and leaves other strings as-is", func() {
+		c := &config.ModelConfig{}
+		got := c.ResolveChatTemplateKwargs(map[string]string{
+			"enable_thinking":  "true",
+			"reasoning_effort": "high",
+		})
+		Expect(got).To(HaveKeyWithValue("enable_thinking", true))
+		Expect(got).To(HaveKeyWithValue("reasoning_effort", "high"))
+	})
+
+	It("skips the reserved chat_template_kwargs metadata key but keeps siblings", func() {
+		c := &config.ModelConfig{}
+		got := c.ResolveChatTemplateKwargs(map[string]string{
+			"chat_template_kwargs": "{\"x\":1}",
+			"preserve_thinking":    "true",
+		})
+		Expect(got).ToNot(HaveKey("chat_template_kwargs"))
+		Expect(got).To(HaveKeyWithValue("preserve_thinking", true))
+	})
+})
--- a/core/config/gguf.go
+++ b/core/config/gguf.go
@@ -19,8 +19,19 @@ const (
 	defaultNGPULayers  = 99999999
 )

-func guessGGUFFromFile(cfg *ModelConfig, f *gguf.GGUFFile, defaultCtx int) {
+// reservedNonChatModel reports whether the operator reserved this model for an
+// internal primitive — the router score classifier or the PII NER
+// token_classify tier. Such a model has no chat template and must not be
+// given the generative-chat defaults the GGUF importer otherwise applies
+// (FLAG_CHAT, jinja templating): surfacing it in chat pickers defeats the
+// reservation. Operators who do want a combined model declare both usecases
+// explicitly — the combination is valid.
+func reservedNonChatModel(cfg *ModelConfig) bool {
+	return cfg.KnownUsecases != nil &&
+		(*cfg.KnownUsecases&(FLAG_SCORE|FLAG_TOKEN_CLASSIFY)) != 0
+}

+func guessGGUFFromFile(cfg *ModelConfig, f *gguf.GGUFFile, defaultCtx int) {
 	if defaultCtx == 0 && cfg.ContextSize == nil {
 		ctxSize := f.EstimateLLaMACppRun().ContextSize
 		if ctxSize > 0 {
@@ -77,11 +88,19 @@ func guessGGUFFromFile(cfg *ModelConfig, f *gguf.GGUFFile, defaultCtx int) {
 		cfg.Name = f.Metadata().Name
 	}

-	// Instruct to use template from llama.cpp
-	cfg.TemplateConfig.UseTokenizerTemplate = true
-	cfg.FunctionsConfig.GrammarConfig.NoGrammar = true
-	cfg.Options = append(cfg.Options, "use_jinja:true")
-	cfg.KnownUsecaseStrings = append(cfg.KnownUsecaseStrings, "FLAG_CHAT")
+	// A model the operator reserved for an internal primitive (the router
+	// score classifier, or the PII NER token_classify tier) is not a chat
+	// model: it carries no chat template and must not be painted with the
+	// generative-chat defaults — appending FLAG_CHAT here would fold chat
+	// into KnownUsecases on the next sync and surface the model in every
+	// chat picker. Respect the declaration.
+	if !reservedNonChatModel(cfg) {
+		// Instruct to use template from llama.cpp
+		cfg.TemplateConfig.UseTokenizerTemplate = true
+		cfg.FunctionsConfig.GrammarConfig.NoGrammar = true
+		cfg.Options = append(cfg.Options, "use_jinja:true")
+		cfg.KnownUsecaseStrings = append(cfg.KnownUsecaseStrings, "FLAG_CHAT")
+	}

 	// Apply per-model-family inference parameter defaults (temperature, top_p, etc.)
 	ApplyInferenceDefaults(cfg, f.Metadata().Name)
--- a/core/config/hardware_defaults.go
+++ b/core/config/hardware_defaults.go
@@ -0,0 +1,190 @@
+package config
+
+import (
+	"fmt"
+	"strconv"
+	"strings"
+
+	"github.com/mudler/LocalAI/pkg/xsysinfo"
+	"github.com/mudler/xlog"
+)
+
+// Hardware-driven model-config defaults.
+//
+// This sits alongside the other config overriders (ApplyInferenceDefaults for
+// model families, guessDefaultsFromFile for GGUF/NGPULayers): they all
+// heuristically fill ModelConfig values the user left unset. Hardware tuning is
+// the same domain — "adjust the config from the device that will run it" — so
+// it lives here rather than scattered into the backend or a separate package.
+//
+// The heuristics are parameterized on a GPU descriptor (not on direct
+// detection) so they apply in both deployment shapes: SetDefaults passes the
+// LocalGPU on a single host, and the distributed router passes the *selected
+// node's* reported GPU before loading there (the frontend that loaded the
+// config may have no GPU at all).
+
+// GPU describes the device that will run a model.
+type GPU struct {
+	// Vendor is "nvidia", "amd", … (matches xsysinfo vendor constants).
+	Vendor string
+	// ComputeCapability is the NVIDIA compute capability as "major.minor"
+	// (e.g. "12.1" for GB10 / DGX Spark). Empty for non-NVIDIA / unknown.
+	ComputeCapability string
+	// VRAM is total device memory in bytes (0 = unknown).
+	VRAM uint64
+}
+
+// Physical batch (n_batch / n_ubatch) defaults.
+const (
+	// DefaultPhysicalBatch is the conservative default when no hardware-specific
+	// tuning applies. Matches backend.DefaultBatchSize.
+	DefaultPhysicalBatch = 512
+	// BlackwellPhysicalBatch is the default on NVIDIA Blackwell consumer GPUs
+	// (sm_12x: sm_120 RTX 50-series, sm_121 GB10 / DGX Spark). A larger physical
+	// batch materially lifts MoE prefill there (per-expert GEMM tiles fill
+	// better); measured on a GB10 with Qwen3-30B-A3B to saturate around 2048.
+	BlackwellPhysicalBatch = 2048
+)
+
+// IsNVIDIABlackwell reports whether the GPU is in the NVIDIA Blackwell consumer
+// family (sm_12x). Datacenter Blackwell (B100/B200/GB200, sm_100 / cc 10.0)
+// reports a different compute capability and is intentionally not matched.
+func (g GPU) IsNVIDIABlackwell() bool {
+	maj, _ := parseComputeCapability(g.ComputeCapability)
+	return maj >= 12
+}
+
+// PhysicalBatch returns the canonical physical batch (n_batch/n_ubatch) for the
+// given hardware, used when the model config leaves batch unset.
+func PhysicalBatch(g GPU) int {
+	if g.IsNVIDIABlackwell() {
+		return BlackwellPhysicalBatch
+	}
+	return DefaultPhysicalBatch
+}
+
+// IsManagedPhysicalBatch reports whether n is a value PhysicalBatch assigns.
+// Callers that re-tune a value chosen by an upstream host (the distributed
+// router correcting the frontend's guess) use this to avoid clobbering an
+// explicit user batch such as 1024.
+func IsManagedPhysicalBatch(n int) bool {
+	return n == DefaultPhysicalBatch || n == BlackwellPhysicalBatch
+}
+
+// Parallel-slot (n_parallel) VRAM tiers. llama.cpp serializes requests at
+// n_parallel=1 (the backend default) and only auto-enables continuous batching
+// when n_parallel > 1 — so a single-slot default makes concurrent requests
+// queue. We default a slot count by GPU size so multi-user serving works out of
+// the box. With the backend's unified KV cache the slots SHARE the context
+// budget, so more slots add concurrency without multiplying KV memory.
+const (
+	parallelSlotsVRAMHigh = uint64(32) << 30 // >=32 GiB -> 8 slots
+	parallelSlotsVRAMMid  = uint64(8) << 30  // >=8 GiB  -> 4 slots
+	parallelSlotsVRAMLow  = uint64(4) << 30  // >=4 GiB  -> 2 slots
+)
+
+// DefaultParallelSlots returns the n_parallel default for the given GPU. Returns
+// 1 (no concurrency) when VRAM is unknown or too small, so we never change
+// behavior on CPU-only / tiny devices.
+func DefaultParallelSlots(g GPU) int {
+	switch {
+	case g.VRAM >= parallelSlotsVRAMHigh:
+		return 8
+	case g.VRAM >= parallelSlotsVRAMMid:
+		return 4
+	case g.VRAM >= parallelSlotsVRAMLow:
+		return 2
+	default:
+		return 1
+	}
+}
+
+// EnsureParallelOption appends a VRAM-scaled "parallel:N" backend option when the
+// model doesn't already set one (and the GPU warrants concurrency). Returns the
+// possibly-extended options. Shared by the single-host config path
+// (ApplyHardwareDefaults) and the distributed router (per selected node).
+func EnsureParallelOption(opts []string, gpu GPU) []string {
+	if slots := DefaultParallelSlots(gpu); slots > 1 && !hasParallelOption(opts) {
+		return append(opts, fmt.Sprintf("parallel:%d", slots))
+	}
+	return opts
+}
+
+// hasParallelOption reports whether the model already sets parallel/n_parallel
+// (backend options are "name:value" strings) so we never override an explicit value.
+func hasParallelOption(opts []string) bool {
+	for _, o := range opts {
+		name := o
+		if i := strings.IndexByte(o, ':'); i >= 0 {
+			name = o[:i]
+		}
+		switch strings.TrimSpace(strings.ToLower(name)) {
+		case "parallel", "n_parallel":
+			return true
+		}
+	}
+	return false
+}
+
+// localGPU builds a GPU descriptor from local detection, used by SetDefaults on
+// a single host (the distributed router builds it from the selected node's
+// reported info instead). It is a package var so tests can inject a
+// deterministic device — detection does a live nvidia-smi call.
+var localGPU = func() GPU {
+	vendor, _ := xsysinfo.DetectGPUVendor()
+	vram, _ := xsysinfo.TotalAvailableVRAM()
+	return GPU{
+		Vendor:            vendor,
+		ComputeCapability: xsysinfo.NVIDIAComputeCapability(),
+		VRAM:              vram,
+	}
+}
+
+// ApplyHardwareDefaults fills ModelConfig values that depend on the target GPU
+// and were left unset by the user. Currently: a larger physical batch on
+// Blackwell. Explicit config always wins (we only touch zero values).
+func ApplyHardwareDefaults(cfg *ModelConfig, gpu GPU) {
+	if cfg == nil {
+		return
+	}
+	if cfg.Batch == 0 && gpu.IsNVIDIABlackwell() {
+		cfg.Batch = BlackwellPhysicalBatch
+		xlog.Debug("[hardware_defaults] Blackwell GPU: defaulting physical batch",
+			"batch", cfg.Batch, "compute_cap", gpu.ComputeCapability)
+	}
+
+	// Enable concurrent serving by default on a capable GPU: without this the
+	// llama.cpp backend runs n_parallel=1 and serializes multi-user requests
+	// (continuous batching stays off). Unified KV means the slots share the
+	// context budget, so this is concurrency without extra KV memory. Explicit
+	// parallel/n_parallel in the model options always wins.
+	if before := len(cfg.Options); true {
+		cfg.Options = EnsureParallelOption(cfg.Options, gpu)
+		if len(cfg.Options) > before {
+			xlog.Debug("[hardware_defaults] defaulting parallel slots for concurrent serving",
+				"option", cfg.Options[len(cfg.Options)-1], "vram_gib", gpu.VRAM>>30)
+		}
+	}
+}
+
+// parseComputeCapability splits a "major.minor" string into integer parts.
+// Returns (-1, -1) when it can't be parsed.
+func parseComputeCapability(cc string) (int, int) {
+	cc = strings.TrimSpace(cc)
+	if cc == "" {
+		return -1, -1
+	}
+	majStr, minStr := cc, "0"
+	if dot := strings.IndexByte(cc, '.'); dot >= 0 {
+		majStr, minStr = cc[:dot], cc[dot+1:]
+	}
+	maj, err := strconv.Atoi(strings.TrimSpace(majStr))
+	if err != nil {
+		return -1, -1
+	}
+	min, err := strconv.Atoi(strings.TrimSpace(minStr))
+	if err != nil {
+		min = 0
+	}
+	return maj, min
+}
--- a/core/config/hardware_defaults_internal_test.go
+++ b/core/config/hardware_defaults_internal_test.go
@@ -0,0 +1,37 @@
+package config
+
+import (
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+// Single-instance path: SetDefaults applies hardware defaults from the local
+// GPU. The detection seam (localGPU) is injected so the path is deterministic
+// without a real GPU.
+var _ = Describe("SetDefaults hardware defaults (single-instance)", func() {
+	var orig func() GPU
+	BeforeEach(func() { orig = localGPU })
+	AfterEach(func() { localGPU = orig })
+
+	It("sets the physical batch on a local Blackwell GPU", func() {
+		localGPU = func() GPU { return GPU{ComputeCapability: "12.1"} }
+		cfg := &ModelConfig{}
+		cfg.SetDefaults()
+		Expect(cfg.Batch).To(Equal(BlackwellPhysicalBatch))
+	})
+
+	It("leaves batch unset on a non-Blackwell local GPU", func() {
+		localGPU = func() GPU { return GPU{ComputeCapability: "8.9"} }
+		cfg := &ModelConfig{}
+		cfg.SetDefaults()
+		Expect(cfg.Batch).To(Equal(0))
+	})
+
+	It("never overrides an explicit batch", func() {
+		localGPU = func() GPU { return GPU{ComputeCapability: "12.1"} }
+		cfg := &ModelConfig{}
+		cfg.Batch = 1024
+		cfg.SetDefaults()
+		Expect(cfg.Batch).To(Equal(1024))
+	})
+})
--- a/core/config/hardware_defaults_test.go
+++ b/core/config/hardware_defaults_test.go
@@ -0,0 +1,97 @@
+package config_test
+
+import (
+	. "github.com/mudler/LocalAI/core/config"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("Hardware-driven config defaults", func() {
+	DescribeTable("GPU.IsNVIDIABlackwell (sm_12x consumer family)",
+		func(cc string, want bool) {
+			Expect(GPU{ComputeCapability: cc}.IsNVIDIABlackwell()).To(Equal(want))
+		},
+		Entry("GB10 12.1", "12.1", true),
+		Entry("RTX 50 12.0", "12.0", true),
+		Entry("future 13.0", "13.0", true),
+		Entry("Hopper 9.0", "9.0", false),
+		Entry("Ada 8.9", "8.9", false),
+		Entry("datacenter Blackwell sm_100 10.0", "10.0", false),
+		Entry("unknown", "", false),
+	)
+
+	Describe("PhysicalBatch / IsManagedPhysicalBatch", func() {
+		It("returns the Blackwell batch on Blackwell", func() {
+			Expect(PhysicalBatch(GPU{ComputeCapability: "12.1"})).To(Equal(BlackwellPhysicalBatch))
+		})
+		It("returns the default batch otherwise", func() {
+			Expect(PhysicalBatch(GPU{ComputeCapability: "9.0"})).To(Equal(DefaultPhysicalBatch))
+			Expect(PhysicalBatch(GPU{})).To(Equal(DefaultPhysicalBatch))
+		})
+		It("recognizes managed defaults but not explicit values", func() {
+			Expect(IsManagedPhysicalBatch(DefaultPhysicalBatch)).To(BeTrue())
+			Expect(IsManagedPhysicalBatch(BlackwellPhysicalBatch)).To(BeTrue())
+			Expect(IsManagedPhysicalBatch(1024)).To(BeFalse())
+		})
+	})
+
+	Describe("ApplyHardwareDefaults", func() {
+		It("raises an unset batch to 2048 on Blackwell", func() {
+			cfg := &ModelConfig{}
+			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1"})
+			Expect(cfg.Batch).To(Equal(BlackwellPhysicalBatch))
+		})
+		It("leaves batch unset on non-Blackwell", func() {
+			cfg := &ModelConfig{}
+			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "9.0"})
+			Expect(cfg.Batch).To(Equal(0))
+		})
+		It("never overrides an explicit batch", func() {
+			cfg := &ModelConfig{}
+			cfg.Batch = 1024
+			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1"})
+			Expect(cfg.Batch).To(Equal(1024))
+		})
+		It("no-ops on nil", func() {
+			Expect(func() { ApplyHardwareDefaults(nil, GPU{ComputeCapability: "12.1"}) }).ToNot(Panic())
+		})
+	})
+
+	const gib = uint64(1) << 30
+
+	DescribeTable("DefaultParallelSlots (by VRAM)",
+		func(vramGiB uint64, want int) {
+			Expect(DefaultParallelSlots(GPU{VRAM: vramGiB * gib})).To(Equal(want))
+		},
+		Entry("GB10 119 GiB", uint64(119), 8),
+		Entry("48 GiB", uint64(48), 8),
+		Entry("24 GiB", uint64(24), 4),
+		Entry("8 GiB", uint64(8), 4),
+		Entry("6 GiB", uint64(6), 2),
+		Entry("2 GiB", uint64(2), 1),
+		Entry("unknown 0", uint64(0), 1),
+	)
+
+	Describe("ApplyHardwareDefaults parallel slots", func() {
+		It("adds a VRAM-scaled parallel option on a capable GPU", func() {
+			cfg := &ModelConfig{}
+			ApplyHardwareDefaults(cfg, GPU{ComputeCapability: "12.1", VRAM: 119 * gib})
+			Expect(cfg.Options).To(ContainElement("parallel:8"))
+		})
+		It("scales the slot count down with VRAM", func() {
+			cfg := &ModelConfig{}
+			ApplyHardwareDefaults(cfg, GPU{VRAM: 24 * gib})
+			Expect(cfg.Options).To(ContainElement("parallel:4"))
+		})
+		It("adds no parallel option on small/unknown VRAM", func() {
+			cfg := &ModelConfig{}
+			ApplyHardwareDefaults(cfg, GPU{VRAM: 2 * gib})
+			Expect(cfg.Options).ToNot(ContainElement(ContainSubstring("parallel")))
+		})
+		It("never overrides an explicit parallel option", func() {
+			cfg := &ModelConfig{Options: []string{"parallel:2"}}
+			ApplyHardwareDefaults(cfg, GPU{VRAM: 119 * gib})
+			Expect(cfg.Options).To(Equal([]string{"parallel:2"}))
+		})
+	})
+})
--- a/core/config/inference_defaults.json
+++ b/core/config/inference_defaults.json
@@ -40,6 +40,7 @@
    "glm-5": {"min_p":0.01,"repeat_penalty":1,"temperature":1,"top_k":-1,"top_p":0.95},
    "glm-4": {"min_p":0.01,"repeat_penalty":1,"temperature":1,"top_k":-1,"top_p":0.95},
    "nemotron": {"min_p":0.01,"repeat_penalty":1,"temperature":1,"top_k":-1,"top_p":1},
+    "minimax-m2.7": {"min_p":0.01,"repeat_penalty":1,"temperature":1,"top_k":40,"top_p":0.95},
    "minimax-m2.5": {"min_p":0.01,"repeat_penalty":1,"temperature":1,"top_k":40,"top_p":0.95},
    "minimax": {"min_p":0.01,"repeat_penalty":1,"temperature":1,"top_k":40,"top_p":0.95},
    "gpt-oss": {"min_p":0.01,"repeat_penalty":1,"temperature":1,"top_k":0,"top_p":1},
@@ -55,5 +56,5 @@
    "grok": {"min_p":0.01,"repeat_penalty":1,"temperature":1,"top_k":-1,"top_p":0.95},
    "mimo": {"min_p":0.01,"repeat_penalty":1,"temperature":0.7,"top_k":-1,"top_p":0.95}
  },
-  "patterns": ["qwen3.6","qwen3.5","qwen3-coder","qwen3-next","qwen3-vl","qwen3","qwen2.5-coder","qwen2.5-vl","qwen2.5-omni","qwen2.5-math","qwen2.5","qwen2-vl","qwen2","qwq","gemma-4","gemma-3n","gemma-3","medgemma","gemma-2","llama-4","llama-3.3","llama-3.2","llama-3.1","llama-3","phi-4","phi-3","mistral-nemo","mistral-small","mistral-large","magistral","ministral","devstral","pixtral","deepseek-r1","deepseek-v3","deepseek-ocr","glm-5","glm-4","nemotron","minimax-m2.5","minimax","gpt-oss","granite-4","kimi-k2","kimi","lfm2","smollm","olmo","falcon","ernie","seed","grok","mimo"]
+  "patterns": ["qwen3.6","qwen3.5","qwen3-coder","qwen3-next","qwen3-vl","qwen3","qwen2.5-coder","qwen2.5-vl","qwen2.5-omni","qwen2.5-math","qwen2.5","qwen2-vl","qwen2","qwq","gemma-4","gemma-3n","gemma-3","medgemma","gemma-2","llama-4","llama-3.3","llama-3.2","llama-3.1","llama-3","phi-4","phi-3","mistral-nemo","mistral-small","mistral-large","magistral","ministral","devstral","pixtral","deepseek-r1","deepseek-v3","deepseek-ocr","glm-5","glm-4","nemotron","minimax-m2.7","minimax-m2.5","minimax","gpt-oss","granite-4","kimi-k2","kimi","lfm2","smollm","olmo","falcon","ernie","seed","grok","mimo"]
 }
--- a/core/config/meta/constants.go
+++ b/core/config/meta/constants.go
@@ -64,6 +64,7 @@ var UsecaseOptions = []FieldOption{
 	{Value: "image", Label: "Image"},
 	{Value: "vision", Label: "Vision"},
 	{Value: "detection", Label: "Detection"},
+	{Value: "depth", Label: "Depth"},
 	{Value: "face_recognition", Label: "Face Recognition"},
 	{Value: "transcript", Label: "Transcript"},
 	{Value: "diarization", Label: "Diarization"},
--- a/core/config/meta/pattern_meta_test.go
+++ b/core/config/meta/pattern_meta_test.go
@@ -0,0 +1,41 @@
+package meta_test
+
+import (
+	"reflect"
+	"testing"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/config/meta"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+func TestMeta(t *testing.T) {
+	RegisterFailHandler(Fail)
+	RunSpecs(t, "config/meta suite")
+}
+
+var _ = Describe("pattern detector field metadata", func() {
+	byPath := func() map[string]meta.FieldMeta {
+		md := meta.BuildForTest(reflect.TypeOf(config.ModelConfig{}), meta.DefaultRegistry())
+		out := make(map[string]meta.FieldMeta, len(md.Fields))
+		for _, f := range md.Fields {
+			out[f.Path] = f
+		}
+		return out
+	}
+
+	It("renders builtins as a select with the catalogue as options", func() {
+		f, ok := byPath()["pii_detection.builtins"]
+		Expect(ok).To(BeTrue(), "pii_detection.builtins field should exist")
+		Expect(f.Component).To(Equal("pii-builtins-select"))
+		Expect(f.Options).NotTo(BeEmpty())
+	})
+
+	It("renders custom patterns with the pattern-list editor", func() {
+		f, ok := byPath()["pii_detection.patterns"]
+		Expect(ok).To(BeTrue(), "pii_detection.patterns field should exist")
+		Expect(f.Component).To(Equal("pii-pattern-list"))
+	})
+})
--- a/core/config/meta/registry.go
+++ b/core/config/meta/registry.go
@@ -1,5 +1,19 @@
 package meta

+import "github.com/mudler/LocalAI/core/services/routing/piipattern"
+
+// builtinPatternOptions turns the piipattern built-in catalogue into select
+// options for the editor's built-in-patterns checklist, keeping the catalogue
+// the single source of truth.
+func builtinPatternOptions() []FieldOption {
+	cat := piipattern.BuiltinCatalogue()
+	out := make([]FieldOption, 0, len(cat))
+	for _, b := range cat {
+		out = append(out, FieldOption{Value: b.Name, Label: b.Name + " — " + b.Description})
+	}
+	return out
+}
+
 // DefaultRegistry returns enrichment overrides for the ~30 most commonly used
 // config fields. Fields not listed here still appear with auto-generated
 // labels and type-inferred components.
@@ -434,6 +448,13 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Component:   "json-editor",
 			Order:       78,
 		},
+		"pipeline.max_history_items": {
+			Section:     "pipeline",
+			Label:       "Max History Items",
+			Description: "Cap how many trailing conversation items are fed to the LLM each realtime turn (0 = unlimited, rely on the LLM's context window). Set it on a composed pipeline (VAD+STT+LLM+TTS) so a long-running session doesn't grow until the context fills. Unset uses the per-model-type default.",
+			Component:   "number",
+			Order:       79,
+		},

 		// --- Functions ---
 		"function.grammar.parallel_calls": {
@@ -497,12 +518,60 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Component:   "toggle",
 			Order:       200,
 		},
-		"pii.patterns": {
+		"pii.detectors": {
+			Section:              "pii",
+			Label:                "PII Detector Models",
+			Description:          "Token-classification (NER) models that scan this model's requests for PII. The detection policy (which entities, what action, min score) lives on each detector model's own PII Detection block. Multiple detectors union their hits.",
+			Component:            "model-multi-select",
+			AutocompleteProvider: "models:token_classify",
+			Order:                201,
+		},
+
+		// --- PII detection policy (on a token_classify detector model) ---
+		"pii_detection.min_score": {
 			Section:     "pii",
-			Label:       "PII Pattern Overrides",
-			Description: "Override the global default action for specific patterns on this model. Patterns not listed here inherit the global action (Settings → Middleware → Filtering).",
+			Label:       "Detector Min Score",
+			Description: "When this model is used as a PII detector, drop detections scored below this confidence before they are acted on. 0 keeps every detection.",
+			Component:   "slider",
+			Min:         f64(0),
+			Max:         f64(1),
+			Step:        f64(0.01),
+			Order:       210,
+		},
+		"pii_detection.default_action": {
+			Section:     "pii",
+			Label:       "Detector Default Action",
+			Description: "Action applied to detected entity groups with no explicit per-entity override. Defaults to mask — the safe-by-default policy for a PII filter.",
+			Component:   "select",
+			Options: []FieldOption{
+				{Value: "mask", Label: "mask (redact the span)"},
+				{Value: "block", Label: "block (reject the request)"},
+				{Value: "allow", Label: "allow (detect & log only)"},
+			},
+			Default: "mask",
+			Order:   211,
+		},
+		"pii_detection.entity_actions": {
+			Section:     "pii",
+			Label:       "Detector Entity Actions",
+			Description: "Per-entity-group action policy for this detector model (e.g. PASSWORD → block, EMAIL → mask). Groups without an entry use the default action.",
+			Component:   "entity-action-list",
+			Order:       212,
+		},
+		"pii_detection.builtins": {
+			Section:     "pii",
+			Label:       "Built-in Secret Patterns",
+			Description: "Built-in regex patterns for common credentials (API keys, tokens, private keys). Turning any on makes this a pattern detector — it matches high-entropy secrets the NER tier can't, in-process with no model load.",
+			Component:   "pii-builtins-select",
+			Options:     builtinPatternOptions(),
+			Order:       213,
+		},
+		"pii_detection.patterns": {
+			Section:     "pii",
+			Label:       "Custom Secret Patterns",
+			Description: "Operator-defined patterns in a restricted regex subset (e.g. \"sk-prefix-\\w+\"). Each must contain a fixed literal anchor of ≥3 chars; open-ended shapes like emails are rejected (leave those to NER). Matches report under the pattern name as the entity group.",
 			Component:   "pii-pattern-list",
-			Order:       201,
+			Order:       214,
 		},

 		// --- Cloud passthrough proxy ---
--- a/core/config/meta/registry_coverage_test.go
+++ b/core/config/meta/registry_coverage_test.go
@@ -112,6 +112,7 @@ var grandfatheredUnregistered = []string{
 	"agent.max_attempts",
 	"agent.max_iterations",
 	"cfg_scale",
+	"chat_template_kwargs",
 	"concurrency_groups",
 	"cutstrings",
 	"debug",
--- a/core/config/model_config.go
+++ b/core/config/model_config.go
@@ -10,6 +10,7 @@ import (
 	"text/template"

 	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/core/services/routing/piipattern"
 	"github.com/mudler/LocalAI/pkg/downloader"
 	"github.com/mudler/LocalAI/pkg/functions"
 	"github.com/mudler/LocalAI/pkg/reasoning"
@@ -23,7 +24,6 @@ const (

 // @Description TTS configuration
 type TTSConfig struct {
-
 	// Voice wav path or id
 	Voice string `yaml:"voice,omitempty" json:"voice,omitempty"`

@@ -70,6 +70,19 @@ type ModelConfig struct {
 	// (Harmony) or LFM2.5 — honor it; "none" also toggles enable_thinking off.
 	ReasoningEffort string `yaml:"reasoning_effort,omitempty" json:"reasoning_effort,omitempty"`

+	// ChatTemplateKwargs are arbitrary key/values forwarded to the backend's jinja
+	// chat template via chat_template_kwargs (e.g. preserve_thinking: true). The
+	// server-derived reasoning levers (enable_thinking / reasoning_effort) and any
+	// per-request metadata overrides layer on top. See gRPCPredictOpts.
+	ChatTemplateKwargs map[string]any `yaml:"chat_template_kwargs,omitempty" json:"chat_template_kwargs,omitempty"`
+
+	// RequestMetadata holds the raw client request `metadata` map for the current
+	// request. The request middleware stamps it; gRPCPredictOpts merges it into the
+	// backend gRPC metadata (overriding the server-derived enable_thinking /
+	// reasoning_effort) and folds it, coerced, into the chat_template_kwargs blob.
+	// Never persisted to YAML.
+	RequestMetadata map[string]string `yaml:"-" json:"-"`
+
 	FeatureFlag FeatureFlag `yaml:"feature_flags,omitempty" json:"feature_flags,omitempty"` // Feature Flag registry. We move fast, and features may break on a per model/backend basis. Registry for (usually temporary) flags that indicate aborting something early.
 	// LLM configs (GPT4ALL, Llama.cpp, ...)
 	LLMConfig `yaml:",inline" json:",inline"`
@@ -103,13 +116,18 @@ type ModelConfig struct {
 	Options   []string `yaml:"options,omitempty" json:"options,omitempty"`
 	Overrides []string `yaml:"overrides,omitempty" json:"overrides,omitempty"`

-	MCP    MCPConfig       `yaml:"mcp,omitempty" json:"mcp,omitempty"`
-	Agent  AgentConfig     `yaml:"agent,omitempty" json:"agent,omitempty"`
-	PII    PIIConfig       `yaml:"pii,omitempty" json:"pii,omitempty"`
-	Router RouterConfig    `yaml:"router,omitempty" json:"router,omitempty"`
-	Proxy  ProxyConfig     `yaml:"proxy,omitempty" json:"proxy,omitempty"`
-	MITM   MITMModelConfig `yaml:"mitm,omitempty" json:"mitm,omitempty"`
-	Limits LimitsConfig    `yaml:"limits,omitempty" json:"limits,omitempty"`
+	MCP   MCPConfig   `yaml:"mcp,omitempty" json:"mcp,omitempty"`
+	Agent AgentConfig `yaml:"agent,omitempty" json:"agent,omitempty"`
+	PII   PIIConfig   `yaml:"pii,omitempty" json:"pii,omitempty"`
+	// PIIDetection is the detection policy when THIS model is used as a
+	// PII detector (a token_classify model named in another model's
+	// pii.detectors). Ignored on models that aren't referenced as
+	// detectors.
+	PIIDetection PIIDetectionConfig `yaml:"pii_detection,omitempty" json:"pii_detection,omitempty"`
+	Router       RouterConfig       `yaml:"router,omitempty" json:"router,omitempty"`
+	Proxy        ProxyConfig        `yaml:"proxy,omitempty" json:"proxy,omitempty"`
+	MITM         MITMModelConfig    `yaml:"mitm,omitempty" json:"mitm,omitempty"`
+	Limits       LimitsConfig       `yaml:"limits,omitempty" json:"limits,omitempty"`
 }

 // @Description Admission-control limits applied per request. The
@@ -384,18 +402,54 @@ type PIIConfig struct {
 	// the YAML key is distinguishable from explicit false.
 	Enabled *bool `yaml:"enabled,omitempty" json:"enabled,omitempty"`

-	// Patterns lets a model upgrade or downgrade individual pattern
-	// actions (mask | block | allow) relative to the global
-	// defaults loaded from --pii-config / DefaultPatterns. Pattern IDs
-	// not listed inherit the global action. The regex itself stays
-	// global — only the action is settable per-model.
-	Patterns []PIIPatternOverride `yaml:"patterns,omitempty" json:"patterns,omitempty"`
+	// Detectors lists the token-classification (NER) models whose
+	// detections drive PII redaction for this model. The detection policy
+	// (min score, per-entity actions, default action) lives on each named
+	// detector model's own pii_detection block, not here — a consuming
+	// model just opts in by listing detectors. Multiple detectors union
+	// their hits; overlapping spans resolve to the strongest action.
+	Detectors []string `yaml:"detectors,omitempty" json:"detectors,omitempty"`
 }

-// @Description Per-model action override for a single PII pattern.
-type PIIPatternOverride struct {
-	ID     string `yaml:"id" json:"id"`
-	Action string `yaml:"action" json:"action"`
+// @Description Detection policy for a token-classification (NER) model
+// used as a PII detector. Lives on the detector model's own config so the
+// model is a self-describing policy unit: consuming models reference it by
+// name (via pii.detectors) and inherit this policy with no per-consumer
+// overrides.
+type PIIDetectionConfig struct {
+	// MinScore drops detections the model scores below this confidence
+	// before they are acted on. 0 keeps every detection.
+	MinScore float32 `yaml:"min_score,omitempty" json:"min_score,omitempty"`
+	// DefaultAction (mask | block | allow) applies to detected entity
+	// groups with no explicit EntityActions entry. Empty defaults to
+	// "mask" — the safe-by-default policy for a PII filter.
+	DefaultAction string `yaml:"default_action,omitempty" json:"default_action,omitempty"`
+	// EntityActions maps an entity group the model emits (e.g. "EMAIL",
+	// "PASSWORD") to an action, overriding DefaultAction for that group.
+	// This is where an operator says which PII to block vs mask vs
+	// allow-log.
+	EntityActions map[string]string `yaml:"entity_actions,omitempty" json:"entity_actions,omitempty"`
+
+	// Builtins names the built-in pattern groups this (pattern) detector
+	// enables, e.g. "anthropic_api_key", "github_token". Pattern detectors
+	// match high-entropy structured secrets the NER tier can't; see
+	// core/services/routing/piipattern.
+	Builtins []string `yaml:"builtins,omitempty" json:"builtins,omitempty"`
+	// Patterns lists operator-defined secret patterns in the restricted-regex
+	// subset (validated at load). Each match is reported under its Name as the
+	// entity group, so EntityActions/DefaultAction apply by Name.
+	Patterns []PIIPattern `yaml:"patterns,omitempty" json:"patterns,omitempty"`
+}
+
+// PIIPattern is one operator-defined pattern on a pattern detector model. Name
+// is the entity group reported for matches (and the EntityActions key). Match
+// is the restricted-regex source. Action optionally overrides DefaultAction for
+// this pattern. MinLen drops matches shorter than N bytes (0 = no floor).
+type PIIPattern struct {
+	Name   string `yaml:"name" json:"name"`
+	Match  string `yaml:"match" json:"match"`
+	Action string `yaml:"action,omitempty" json:"action,omitempty"`
+	MinLen int    `yaml:"min_len,omitempty" json:"min_len,omitempty"`
 }

 // PIIIsEnabled returns the resolved PII state for this model. Single
@@ -408,27 +462,71 @@ func (c *ModelConfig) PIIIsEnabled() bool {
 	return c.Backend == "cloud-proxy"
 }

-// PIIPatternOverrides returns the per-pattern action overrides as a map
-// keyed by pattern ID. The values are the raw action strings — the pii
-// package validates and converts them.
-//
-// Returned via the documented modelPIIConfig interface in
-// core/services/routing/pii/middleware.go without taking a config
-// dependency on this package.
-func (c *ModelConfig) PIIPatternOverrides() map[string]string {
-	if len(c.PII.Patterns) == 0 {
+// PIIDetectors returns the names of the token-classification models that
+// drive PII redaction for this (consuming) model. Read via the
+// ModelPIIConfig interface in core/services/routing/pii/middleware.go.
+func (c *ModelConfig) PIIDetectors() []string {
+	if len(c.PII.Detectors) == 0 {
 		return nil
 	}
-	out := make(map[string]string, len(c.PII.Patterns))
-	for _, p := range c.PII.Patterns {
-		if p.ID == "" {
-			continue
-		}
-		out[p.ID] = p.Action
+	out := make([]string, len(c.PII.Detectors))
+	copy(out, c.PII.Detectors)
+	return out
+}
+
+// piiCoverableUsecases lists the model usecases whose serving API has a
+// request-side PII filter wired (a piiadapter + the pii middleware). It scopes
+// the Middleware admin list (PIIFilterApplies). Grow it as adapters are added
+// for new endpoints. cloud-proxy carries no usecase flag but is always covered
+// (via the MITM / proxy chat path), so PIIFilterApplies handles it separately.
+var piiCoverableUsecases = []ModelConfigUsecase{FLAG_CHAT, FLAG_COMPLETION, FLAG_EDIT, FLAG_EMBEDDINGS}
+
+// PIIFilterApplies reports whether request-side PII filtering can apply to
+// this model at all — i.e. it is reachable through a text-accepting endpoint
+// that has a PII adapter wired. Used to scope the Middleware admin view so it
+// lists only models PII could protect, not every config (VAD, STT,
+// embedding-only, image, or the token_classify detector models themselves,
+// which are the filters rather than consumers). Detector/score models return
+// false naturally: HasUsecases short-circuits to false for any usecase a
+// declared score/token_classify model did not itself declare.
+func (c *ModelConfig) PIIFilterApplies() bool {
+	if c.Backend == "cloud-proxy" {
+		return true
+	}
+	return slices.ContainsFunc(piiCoverableUsecases, c.HasUsecases)
+}
+
+// PIIDetectionMinScore returns the confidence floor this model applies
+// when used as a PII detector.
+func (c *ModelConfig) PIIDetectionMinScore() float32 { return c.PIIDetection.MinScore }
+
+// PIIDetectionDefaultAction returns the raw default-action string applied
+// to detected entity groups without an explicit override. The pii package
+// validates it and applies the "mask" fallback.
+func (c *ModelConfig) PIIDetectionDefaultAction() string { return c.PIIDetection.DefaultAction }
+
+// PIIDetectionEntityActions returns the per-entity-group action policy as
+// a fresh map of raw action strings (validated by the pii package).
+func (c *ModelConfig) PIIDetectionEntityActions() map[string]string {
+	if len(c.PIIDetection.EntityActions) == 0 {
+		return nil
+	}
+	out := make(map[string]string, len(c.PIIDetection.EntityActions))
+	for k, v := range c.PIIDetection.EntityActions {
+		out[k] = v
 	}
 	return out
 }

+// IsPatternDetector reports whether this detector model matches secrets with
+// regex patterns (built-in and/or operator-defined) rather than a neural NER
+// model. Such a model runs entirely in-process (no backend / GGUF / VRAM); the
+// PII resolver builds an in-process pattern matcher for it instead of loading a
+// gRPC token-classifier.
+func (c *ModelConfig) IsPatternDetector() bool {
+	return len(c.PIIDetection.Builtins) > 0 || len(c.PIIDetection.Patterns) > 0
+}
+
 // @Description MCP configuration
 type MCPConfig struct {
 	Servers string `yaml:"remote,omitempty" json:"remote,omitempty"`
@@ -472,8 +570,10 @@ func (c *MCPConfig) MCPConfigFromYAML() (MCPGenericConfig[MCPRemoteServers], MCP
 type MCPGenericConfig[T any] struct {
 	Servers T `yaml:"mcpServers,omitempty" json:"mcpServers,omitempty"`
 }
-type MCPRemoteServers map[string]MCPRemoteServer
-type MCPSTDIOServers map[string]MCPSTDIOServer
+type (
+	MCPRemoteServers map[string]MCPRemoteServer
+	MCPSTDIOServers  map[string]MCPSTDIOServer
+)

 // @Description MCP remote server configuration
 type MCPRemoteServer struct {
@@ -510,6 +610,13 @@ type Pipeline struct {
 	// LLM model config. Unset leaves the LLM model config in charge.
 	DisableThinking *bool `yaml:"disable_thinking,omitempty" json:"disable_thinking,omitempty"`

+	// MaxHistoryItems caps how many trailing conversation items are fed to the
+	// LLM each realtime turn (0 = unlimited, rely on the LLM's context window).
+	// Unset (nil) uses the per-model-type default. Set it on a composed pipeline
+	// (VAD+STT+LLM+TTS) so a long-running session doesn't grow until the LLM's
+	// context fills.
+	MaxHistoryItems *int `yaml:"max_history_items,omitempty" json:"max_history_items,omitempty"`
+
 	// VoiceRecognition gates the pipeline behind speaker verification. Nil
 	// (block absent) means no gate, preserving existing behavior.
 	VoiceRecognition *PipelineVoiceRecognition `yaml:"voice_recognition,omitempty" json:"voice_recognition,omitempty"`
@@ -544,6 +651,44 @@ func (c *ModelConfig) ApplyReasoningEffort(requestEffort string) {
 	}
 }

+// coerceChatTemplateKwarg coerces a request-metadata string value for use as a
+// jinja chat_template_kwarg. "true"/"false" become real booleans (so a jinja
+// `{% if preserve_thinking %}` reads false correctly, since any non-empty string
+// is truthy); everything else stays a string. Numeric/typed per-request values are
+// out of scope - set those in the model YAML chat_template_kwargs (YAML keeps the type).
+func coerceChatTemplateKwarg(v string) any {
+	switch v {
+	case "true":
+		return true
+	case "false":
+		return false
+	default:
+		return v
+	}
+}
+
+// ResolveChatTemplateKwargs builds the final chat_template_kwargs map forwarded to
+// the backend, layered: the model config map (base) < the coerced backend metadata
+// (server reasoning levers + client request overrides). `meta` is the already-merged
+// backend metadata string map. The reserved "chat_template_kwargs" key is skipped so
+// a client cannot smuggle a nested blob. Returns nil when there is nothing to forward.
+func (c *ModelConfig) ResolveChatTemplateKwargs(meta map[string]string) map[string]any {
+	out := map[string]any{}
+	for k, v := range c.ChatTemplateKwargs {
+		out[k] = v
+	}
+	for k, v := range meta {
+		if k == "chat_template_kwargs" {
+			continue
+		}
+		out[k] = coerceChatTemplateKwarg(v)
+	}
+	if len(out) == 0 {
+		return nil
+	}
+	return out
+}
+
 // @Description PipelineStreaming toggles incremental delivery per realtime stage.
 type PipelineStreaming struct {
 	LLM           *bool `yaml:"llm,omitempty" json:"llm,omitempty"`
@@ -966,6 +1111,11 @@ func (cfg *ModelConfig) SetDefaults(opts ...ConfigLoaderOption) {
 	// This ensures gallery-installed and runtime-loaded models get optimal parameters.
 	ApplyInferenceDefaults(cfg, cfg.Name, cfg.Model)

+	// Apply hardware-driven defaults (e.g. a larger physical batch on Blackwell).
+	// Uses the local GPU here; in distributed mode the router re-applies the same
+	// heuristics for the selected node's GPU before loading. Explicit config wins.
+	ApplyHardwareDefaults(cfg, localGPU())
+
 	// https://github.com/ggerganov/llama.cpp/blob/75cd4c77292034ecec587ecb401366f57338f7c0/common/sampling.h#L22
 	defaultTopP := 0.95
 	defaultTopK := 40
@@ -1163,6 +1313,8 @@ func (c *ModelConfig) Validate() (bool, error) {
 	// llama_context against concurrent generation/embedding traffic
 	// (see backend/cpp/llama-cpp/grpc-server.cpp on Score). Reject the
 	// combination here so operators are forced to split the model.
+	// (token_classify is unaffected — it runs on the standalone
+	// privacy-filter backend, not llama-cpp.)
 	const scoreConflicts = FLAG_CHAT | FLAG_COMPLETION | FLAG_EMBEDDINGS
 	if (c.Backend == "llama-cpp" || c.Backend == "llama") &&
 		c.HasUsecases(FLAG_SCORE) && c.KnownUsecases != nil &&
@@ -1172,6 +1324,26 @@ func (c *ModelConfig) Validate() (bool, error) {
 				"with chat/completion/embeddings — split into separate model configs")
 	}

+	// Pattern detector: validate built-in names and that each operator-defined
+	// pattern is a well-formed, anchored, bounded restricted-regex. Reject at
+	// load so a bad pattern surfaces as a clear config error rather than a
+	// silent no-op (or a fail-closed block) at request time.
+	if c.IsPatternDetector() {
+		for _, name := range c.PIIDetection.Builtins {
+			if _, ok := piipattern.LookupBuiltin(name); !ok {
+				return false, fmt.Errorf("pii_detection: unknown built-in pattern %q", name)
+			}
+		}
+		for _, p := range c.PIIDetection.Patterns {
+			if p.Name == "" {
+				return false, fmt.Errorf("pii_detection: pattern is missing a name")
+			}
+			if err := piipattern.ValidatePattern(p.Match); err != nil {
+				return false, fmt.Errorf("pii_detection: pattern %q: %w", p.Name, err)
+			}
+		}
+	}
+
 	// router.score_normalization is consumed lazily by the score
 	// classifier at first-request time; without load-time validation
 	// a typo wouldn't surface until the first router request panicked
@@ -1278,12 +1450,24 @@ const (
 	// Marks a model as wired for the Score gRPC primitive (joint
 	// log-prob of candidate continuations under a shared prompt). Must
 	// be declared explicitly via `known_usecases: [score]` — there's
-	// no heuristic for it. On the llama-cpp backend, Score bypasses
-	// the slot loop and races the llama_context, so Validate() refuses
-	// to load a llama-cpp config that combines FLAG_SCORE with
-	// chat/completion/embeddings.
+	// no heuristic for it. On llama-cpp, Score bypasses the slot loop
+	// (direct llama_decode), so combining score with
+	// chat/completion/embeddings in one config is rejected at validation.
 	FLAG_SCORE ModelConfigUsecase = 0b10000000000000000000

+	// Marks a model as wired for the Depth gRPC primitive (per-pixel
+	// metric depth + camera pose + 3D point cloud via Depth Anything 3).
+	FLAG_DEPTH ModelConfigUsecase = 0b100000000000000000000
+
+	// Marks a model as wired for the TokenClassify gRPC primitive (the
+	// openai-privacy-filter PII NER tier — per-token BIOES classification).
+	// Like FLAG_SCORE it must be declared explicitly via
+	// `known_usecases: [token_classify]`; there's no heuristic. Requires
+	// TOKEN_CLS pooling, which is loaded via the embeddings flag. On
+	// llama-cpp the classification windows ride the embedding task queue,
+	// so it may combine freely with other usecases.
+	FLAG_TOKEN_CLASSIFY ModelConfigUsecase = 0b1000000000000000000000
+
 	// Common Subsets
 	FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT
 )
@@ -1341,6 +1525,8 @@ func GetAllModelConfigUsecases() map[string]ModelConfigUsecase {
 		"FLAG_DIARIZATION":         FLAG_DIARIZATION,
 		"FLAG_REALTIME_AUDIO":      FLAG_REALTIME_AUDIO,
 		"FLAG_SCORE":               FLAG_SCORE,
+		"FLAG_DEPTH":               FLAG_DEPTH,
+		"FLAG_TOKEN_CLASSIFY":      FLAG_TOKEN_CLASSIFY,
 	}
 }

@@ -1368,19 +1554,20 @@ func GetUsecasesFromYAML(input []string) *ModelConfigUsecase {
 // HasUsecases examines a ModelConfig and determines which endpoints have a chance of success.
 //
 // Declared known_usecases are normally additive — the guessing heuristic
-// still adds whatever it can infer from backend/templates. The one
-// exception is FLAG_SCORE: when the operator declared score, they
-// reserved the model for the router classifier. Letting GuessUsecases
-// paint chat/completion on top would surface it in chat pickers it was
-// deliberately kept out of, and (on llama-cpp) reintroduce the slot
-// contention the score/chat conflict check exists to prevent. So a
-// declared score list is authoritative.
+// still adds whatever it can infer from backend/templates. The exceptions
+// are FLAG_SCORE and FLAG_TOKEN_CLASSIFY: when the operator declared
+// either, they reserved the model for an internal direct-decode primitive
+// (the router classifier, or the PII NER tier). Letting GuessUsecases
+// paint chat/completion/embeddings on top would surface it in pickers it
+// was deliberately kept out of, and (on llama-cpp) reintroduce the slot
+// contention the conflict check exists to prevent. So a declared score or
+// token_classify list is authoritative.
 func (c *ModelConfig) HasUsecases(u ModelConfigUsecase) bool {
 	if c.KnownUsecases != nil {
 		if (u & *c.KnownUsecases) == u {
 			return true
 		}
-		if (*c.KnownUsecases & FLAG_SCORE) == FLAG_SCORE {
+		if (*c.KnownUsecases & (FLAG_SCORE | FLAG_TOKEN_CLASSIFY)) != 0 {
 			return false
 		}
 	}
@@ -1484,6 +1671,13 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
 		}
 	}

+	if (u & FLAG_DEPTH) == FLAG_DEPTH {
+		depthBackends := []string{"depth-anything"}
+		if !slices.Contains(depthBackends, c.Backend) {
+			return false
+		}
+	}
+
 	if (u & FLAG_FACE_RECOGNITION) == FLAG_FACE_RECOGNITION {
 		faceBackends := []string{"insightface"}
 		if !slices.Contains(faceBackends, c.Backend) {
@@ -1553,6 +1747,15 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
 		return false
 	}

+	if (u & FLAG_TOKEN_CLASSIFY) == FLAG_TOKEN_CLASSIFY {
+		// No heuristic: token-classification intent is a deliberate
+		// operator choice (it reserves the model from generation traffic
+		// on llama-cpp, and the model's TOKEN_CLS head isn't useful as
+		// general embeddings), so HasUsecases(FLAG_TOKEN_CLASSIFY) is true
+		// only when KnownUsecases declares it explicitly.
+		return false
+	}
+
 	return true
 }

--- a/core/config/model_config_test.go
+++ b/core/config/model_config_test.go
@@ -7,6 +7,7 @@ import (

 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
+	"gopkg.in/yaml.v3"
 )

 var _ = Describe("Test cases for config related functions", func() {
@@ -72,9 +73,10 @@ parameters:
 			Expect(valid).To(BeTrue())

 			// llama-cpp configs can't mix the score usecase with
-			// chat/completion/embeddings — Score bypasses the slot
-			// loop and would race the llama_context. The check fires
-			// at load and save time; here we exercise it directly.
+			// chat/completion/embeddings — Score bypasses the slot loop
+			// and would race the llama_context. (token_classify is exempt:
+			// it runs on the privacy-filter backend, not llama-cpp, so the
+			// token_classify combinations below stay valid.)
 			scoreFlag := FLAG_SCORE | FLAG_CHAT
 			conflicting := ModelConfig{
 				Name:          "router-but-also-chat",
@@ -96,15 +98,23 @@ parameters:
 			Expect(valid).To(BeTrue())
 			Expect(err).NotTo(HaveOccurred())

-			// The constraint is llama-cpp-specific; other backends
-			// may safely combine.
-			scoreAndChat := FLAG_SCORE | FLAG_CHAT
-			otherBackend := ModelConfig{
-				Name:          "vllm-router-and-chat",
-				Backend:       "vllm",
-				KnownUsecases: &scoreAndChat,
+			tcAndChat := FLAG_TOKEN_CLASSIFY | FLAG_CHAT
+			tcCombined := ModelConfig{
+				Name:          "ner-and-chat",
+				Backend:       "llama-cpp",
+				KnownUsecases: &tcAndChat,
 			}
-			valid, err = otherBackend.Validate()
+			valid, err = tcCombined.Validate()
+			Expect(valid).To(BeTrue())
+			Expect(err).NotTo(HaveOccurred())
+
+			tcAndEmbeddings := FLAG_TOKEN_CLASSIFY | FLAG_EMBEDDINGS
+			tcWithEmbeddings := ModelConfig{
+				Name:          "pii-ner",
+				Backend:       "llama-cpp",
+				KnownUsecases: &tcAndEmbeddings,
+			}
+			valid, err = tcWithEmbeddings.Validate()
 			Expect(valid).To(BeTrue())
 			Expect(err).NotTo(HaveOccurred())

@@ -228,7 +238,6 @@ parameters:
 		})
 	})
 	It("Properly handles backend usecase matching", func() {
-
 		a := ModelConfig{
 			Name: "a",
 		}
@@ -336,17 +345,17 @@ parameters:
 		// Declared `known_usecases: [score]` is authoritative — the
 		// guessing heuristic must NOT add chat on top, even though the
 		// inherited chatml template would otherwise satisfy the chat
-		// heuristic. Score means "this model is reserved for the
-		// router classifier"; surfacing it as a chat model defeats the
-		// reservation and reintroduces the slot contention the load-time
-		// score/chat conflict check exists to prevent.
+		// heuristic. A score-only declaration means "this model is
+		// reserved for the router classifier"; surfacing it as a chat
+		// model defeats the reservation. (Operators who do want both
+		// may declare both — the combination is supported.)
 		scoreReserved := FLAG_SCORE
 		j := ModelConfig{
 			Name:          "arch-router",
 			Backend:       "llama-cpp",
 			KnownUsecases: &scoreReserved,
 			TemplateConfig: TemplateConfig{
-				Chat:    "inherited from chatml",
+				Chat:        "inherited from chatml",
 				ChatMessage: "inherited from chatml",
 				Completion:  "inherited from chatml",
 			},
@@ -355,6 +364,27 @@ parameters:
 		Expect(j.HasUsecases(FLAG_CHAT)).To(BeFalse())
 		Expect(j.HasUsecases(FLAG_COMPLETION)).To(BeFalse())
 		Expect(j.HasUsecases(FLAG_EMBEDDINGS)).To(BeFalse())
+
+		// Declared `known_usecases: [token_classify]` is likewise
+		// authoritative — a PII NER model is reserved for the redactor's
+		// NER tier and must not surface as chat or as a general embeddings
+		// model, even though it loads with embeddings enabled (its
+		// TOKEN_CLS head produces BIOES logits, not reusable embeddings).
+		tcReserved := FLAG_TOKEN_CLASSIFY
+		embTrue := true
+		k := ModelConfig{
+			Name:          "privacy-filter",
+			Backend:       "llama-cpp",
+			KnownUsecases: &tcReserved,
+			Embeddings:    &embTrue,
+			TemplateConfig: TemplateConfig{
+				Chat:        "inherited from chatml",
+				ChatMessage: "inherited from chatml",
+			},
+		}
+		Expect(k.HasUsecases(FLAG_TOKEN_CLASSIFY)).To(BeTrue())
+		Expect(k.HasUsecases(FLAG_CHAT)).To(BeFalse())
+		Expect(k.HasUsecases(FLAG_EMBEDDINGS)).To(BeFalse())
 	})
 	It("Test Validate with invalid MCP config", func() {
 		tmp, err := os.CreateTemp("", "config.yaml")
@@ -598,3 +628,162 @@ concurrency_groups:
 		})
 	})
 })
+
+var _ = Describe("PII config accessors", func() {
+	It("PIIDetectors returns a fresh copy of the consumer's detector list", func() {
+		cfg := &ModelConfig{PII: PIIConfig{Detectors: []string{"a", "b"}}}
+		got := cfg.PIIDetectors()
+		Expect(got).To(Equal([]string{"a", "b"}))
+		got[0] = "mutated"
+		Expect(cfg.PII.Detectors[0]).To(Equal("a"), "accessor must not alias the underlying slice")
+	})
+
+	It("PIIDetectors is nil when none are configured", func() {
+		Expect((&ModelConfig{}).PIIDetectors()).To(BeNil())
+	})
+
+	It("exposes the detector model's pii_detection policy", func() {
+		cfg := &ModelConfig{PIIDetection: PIIDetectionConfig{
+			MinScore:      0.5,
+			DefaultAction: "mask",
+			EntityActions: map[string]string{"PASSWORD": "block", "EMAIL": "mask"},
+		}}
+		Expect(cfg.PIIDetectionMinScore()).To(BeNumerically("~", 0.5, 1e-6))
+		Expect(cfg.PIIDetectionDefaultAction()).To(Equal("mask"))
+		ea := cfg.PIIDetectionEntityActions()
+		Expect(ea).To(HaveKeyWithValue("PASSWORD", "block"))
+		ea["PASSWORD"] = "mutated"
+		Expect(cfg.PIIDetection.EntityActions["PASSWORD"]).To(Equal("block"), "accessor must return a fresh map")
+	})
+
+	It("unmarshals pii.detectors and pii_detection from YAML", func() {
+		var cfg ModelConfig
+		raw := []byte("name: consumer\npii:\n  enabled: true\n  detectors: [pf]\npii_detection:\n  min_score: 0.4\n  default_action: mask\n  entity_actions:\n    PASSWORD: block\n")
+		Expect(yaml.Unmarshal(raw, &cfg)).To(Succeed())
+		Expect(cfg.PIIDetectors()).To(Equal([]string{"pf"}))
+		Expect(cfg.PIIDetectionDefaultAction()).To(Equal("mask"))
+		Expect(cfg.PIIDetectionEntityActions()).To(HaveKeyWithValue("PASSWORD", "block"))
+	})
+})
+
+var _ = Describe("GGUF importer chat-default guard (reservedNonChatModel)", func() {
+	mk := func(flags ModelConfigUsecase) *ModelConfig {
+		return &ModelConfig{Backend: "llama-cpp", KnownUsecases: &flags}
+	}
+
+	It("treats declared score / token_classify models as reserved (no chat defaults)", func() {
+		Expect(reservedNonChatModel(mk(FLAG_SCORE))).To(BeTrue())
+		Expect(reservedNonChatModel(mk(FLAG_TOKEN_CLASSIFY))).To(BeTrue())
+		// embeddings declared alongside token_classify (the PII NER shape) is
+		// still reserved.
+		Expect(reservedNonChatModel(mk(FLAG_TOKEN_CLASSIFY | FLAG_EMBEDDINGS))).To(BeTrue())
+	})
+
+	It("does not reserve ordinary or undeclared models", func() {
+		Expect(reservedNonChatModel(mk(FLAG_CHAT))).To(BeFalse())
+		Expect(reservedNonChatModel(mk(FLAG_EMBEDDINGS))).To(BeFalse())
+		Expect(reservedNonChatModel(&ModelConfig{Backend: "llama-cpp"})).To(BeFalse())
+	})
+
+	It("keeps a token_classify GGUF config valid by withholding FLAG_CHAT", func() {
+		// The privacy-filter import shape: the GGUF importer appends FLAG_CHAT
+		// to a templateless model, which the next sync folds into
+		// KnownUsecases. token_classify+chat is a VALID combination
+		// (token_classify runs on the privacy-filter backend, not llama-cpp,
+		// so the score/chat conflict check does not apply to it), but the
+		// importer must still not paint a declared-reserved model as chat
+		// — that would surface it in every chat picker.
+		reserved := []string{"token_classify"}
+		withChat := append(append([]string{}, reserved...), "FLAG_CHAT")
+
+		// What the importer would produce WITHOUT the guard: valid (the
+		// score/chat conflict check is score-specific), just undesirable
+		// defaults.
+		combined := &ModelConfig{Backend: "llama-cpp", KnownUsecaseStrings: withChat}
+		combined.syncKnownUsecasesFromString()
+		valid, err := combined.Validate()
+		Expect(valid).To(BeTrue())
+		Expect(err).NotTo(HaveOccurred())
+
+		// With the guard (FLAG_CHAT withheld): the declaration survives and the
+		// config validates.
+		good := &ModelConfig{Backend: "llama-cpp", KnownUsecaseStrings: reserved}
+		good.syncKnownUsecasesFromString()
+		Expect(reservedNonChatModel(good)).To(BeTrue())
+		valid, err = good.Validate()
+		Expect(valid).To(BeTrue())
+		Expect(err).NotTo(HaveOccurred())
+		Expect(good.HasUsecases(FLAG_TOKEN_CLASSIFY)).To(BeTrue())
+	})
+})
+
+var _ = Describe("PIIFilterApplies (Middleware admin list scoping)", func() {
+	withUsecases := func(backend string, flags ModelConfigUsecase) *ModelConfig {
+		return &ModelConfig{Name: "m", Backend: backend, KnownUsecases: &flags}
+	}
+
+	It("includes chat-capable models and cloud-proxy models", func() {
+		Expect(withUsecases("llama-cpp", FLAG_CHAT).PIIFilterApplies()).To(BeTrue())
+		// cloud-proxy is always covered (MITM / proxy chat path), regardless
+		// of declared usecases.
+		Expect((&ModelConfig{Name: "claude", Backend: "cloud-proxy"}).PIIFilterApplies()).To(BeTrue())
+	})
+
+	It("excludes the detector and score models themselves", func() {
+		// token_classify detectors are the filters, not consumers; score
+		// classifiers are internal primitives. Both short-circuit
+		// HasUsecases(FLAG_CHAT) to false.
+		Expect(withUsecases("llama-cpp", FLAG_TOKEN_CLASSIFY).PIIFilterApplies()).To(BeFalse())
+		Expect(withUsecases("llama-cpp", FLAG_SCORE).PIIFilterApplies()).To(BeFalse())
+	})
+
+	It("includes embedding and completion models (their request text is filtered)", func() {
+		// Phase 4 wired PII onto /v1/embeddings, /v1/completions and /v1/edits,
+		// so those usecases are now coverable.
+		emb := withUsecases("llama-cpp", FLAG_EMBEDDINGS)
+		t := true
+		emb.Embeddings = &t
+		Expect(emb.PIIFilterApplies()).To(BeTrue())
+		Expect(withUsecases("llama-cpp", FLAG_COMPLETION).PIIFilterApplies()).To(BeTrue())
+	})
+
+	It("excludes models with no text-accepting, PII-covered endpoint", func() {
+		// VAD / audio-in models carry no coverable usecase.
+		Expect((&ModelConfig{Name: "vad", Backend: "silero-vad"}).PIIFilterApplies()).To(BeFalse())
+		Expect(withUsecases("whisper", FLAG_TRANSCRIPT).PIIFilterApplies()).To(BeFalse())
+	})
+})
+
+var _ = Describe("pattern detector config", func() {
+	patternCfg := func() *ModelConfig {
+		c := &ModelConfig{Name: "secret-filter", Backend: "pattern"}
+		c.PIIDetection.Builtins = []string{"anthropic_api_key"}
+		c.PIIDetection.Patterns = []PIIPattern{{Name: "INTERNAL", Match: `tok-[A-Za-z0-9]{20,}`}}
+		return c
+	}
+
+	It("IsPatternDetector keys off builtins/patterns", func() {
+		Expect(patternCfg().IsPatternDetector()).To(BeTrue())
+		Expect((&ModelConfig{Name: "ner", Backend: "llama-cpp"}).IsPatternDetector()).To(BeFalse())
+	})
+
+	It("Validate accepts a well-formed pattern detector (no model file needed)", func() {
+		ok, err := patternCfg().Validate()
+		Expect(err).NotTo(HaveOccurred())
+		Expect(ok).To(BeTrue())
+	})
+
+	It("Validate rejects an unknown built-in", func() {
+		c := &ModelConfig{Name: "x", Backend: "pattern"}
+		c.PIIDetection.Builtins = []string{"does_not_exist"}
+		_, err := c.Validate()
+		Expect(err).To(MatchError(ContainSubstring("unknown built-in")))
+	})
+
+	It("Validate rejects an unanchored custom pattern", func() {
+		c := &ModelConfig{Name: "x", Backend: "pattern"}
+		c.PIIDetection.Patterns = []PIIPattern{{Name: "EMAILish", Match: `[\w.]+@[\w.]+\.\w+`}}
+		_, err := c.Validate()
+		Expect(err).To(MatchError(ContainSubstring("pattern \"EMAILish\"")))
+	})
+})
--- a/core/config/runtime_settings.go
+++ b/core/config/runtime_settings.go
@@ -18,8 +18,8 @@ type RuntimeSettings struct {
 	WatchdogInterval    *string `json:"watchdog_interval,omitempty"` // Interval between watchdog checks (e.g., 2s, 30s)

 	// Backend management
-	SingleBackend           *bool `json:"single_backend,omitempty"`      // Deprecated: use MaxActiveBackends = 1 instead
-	MaxActiveBackends       *int  `json:"max_active_backends,omitempty"` // Maximum number of active backends (0 = unlimited, 1 = single backend mode)
+	SingleBackend             *bool `json:"single_backend,omitempty"`              // Deprecated: use MaxActiveBackends = 1 instead
+	MaxActiveBackends         *int  `json:"max_active_backends,omitempty"`         // Maximum number of active backends (0 = unlimited, 1 = single backend mode)
 	AutoUpgradeBackends       *bool `json:"auto_upgrade_backends,omitempty"`       // Automatically upgrade backends when new versions are detected
 	PreferDevelopmentBackends *bool `json:"prefer_development_backends,omitempty"` // Prefer development backend versions by default in UI
 	// Memory Reclaimer settings (works with GPU if available, otherwise RAM)
@@ -97,19 +97,9 @@ type RuntimeSettings struct {
 	// trusted clients.
 	MITMListen *string `json:"mitm_listen,omitempty"`

-	// PII pattern overrides — keyed by pattern id, applied to the live
-	// redactor at startup and persisted by POST /api/pii/patterns/persist.
-	// Distinguishes from --pii-config (which replaces the entire
-	// pattern set) by only carrying the per-id action/enabled deltas
-	// against the global default catalog.
-	PIIPatternOverrides *map[string]PIIPatternRuntimeOverride `json:"pii_pattern_overrides,omitempty"`
-}
-
-// PIIPatternRuntimeOverride captures the persistable deltas an admin
-// has applied to a single global PII pattern. Both fields are pointers
-// so an override that only flips Disabled doesn't have to also restate
-// Action (and vice versa).
-type PIIPatternRuntimeOverride struct {
-	Action   *string `json:"action,omitempty"`
-	Disabled *bool   `json:"disabled,omitempty"`
+	// PIIDefaultDetectors are the token-classification detector models applied
+	// to any PII-enabled model that names no detectors of its own (so
+	// cloud-proxy/MITM redaction works without per-model config). No omitempty:
+	// an empty array must round-trip so the operator can clear it from the UI.
+	PIIDefaultDetectors *[]string `json:"pii_default_detectors"`
 }
--- a/core/gallery/importers/depth-anything.go
+++ b/core/gallery/importers/depth-anything.go
@@ -0,0 +1,181 @@
+package importers
+
+import (
+	"encoding/json"
+	"path/filepath"
+	"strings"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/gallery"
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/pkg/downloader"
+	hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
+	"go.yaml.in/yaml/v2"
+)
+
+var _ Importer = &DepthAnythingImporter{}
+
+// DepthAnythingImporter recognises depth-anything.cpp GGUF weights, the
+// C++/ggml port of ByteDance Depth Anything 3. The signal is narrow on
+// purpose: depth-anything.cpp names its weights
+// "depth-anything-<size>-<quant>.gguf" (e.g. depth-anything-small-f32.gguf,
+// depth-anything-large-q4_k.gguf), so we only match a .gguf whose name carries
+// a depth-anything token. That keeps us from claiming arbitrary llama-style
+// GGUFs (the importer is registered before llama-cpp), and it deliberately
+// does NOT match the upstream depth-anything/* PyTorch repos (which ship
+// safetensors checkpoints, not runnable GGUFs).
+// preferences.backend="depth-anything" forces the importer regardless.
+type DepthAnythingImporter struct{}
+
+func (i *DepthAnythingImporter) Name() string      { return "depth-anything" }
+func (i *DepthAnythingImporter) Modality() string  { return "image" }
+func (i *DepthAnythingImporter) AutoDetects() bool { return true }
+
+func (i *DepthAnythingImporter) Match(details Details) bool {
+	preferences, err := details.Preferences.MarshalJSON()
+	if err != nil {
+		return false
+	}
+	preferencesMap := make(map[string]any)
+	if len(preferences) > 0 {
+		if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
+			return false
+		}
+	}
+
+	if b, ok := preferencesMap["backend"].(string); ok && b == "depth-anything" {
+		return true
+	}
+
+	// Direct URL or path to a depth-anything GGUF.
+	if isDepthAnythingGGUF(filepath.Base(details.URI)) {
+		return true
+	}
+
+	// HF repo shipping at least one depth-anything GGUF.
+	if details.HuggingFace != nil {
+		for _, f := range details.HuggingFace.Files {
+			if isDepthAnythingGGUF(filepath.Base(f.Path)) {
+				return true
+			}
+		}
+	}
+
+	return false
+}
+
+func (i *DepthAnythingImporter) Import(details Details) (gallery.ModelConfig, error) {
+	preferences, err := details.Preferences.MarshalJSON()
+	if err != nil {
+		return gallery.ModelConfig{}, err
+	}
+	preferencesMap := make(map[string]any)
+	if len(preferences) > 0 {
+		if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
+			return gallery.ModelConfig{}, err
+		}
+	}
+
+	name, ok := preferencesMap["name"].(string)
+	if !ok {
+		name = filepath.Base(details.URI)
+	}
+
+	description, ok := preferencesMap["description"].(string)
+	if !ok {
+		description = "Imported from " + details.URI
+	}
+
+	// depth-anything quants stay above 0.998 correlation even at q4_k, so
+	// default to the smallest, then fall back up the size ladder; the last
+	// file wins if none match (mirrors whisper / llama-cpp). The ladder lists
+	// both f16 and f32 since the published GGUFs ship f32 rather than f16.
+	preferredQuants, _ := preferencesMap["quantizations"].(string)
+	quants := []string{"q4_k", "q5_k", "q6_k", "q8_0", "f16", "f32"}
+	if preferredQuants != "" {
+		quants = strings.Split(preferredQuants, ",")
+	}
+
+	cfg := gallery.ModelConfig{
+		Name:        name,
+		Description: description,
+	}
+
+	modelConfig := config.ModelConfig{
+		Name:        name,
+		Description: description,
+		Backend:     "depth-anything",
+	}
+
+	uri := downloader.URI(details.URI)
+	directGGUF := isDepthAnythingGGUF(filepath.Base(details.URI))
+	switch {
+	case uri.LooksLikeURL() && directGGUF:
+		// Direct file URL (e.g. .../resolve/main/depth-anything-small-f32.gguf).
+		// The exact file is known, no quant pick.
+		fileName, err := uri.FilenameFromUrl()
+		if err != nil {
+			return gallery.ModelConfig{}, err
+		}
+		target := filepath.Join("depth-anything", "models", name, fileName)
+		cfg.Files = append(cfg.Files, gallery.File{
+			URI:      details.URI,
+			Filename: target,
+		})
+		modelConfig.PredictionOptions = schema.PredictionOptions{
+			BasicModelRequest: schema.BasicModelRequest{Model: target},
+		}
+	case details.HuggingFace != nil:
+		// HF repo: collect every depth-anything GGUF, pick the preferred quant,
+		// and nest under depth-anything/models/<name>/ so a multi-quant repo
+		// doesn't collide on disk.
+		var ggufFiles []hfapi.ModelFile
+		for _, f := range details.HuggingFace.Files {
+			if isDepthAnythingGGUF(filepath.Base(f.Path)) {
+				ggufFiles = append(ggufFiles, f)
+			}
+		}
+		if chosen, ok := pickPreferredGGMLFile(ggufFiles, quants); ok {
+			target := filepath.Join("depth-anything", "models", name, filepath.Base(chosen.Path))
+			cfg.Files = append(cfg.Files, gallery.File{
+				URI:      chosen.URL,
+				Filename: target,
+				SHA256:   chosen.SHA256,
+			})
+			modelConfig.PredictionOptions = schema.PredictionOptions{
+				BasicModelRequest: schema.BasicModelRequest{Model: target},
+			}
+		}
+	default:
+		// Bare URI with no HF metadata (pref-only path): point at the basename
+		// so users can tweak the YAML after import.
+		modelConfig.PredictionOptions = schema.PredictionOptions{
+			BasicModelRequest: schema.BasicModelRequest{Model: filepath.Base(details.URI)},
+		}
+	}
+
+	data, err := yaml.Marshal(modelConfig)
+	if err != nil {
+		return gallery.ModelConfig{}, err
+	}
+	cfg.ConfigFile = string(data)
+
+	return cfg, nil
+}
+
+// isDepthAnythingGGUF reports whether name is a depth-anything.cpp GGUF: a
+// .gguf file whose name carries a depth-anything token. The .gguf check is
+// case-insensitive; the tokens cover the published naming
+// (depth-anything-<size>-<quant>.gguf) and its hyphen/underscore variants.
+func isDepthAnythingGGUF(name string) bool {
+	lower := strings.ToLower(name)
+	if !strings.HasSuffix(lower, ".gguf") {
+		return false
+	}
+	for _, tok := range []string{"depth-anything", "depth_anything", "depthanything"} {
+		if strings.Contains(lower, tok) {
+			return true
+		}
+	}
+	return false
+}
--- a/core/gallery/importers/depth-anything_test.go
+++ b/core/gallery/importers/depth-anything_test.go
@@ -0,0 +1,112 @@
+package importers_test
+
+import (
+	"encoding/json"
+	"fmt"
+
+	"github.com/mudler/LocalAI/core/gallery/importers"
+	hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+// depthAnythingDetails builds Details carrying a synthetic HF file list so
+// detection can be exercised without hitting the network.
+func depthAnythingDetails(uri string, prefs string, files ...hfapi.ModelFile) importers.Details {
+	return importers.Details{
+		URI:         uri,
+		Preferences: json.RawMessage(prefs),
+		HuggingFace: &hfapi.ModelDetails{Files: files},
+	}
+}
+
+var _ = Describe("DepthAnythingImporter", func() {
+	imp := &importers.DepthAnythingImporter{}
+
+	Context("Importer interface metadata", func() {
+		It("exposes name/modality/autodetect", func() {
+			Expect(imp.Name()).To(Equal("depth-anything"))
+			Expect(imp.Modality()).To(Equal("image"))
+			Expect(imp.AutoDetects()).To(BeTrue())
+		})
+	})
+
+	Context("detection (Match)", func() {
+		It("matches an HF repo shipping a depth-anything GGUF", func() {
+			d := depthAnythingDetails("huggingface://mudler/depth-anything.cpp-gguf", `{}`,
+				hfapi.ModelFile{Path: "depth-anything-small-f32.gguf"},
+				hfapi.ModelFile{Path: "README.md"},
+			)
+			Expect(imp.Match(d)).To(BeTrue())
+		})
+
+		It("matches a direct URL to a depth-anything GGUF", func() {
+			d := depthAnythingDetails("https://huggingface.co/mudler/depth-anything.cpp-gguf/resolve/main/depth-anything-large-q4_k.gguf", `{}`)
+			Expect(imp.Match(d)).To(BeTrue())
+		})
+
+		It("honours preferences.backend=depth-anything for arbitrary URIs", func() {
+			d := depthAnythingDetails("https://example.com/whatever", `{"backend": "depth-anything"}`)
+			Expect(imp.Match(d)).To(BeTrue())
+		})
+
+		It("does NOT claim a generic llama-style GGUF", func() {
+			d := depthAnythingDetails("huggingface://someorg/some-llm-gguf", `{}`,
+				hfapi.ModelFile{Path: "llama-3-8b-instruct-q4_k_m.gguf"},
+			)
+			Expect(imp.Match(d)).To(BeFalse())
+		})
+
+		It("does NOT claim the upstream PyTorch repo (safetensors, no GGUF)", func() {
+			d := depthAnythingDetails("huggingface://depth-anything/Depth-Anything-V3", `{}`,
+				hfapi.ModelFile{Path: "model.safetensors"},
+			)
+			Expect(imp.Match(d)).To(BeFalse())
+		})
+	})
+
+	Context("import (Import)", func() {
+		It("picks the default quant (q4_k) from a multi-quant HF repo", func() {
+			d := depthAnythingDetails("huggingface://mudler/depth-anything.cpp-gguf", `{"name":"depth-anything-small"}`,
+				hfapi.ModelFile{Path: "depth-anything-small-f32.gguf", URL: "https://hf/f32", SHA256: "aaa"},
+				hfapi.ModelFile{Path: "depth-anything-small-q4_k.gguf", URL: "https://hf/q4k", SHA256: "bbb"},
+				hfapi.ModelFile{Path: "depth-anything-small-q8_0.gguf", URL: "https://hf/q8", SHA256: "ccc"},
+			)
+			cfg, err := imp.Import(d)
+			Expect(err).ToNot(HaveOccurred())
+			Expect(cfg.ConfigFile).To(ContainSubstring("backend: depth-anything"), fmt.Sprintf("%+v", cfg))
+			Expect(cfg.Files).To(HaveLen(1))
+			Expect(cfg.Files[0].URI).To(Equal("https://hf/q4k"), "default quant should be q4_k")
+			Expect(cfg.Files[0].Filename).To(ContainSubstring("depth-anything/models/depth-anything-small/depth-anything-small-q4_k.gguf"))
+		})
+
+		It("honours a preferred quantization override", func() {
+			d := depthAnythingDetails("huggingface://mudler/depth-anything.cpp-gguf", `{"name":"d","quantizations":"q8_0"}`,
+				hfapi.ModelFile{Path: "depth-anything-small-f32.gguf", URL: "https://hf/f32"},
+				hfapi.ModelFile{Path: "depth-anything-small-q8_0.gguf", URL: "https://hf/q8"},
+			)
+			cfg, err := imp.Import(d)
+			Expect(err).ToNot(HaveOccurred())
+			Expect(cfg.Files).To(HaveLen(1))
+			Expect(cfg.Files[0].URI).To(Equal("https://hf/q8"))
+		})
+
+		It("falls back to f32 when no quantized file is present", func() {
+			d := depthAnythingDetails("huggingface://mudler/depth-anything.cpp-gguf", `{"name":"d"}`,
+				hfapi.ModelFile{Path: "depth-anything-base-f32.gguf", URL: "https://hf/f32"},
+			)
+			cfg, err := imp.Import(d)
+			Expect(err).ToNot(HaveOccurred())
+			Expect(cfg.Files).To(HaveLen(1))
+			Expect(cfg.Files[0].URI).To(Equal("https://hf/f32"))
+		})
+
+		It("uses the exact file for a direct GGUF URL", func() {
+			d := depthAnythingDetails("https://huggingface.co/mudler/depth-anything.cpp-gguf/resolve/main/depth-anything-base-q5_k.gguf", `{"name":"da"}`)
+			cfg, err := imp.Import(d)
+			Expect(err).ToNot(HaveOccurred())
+			Expect(cfg.Files).To(HaveLen(1))
+			Expect(cfg.Files[0].Filename).To(ContainSubstring("depth-anything/models/da/depth-anything-base-q5_k.gguf"))
+		})
+	})
+})
--- a/core/gallery/importers/importers.go
+++ b/core/gallery/importers/importers.go
@@ -163,12 +163,23 @@ var defaultImporters = []Importer{
 	// bundles aren't claimed by the generic .gguf importer; kept next to
 	// RFDetrImporter as both are detection models.
 	&LocateAnythingImporter{},
+	// DepthAnythingImporter (ByteDance Depth Anything 3 metric depth + camera
+	// pose, native C++/ggml port) must run before LlamaCPPImporter so its GGUF
+	// bundles aren't claimed by the generic .gguf importer; matches only the
+	// depth-anything-<size>-<quant>.gguf naming, so it cannot claim arbitrary
+	// GGUFs.
+	&DepthAnythingImporter{},
 	// Existing
 	// DS4Importer must precede LlamaCPPImporter - ds4 weights are GGUFs and
 	// would otherwise be claimed by the generic .gguf-handling llama-cpp
 	// importer. Matches only the antirez/deepseek-v4-gguf repo + filename
 	// pattern, so false-positives against arbitrary GGUFs are impossible.
 	&DS4Importer{},
+	// PrivacyFilterImporter must precede LlamaCPPImporter too — the OpenMed
+	// privacy-filter GGUFs would otherwise be claimed by the generic .gguf
+	// importer. Matches only .gguf names carrying the "privacy-filter" token,
+	// so arbitrary GGUFs are never claimed.
+	&PrivacyFilterImporter{},
 	&LlamaCPPImporter{},
 	&MLXImporter{},
 	&VLLMImporter{},
--- a/core/gallery/importers/privacy-filter.go
+++ b/core/gallery/importers/privacy-filter.go
@@ -0,0 +1,202 @@
+package importers
+
+import (
+	"encoding/json"
+	"path/filepath"
+	"strings"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/gallery"
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/pkg/downloader"
+	hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
+	"go.yaml.in/yaml/v2"
+)
+
+var _ Importer = &PrivacyFilterImporter{}
+
+// PrivacyFilterImporter recognises the OpenMed privacy-filter PII/NER model
+// family, served by the standalone privacy-filter.cpp ggml engine (the
+// openai-privacy-filter architecture). Detection is deliberately narrow: the
+// engine can only run a privacy-filter GGUF, so we match a .gguf whose name
+// carries the "privacy-filter" token (e.g. privacy-filter-multilingual-f16.gguf)
+// or an HF repo that ships one. That keeps us from claiming arbitrary
+// llama-style GGUFs (the importer is registered before llama-cpp) and from
+// claiming the upstream OpenMed/privacy-filter-* safetensors repos, which carry
+// no runnable GGUF. preferences.backend="privacy-filter" forces it regardless.
+type PrivacyFilterImporter struct{}
+
+func (i *PrivacyFilterImporter) Name() string { return "privacy-filter" }
+
+// Modality is "text": the filter operates in the text domain and there is no
+// dedicated token-classification chip in the import UI, so it groups with the
+// other text-domain backends (matching how ds4 — another single-family text
+// GGUF — is classified).
+func (i *PrivacyFilterImporter) Modality() string  { return "text" }
+func (i *PrivacyFilterImporter) AutoDetects() bool { return true }
+
+func (i *PrivacyFilterImporter) Match(details Details) bool {
+	preferences, err := details.Preferences.MarshalJSON()
+	if err != nil {
+		return false
+	}
+	preferencesMap := make(map[string]any)
+	if len(preferences) > 0 {
+		if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
+			return false
+		}
+	}
+
+	if b, ok := preferencesMap["backend"].(string); ok && b == "privacy-filter" {
+		return true
+	}
+
+	// Direct URL or path to a privacy-filter GGUF.
+	if isPrivacyFilterGGUF(filepath.Base(details.URI)) {
+		return true
+	}
+
+	// HF repo shipping at least one privacy-filter GGUF.
+	if details.HuggingFace != nil {
+		for _, f := range details.HuggingFace.Files {
+			if isPrivacyFilterGGUF(filepath.Base(f.Path)) {
+				return true
+			}
+		}
+	}
+
+	// Fallback: hfapi recursion bug may leave HuggingFace nil — match a repo
+	// that names itself as the privacy-filter GGUF distribution (both tokens
+	// present), e.g. LocalAI-io/privacy-filter-multilingual-GGUF. Requiring
+	// "gguf" keeps the safetensors-only source repo out.
+	if _, repo, ok := HFOwnerRepoFromURI(details.URI); ok {
+		lower := strings.ToLower(repo)
+		if privacyFilterName(lower) && strings.Contains(lower, "gguf") {
+			return true
+		}
+	}
+
+	return false
+}
+
+func (i *PrivacyFilterImporter) Import(details Details) (gallery.ModelConfig, error) {
+	preferences, err := details.Preferences.MarshalJSON()
+	if err != nil {
+		return gallery.ModelConfig{}, err
+	}
+	preferencesMap := make(map[string]any)
+	if len(preferences) > 0 {
+		if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
+			return gallery.ModelConfig{}, err
+		}
+	}
+
+	name, ok := preferencesMap["name"].(string)
+	if !ok {
+		name = filepath.Base(details.URI)
+	}
+
+	description, ok := preferencesMap["description"].(string)
+	if !ok {
+		description = "Imported from " + details.URI
+	}
+
+	// The token classifier's accuracy is parity-sensitive, so prefer the
+	// highest-precision weights first (f16 is what the gallery ships today),
+	// then fall back down the quant ladder; the last file wins if none match.
+	preferredQuants, _ := preferencesMap["quantizations"].(string)
+	quants := []string{"f16", "q8_0", "q6_k", "q5_k", "q4_k"}
+	if preferredQuants != "" {
+		quants = strings.Split(preferredQuants, ",")
+	}
+
+	cfg := gallery.ModelConfig{
+		Name:        name,
+		Description: description,
+	}
+
+	trueV := true
+	modelConfig := config.ModelConfig{
+		Name:        name,
+		Description: description,
+		Backend:     "privacy-filter",
+		// embeddings:true mirrors the gallery entry — the privacy-filter
+		// backend loads in embedding mode to expose per-token logits.
+		Embeddings: &trueV,
+		// token_classify reserves the model for the PII NER tier; another
+		// model opts into redaction by listing this one under pii.detectors.
+		KnownUsecaseStrings: []string{"token_classify"},
+	}
+
+	uri := downloader.URI(details.URI)
+	directGGUF := isPrivacyFilterGGUF(filepath.Base(details.URI))
+	switch {
+	case uri.LooksLikeURL() && directGGUF:
+		// Direct file URL (e.g. .../resolve/main/privacy-filter-multilingual-f16.gguf).
+		// The exact file is known, no quant pick.
+		fileName, err := uri.FilenameFromUrl()
+		if err != nil {
+			return gallery.ModelConfig{}, err
+		}
+		target := filepath.Join("privacy-filter", "models", name, fileName)
+		cfg.Files = append(cfg.Files, gallery.File{
+			URI:      details.URI,
+			Filename: target,
+		})
+		modelConfig.PredictionOptions = schema.PredictionOptions{
+			BasicModelRequest: schema.BasicModelRequest{Model: target},
+		}
+	case details.HuggingFace != nil:
+		// HF repo: collect every privacy-filter GGUF, pick the preferred quant,
+		// and nest under privacy-filter/models/<name>/ so a multi-quant repo
+		// doesn't collide on disk.
+		var ggufFiles []hfapi.ModelFile
+		for _, f := range details.HuggingFace.Files {
+			if isPrivacyFilterGGUF(filepath.Base(f.Path)) {
+				ggufFiles = append(ggufFiles, f)
+			}
+		}
+		if chosen, ok := pickPreferredGGMLFile(ggufFiles, quants); ok {
+			target := filepath.Join("privacy-filter", "models", name, filepath.Base(chosen.Path))
+			cfg.Files = append(cfg.Files, gallery.File{
+				URI:      chosen.URL,
+				Filename: target,
+				SHA256:   chosen.SHA256,
+			})
+			modelConfig.PredictionOptions = schema.PredictionOptions{
+				BasicModelRequest: schema.BasicModelRequest{Model: target},
+			}
+		}
+	default:
+		// Bare URI with no HF metadata (pref-only path): point at the basename
+		// so users can tweak the YAML after import.
+		modelConfig.PredictionOptions = schema.PredictionOptions{
+			BasicModelRequest: schema.BasicModelRequest{Model: filepath.Base(details.URI)},
+		}
+	}
+
+	data, err := yaml.Marshal(modelConfig)
+	if err != nil {
+		return gallery.ModelConfig{}, err
+	}
+	cfg.ConfigFile = string(data)
+
+	return cfg, nil
+}
+
+// privacyFilterName reports whether a lower-cased string carries the
+// privacy-filter token in either separator form.
+func privacyFilterName(lower string) bool {
+	return strings.Contains(lower, "privacy-filter") || strings.Contains(lower, "privacy_filter")
+}
+
+// isPrivacyFilterGGUF reports whether name is a privacy-filter GGUF: a .gguf
+// file whose name carries the privacy-filter token. The .gguf check is
+// case-insensitive.
+func isPrivacyFilterGGUF(name string) bool {
+	lower := strings.ToLower(name)
+	if !strings.HasSuffix(lower, ".gguf") {
+		return false
+	}
+	return privacyFilterName(lower)
+}
--- a/core/gallery/importers/privacy-filter_test.go
+++ b/core/gallery/importers/privacy-filter_test.go
@@ -0,0 +1,104 @@
+package importers_test
+
+import (
+	"encoding/json"
+	"fmt"
+
+	"github.com/mudler/LocalAI/core/gallery/importers"
+	hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+// privacyFilterDetails builds Details carrying a synthetic HF file list so
+// detection can be exercised without hitting the network.
+func privacyFilterDetails(uri string, prefs string, files ...hfapi.ModelFile) importers.Details {
+	return importers.Details{
+		URI:         uri,
+		Preferences: json.RawMessage(prefs),
+		HuggingFace: &hfapi.ModelDetails{Files: files},
+	}
+}
+
+var _ = Describe("PrivacyFilterImporter", func() {
+	imp := &importers.PrivacyFilterImporter{}
+
+	Context("Importer interface metadata", func() {
+		It("exposes name/modality/autodetect", func() {
+			Expect(imp.Name()).To(Equal("privacy-filter"))
+			Expect(imp.Modality()).To(Equal("text"))
+			Expect(imp.AutoDetects()).To(BeTrue())
+		})
+	})
+
+	Context("detection (Match)", func() {
+		It("matches an HF repo shipping a privacy-filter GGUF", func() {
+			d := privacyFilterDetails("huggingface://LocalAI-io/privacy-filter-multilingual-GGUF", "",
+				hfapi.ModelFile{Path: "privacy-filter-multilingual-f16.gguf", URL: "https://hf/f16"})
+			Expect(imp.Match(d)).To(BeTrue())
+		})
+
+		It("matches a direct URL to a privacy-filter GGUF", func() {
+			d := privacyFilterDetails("https://hf/resolve/main/privacy-filter-multilingual-f16.gguf", "")
+			Expect(imp.Match(d)).To(BeTrue())
+		})
+
+		It("matches the GGUF distribution repo by name when HF metadata is absent", func() {
+			d := importers.Details{URI: "huggingface://LocalAI-io/privacy-filter-multilingual-GGUF", Preferences: json.RawMessage("")}
+			Expect(imp.Match(d)).To(BeTrue())
+		})
+
+		It("honours preferences.backend=privacy-filter for arbitrary URIs", func() {
+			d := privacyFilterDetails("huggingface://some/unrelated-repo", `{"backend":"privacy-filter"}`)
+			Expect(imp.Match(d)).To(BeTrue())
+		})
+
+		It("does NOT claim a generic llama-style GGUF", func() {
+			d := privacyFilterDetails("huggingface://TheBloke/Llama-2-7B-GGUF", "",
+				hfapi.ModelFile{Path: "llama-2-7b.Q4_K_M.gguf", URL: "https://hf/llama"})
+			Expect(imp.Match(d)).To(BeFalse())
+		})
+
+		It("does NOT claim the upstream safetensors source repo (no GGUF)", func() {
+			d := privacyFilterDetails("huggingface://OpenMed/privacy-filter-multilingual", "",
+				hfapi.ModelFile{Path: "model.safetensors", URL: "https://hf/st"},
+				hfapi.ModelFile{Path: "config.json", URL: "https://hf/cfg"})
+			Expect(imp.Match(d)).To(BeFalse())
+		})
+	})
+
+	Context("import (Import)", func() {
+		It("emits a privacy-filter token_classify config from an HF GGUF repo", func() {
+			d := privacyFilterDetails("huggingface://LocalAI-io/privacy-filter-multilingual-GGUF", `{"name":"pii"}`,
+				hfapi.ModelFile{Path: "privacy-filter-multilingual-f16.gguf", URL: "https://hf/f16", SHA256: "abc"})
+			cfg, err := imp.Import(d)
+			Expect(err).ToNot(HaveOccurred())
+			Expect(cfg.ConfigFile).To(ContainSubstring("backend: privacy-filter"), fmt.Sprintf("%+v", cfg))
+			Expect(cfg.ConfigFile).To(ContainSubstring("token_classify"))
+			Expect(cfg.ConfigFile).To(ContainSubstring("embeddings: true"))
+			Expect(cfg.Files).To(HaveLen(1))
+			Expect(cfg.Files[0].URI).To(Equal("https://hf/f16"))
+			Expect(cfg.Files[0].SHA256).To(Equal("abc"))
+			Expect(cfg.Files[0].Filename).To(ContainSubstring("privacy-filter/models/pii/privacy-filter-multilingual-f16.gguf"))
+		})
+
+		It("prefers the highest-precision quant (f16) from a multi-quant repo", func() {
+			d := privacyFilterDetails("huggingface://LocalAI-io/privacy-filter-multilingual-GGUF", "",
+				hfapi.ModelFile{Path: "privacy-filter-multilingual-q4_k.gguf", URL: "https://hf/q4k"},
+				hfapi.ModelFile{Path: "privacy-filter-multilingual-f16.gguf", URL: "https://hf/f16"})
+			cfg, err := imp.Import(d)
+			Expect(err).ToNot(HaveOccurred())
+			Expect(cfg.Files).To(HaveLen(1))
+			Expect(cfg.Files[0].URI).To(Equal("https://hf/f16"), "f16 should win over q4_k")
+		})
+
+		It("uses the exact file for a direct GGUF URL", func() {
+			d := privacyFilterDetails("https://hf/resolve/main/privacy-filter-multilingual-f16.gguf", "")
+			cfg, err := imp.Import(d)
+			Expect(err).ToNot(HaveOccurred())
+			Expect(cfg.Files).To(HaveLen(1))
+			Expect(cfg.Files[0].Filename).To(ContainSubstring("privacy-filter/models/"))
+			Expect(cfg.Files[0].Filename).To(ContainSubstring("privacy-filter-multilingual-f16.gguf"))
+		})
+	})
+})
--- a/core/http/app_test.go
+++ b/core/http/app_test.go
@@ -735,6 +735,18 @@ parameters:
 `
 			Expect(os.WriteFile(filepath.Join(modelDir, "mock-model.yaml"), []byte(mockModelYAML), 0644)).To(Succeed())

+			// A second model carrying chat_template_kwargs so the REST->gRPC
+			// metadata-forwarding spec below can assert the model-YAML kwarg is
+			// merged with the per-request override.
+			mockCTKModelYAML := `name: mock-ctk-model
+backend: mock-backend
+parameters:
+  model: mock-model.bin
+chat_template_kwargs:
+  preserve_thinking: true
+`
+			Expect(os.WriteFile(filepath.Join(modelDir, "mock-ctk-model.yaml"), []byte(mockCTKModelYAML), 0644)).To(Succeed())
+
 			systemState, err := system.GetSystemState(
 				system.WithBackendPath(backendDir),
 				system.WithModelPath(modelDir),
@@ -809,6 +821,59 @@ parameters:
 			Expect(string(dat)).To(ContainSubstring("mock-backend"))
 		})

+		It("forwards chat_template_kwargs and reasoning levers to gRPC PredictOptions.Metadata", func() {
+			// True HTTP->gRPC contract guard: drive a real /v1/chat/completions
+			// request and assert the exact metadata the REST layer forwarded to
+			// the backend. The mock-backend echoes PredictOptions.Metadata as JSON
+			// when it sees the ECHO_PREDICT_METADATA marker in the prompt, so this
+			// pins the request->gRPC mapping (model-YAML chat_template_kwargs +
+			// per-request metadata override + type coercion + standalone keys)
+			// without adding a new RPC. The marker rides in the user content and
+			// must survive into the backend prompt; if a future default chat
+			// template drops raw user content, move the marker to /v1/completions.
+			reqBody := map[string]any{
+				"model": "mock-ctk-model",
+				"messages": []map[string]any{
+					{"role": "user", "content": "ECHO_PREDICT_METADATA"},
+				},
+				// per-request override: overrides the standalone enable_thinking key
+				// and exercises coercion ("false" -> bool, "low" -> string) in the blob
+				"metadata": map[string]string{
+					"enable_thinking":  "false",
+					"reasoning_effort": "low",
+				},
+			}
+
+			var chatResp struct {
+				Choices []struct {
+					Message struct {
+						Content string `json:"content"`
+					} `json:"message"`
+				} `json:"choices"`
+			}
+			err := postRequestResponseJSON("http://127.0.0.1:9090/v1/chat/completions", &reqBody, &chatResp)
+			Expect(err).ToNot(HaveOccurred())
+			Expect(chatResp.Choices).ToNot(BeEmpty())
+
+			// The assistant content is the JSON snapshot of PredictOptions.Metadata.
+			var meta map[string]string
+			Expect(json.Unmarshal([]byte(chatResp.Choices[0].Message.Content), &meta)).To(Succeed(), "echoed metadata: %s", chatResp.Choices[0].Message.Content)
+
+			// Standalone keys reflect the per-request override (consumed by Python
+			// backends; consistent across backends).
+			Expect(meta).To(HaveKeyWithValue("enable_thinking", "false"))
+			Expect(meta).To(HaveKeyWithValue("reasoning_effort", "low"))
+
+			// The chat_template_kwargs blob (consumed by llama.cpp) merges the
+			// model-YAML kwarg with the coerced request metadata override.
+			Expect(meta).To(HaveKey("chat_template_kwargs"))
+			var ctk map[string]any
+			Expect(json.Unmarshal([]byte(meta["chat_template_kwargs"]), &ctk)).To(Succeed(), "chat_template_kwargs blob: %s", meta["chat_template_kwargs"])
+			Expect(ctk).To(HaveKeyWithValue("preserve_thinking", true)) // bool from model YAML
+			Expect(ctk).To(HaveKeyWithValue("enable_thinking", false))  // coerced "false" -> bool
+			Expect(ctk).To(HaveKeyWithValue("reasoning_effort", "low")) // non-bool stays string
+		})
+
 		// Agent Jobs: HTTP API for task/job scheduling. The underlying AgentPool
 		// service is exercised in core/services/agentpool/agent_jobs_test.go;
 		// these specs cover the /api/agent/* HTTP plumbing on top.
--- a/core/http/auth/features.go
+++ b/core/http/auth/features.go
@@ -123,6 +123,10 @@ var RouteFeatureRegistry = []RouteFeature{
 	{"GET", "/api/fine-tuning/jobs/:id/download", FeatureFineTuning},
 	{"POST", "/api/fine-tuning/datasets", FeatureFineTuning},

+	// PII analyze/redact service (the events log stays admin-gated in-handler)
+	{"POST", "/api/pii/analyze", FeaturePIIFilter},
+	{"POST", "/api/pii/redact", FeaturePIIFilter},
+
 	// Quantization
 	{"POST", "/api/quantization/jobs", FeatureQuantization},
 	{"GET", "/api/quantization/jobs", FeatureQuantization},
@@ -181,5 +185,6 @@ func APIFeatureMetas() []FeatureMeta {
 		{FeatureFaceRecognition, "Face Recognition", true},
 		{FeatureVoiceRecognition, "Voice Recognition", true},
 		{FeatureAudioTransform, "Audio Transform", true},
+		{FeaturePIIFilter, "PII Analyze / Redact", true},
 	}
 }
--- a/core/http/auth/permissions.go
+++ b/core/http/auth/permissions.go
@@ -56,6 +56,10 @@ const (
 	FeatureFaceRecognition    = "face_recognition"
 	FeatureVoiceRecognition   = "voice_recognition"
 	FeatureAudioTransform     = "audio_transform"
+	// FeaturePIIFilter gates the synchronous PII analyze/redact service
+	// (POST /api/pii/{analyze,redact}). Default ON like the other API
+	// features; the admin-only events log is gated separately in-handler.
+	FeaturePIIFilter = "pii_filter"
 )

 // AgentFeatures lists agent-related features (default OFF).
@@ -71,6 +75,7 @@ var APIFeatures = []string{
 	FeatureVAD, FeatureDetection, FeatureVideo, FeatureEmbeddings, FeatureSound,
 	FeatureRealtime, FeatureRerank, FeatureTokenize, FeatureMCP, FeatureStores,
 	FeatureFaceRecognition, FeatureVoiceRecognition, FeatureAudioTransform,
+	FeaturePIIFilter,
 }

 // AllFeatures lists all known features (used by UI and validation).
--- a/core/http/endpoints/anthropic/messages.go
+++ b/core/http/endpoints/anthropic/messages.go
@@ -10,13 +10,11 @@ import (
 	"github.com/labstack/echo/v4"
 	"github.com/mudler/LocalAI/core/backend"
 	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/core/http/auth"
 	mcpTools "github.com/mudler/LocalAI/core/http/endpoints/mcp"
 	openaiEndpoint "github.com/mudler/LocalAI/core/http/endpoints/openai"
 	"github.com/mudler/LocalAI/core/http/middleware"
 	"github.com/mudler/LocalAI/core/schema"
 	"github.com/mudler/LocalAI/core/services/cloudproxy"
-	"github.com/mudler/LocalAI/core/services/routing/pii"
 	"github.com/mudler/LocalAI/core/templates"
 	"github.com/mudler/LocalAI/pkg/functions"
 	"github.com/mudler/LocalAI/pkg/model"
@@ -30,7 +28,7 @@ import (
 // @Param request body schema.AnthropicRequest true "query params"
 // @Success 200 {object} schema.AnthropicResponse "Response"
 // @Router /v1/messages [post]
-func MessagesEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator *templates.Evaluator, appConfig *config.ApplicationConfig, natsClient mcpTools.MCPNATSClient, piiRedactor *pii.Redactor, piiEvents pii.EventStore) echo.HandlerFunc {
+func MessagesEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evaluator *templates.Evaluator, appConfig *config.ApplicationConfig, natsClient mcpTools.MCPNATSClient) echo.HandlerFunc {
 	return func(c echo.Context) error {
 		id := uuid.New().String()

@@ -53,7 +51,7 @@ func MessagesEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evalu
 		// Cloud-proxy bail. Same shape as the OpenAI chat endpoint —
 		// forwards via the cloud-proxy gRPC backend.
 		if cfg.IsCloudProxyBackendPassthrough() {
-			return forwardCloudProxyAnthropicViaBackend(c, cfg, input, piiRedactor, piiEvents, ml, appConfig)
+			return forwardCloudProxyAnthropicViaBackend(c, cfg, input, ml, appConfig)
 		}

 		// Convert Anthropic messages to OpenAI format for internal processing
@@ -141,7 +139,7 @@ func MessagesEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, evalu
 		xlog.Debug("Anthropic Messages - Prompt (after templating)", "prompt", predInput)

 		if input.Stream {
-			return handleAnthropicStream(c, id, input, cfg, ml, cl, appConfig, predInput, openAIReq, funcs, shouldUseFn, mcpExecutor, evaluator, piiRedactor, piiEvents)
+			return handleAnthropicStream(c, id, input, cfg, ml, cl, appConfig, predInput, openAIReq, funcs, shouldUseFn, mcpExecutor, evaluator)
 		}

 		return handleAnthropicNonStream(c, id, input, cfg, ml, cl, appConfig, predInput, openAIReq, funcs, shouldUseFn, mcpExecutor, evaluator)
@@ -330,36 +328,13 @@ func handleAnthropicNonStream(c echo.Context, id string, input *schema.Anthropic
 	return sendAnthropicError(c, 500, "api_error", "MCP iteration limit reached")
 }

-func handleAnthropicStream(c echo.Context, id string, input *schema.AnthropicRequest, cfg *config.ModelConfig, ml *model.ModelLoader, cl *config.ModelConfigLoader, appConfig *config.ApplicationConfig, predInput string, openAIReq *schema.OpenAIRequest, funcs functions.Functions, shouldUseFn bool, mcpExecutor mcpTools.ToolExecutor, evaluator *templates.Evaluator, piiRedactor *pii.Redactor, piiEvents pii.EventStore) error {
+func handleAnthropicStream(c echo.Context, id string, input *schema.AnthropicRequest, cfg *config.ModelConfig, ml *model.ModelLoader, cl *config.ModelConfigLoader, appConfig *config.ApplicationConfig, predInput string, openAIReq *schema.OpenAIRequest, funcs functions.Functions, shouldUseFn bool, mcpExecutor mcpTools.ToolExecutor, evaluator *templates.Evaluator) error {
 	c.Response().Header().Set("Content-Type", "text/event-stream")
 	c.Response().Header().Set("Cache-Control", "no-cache")
 	c.Response().Header().Set("Connection", "keep-alive")

-	// Per-stream PII filter — same gating as the OpenAI chat path. The
-	// filter is wire-format-agnostic; we feed it the text portion of
-	// each text_delta and emit only what's safe to send. The filter
-	// holds back a tail of size MaxPatternLength-1 so a pattern split
-	// across chunk boundaries still gets masked. When PII is disabled
-	// for this model the filter is nil and emits flow unchanged.
-	var streamPIIFilter *pii.StreamFilter
-	if piiRedactor != nil && cfg.PIIIsEnabled() {
-		correlationID := c.Request().Header.Get("x-request-id")
-		userID := ""
-		if u := auth.GetUser(c); u != nil {
-			userID = u.ID
-		}
-		var overrides map[string]pii.Action
-		if raw := cfg.PIIPatternOverrides(); len(raw) > 0 {
-			overrides = make(map[string]pii.Action, len(raw))
-			for ovid, action := range raw {
-				switch pii.Action(action) {
-				case pii.ActionMask, pii.ActionBlock, pii.ActionAllow:
-					overrides[ovid] = pii.Action(action)
-				}
-			}
-		}
-		streamPIIFilter = pii.NewStreamFilter(piiRedactor, overrides, piiEvents, correlationID, userID)
-	}
+	// Response/output PII redaction is out of scope for now — redaction
+	// runs request-side only (the NER middleware).

 	// Send message_start event
 	messageStart := schema.AnthropicStreamEvent{
@@ -440,7 +415,6 @@ func handleAnthropicStream(c echo.Context, id string, input *schema.AnthropicReq

 				if len(toolCalls) > toolCallsEmitted {
 					if !inToolCall && currentBlockIndex == 0 {
-						drainStreamPIIToText(c, streamPIIFilter, intPtr(currentBlockIndex))
 						sendAnthropicSSE(c, schema.AnthropicStreamEvent{
 							Type:  "content_block_stop",
 							Index: intPtr(currentBlockIndex),
@@ -481,20 +455,14 @@ func handleAnthropicStream(c echo.Context, id string, input *schema.AnthropicReq
 			}

 			if !inToolCall && token != "" {
-				out := token
-				if streamPIIFilter != nil {
-					out = streamPIIFilter.Push(token)
-				}
-				if out != "" {
-					sendAnthropicSSE(c, schema.AnthropicStreamEvent{
-						Type:  "content_block_delta",
-						Index: intPtr(0),
-						Delta: &schema.AnthropicStreamDelta{
-							Type: "text_delta",
-							Text: out,
-						},
-					})
-				}
+				sendAnthropicSSE(c, schema.AnthropicStreamEvent{
+					Type:  "content_block_delta",
+					Index: intPtr(0),
+					Delta: &schema.AnthropicStreamDelta{
+						Type: "text_delta",
+						Text: token,
+					},
+				})
 			}
 			return true
 		}
@@ -532,20 +500,14 @@ func handleAnthropicStream(c echo.Context, id string, input *schema.AnthropicReq
 			// didn't already stream it (autoparser clears raw text, so
 			// accumulatedContent will be empty in that case).
 			if deltaContent != "" && !inToolCall && accumulatedContent == "" {
-				out := deltaContent
-				if streamPIIFilter != nil {
-					out = streamPIIFilter.Push(deltaContent)
-				}
-				if out != "" {
-					sendAnthropicSSE(c, schema.AnthropicStreamEvent{
-						Type:  "content_block_delta",
-						Index: intPtr(0),
-						Delta: &schema.AnthropicStreamDelta{
-							Type: "text_delta",
-							Text: out,
-						},
-					})
-				}
+				sendAnthropicSSE(c, schema.AnthropicStreamEvent{
+					Type:  "content_block_delta",
+					Index: intPtr(0),
+					Delta: &schema.AnthropicStreamDelta{
+						Type: "text_delta",
+						Text: deltaContent,
+					},
+				})
 			}

 			// Emit tool_use blocks from ChatDeltas
@@ -553,7 +515,6 @@ func handleAnthropicStream(c echo.Context, id string, input *schema.AnthropicReq
 				collectedToolCalls = deltaToolCalls

 				if !inToolCall && currentBlockIndex == 0 {
-					drainStreamPIIToText(c, streamPIIFilter, intPtr(currentBlockIndex))
 					sendAnthropicSSE(c, schema.AnthropicStreamEvent{
 						Type:  "content_block_stop",
 						Index: intPtr(currentBlockIndex),
@@ -657,9 +618,7 @@ func handleAnthropicStream(c echo.Context, id string, input *schema.AnthropicReq
 		if !shouldUseFn && cfg.FunctionsConfig.AutomaticToolParsingFallback && accumulatedContent != "" && toolCallsEmitted == 0 {
 			parsed := functions.ParseFunctionCall(accumulatedContent, cfg.FunctionsConfig)
 			if len(parsed) > 0 {
-				// Close the text content block (after flushing any
-				// residual the streaming PII filter held back).
-				drainStreamPIIToText(c, streamPIIFilter, intPtr(currentBlockIndex))
+				// Close the text content block.
 				sendAnthropicSSE(c, schema.AnthropicStreamEvent{
 					Type:  "content_block_stop",
 					Index: intPtr(currentBlockIndex),
@@ -699,12 +658,8 @@ func handleAnthropicStream(c echo.Context, id string, input *schema.AnthropicReq
 			}
 		}

-		// No MCP tools to execute, close stream. drainStreamPIIToText
-		// flushes any residual the streaming PII filter held back as
-		// part of its trailing pattern-window before we close the
-		// text content block.
+		// No MCP tools to execute, close the text content block.
 		if !inToolCall {
-			drainStreamPIIToText(c, streamPIIFilter, intPtr(0))
 			sendAnthropicSSE(c, schema.AnthropicStreamEvent{
 				Type:  "content_block_stop",
 				Index: intPtr(0),
@@ -752,30 +707,6 @@ func convertFuncsToOpenAITools(funcs functions.Functions) []functions.Tool {

 func intPtr(i int) *int { return &i }

-// drainStreamPIIToText flushes any residual the streaming PII filter
-// has been holding back as part of its trailing pattern-window, and
-// emits it as one final text_delta into the named block before the
-// caller closes that block. Drain is idempotent: calling it twice on
-// the same filter returns "" the second time. Safe to call with a nil
-// filter (no-op).
-func drainStreamPIIToText(c echo.Context, sf *pii.StreamFilter, index *int) {
-	if sf == nil {
-		return
-	}
-	residual := sf.Drain()
-	if residual == "" {
-		return
-	}
-	sendAnthropicSSE(c, schema.AnthropicStreamEvent{
-		Type:  "content_block_delta",
-		Index: index,
-		Delta: &schema.AnthropicStreamDelta{
-			Type: "text_delta",
-			Text: residual,
-		},
-	})
-}
-
 func sendAnthropicSSE(c echo.Context, event schema.AnthropicStreamEvent) {
 	data, err := json.Marshal(event)
 	if err != nil {
@@ -973,17 +904,14 @@ func convertAnthropicTools(input *schema.AnthropicRequest, cfg *config.ModelConf
 }

 // forwardCloudProxyAnthropicViaBackend marshals the Anthropic request,
-// constructs the streaming PII filter (when applicable), and hands the
-// body off to the cloud-proxy gRPC backend. Model swap + upstream auth
-// headers are applied inside the backend; the filter is built here
-// because the auth/correlation context only exists in the echo handler.
-func forwardCloudProxyAnthropicViaBackend(c echo.Context, cfg *config.ModelConfig, input *schema.AnthropicRequest, piiRedactor *pii.Redactor, piiEvents pii.EventStore, ml *model.ModelLoader, appConfig *config.ApplicationConfig) error {
+// and hands the body off to the cloud-proxy gRPC backend. Model swap +
+// upstream auth headers are applied inside the backend. Request-side PII
+// redaction already ran in the middleware; the response is forwarded
+// unmodified.
+func forwardCloudProxyAnthropicViaBackend(c echo.Context, cfg *config.ModelConfig, input *schema.AnthropicRequest, ml *model.ModelLoader, appConfig *config.ApplicationConfig) error {
 	body, err := json.Marshal(input)
 	if err != nil {
 		return sendAnthropicError(c, 400, "invalid_request_error", "cloudproxy: marshal request: "+err.Error())
 	}
-
-	correlationID := c.Request().Header.Get("x-request-id")
-	streamFilter := cloudproxy.BuildStreamFilter(c, cfg, input.Stream, piiRedactor, piiEvents, correlationID)
-	return cloudproxy.ForwardViaBackend(c, cfg, body, streamFilter, ml, appConfig)
+	return cloudproxy.ForwardViaBackend(c, cfg, body, ml, appConfig)
 }
--- a/core/http/endpoints/anthropic/messages_pii_test.go
+++ b/core/http/endpoints/anthropic/messages_pii_test.go
@@ -1,114 +0,0 @@
-package anthropic
-
-import (
-	"net/http"
-	"net/http/httptest"
-	"strings"
-
-	"github.com/labstack/echo/v4"
-	"github.com/mudler/LocalAI/core/services/routing/pii"
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-// drainStreamPIIToText is called from four sites in messages.go and is
-// the load-bearing primitive for "the streaming filter has buffered
-// some bytes that the request just ended on; flush them as a final
-// text_delta event before closing the content block". A regression
-// here would silently truncate the last few bytes of an assistant
-// response on every PII-enabled stream — invisible without coverage.
-
-// newTestFilter compiles the default patterns and returns a filter
-// that holds back its trailing pattern-window; pushing a short string
-// (shorter than holdLen) keeps the bytes inside Drain.
-func newTestFilter() *pii.StreamFilter {
-	patterns, err := pii.Compile(pii.DefaultPatterns())
-	ExpectWithOffset(1, err).NotTo(HaveOccurred())
-	red := pii.NewRedactor(patterns)
-	return pii.NewStreamFilter(red, nil, nil, "", "")
-}
-
-// newTestContext builds a recording echo context — the recorder
-// captures the SSE bytes drainStreamPIIToText writes.
-func newTestContext() (echo.Context, *httptest.ResponseRecorder) {
-	req := httptest.NewRequest(http.MethodPost, "/v1/messages", strings.NewReader("{}"))
-	rec := httptest.NewRecorder()
-	return echo.New().NewContext(req, rec), rec
-}
-
-var _ = Describe("drainStreamPIIToText", func() {
-	It("is a no-op when the filter is nil", func() {
-		c, rec := newTestContext()
-		drainStreamPIIToText(c, nil, intPtr(0))
-		Expect(rec.Body.Len()).To(Equal(0), "nil filter wrote %d bytes: %q", rec.Body.Len(), rec.Body.String())
-	})
-
-	It("emits nothing when the drain is empty", func() {
-		// A filter with nothing buffered should not emit a phantom event;
-		// otherwise every non-PII response would close with an empty
-		// text_delta that pollutes downstream parsers.
-		sf := newTestFilter()
-		c, rec := newTestContext()
-		drainStreamPIIToText(c, sf, intPtr(0))
-		Expect(rec.Body.Len()).To(Equal(0), "empty drain wrote %d bytes: %q", rec.Body.Len(), rec.Body.String())
-	})
-
-	It("flushes residual buffered bytes as a text_delta event", func() {
-		sf := newTestFilter()
-		// Push less than holdLen so all bytes are retained until Drain.
-		// "tail" is short enough that no pattern is plausible.
-		out := sf.Push("tail")
-		Expect(out).To(Equal(""), "Push of short text emitted %q; want all bytes held", out)
-
-		c, rec := newTestContext()
-		drainStreamPIIToText(c, sf, intPtr(2))
-
-		body := rec.Body.String()
-		// Wire format: "event: content_block_delta\ndata: {…}\n\n"
-		Expect(body).To(ContainSubstring("event: content_block_delta"))
-		Expect(body).To(ContainSubstring(`"type":"content_block_delta"`))
-		Expect(body).To(ContainSubstring(`"index":2`))
-		Expect(body).To(ContainSubstring(`"text":"tail"`))
-		Expect(body).To(ContainSubstring(`"type":"text_delta"`))
-		Expect(strings.HasSuffix(body, "\n\n")).To(BeTrue(), "SSE event missing trailing blank line: %q", body)
-	})
-
-	It("is idempotent across consecutive drains", func() {
-		// Two consecutive Drains: the filter returns "" the second time,
-		// so the second drainStreamPIIToText must emit nothing. The
-		// production path in messages.go has at least four call sites
-		// that may overlap (currentBlockIndex==0 emergency path + the
-		// unconditional drain near the end of the stream); without
-		// idempotence we'd duplicate the residual on the wire.
-		sf := newTestFilter()
-		sf.Push("tail")
-
-		c1, rec1 := newTestContext()
-		drainStreamPIIToText(c1, sf, intPtr(0))
-		first := rec1.Body.Len()
-		Expect(first).NotTo(Equal(0), "first drain emitted nothing")
-
-		c2, rec2 := newTestContext()
-		drainStreamPIIToText(c2, sf, intPtr(0))
-		Expect(rec2.Body.Len()).To(Equal(0), "second drain wrote %d bytes; want idempotent no-op: %q", rec2.Body.Len(), rec2.Body.String())
-	})
-
-	It("masks redacted residual instead of leaking it", func() {
-		// The held tail must travel through the redactor on Drain. If
-		// the bytes happen to form a complete pattern at end-of-stream,
-		// the residual emit must contain the mask placeholder, not the
-		// raw value.
-		sf := newTestFilter()
-		// "alice@example.com" is 17 bytes. holdLen for default patterns
-		// is well above 17, so this stays buffered until Drain, which
-		// then redacts it.
-		out := sf.Push("alice@example.com")
-		Expect(out).To(Equal(""), "Push emitted bytes early: %q", out)
-
-		c, rec := newTestContext()
-		drainStreamPIIToText(c, sf, intPtr(0))
-		body := rec.Body.String()
-		Expect(body).NotTo(ContainSubstring("alice@example.com"), "raw email leaked in residual emit: %q", body)
-		Expect(body).To(ContainSubstring("[REDACTED:email]"), "residual emit missing mask placeholder: %q", body)
-	})
-})
--- a/core/http/endpoints/localai/api_instructions.go
+++ b/core/http/endpoints/localai/api_instructions.go
@@ -100,15 +100,15 @@ var instructionDefs = []instructionDef{
 	},
 	{
 		Name:        "pii-filtering",
-		Description: "Inspect and tune the regex PII filter applied to chat requests",
+		Description: "Inspect the NER-based PII filter applied to chat requests",
 		Tags:        []string{"pii"},
-		Intro:       "GET /api/pii/patterns lists the active pattern set with each one's action (mask, block, allow). GET /api/pii/events returns recent redaction events filtered by correlation_id / user_id / pattern_id (admin or local-user only). POST /api/pii/test dry-runs the redactor against an admin-supplied string. POST /api/pii/decide is the programmatic decision oracle for external routers: send `{text}`, receive `{findings, suggested_action, redacted_preview}` without LocalAI mutating, recording, or acting on the call — caller composes the action with its own policy. Default patterns: email, phone, SSN, credit card (Luhn), IPv4, common API key prefixes (sk-, pk-, ghp_, github_pat_). PII is per-model: by default it is OFF for non-proxy backends and ON for backends starting with proxy-* (cloud passthroughs). Opt in with `pii: { enabled: true }` in a model's YAML; use `pii: { patterns: [{id, action}] }` to upgrade or downgrade individual actions for that model. Override global default actions via --pii-config pii.yaml; --disable-pii turns the filter off entirely.",
+		Intro:       "PII redaction is NER-based and request-side. A consuming model opts in with `pii: { enabled: true, detectors: [<model>] }` where each detector is a token-classification (token_classify) model. The detection policy lives on the detector model itself in a `pii_detection:` block: `{ min_score, default_action (mask|block|allow), entity_actions: { GROUP: action } }`. Multiple detectors union their hits; overlapping spans resolve to the strongest action (block > mask > allow). PII defaults OFF for non-proxy backends and ON for proxy-* (cloud passthroughs). Besides the inline path, two synchronous service endpoints expose the same engine without an inference request: POST /api/pii/analyze returns the detected entity spans (entity_type, source ner|pattern, start/end, score, action) without mutating the text, and POST /api/pii/redact applies the policy — returning redacted_text, or 400 (type pii_blocked) with the offending entities when a block action fires. Both take `{ text, detectors:[<model>...] }` (or `model` to inherit a consuming model's detectors), require the pii_filter feature (any authenticated user), and record audit events with an `origin` of pii_analyze / pii_redact. GET /api/pii/events returns recent redaction events filtered by correlation_id / user_id / pattern_id / origin (middleware|proxy|pii_analyze|pii_redact); events carry `<source>:<GROUP>` ids — e.g. `ner:EMAIL` for the neural detector, `pattern:ANTHROPIC_KEY` for the regex pattern tier — and an 8-char hash prefix, never the matched value (admin or local-user only). The legacy regex pattern tier and its endpoints (/api/pii/patterns, /test, /decide) were removed.",
 	},
 	{
 		Name:        "middleware-admin",
 		Description: "Inspect and configure the routing-module middleware (PII filter and routing)",
 		Tags:        []string{"middleware", "pii", "router"},
-		Intro:       "GET /api/middleware/status is the single round-trip the /app/middleware admin page reads to render the current state: active PII patterns and their actions, every model's resolved enabled/override state, recent event count, and the active routing models with their classifier configurations. Admin-only (the synthetic local user is admin in no-auth mode). PUT /api/pii/patterns/:id changes a pattern's action in-process — TRANSIENT, lost on restart. To persist, edit --pii-config YAML. GET /api/router/decisions returns the routing decision log filtered by correlation_id / user_id / router_model. The same surface is exposed as MCP tools (`get_middleware_status`, `set_pii_pattern_action`, `get_router_decisions`) for agent-driven configuration.",
+		Intro:       "GET /api/middleware/status is the single round-trip the /app/middleware admin page reads to render the current state: every model's resolved PII enabled state and the NER detector models it references, recent event count, and the active routing models with their classifier configurations. Admin-only (the synthetic local user is admin in no-auth mode). PII detection policy is edited on each detector model's `pii_detection:` block via the model-config tools/UI — there is no global pattern set to mutate. GET /api/router/decisions returns the routing decision log filtered by correlation_id / user_id / router_model. The same surface is exposed as MCP tools (`get_middleware_status`, `get_pii_events`, `get_router_decisions`) for agent-driven inspection.",
 	},
 	{
 		Name:        "intelligent-routing",
--- a/core/http/endpoints/localai/backend.go
+++ b/core/http/endpoints/localai/backend.go
@@ -25,6 +25,10 @@ var knownPrefOnlyBackends = []schema.KnownBackend{
 	// Text LLM
 	// ds4: antirez/ds4 - single-model DeepSeek V4 Flash engine; auto-detected via DS4Importer
 	{Name: "ds4", Modality: "text", AutoDetect: false, Description: "antirez/ds4 DeepSeek V4 Flash engine (auto-detected; pref-only fallback)"},
+	// privacy-filter is now auto-detected via PrivacyFilterImporter (see
+	// core/gallery/importers/privacy-filter.go); the importer registry entry
+	// supersedes any pref-only line here, which the /backends/known merge would
+	// dedupe away.
 	{Name: "sglang", Modality: "text", AutoDetect: false, Description: "SGLang runtime (preference-only)"},
 	{Name: "tinygrad", Modality: "text", AutoDetect: false, Description: "tinygrad runtime (preference-only)"},
 	{Name: "trl", Modality: "text", AutoDetect: false, Description: "Transformers Reinforcement Learning (preference-only)"},
@@ -38,6 +42,7 @@ var knownPrefOnlyBackends = []schema.KnownBackend{
 	{Name: "qwen3-tts-cpp", Modality: "tts", AutoDetect: false, Description: "Qwen3 TTS C++ (preference-only)"},
 	{Name: "omnivoice-cpp", Modality: "tts", AutoDetect: false, Description: "OmniVoice C++ TTS with voice cloning and voice design (preference-only)"},
 	{Name: "faster-qwen3-tts", Modality: "tts", AutoDetect: false, Description: "Faster Qwen3 TTS (preference-only)"},
+	{Name: "supertonic", Modality: "tts", AutoDetect: false, Description: "Supertonic multilingual ONNX TTS (preference-only)"},
 	// Detection
 	{Name: "sam3-cpp", Modality: "detection", AutoDetect: false, Description: "SAM3 C++ object detection (preference-only)"},
 	// Audio transform (audio-in / audio-out, optional reference signal)
--- a/Show More
+++ b/Show More