chore(deps): bump the go_modules group across 1 directory with 8 updates

Bumps the go_modules group with 7 updates in the / directory: | Package | From | To | | --- | --- | --- | | [github.com/containerd/containerd](https://github.com/containerd/containerd) | `1.7.31` | `1.7.32` | | [github.com/in-toto/in-toto-golang](https://github.com/in-toto/in-toto-golang) | `0.9.0` | `0.11.0` | | [github.com/sigstore/rekor](https://github.com/sigstore/rekor) | `1.4.3` | `1.5.0` | | [github.com/sigstore/timestamp-authority/v2](https://github.com/sigstore/timestamp-authority) | `2.0.3` | `2.0.6` | | [github.com/theupdateframework/go-tuf/v2](https://github.com/theupdateframework/go-tuf) | `2.3.0` | `2.4.1` | | [github.com/go-git/go-git/v5](https://github.com/go-git/go-git) | `5.19.0` | `5.19.1` | | [github.com/slack-go/slack](https://github.com/slack-go/slack) | `0.17.3` | `0.23.1` | Updates `github.com/containerd/containerd` from 1.7.31 to 1.7.32 - [Release notes](https://github.com/containerd/containerd/releases) - [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md) - [Commits](https://github.com/containerd/containerd/compare/v1.7.31...v1.7.32) Updates `github.com/in-toto/in-toto-golang` from 0.9.0 to 0.11.0 - [Release notes](https://github.com/in-toto/in-toto-golang/releases) - [Changelog](https://github.com/in-toto/in-toto-golang/blob/master/CHANGELOG.md) - [Commits](https://github.com/in-toto/in-toto-golang/compare/v0.9.0...v0.11.0) Updates `github.com/sigstore/rekor` from 1.4.3 to 1.5.0 - [Release notes](https://github.com/sigstore/rekor/releases) - [Changelog](https://github.com/sigstore/rekor/blob/main/CHANGELOG.md) - [Commits](https://github.com/sigstore/rekor/compare/v1.4.3...v1.5.0) Updates `github.com/sigstore/sigstore` from 1.10.0 to 1.10.3 - [Release notes](https://github.com/sigstore/sigstore/releases) - [Commits](https://github.com/sigstore/sigstore/compare/v1.10.0...v1.10.3) Updates `github.com/sigstore/timestamp-authority/v2` from 2.0.3 to 2.0.6 - [Release notes](https://github.com/sigstore/timestamp-authority/releases) - [Changelog](https://github.com/sigstore/timestamp-authority/blob/main/CHANGELOG.md) - [Commits](https://github.com/sigstore/timestamp-authority/compare/v2.0.3...v2.0.6) Updates `github.com/theupdateframework/go-tuf/v2` from 2.3.0 to 2.4.1 - [Release notes](https://github.com/theupdateframework/go-tuf/releases) - [Commits](https://github.com/theupdateframework/go-tuf/compare/v2.3.0...v2.4.1) Updates `github.com/go-git/go-git/v5` from 5.19.0 to 5.19.1 - [Release notes](https://github.com/go-git/go-git/releases) - [Changelog](https://github.com/go-git/go-git/blob/main/HISTORY.md) - [Commits](https://github.com/go-git/go-git/compare/v5.19.0...v5.19.1) Updates `github.com/slack-go/slack` from 0.17.3 to 0.23.1 - [Release notes](https://github.com/slack-go/slack/releases) - [Changelog](https://github.com/slack-go/slack/blob/master/CHANGELOG.md) - [Commits](https://github.com/slack-go/slack/compare/v0.17.3...v0.23.1) --- updated-dependencies: - dependency-name: github.com/containerd/containerd dependency-version: 1.7.32 dependency-type: direct:production - dependency-name: github.com/go-git/go-git/v5 dependency-version: 5.19.1 dependency-type: indirect - dependency-name: github.com/in-toto/in-toto-golang dependency-version: 0.11.0 dependency-type: indirect - dependency-name: github.com/sigstore/rekor dependency-version: 1.5.0 dependency-type: indirect - dependency-name: github.com/sigstore/sigstore dependency-version: 1.10.3 dependency-type: indirect - dependency-name: github.com/sigstore/timestamp-authority/v2 dependency-version: 2.0.6 dependency-type: indirect - dependency-name: github.com/slack-go/slack dependency-version: 0.23.1 dependency-type: indirect - dependency-name: github.com/theupdateframework/go-tuf/v2 dependency-version: 2.4.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>
2026-06-28 10:27:30 -04:00 · 2026-06-03 08:45:00 +00:00
904 changed files with 11155 additions and 66250 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -102,24 +102,6 @@ Multi-arch backends are NOT a single matrix entry with `platforms: 'linux/amd64,

 Entries whose `dockerfile` is `./backend/Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` must also set a `builder-base-image` field pointing at a prebuilt base from `quay.io/go-skynet/ci-cache:base-grpc-*` (CI builds these via `.github/workflows/base-images.yml`). The mapping is by `(build-type, platforms)` — see existing entries for the pattern. CI uses these prebuilt bases to skip the gRPC compile (~25–35 min cold). Local `make backends/<name>` ignores `builder-base-image` and uses the from-source path inside the Dockerfile, so you don't need quay access for local builds.

-### Cover every OS the project supports (Linux **and** Darwin)
-
-`.github/backend-matrix.yml` has two matrices, and they are the source of truth for which OS a backend ships on:
-
- `include:` — the **Linux** matrix (x86_64 + arm64; CPU and CUDA / ROCm / SYCL / Vulkan).
- `includeDarwin:` — the **macOS / Apple Silicon** matrix (arm64; Metal where the engine supports it, otherwise a native arm64 CPU build).
-
-**A new backend must target every OS it can build for — do not ship Linux-only by default.** A backend that appears only under `include:` is silently unavailable on macOS even when its code would run there. Most C/C++/GGML engines build on Darwin out of the box (ggml defaults `GGML_METAL=ON` on Apple, so a plain build is Metal-enabled), and many Python backends do too (CPU / MPS wheels). If a backend genuinely cannot support an OS (e.g. CUDA-only, no CPU variant), state that in the PR description instead of omitting it silently.
-
-Wiring a backend into `includeDarwin:` is more than the matrix entry:
-
-1. **`includeDarwin:` entry** — `tag-suffix: "-metal-darwin-arm64-<backend>"`, `build-type: "metal"`, `lang: "go"` for go+ggml backends; omit `build-type` for the bespoke C++ ones (llama-cpp / ds4 / privacy-filter). Match an existing entry of the same shape.
-2. **`backend/index.yaml`** — add `metal:` to the backend's `capabilities` map (main and `-development`) and concrete `metal-<backend>` / `metal-<backend>-development` image entries pointing at the `-metal-darwin-arm64-<backend>` images.
-3. **C/C++ backends only** — add an `inferBackendPathDarwin` case in `scripts/changed-backends.js` returning `backend/cpp/<backend>/` (the generic fallthrough assumes `backend/<lang>/`, which is wrong for a C++ source tree driven with `lang: go`), and give `run.sh` a Darwin branch that exports `DYLD_LIBRARY_PATH` instead of `LD_LIBRARY_PATH`. If the build is bespoke (single `grpc-server` + dylib bundling), model it on `scripts/build/ds4-darwin.sh` and add a `backends/<backend>-darwin` make target plus a gated step in `.github/workflows/backend_build_darwin.yml`.
-4. **C++ proto gotcha** — if the backend compiles the generated gRPC/protobuf in a separate CMake target (e.g. `hw_grpc_proto`), that target must link `protobuf::libprotobuf` + `gRPC::grpc++` so the Homebrew include dirs propagate; otherwise macOS fails with `google/protobuf/runtime_version.h not found` (Linux hides this because apt headers sit in `/usr/include`).
-
-The CI path filter only builds a backend on a PR when a file under its directory changes, so a darwin-only YAML edit builds nothing — touch a file under `backend/<lang>/<backend>/` (a one-line comment is enough) in the same PR.
-
 ## 3. Add Backend Metadata to `backend/index.yaml`

 **Step 3a: Add Meta Definition**
@@ -216,34 +198,12 @@ docker-build-backends: ... docker-build-<backend-name>
 - If the backend is in `backend/python/<backend-name>/` but uses `.` as context in the workflow file, use `.` context
 - Check similar backends to determine the correct context

-## Documenting the backend (README + docs)
-
-A backend is not "added" until it is discoverable. Update the user-facing docs:
-
- **`docs/content/features/backends.md`** - add the backend to the right
-  category in the "LocalAI supports various types of backends" list (and add a
-  new category if it introduces a new modality, e.g. sound classification).
- If the backend introduces a **new API surface** (a new endpoint or a realtime
-  capability), document it under `docs/content/` where its area lives (audio,
-  vision, etc.) and follow the api-endpoints checklist in
-  [api-endpoints-and-auth.md](api-endpoints-and-auth.md).
-
-**If the backend is a native C/C++/GGML engine created and maintained by the
-LocalAI team** (a from-scratch port like `parakeet.cpp`, `ced.cpp`,
-`vibevoice.cpp`, `rf-detr.cpp`, not a wrapper around a third-party runtime), it
-ALSO belongs in the top-level **`README.md`** table under "native C/C++/GGML
-engines ... developed and maintained by the LocalAI project itself". Add a row
-linking the upstream engine repo with a one-line description. This is the
-project's showcase of its own engines; a new in-house backend that is missing
-from it is a documentation bug.
-
 ## 5. Verification Checklist

 After adding a new backend, verify:

 - [ ] Backend directory structure is complete with all necessary files
 - [ ] Build configurations added to `.github/backend-matrix.yml` for all desired platforms (per-arch entries with `platform-tag` for multi-arch; `builder-base-image` for llama-cpp / ik-llama-cpp / turboquant)
- [ ] **OS coverage considered**: added to `includeDarwin:` (macOS/Apple Silicon) if the backend can build there — with the `backend/index.yaml` `metal:` capability + `metal-<backend>` image entries, a `run.sh` Darwin/DYLD branch and `inferBackendPathDarwin` case for C++ backends — or the PR explains why an OS is unsupported. Do not ship Linux-only by default.
 - [ ] Meta definition added to `backend/index.yaml` in the `## metas` section
 - [ ] Image entries added to `backend/index.yaml` for all build variants (latest + development)
 - [ ] Tag suffixes match between workflow file and index.yaml
@@ -251,8 +211,6 @@ After adding a new backend, verify:
 - [ ] No YAML syntax errors (check with linter)
 - [ ] No Makefile syntax errors (check with linter)
 - [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
- [ ] Documented: added to the category list in `docs/content/features/backends.md` (and any new endpoint/realtime capability documented under `docs/content/`)
- [ ] If it is an in-house native C/C++/GGML engine, added to the maintained-engines table in the top-level `README.md`

 ## Bundling runtime shared libraries (`package.sh`)

--- a/.agents/ds4-backend.md
+++ b/.agents/ds4-backend.md
@@ -44,39 +44,6 @@ maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_
 via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
 NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).

-## Engine options (LoadModel)
-
-`LoadModel` maps `ModelOptions.Options[]` (`"key:value"`, from model-YAML
-`options:`) onto `ds4_engine_options` through a **declarative table**
-(`kEngineOptSpecs` + `apply_engine_option` in `grpc-server.cpp`). The struct is
-plain C with no reflection, so the field set is enumerated once in the table;
-adding a future engine knob is a one-line table row, not a new branch. Unknown
-keys are ignored (back-compat). A bare flag (`ssd_streaming` with no value)
-means `true`. Path-type values (`mtp_path`, `expert_profile_path`,
-`directional_steering_file`) resolve **relative to the model directory**, so a
-gallery entry can reference a companion file it downloaded by bare filename;
-absolute values pass through. `ds4_role` / `ds4_layers` / `ds4_listen` /
-`ds4_route_timeout` / `kv_cache_dir` keep their dedicated handling (validation
-+ coordinator wiring) and are not in the table.
-
-Wired keys: `mtp_path`, `mtp_draft`, `mtp_margin`, `prefill_chunk`,
-`power_percent`, `warm_weights`, `quality`, `ssd_streaming`,
-`ssd_streaming_cold`, `ssd_streaming_preload_experts`,
-`ssd_streaming_cache_experts` (count or `NGB`, sets both experts+bytes via
-`ds4_parse_streaming_cache_experts_arg`), `simulate_used_memory` (`NGB` via
-`ds4_parse_gib_arg`), `expert_profile_path`, `directional_steering_file`,
-`directional_steering_attn`, `directional_steering_ffn`.
-
-## SSD streaming (running models larger than RAM)
-
-ds4's **SSD streaming** keeps non-routed weights resident and streams routed MoE
-experts from the GGUF on cache misses, turning "does it fit in RAM" into a speed
-spectrum. **Metal (Darwin) only** - it is a no-op on CUDA/CPU. Enable with
-`options: ["ssd_streaming"]`; size the routed-expert cache with
-`ssd_streaming_cache_experts:NGB` (omit for ds4's automatic 80%-of-working-set
-budget). Gallery entries built on this: `deepseek-v4-flash-q4-ssd` (153 GB Flash
-on a 128 GB Mac) and `deepseek-v4-pro-q2-ssd` (433 GB Pro, experimental).
-
 ## Build matrix

 | Build | Where | Notes |
--- a/.docker/install-base-deps.sh
+++ b/.docker/install-base-deps.sh
@@ -70,12 +70,6 @@ if [ "${BUILD_TYPE:-}" = "vulkan" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; t
        git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
        ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
        clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
-    # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe + Arm SoC) and their ICD
-    # manifests. The LunarG SDK below only provides the loader and shader
-    # tooling, not hardware drivers — without Mesa the packaged Vulkan backend
-    # would ship a loader that finds no GPU. package-gpu-libs.sh bundles these
-    # .so files plus their deps into the backend so it stays self-contained.
-    apt-get install -y mesa-vulkan-drivers libdrm2
    if [ "amd64" = "${TARGETARCH:-}" ]; then
        wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz"
        tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz
--- a/.docker/llama-cpp-compile.sh
+++ b/.docker/llama-cpp-compile.sh
@@ -17,29 +17,19 @@ if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
  rm -rf /LocalAI/backend/cpp/llama-cpp-*-build
 fi

-cd /LocalAI/backend/cpp/llama-cpp
-if [ -z "${BUILD_TYPE:-}" ]; then
-  # Pure CPU image (BUILD_TYPE empty): one build with ggml CPU_ALL_VARIANTS replaces the
-  # per-microarch binaries (x86: avx/avx2/avx512/fallback; arm64: armv8.x/armv9.x). ggml
-  # dlopens the best libggml-cpu-*.so at runtime by probing host CPU features.
-  #
-  # arm64: the CPU_ALL_VARIANTS table includes armv9.2 SME variants whose -march=...+sme is
-  # rejected by the Ubuntu 24.04 default gcc-13. gcc-14 accepts it, so build the arm64
-  # variants with it (the host never *selects* SME unless it has it, but every variant must
-  # still compile).
-  if [ "${TARGETARCH}" = "arm64" ]; then
-    apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
-    export CC=gcc-14 CXX=g++-14
-  fi
-  make llama-cpp-cpu-all
-else
-  # GPU build (cublas/hipblas/sycl/vulkan/...): the accelerator does the compute, so a
-  # single fallback CPU build is enough - no per-microarch CPU variants needed. (This also
-  # keeps the heavy GPU backend compile from also building the whole CPU variant matrix,
-  # and avoids the gcc-14 apt step on GPU base images such as nvidia l4t.)
+if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
+  cd /LocalAI/backend/cpp/llama-cpp
  make llama-cpp-fallback
+  make llama-cpp-grpc
+  make llama-cpp-rpc-server
+else
+  cd /LocalAI/backend/cpp/llama-cpp
+  make llama-cpp-avx
+  make llama-cpp-avx2
+  make llama-cpp-avx512
+  make llama-cpp-fallback
+  make llama-cpp-grpc
+  make llama-cpp-rpc-server
 fi
-make llama-cpp-grpc
-make llama-cpp-rpc-server

 ccache -s || true
--- a/.docker/turboquant-compile.sh
+++ b/.docker/turboquant-compile.sh
@@ -19,21 +19,17 @@ fi

 cd /LocalAI/backend/cpp/turboquant

-if [ -z "${BUILD_TYPE:-}" ]; then
-  # Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
-  # arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
-  if [ "${TARGETARCH}" = "arm64" ]; then
-    apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
-    export CC=gcc-14 CXX=g++-14
-  fi
-  make turboquant-cpu-all
-else
-  # GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator
-  # does the compute. Keeps the GPU compile from also building the CPU variant matrix and
-  # avoids the gcc-14 apt step on GPU base images such as nvidia l4t.
+if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
  make turboquant-fallback
+  make turboquant-grpc
+  make turboquant-rpc-server
+else
+  make turboquant-avx
+  make turboquant-avx2
+  make turboquant-avx512
+  make turboquant-fallback
+  make turboquant-grpc
+  make turboquant-rpc-server
 fi
-make turboquant-grpc
-make turboquant-rpc-server

 ccache -s || true
--- a/.dockerignore
+++ b/.dockerignore
@@ -31,15 +31,6 @@ backend/python/**/source
 backend/cpp/llama-cpp/llama.cpp
 backend/cpp/llama-cpp-*-build

-# privacy-filter: same in-place pattern. The Makefile fetches privacy-filter.cpp
-# at the pinned commit (or symlinks a PRIVACY_FILTER_SRC checkout for local dev).
-# A stale dir/symlink COPY'd into the image makes the clone step fail (dangling
-# symlink) or compile against the wrong commit, so keep host build state out.
-backend/cpp/privacy-filter/privacy-filter.cpp
-backend/cpp/privacy-filter/build
-backend/cpp/privacy-filter/grpc-server
-backend/cpp/privacy-filter/package
-
 # Rust backend build output (sources are tracked; target/ is generated)
 backend/rust/*/target

--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
--- a/.github/bump_vllm_metal.sh
+++ b/.github/bump_vllm_metal.sh
@@ -1,55 +0,0 @@
-#!/bin/bash
-# Bump the single vllm-metal pin (VLLM_METAL_VERSION) in the vLLM backend's
-# darwin (Apple Silicon) install path. The macOS/Metal build
-# (backend/python/vllm/install.sh, Darwin branch) installs vllm-metal, which is
-# version-locked to a specific vLLM source release. install.sh derives that vLLM
-# version at build time from vllm-metal's own installer (`vllm_v=`) at the pinned
-# tag, so there is only ONE value to bump here -- mirroring bump_vllm_wheel.sh,
-# which bumps the Linux cu130 wheel pin.
-#
-# This deliberately tracks vllm-project/vllm-metal, NOT vllm-project/vllm: the
-# darwin build can only use the exact vLLM version vllm-metal supports, so it may
-# lag the Linux pin (requirements-cublas13-after.txt) until vllm-metal catches up.
-set -xe
-REPO=$1   # vllm-project/vllm-metal
-FILE=$2   # backend/python/vllm/install.sh
-VAR=$3    # VLLM_METAL_VERSION (used for the workflow's output file names)
-
-if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
-    echo "usage: $0 <repo> <install-file> <var-name>" >&2
-    exit 1
-fi
-
-# vllm-metal ships frequent dev releases, all flagged as non-prerelease, so
-# /releases/latest returns the newest one (with its cp312 wheel asset).
-LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
-    "https://api.github.com/repos/$REPO/releases/latest" \
-    | python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
-
-# The coupled vLLM source version lives in vllm-metal's installer at that tag.
-NEW_VLLM_VERSION=$(curl -fsSL \
-    "https://raw.githubusercontent.com/$REPO/$LATEST_TAG/install.sh" \
-    | grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -1 | cut -d'"' -f2)
-
-if [ -z "$LATEST_TAG" ] || [ -z "$NEW_VLLM_VERSION" ]; then
-    echo "Could not resolve vllm-metal tag ($LATEST_TAG) or its vllm_v ($NEW_VLLM_VERSION)." >&2
-    exit 1
-fi
-
-set +e
-CURRENT_TAG=$(grep -oE 'VLLM_METAL_VERSION="[^"]*"' "$FILE" | head -1 | cut -d'"' -f2)
-set -e
-
-# Rewrite the single pin. install.sh derives VLLM_VERSION from this tag at build
-# time, so there is nothing else to touch. peter-evans/create-pull-request opens
-# no PR on a clean tree, so a no-op rewrite (already current) is safe.
-sed -i "$FILE" \
-    -e "s|VLLM_METAL_VERSION=\"[^\"]*\"|VLLM_METAL_VERSION=\"$LATEST_TAG\"|"
-
-if [ -z "$CURRENT_TAG" ]; then
-    echo "Could not find VLLM_METAL_VERSION=\"...\" in $FILE." >&2
-    exit 0
-fi
-
-echo "vllm-metal ${CURRENT_TAG} -> ${LATEST_TAG} (builds vLLM ${NEW_VLLM_VERSION}): https://github.com/$REPO/releases/tag/${LATEST_TAG}" >> "${VAR}_message.txt"
-echo "${LATEST_TAG}" >> "${VAR}_commit.txt"
--- a/.github/gallery-agent/main.go
+++ b/.github/gallery-agent/main.go
@@ -3,7 +3,6 @@ package main
 import (
 	"context"
 	"encoding/json"
-	"errors"
 	"fmt"
 	"os"
 	"strconv"
@@ -114,17 +113,6 @@ func main() {
 	fmt.Println("Searching for trending models on HuggingFace...")
 	rawModels, err := client.GetTrending(searchTerm, limit)
 	if err != nil {
-		if errors.Is(err, hfapi.ErrRateLimited) {
-			fmt.Printf("HuggingFace API is rate limited after retries, skipping this run: %v\n", err)
-			writeSummary(AddedModelSummary{
-				SearchTerm:     searchTerm,
-				TotalFound:     0,
-				ModelsAdded:    0,
-				Quantization:   quantization,
-				ProcessingTime: time.Since(startTime).String(),
-			})
-			return
-		}
 		fmt.Fprintf(os.Stderr, "Error fetching models: %v\n", err)
 		os.Exit(1)
 	}
@@ -289,3 +277,4 @@ func truncateString(s string, maxLen int) string {
 	}
 	return s[:maxLen] + "..."
 }
+
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -44,7 +44,7 @@ jobs:
      has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6

      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -101,7 +101,7 @@ jobs:
    steps:

      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true

--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -57,7 +57,7 @@ jobs:
      HOMEBREW_NO_ANALYTICS: '1'
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true

@@ -98,8 +98,6 @@ jobs:
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
-            /opt/homebrew/Cellar/nlohmann-json
-            /opt/homebrew/Cellar/opus
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}

      - name: Dependencies
@@ -111,15 +109,7 @@ jobs:
          # Without explicitly installing them, a brew cache-hit run restores
          # ccache's Cellar dir but skips installing those transitive deps,
          # and ccache fails at runtime with `dyld: Library not loaded`.
-          # nlohmann-json is header-only and required by the ds4 backend
-          # (dsml_renderer.cpp includes <nlohmann/json.hpp>); on Linux it comes
-          # from the apt-installed nlohmann-json3-dev in the build image.
-          # opus + pkg-config are required by the opus go backend: its
-          # Makefile/package.sh call `pkg-config --cflags/--libs opus` to build
-          # libopusshim.dylib and to locate libopus.dylib for bundling. brew's
-          # pkg-config defaults its search path to the Homebrew prefix so the
-          # opus.pc is found.
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json opus pkg-config
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd
          # Force-reinstall ccache so brew re-validates its full runtime-dep
          # closure on every run. This is the durable fix: when the upstream
          # ccache formula gains a new transitive dep (as it has multiple times
@@ -138,7 +128,7 @@ jobs:
          # and decides "already installed" without re-linking, so on a cache-
          # hit run the formulas aren't on PATH. Force-link them; --overwrite
          # tolerates pre-existing symlinks from earlier installs.
-          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json opus pkg-config 2>/dev/null || true
+          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd 2>/dev/null || true

      - name: Save Homebrew cache
        if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
@@ -158,8 +148,6 @@ jobs:
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
-            /opt/homebrew/Cellar/nlohmann-json
-            /opt/homebrew/Cellar/opus
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}

      # ---- ccache for llama.cpp CMake builds ----
@@ -235,17 +223,8 @@ jobs:
        run: |
          make backends/ds4-darwin

-      # privacy-filter is a C++/ggml backend like ds4 - a single grpc-server with
-      # otool dylib bundling - so it gets its own bespoke darwin script rather than
-      # the generic build-darwin-go-backend path.
-      - name: Build privacy-filter backend (Darwin Metal)
-        if: inputs.backend == 'privacy-filter'
-        run: |
-          make protogen-go
-          make backends/privacy-filter-darwin
-
      - name: Build ${{ inputs.backend }}-darwin
-        if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4' && inputs.backend != 'privacy-filter'
+        if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4'
        run: |
          make protogen-go
          BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -49,7 +49,7 @@ jobs:
      # Sparse checkout: the merge job needs `.github/scripts/` (for the
      # keepalive cleanup script) but none of the source tree.
      - name: Checkout (.github/scripts only)
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          sparse-checkout: |
            .github/scripts
--- a/.github/workflows/backend_pr.yml
+++ b/.github/workflows/backend_pr.yml
@@ -23,7 +23,7 @@ jobs:
      has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6

      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
--- a/.github/workflows/base-images.yml
+++ b/.github/workflows/base-images.yml
@@ -127,7 +127,7 @@ jobs:
            # the original l4t matrix entry which set skip-drivers: 'true'.
            skip-drivers: 'true'
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
        with:
          submodules: false
      - name: Free disk space
--- a/.github/workflows/build-test.yaml
+++ b/.github/workflows/build-test.yaml
@@ -11,7 +11,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -25,7 +25,7 @@ jobs:
    runs-on: macos-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -47,7 +47,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Configure apt mirror on runner
--- a/.github/workflows/bump-inference-defaults.yml
+++ b/.github/workflows/bump-inference-defaults.yml
@@ -14,7 +14,7 @@ jobs:
  bump:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6

      - uses: actions/setup-go@v5
        with:
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -26,10 +26,6 @@ jobs:
            variable: "DS4_VERSION"
            branch: "main"
            file: "backend/cpp/ds4/Makefile"
-          - repository: "localai-org/privacy-filter.cpp"
-            variable: "PRIVACY_FILTER_VERSION"
-            branch: "master"
-            file: "backend/cpp/privacy-filter/Makefile"
          - repository: "ggml-org/whisper.cpp"
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
@@ -42,22 +38,6 @@ jobs:
            variable: "PARAKEET_VERSION"
            branch: "master"
            file: "backend/go/parakeet-cpp/Makefile"
-          - repository: "mudler/ced.cpp"
-            variable: "CED_VERSION"
-            branch: "master"
-            file: "backend/go/ced/Makefile"
-          - repository: "mudler/voice-detect.cpp"
-            variable: "VOICEDETECT_VERSION"
-            branch: "master"
-            file: "backend/go/voice-detect/Makefile"
-          - repository: "mudler/face-detect.cpp"
-            variable: "FACEDETECT_VERSION"
-            branch: "master"
-            file: "backend/go/face-detect/Makefile"
-          - repository: "mudler/depth-anything.cpp"
-            variable: "DEPTHANYTHING_VERSION"
-            branch: "master"
-            file: "backend/go/depth-anything-cpp/Makefile"
          - repository: "leejet/stable-diffusion.cpp"
            variable: "STABLEDIFFUSION_GGML_VERSION"
            branch: "master"
@@ -82,25 +62,17 @@ jobs:
            variable: "RFDETR_VERSION"
            branch: "main"
            file: "backend/go/rfdetr-cpp/Makefile"
-          - repository: "mudler/locate-anything.cpp"
-            variable: "LOCATEANYTHING_VERSION"
-            branch: "master"
-            file: "backend/go/locate-anything-cpp/Makefile"
-          - repository: "ServeurpersoCom/qwentts.cpp"
+          - repository: "predict-woo/qwen3-tts.cpp"
            variable: "QWEN3TTS_CPP_VERSION"
-            branch: "master"
+            branch: "main"
            file: "backend/go/qwen3-tts-cpp/Makefile"
-          - repository: "ServeurpersoCom/omnivoice.cpp"
-            variable: "OMNIVOICE_VERSION"
-            branch: "master"
-            file: "backend/go/omnivoice-cpp/Makefile"
          - repository: "localai-org/vibevoice.cpp"
            variable: "VIBEVOICE_CPP_VERSION"
            branch: "master"
            file: "backend/go/vibevoice-cpp/Makefile"
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Bump dependencies 🔧
        id: bump
        run: |
@@ -136,7 +108,7 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Bump vLLM cu130 wheel pin 🔧
        id: bump
        run: |
@@ -162,39 +134,3 @@ jobs:
          branch: "update/VLLM_VERSION"
          body: ${{ steps.bump.outputs.message }}
          signoff: true
-
-  bump-vllm-metal:
-    # The darwin (Apple Silicon) vLLM build installs vllm-metal, which is locked
-    # to a specific vLLM source release. install.sh pins both VLLM_METAL_VERSION
-    # (the wheel release) and VLLM_VERSION (the vLLM it builds against); this job
-    # tracks vllm-project/vllm-metal and rewrites both atomically. Separate from
-    # bump-vllm-wheel because darwin follows vllm-metal, not vllm/vllm latest.
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v7
-      - name: Bump vllm-metal pin 🔧
-        id: bump
-        run: |
-          bash .github/bump_vllm_metal.sh vllm-project/vllm-metal backend/python/vllm/install.sh VLLM_METAL_VERSION
-          {
-            echo 'message<<EOF'
-            cat "VLLM_METAL_VERSION_message.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          {
-            echo 'commit<<EOF'
-            cat "VLLM_METAL_VERSION_commit.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          rm -rfv VLLM_METAL_VERSION_message.txt VLLM_METAL_VERSION_commit.txt
-      - name: Create Pull Request
-        uses: peter-evans/create-pull-request@v8
-        with:
-          token: ${{ secrets.UPDATE_BOT_TOKEN }}
-          push-to-fork: ci-forks/LocalAI
-          commit-message: ':arrow_up: Update vllm-project/vllm-metal (darwin)'
-          title: 'chore: :arrow_up: Update vllm-metal (darwin) to `${{ steps.bump.outputs.commit }}`'
-          branch: "update/VLLM_METAL_VERSION"
-          body: ${{ steps.bump.outputs.message }}
-          signoff: true
--- a/.github/workflows/bump_docs.yaml
+++ b/.github/workflows/bump_docs.yaml
@@ -13,7 +13,7 @@ jobs:
          - repository: "mudler/LocalAI"
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Bump dependencies 🔧
        run: |
          bash .github/bump_docs.sh ${{ matrix.repository }}
--- a/.github/workflows/checksum_checker.yaml
+++ b/.github/workflows/checksum_checker.yaml
@@ -8,7 +8,7 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - name: Install dependencies
--- a/.github/workflows/deploy-explorer.yaml
+++ b/.github/workflows/deploy-explorer.yaml
@@ -16,7 +16,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - uses: actions/setup-go@v5
--- a/.github/workflows/gallery-agent.yaml
+++ b/.github/workflows/gallery-agent.yaml
@@ -31,7 +31,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          token: ${{ secrets.GITHUB_TOKEN }}

--- a/.github/workflows/generate_intel_image.yaml
+++ b/.github/workflows/generate_intel_image.yaml
@@ -44,7 +44,7 @@ jobs:
        uses: docker/setup-buildx-action@master

      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6

      - name: Cache Intel images
        uses: docker/build-push-action@v7
--- a/.github/workflows/gh-pages.yml
+++ b/.github/workflows/gh-pages.yml
@@ -28,7 +28,7 @@ jobs:
      HUGO_VERSION: "0.146.3"
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0  # needed for enableGitInfo
          submodules: true
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -80,7 +80,7 @@ jobs:
    steps:

      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6

      - name: Configure apt mirror on runner
        id: apt_mirror
--- a/.github/workflows/image_merge.yml
+++ b/.github/workflows/image_merge.yml
@@ -36,7 +36,7 @@ jobs:
      # Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
      # script). Skips the rest of the source tree.
      - name: Checkout (.github/scripts only)
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          sparse-checkout: |
            .github/scripts
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -20,7 +20,7 @@ jobs:
  golangci-lint:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
        with:
          # Full history so golangci-lint's new-from-merge-base can reach
          # origin/master and compute the diff against it.
--- a/.github/workflows/release.yaml
+++ b/.github/workflows/release.yaml
@@ -10,7 +10,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -24,35 +24,20 @@ jobs:
          args: release --clean
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          MACOS_SIGN_P12: ${{ secrets.MACOS_CERTIFICATE }}
-          MACOS_SIGN_PASSWORD: ${{ secrets.MACOS_CERTIFICATE_PWD }}
-          MACOS_NOTARY_KEY: ${{ secrets.MACOS_NOTARY_KEY }}
-          MACOS_NOTARY_KEY_ID: ${{ secrets.MACOS_NOTARY_KEY_ID }}
-          MACOS_NOTARY_ISSUER_ID: ${{ secrets.MACOS_NOTARY_ISSUER_ID }}
  launcher-build-darwin:
    runs-on: macos-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Set up Go
        uses: actions/setup-go@v5
        with:
          go-version: 1.23
-      - name: Import signing certificate
-        env:
-          MACOS_CERTIFICATE: ${{ secrets.MACOS_CERTIFICATE }}
-          MACOS_CERTIFICATE_PWD: ${{ secrets.MACOS_CERTIFICATE_PWD }}
-          MACOS_CI_KEYCHAIN_PWD: ${{ secrets.MACOS_CI_KEYCHAIN_PWD }}
-        run: bash contrib/macos/sign-and-notarize.sh import-cert
-      - name: Build, sign and notarize the DMG
-        env:
-          MACOS_SIGN_IDENTITY: ${{ secrets.MACOS_SIGN_IDENTITY }}
-          MACOS_NOTARY_KEY: ${{ secrets.MACOS_NOTARY_KEY }}
-          MACOS_NOTARY_KEY_ID: ${{ secrets.MACOS_NOTARY_KEY_ID }}
-          MACOS_NOTARY_ISSUER_ID: ${{ secrets.MACOS_NOTARY_ISSUER_ID }}
-        run: make release-launcher-darwin
+      - name: Build launcher for macOS ARM64
+        run: |
+          make build-launcher-darwin
      - name: Upload DMG to Release
        uses: softprops/action-gh-release@v3
        with:
@@ -61,7 +46,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Configure apt mirror on runner
--- a/.github/workflows/secscan.yaml
+++ b/.github/workflows/secscan.yaml
@@ -14,17 +14,14 @@ jobs:
      GO111MODULE: on
    steps:
      - name: Checkout Source
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        if: ${{ github.actor != 'dependabot[bot]' }}
      - name: Run Gosec Security Scanner
        if: ${{ github.actor != 'dependabot[bot]' }}
        uses: securego/gosec@v2.27.1
        with:
          # we let the report trigger content trigger a failure using the GitHub Security features.
-          # backend/go/supertonic is excluded: it vendors upstream supertone-inc/supertonic
-          # (helper.go), whose findings (G304 model-file loads, G404 math/rand for flow-matching
-          # noise, G104 unhandled errors) are inherent to that upstream code, not ours to rewrite.
-          args: '-no-fail -exclude-dir=backend/go/supertonic -fmt sarif -out results.sarif ./...'
+          args: '-no-fail -fmt sarif -out results.sarif ./...'
      - name: Upload SARIF file
        if: ${{ github.actor != 'dependabot[bot]' }}
        uses: github/codeql-action/upload-sarif@v4
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -38,7 +38,6 @@ jobs:
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
      qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
      rfdetr-cpp: ${{ steps.detect.outputs.rfdetr-cpp }}
-      locate-anything-cpp: ${{ steps.detect.outputs.locate-anything-cpp }}
      vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
      localvqe: ${{ steps.detect.outputs.localvqe }}
      voxtral: ${{ steps.detect.outputs.voxtral }}
@@ -50,7 +49,7 @@ jobs:
      parakeet-cpp: ${{ steps.detect.outputs.parakeet-cpp }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
      - name: Install dependencies
@@ -67,7 +66,7 @@ jobs:
  #   runs-on: ubuntu-latest
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -90,7 +89,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -113,7 +112,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -137,7 +136,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -158,7 +157,7 @@ jobs:
  #  runs-on: ubuntu-latest
  #  steps:
  #    - name: Clone
-  #      uses: actions/checkout@v7
+  #      uses: actions/checkout@v6
  #      with:
  #        submodules: true
  #    - name: Dependencies
@@ -178,7 +177,7 @@ jobs:
  #   runs-on: ubuntu-latest
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -240,7 +239,7 @@ jobs:
  #           sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
  #           df -h
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -265,7 +264,7 @@ jobs:
  #   runs-on: ubuntu-latest
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -288,7 +287,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -309,7 +308,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -330,7 +329,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -351,7 +350,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -373,7 +372,7 @@ jobs:
  #   timeout-minutes: 45
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -394,7 +393,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -415,7 +414,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -436,7 +435,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -462,7 +461,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -484,7 +483,7 @@ jobs:
    timeout-minutes: 30
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -513,7 +512,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -530,7 +529,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -552,7 +551,7 @@ jobs:
    timeout-minutes: 20
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -564,7 +563,7 @@ jobs:
      - name: Run e2e-backends smoke
        env:
          BACKEND_IMAGE: quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
-          BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias,tokenize
+          BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias
        run: |
          make test-extra-backend
  # Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
@@ -579,7 +578,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -604,7 +603,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -625,7 +624,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -645,7 +644,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -664,7 +663,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -681,7 +680,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -698,7 +697,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -741,7 +740,7 @@ jobs:
  #   timeout-minutes: 90
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -783,7 +782,7 @@ jobs:
  #   timeout-minutes: 90
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -808,7 +807,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -840,7 +839,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -876,7 +875,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -902,45 +901,6 @@ jobs:
      - name: Test rfdetr-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp test
-  # Per-backend e2e for locate-anything-cpp: builds the .so + Go binary and
-  # runs `make -C backend/go/locate-anything-cpp test`. test.sh fetches the
-  # locate-anything-q8_0 GGUF (~6.3 GB, NVIDIA LocateAnything-3B) from the
-  # published mudler/locate-anything.cpp-gguf HF repo + a COCO image, then the
-  # Go wire test loads the model and runs an open-vocabulary Detect, asserting
-  # at least one labeled box. Heavier than the other Go backends (it is a 3B),
-  # so it is gated to changes under backend/go/locate-anything-cpp/.
-  tests-locate-anything-cpp:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.locate-anything-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential cmake curl libopenblas-dev
-      - name: Setup Go
-        uses: actions/setup-go@v5
-      - name: Display Go version
-        run: go version
-      - name: Proto Dependencies
-        run: |
-          # Install protoc
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Build locate-anything-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp
-      - name: Test locate-anything-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp test
  # Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
  # runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
  # the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K
@@ -952,7 +912,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -987,7 +947,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -1008,16 +968,12 @@ jobs:
  # image + working dir.
  tests-vibevoice-cpp-grpc-transcription:
    needs: detect-changes
-    # Skip on release tag pushes: the ASR Q4_K model is ~10 GB and cannot be
-    # pulled from HF within the inner `go test -timeout 30m` budget on a CI
-    # runner, so every tag build hung and timed out. Still runs on PRs/branch
-    # pushes that touch vibevoice-cpp so regressions are caught off the release path.
-    if: (needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true') && !startsWith(github.ref, 'refs/tags/')
+    if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: bigger-runner
    timeout-minutes: 150
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -1046,7 +1002,7 @@ jobs:
    timeout-minutes: 60
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -1062,7 +1018,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -1095,7 +1051,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -1118,7 +1074,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -1144,7 +1100,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -21,7 +21,7 @@ jobs:
        go-version: ['1.26.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Free disk space
@@ -84,7 +84,7 @@ jobs:
        go-version: ['1.26.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go ${{ matrix.go-version }}
@@ -121,19 +121,3 @@ jobs:
          detached: true
          connect-timeout-seconds: 180
          limit-access-to-actor: true
-
-  # Fast standalone unit tests for the backends' pure C++ helpers - currently the
-  # llama-cpp message reconstruction (backend/cpp/llama-cpp/message_content.h),
-  # which guards the OpenAI chat content normalization (mudler/LocalAI#10524,
-  # #7324, #7528). The runner discovers every *_test.cpp under backend/cpp/, so
-  # new pure-C++ unit tests are picked up with no CI changes. These need only the
-  # C++ stdlib + nlohmann/json, so they run on every PR without the full
-  # llama.cpp + gRPC backend build. (The same suite is also wired as an opt-in
-  # CMake/ctest target, -DLLAMA_GRPC_BUILD_TESTS=ON, for in-backend-build runs.)
-  tests-backend-cpp:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-      - name: Run backend C++ unit tests
-        run: make test-backend-cpp
--- a/.github/workflows/tests-aio.yml
+++ b/.github/workflows/tests-aio.yml
@@ -62,7 +62,7 @@ jobs:
          sudo rm -rfv build || true
          df -h
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
--- a/.github/workflows/tests-e2e.yml
+++ b/.github/workflows/tests-e2e.yml
@@ -21,7 +21,7 @@ jobs:
        go-version: ['1.25.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Configure apt mirror on runner
--- a/.github/workflows/tests-pii-ner-e2e.yml
+++ b/.github/workflows/tests-pii-ner-e2e.yml
@@ -1,97 +0,0 @@
---
-name: 'PII NER tier E2E (live GGUF, CPU)'
-
-# Runs the real privacy-filter GGUF NER tier end-to-end on CPU — the gap the
-# hermetic tests/e2e suite cannot cover (it only exercises the in-process
-# pattern tier). Heavy (builds the C++ backend image + downloads a ~2.7 GB
-# GGUF), so it is path-filtered on PRs and otherwise runs nightly / on demand.
-#
-# This drives the container-level harness (tests/e2e-backends) via
-# `make test-extra-backend-privacy-filter`: it builds the privacy-filter image,
-# downloads the model, loads it on CPU, and asserts byte-correct, UTF-8-aligned
-# TokenClassify spans. The complementary HTTP-path specs in tests/e2e
-# (e2e_pii_ner_test.go) Skip unless PII_NER_MODEL_GGUF is wired.
-
-on:
-  workflow_dispatch:
-  schedule:
-    - cron: '0 3 * * *'
-  push:
-    branches:
-      - master
-    paths:
-      - 'backend/cpp/privacy-filter/**'
-      - 'backend/Dockerfile.privacy-filter'
-      - 'core/services/routing/pii/**'
-      - 'core/services/routing/piidetector/**'
-      - 'core/backend/token_classify.go'
-      - 'core/http/endpoints/localai/pii.go'
-      - 'core/schema/pii.go'
-      - 'tests/e2e-backends/**'
-      - 'tests/e2e/e2e_pii_ner_test.go'
-      - 'tests/e2e/e2e_suite_test.go'
-      - '.github/workflows/tests-pii-ner-e2e.yml'
-  pull_request:
-    paths:
-      - 'backend/cpp/privacy-filter/**'
-      - 'backend/Dockerfile.privacy-filter'
-      - 'core/services/routing/pii/**'
-      - 'core/services/routing/piidetector/**'
-      - 'core/backend/token_classify.go'
-      - 'core/http/endpoints/localai/pii.go'
-      - 'core/schema/pii.go'
-      - 'tests/e2e-backends/**'
-      - 'tests/e2e/e2e_pii_ner_test.go'
-      - 'tests/e2e/e2e_suite_test.go'
-      - '.github/workflows/tests-pii-ner-e2e.yml'
-
-concurrency:
-  group: ci-tests-pii-ner-e2e-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  tests-pii-ner-e2e:
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        go-version: ['1.25.x']
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-      - name: Free disk space
-        run: |
-          sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc /opt/hostedtoolcache/CodeQL || true
-          sudo docker image prune --all --force || true
-          df -h
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
-      - name: Setup Go ${{ matrix.go-version }}
-        uses: actions/setup-go@v5
-        with:
-          go-version: ${{ matrix.go-version }}
-          cache: false
-      - name: Proto Dependencies
-        run: |
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential
-      # Builds local-ai-backend:privacy-filter, downloads the GGUF, loads it on
-      # CPU and runs the token_classify capability spec (byte-offset contract).
-      - name: Run live PII NER backend E2E
-        run: PATH="$PATH:$HOME/go/bin" make test-extra-backend-privacy-filter
-      - name: Setup tmate session if tests fail
-        if: ${{ failure() }}
-        uses: mxschmitt/action-tmate@v3.23
-        with:
-          detached: true
-          connect-timeout-seconds: 180
-          limit-access-to-actor: true
--- a/.github/workflows/tests-ui-e2e.yml
+++ b/.github/workflows/tests-ui-e2e.yml
@@ -23,7 +23,7 @@ jobs:
        go-version: ['1.26.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Configure apt mirror on runner
--- a/.github/workflows/update_swagger.yaml
+++ b/.github/workflows/update_swagger.yaml
@@ -10,7 +10,7 @@ jobs:
      fail-fast: false
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - uses: actions/setup-go@v5
--- a/.gitignore
+++ b/.gitignore
@@ -91,9 +91,3 @@ core/http/react-ui/test-results/

 # Local worktrees
 .worktrees/
-
-# SDD / brainstorm scratch (agent-driven development)
-.superpowers/
-
-# Local Apple signing material (never commit)
-.certs/
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -74,8 +74,6 @@ linters:
    paths:
      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
      - 'backend/go/whisper/sources'
-      # Vendored upstream supertonic pipeline (supertone-inc/supertonic go/helper.go).
-      - 'backend/go/supertonic/helper.go'
      - 'docs/'
    rules:
      # CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
--- a/.goreleaser.yaml
+++ b/.goreleaser.yaml
@@ -9,8 +9,7 @@ source:
  enabled: true
  name_template: '{{ .ProjectName }}-{{ .Tag }}-source'
 builds:
-  - id: local-ai
-    main: ./cmd/local-ai
+  - main: ./cmd/local-ai
    env:
      - CGO_ENABLED=0
    ldflags:
@@ -36,19 +35,3 @@ snapshot:
  version_template: "{{ .Tag }}-next"
 changelog:
  use: github-native
-# Sign + notarize the macOS server binary via the quill backend (runs on Linux,
-# no macOS runner needed). Disabled automatically when MACOS_SIGN_P12 is unset
-# (forks / PRs), so those builds stay unsigned and green.
-notarize:
-  macos:
-    - enabled: '{{ isEnvSet "MACOS_SIGN_P12" }}'
-      ids:
-        - local-ai
-      sign:
-        certificate: "{{.Env.MACOS_SIGN_P12}}"
-        password: "{{.Env.MACOS_SIGN_PASSWORD}}"
-      notarize:
-        issuer_id: "{{.Env.MACOS_NOTARY_ISSUER_ID}}"
-        key_id: "{{.Env.MACOS_NOTARY_KEY_ID}}"
-        key: "{{.Env.MACOS_NOTARY_KEY}}"
-        wait: true
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -43,5 +43,4 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 - **New API endpoints**: LocalAI advertises its capability surface in several independent places — swagger `@Tags`, `/api/instructions` registry, auth `RouteFeatureRegistry`, React UI `capabilities.js`, docs. Read [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) and follow its checklist — missing any surface means clients, admins, and the UI won't know the endpoint exists.
 - **Admin endpoints → MCP tool**: every admin endpoint that an admin would manage conversationally (install/list/edit/toggle/upgrade) MUST also be exposed as an MCP tool in `pkg/mcp/localaitools/`. The LocalAI Assistant chat modality and the standalone `local-ai mcp-server` consume that package; drift between REST and MCP is a real risk. Read [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) — the `TestToolHTTPRouteMappingComplete` test fails until you wire the new tool and update the route map.
 - **Build**: Inspect `Makefile` and `.github/workflows/` — ask the user before running long builds
- **Backend OS coverage**: a new backend must target every OS it can build for, not just Linux. `.github/backend-matrix.yml` has two matrices — `include:` (Linux) and `includeDarwin:` (macOS / Apple Silicon). Most C/C++/GGML and many Python backends build on Darwin too — wire the `includeDarwin` entry + `backend/index.yaml` `metal:` entries, or say in the PR why an OS is unsupported. See the darwin checklist in [.agents/adding-backends.md](.agents/adding-backends.md).
 - **UI**: The active UI is the React app in `core/http/react-ui/`. The older Alpine.js/HTML UI in `core/http/static/` is pending deprecation — all new UI work goes in the React UI
--- a/1
+++ b/1
@@ -108,7 +108,6 @@ RUN <<EOT bash
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
--- a/88
+++ b/88
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter backends/privacy-filter-darwin
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -103,7 +103,7 @@ COVERAGE_E2E_LABELS?=!real-models
 COVERAGE_EXCLUDE_RE?=grpc/proto/.*[.]pb[.]go


-.PHONY: all test test-coverage test-coverage-baseline test-coverage-check test-backend-cpp test-ui test-ui-coverage-baseline test-ui-coverage-check install-hooks build vendor lint lint-all
+.PHONY: all test test-coverage test-coverage-baseline test-coverage-check test-ui test-ui-coverage-baseline test-ui-coverage-check install-hooks build vendor lint lint-all

 all: help

@@ -180,7 +180,7 @@ osx-signed: build

 ## Run
 run: ## run local-ai
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./cmd/local-ai
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./

 prepare-test: protogen-go build-mock-backend

@@ -201,13 +201,6 @@ test: prepare-test
 	OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)

-## Compiles and runs the standalone C++ unit tests for the backends (pure
-## helpers that depend only on the stdlib + nlohmann/json, no full backend
-## build). Discovers every *_test.cpp under backend/cpp/ - see
-## backend/cpp/run-unit-tests.sh. Set NLOHMANN_INCLUDE to skip the header fetch.
-test-backend-cpp:
-	bash backend/cpp/run-unit-tests.sh
-
 ## Runs the core suite ($(TEST_PATHS)) with statement-coverage instrumentation
 ## and writes a merged profile to $(COVERAGE_PROFILE). Deliberately omits
 ## --fail-fast so a single failure doesn't truncate the coverage number, and
@@ -316,20 +309,13 @@ run-e2e-aio: protogen-go
 	@echo 'Running e2e AIO tests'
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e-aio

-# Distributed architecture e2e (PostgreSQL + NATS via testcontainers).
-# Includes NatsJWT specs (JWT-enabled NATS). Requires Docker.
-# VLLMMultinode is excluded here; use test-e2e-vllm-multinode for that.
-test-e2e-distributed: protogen-go
-	@echo 'Running distributed e2e tests (label Distributed, incl. NatsJWT)'
-	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter='Distributed && !VLLMMultinode' --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e/distributed
-
 # vLLM multi-node DP smoke (CPU). Builds local-ai:tests and the
 # cpu-vllm backend from the current working tree, then drives a
 # head + headless follower via testcontainers-go and asserts a chat
 # completion. BuildKit caches both images, so re-runs only rebuild
 # what changed. The test lives under tests/e2e/distributed and is
 # selected by the VLLMMultinode label so it doesn't run alongside
-# test-e2e-distributed.
+# the other distributed-suite tests by default.
 test-e2e-vllm-multinode: docker-build-e2e extract-backend-vllm protogen-go
 	@echo 'Running e2e vLLM multi-node DP test'
 	LOCALAI_IMAGE=local-ai \
@@ -573,7 +559,6 @@ prepare-test-extra: protogen-python
 	$(MAKE) -C backend/python/speaker-recognition
 	$(MAKE) -C backend/rust/kokoros kokoros-grpc
 	$(MAKE) -C backend/go/rfdetr-cpp
-	$(MAKE) -C backend/go/locate-anything-cpp

 test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/transformers test
@@ -601,9 +586,6 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/speaker-recognition test
 	$(MAKE) -C backend/rust/kokoros test
 	$(MAKE) -C backend/go/rfdetr-cpp test
-	$(MAKE) -C backend/go/locate-anything-cpp test
-	$(MAKE) -C backend/go/depth-anything-cpp test
-	$(MAKE) -C backend/go/supertonic test

 ##
 ## End-to-end gRPC tests that exercise a built backend container image.
@@ -697,16 +679,6 @@ test-extra-backend-llama-cpp-transcription: docker-build-llama-cpp
 	BACKEND_TEST_CTX_SIZE=2048 \
 	$(MAKE) test-extra-backend

-## privacy-filter: the PII/NER token-classification backend. Exercises the
-## TokenClassify RPC and asserts byte-correct, UTF-8-aligned span offsets
-## against the openai-privacy-filter multilingual GGUF (CPU-runnable, ~50M
-## active params). This is the live-backend coverage for the PII NER tier.
-test-extra-backend-privacy-filter: docker-build-privacy-filter
-	BACKEND_IMAGE=local-ai-backend:privacy-filter \
-	BACKEND_TEST_MODEL_URL=https://huggingface.co/LocalAI-io/privacy-filter-multilingual-GGUF/resolve/main/privacy-filter-multilingual-f16.gguf \
-	BACKEND_TEST_CAPS=health,load,token_classify \
-	$(MAKE) test-extra-backend
-
 ## vllm is resolved from a HuggingFace model id (no file download) and
 ## exercises Predict + streaming + tool-call extraction via the hermes parser.
 ## Requires a host CPU with the SIMD instructions the prebuilt vllm CPU
@@ -1136,10 +1108,6 @@ backends/ds4-darwin: build
 	bash ./scripts/build/ds4-darwin.sh
 	./local-ai backends install "ocifile://$(abspath ./backend-images/ds4.tar)"

-backends/privacy-filter-darwin: build
-	bash ./scripts/build/privacy-filter-darwin.sh
-	./local-ai backends install "ocifile://$(abspath ./backend-images/privacy-filter.tar)"
-
 build-darwin-python-backend: build
 	bash ./scripts/build/python-darwin.sh

@@ -1185,10 +1153,6 @@ BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
 # Single-model; hardware-only validation lives at tests/e2e-backends/
 # (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
 BACKEND_DS4 = ds4|ds4|.|false|false
-# privacy-filter wraps the standalone privacy-filter.cpp GGML engine (the
-# openai-privacy-filter PII/NER token classifier) — the TokenClassify RPC for
-# the PII redactor tier, on stock ggml with no llama.cpp carry-patches.
-BACKEND_PRIVACY_FILTER = privacy-filter|privacy-filter|.|false|false

 # Golang backends
 BACKEND_PIPER = piper|golang|.|false|true
@@ -1200,16 +1164,13 @@ BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|tr
 BACKEND_WHISPER = whisper|golang|.|false|true
 BACKEND_CRISPASR = crispasr|golang|.|false|true
 BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
-BACKEND_DEPTH_ANYTHING_CPP = depth-anything-cpp|golang|.|false|true
 BACKEND_VOXTRAL = voxtral|golang|.|false|true
 BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
 BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
-BACKEND_OMNIVOICE_CPP = omnivoice-cpp|golang|.|false|true
 BACKEND_VIBEVOICE_CPP = vibevoice-cpp|golang|.|false|true
 BACKEND_LOCALVQE = localvqe|golang|.|false|true
 BACKEND_OPUS = opus|golang|.|false|true
 BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true
-BACKEND_SUPERTONIC = supertonic|golang|.|false|true

 # Python backends with root context
 BACKEND_RERANKERS = rerankers|python|.|false|true
@@ -1283,7 +1244,6 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
 $(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
-$(eval $(call generate-docker-build-target,$(BACKEND_PRIVACY_FILTER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_CLOUD_PROXY)))
@@ -1293,7 +1253,6 @@ $(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
 $(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_DEPTH_ANYTHING_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
 $(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
@@ -1326,7 +1285,6 @@ $(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX)))
 $(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_OMNIVOICE_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCALVQE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_MLX)))
@@ -1339,13 +1297,12 @@ $(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RFDETR_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
-$(eval $(call generate-docker-build-target,$(BACKEND_SUPERTONIC)))

 # Pattern rule for docker-save targets
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy

 ########################################################
 ### Mock Backend for E2E Tests
@@ -1460,32 +1417,13 @@ docs: docs/static/gallery.html
 ########################################################

 ## fyne cross-platform build
-# Build LocalAI.app from the launcher via fyne (metadata read from cmd/launcher/FyneApp.toml).
-# Signing happens via contrib/macos/sign-and-notarize.sh, which is a no-op when the signing
-# secrets are unset, so unsigned local/fork builds keep working.
-build-launcher-darwin:
-	rm -rf dist/LocalAI.app cmd/launcher/LocalAI.app
-	mkdir -p dist
-	cd cmd/launcher && go run fyne.io/tools/cmd/fyne@latest package -os darwin -icon ../../core/http/static/logo.png --executable $(LAUNCHER_BINARY_NAME)
-	mv cmd/launcher/LocalAI.app dist/LocalAI.app
-	bash contrib/macos/sign-and-notarize.sh sign dist/LocalAI.app
-
-# Wrap the (signed) app into a drag-to-Applications DMG via hdiutil, then sign the DMG.
-dmg-launcher-darwin: build-launcher-darwin
-	rm -rf dist/dmg dist/LocalAI.dmg
-	mkdir -p dist/dmg
-	cp -R dist/LocalAI.app dist/dmg/LocalAI.app
-	ln -s /Applications dist/dmg/Applications
-	hdiutil create -volname "LocalAI" -srcfolder dist/dmg -ov -format UDZO dist/LocalAI.dmg
-	bash contrib/macos/sign-and-notarize.sh sign dist/LocalAI.dmg
-
-# Submit the DMG to Apple notarization and staple the ticket (no-op without notary secrets).
-notarize-launcher-darwin: dmg-launcher-darwin
-	bash contrib/macos/sign-and-notarize.sh notarize dist/LocalAI.dmg
-
-# Single entrypoint for CI: build -> sign app -> dmg -> sign dmg -> notarize -> staple.
-release-launcher-darwin: notarize-launcher-darwin
-	@echo "dist/LocalAI.dmg is ready"
+build-launcher-darwin: build-launcher
+	go run github.com/tiagomelo/macos-dmg-creator/cmd/createdmg@latest \
+	--appName "LocalAI" \
+	--appBinaryPath "$(LAUNCHER_BINARY_NAME)" \
+	--bundleIdentifier "com.localai.launcher" \
+	--iconPath "core/http/static/logo.png" \
+	--outputDir "dist/"

 build-launcher-linux:
-	cd cmd/launcher && go run fyne.io/tools/cmd/fyne@latest package -os linux -icon ../../core/http/static/logo.png --executable $(LAUNCHER_BINARY_NAME)-linux && mv LocalAI.tar.xz ../../$(LAUNCHER_BINARY_NAME)-linux.tar.xz
+	cd cmd/launcher && go run fyne.io/tools/cmd/fyne@latest package -os linux -icon ../../core/http/static/logo.png --executable $(LAUNCHER_BINARY_NAME)-linux && mv launcher.tar.xz ../../$(LAUNCHER_BINARY_NAME)-linux.tar.xz
--- a/README.md
+++ b/README.md
@@ -29,18 +29,6 @@
 <a href="https://trendshift.io/repositories/5539" target="_blank"><img src="https://trendshift.io/api/badge/repositories/5539" alt="mudler%2FLocalAI | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 </p>

-<!-- Keep these links, translations synced daily. -->
-<p align="center">
-<a href="https://zdoc.app/de/mudler/LocalAI">Deutsch</a> |
-<a href="https://zdoc.app/es/mudler/LocalAI">Español</a> |
-<a href="https://zdoc.app/fr/mudler/LocalAI">français</a> |
-<a href="https://zdoc.app/ja/mudler/LocalAI">日本語</a> |
-<a href="https://zdoc.app/ko/mudler/LocalAI">한국어</a> |
-<a href="https://zdoc.app/pt/mudler/LocalAI">Português</a> |
-<a href="https://zdoc.app/ru/mudler/LocalAI">Русский</a> |
-<a href="https://zdoc.app/zh/mudler/LocalAI">中文</a>
-</p>
-
 **LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

 **A small core, not a bundle.** Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.
@@ -161,27 +149,12 @@ local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
 local-ai run oci://localai/phi-2:latest
 ```

-To test a running LocalAI server from the terminal, open an interactive chat session from another shell. Inside the prompt, `/models` lists installed models and `/model <name>` switches between them.
-
-```bash
-# Terminal 1
-local-ai run llama-3.2-1b-instruct:q4_k_m
-
-# Terminal 2
-local-ai chat --model llama-3.2-1b-instruct:q4_k_m
-```
-
 > **Automatic Backend Detection**: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/).

 For more details, see the [Getting Started guide](https://localai.io/basics/getting_started/).

 ## Latest News

- **June 2026**: New native biometric backends from the LocalAI team: [voice-detect.cpp](https://github.com/mudler/voice-detect.cpp) for speaker recognition and voice analysis (ECAPA-TDNN, WeSpeaker, ERes2Net, CAM++, wav2vec2 age/gender/emotion) and [face-detect.cpp](https://github.com/mudler/face-detect.cpp) for face detection, recognition, demographics and anti-spoofing (SCRFD/ArcFace, YuNet/SFace). Both are from-scratch C++/ggml engines with no Python or onnxruntime at inference, self-contained GGUF weights, bit-exact parity with the reference, and GPU cuDNN parity, replacing the heavier Python `insightface` and `speaker-recognition` backends ([PR #10441](https://github.com/mudler/LocalAI/pull/10441)).
- **June 2026**: New [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (a tiny Go client for the Realtime API with a full talk-back voice loop and tool calling), plus [streaming of the realtime LLM / TTS / transcription pipeline stages](https://github.com/mudler/LocalAI/pull/10176) and [configurable WebRTC ICE candidates](https://github.com/mudler/LocalAI/pull/10231).
- **June 2026**: Big speech push: the [parakeet.cpp](https://github.com/mudler/parakeet.cpp) ASR engine gains [NeMo-faithful segment timestamps](https://github.com/mudler/LocalAI/pull/10207), a [multilingual streaming Nemotron-3.5 model](https://github.com/mudler/LocalAI/pull/10199), [dynamic batching for concurrent transcription](https://github.com/mudler/LocalAI/pull/10112) and [CUDA graphs](https://github.com/mudler/LocalAI/pull/10273); the new [CrispASR backend](https://github.com/mudler/LocalAI/pull/10099) adds multi-architecture ASR + TTS, and [60 Piper TTS voices across 42 languages](https://github.com/mudler/LocalAI/pull/10296) land in the gallery (plus [per-request TTS instructions and params](https://github.com/mudler/LocalAI/pull/10172)).
- **June 2026**: New backends and models: [locate-anything.cpp](https://github.com/mudler/LocalAI/pull/10264) for open-vocabulary object detection via ggml, [Ideogram4 image generation](https://github.com/mudler/LocalAI/pull/10201) in stablediffusion-ggml, [llama.cpp video input](https://github.com/mudler/LocalAI/pull/10216), and the [Gemma 4 QAT family with MTP speculative-decoding pairs](https://github.com/mudler/LocalAI/pull/10215). Plus an [interactive CLI chat mode](https://github.com/mudler/LocalAI/pull/10226) and [RAG source citations in agent responses](https://github.com/mudler/LocalAI/pull/10228).
- **June 2026**: Distributed mode hardening: [prefix-cache-aware routing](https://github.com/mudler/LocalAI/pull/10071), a [production-ready request router with auto-sized embedding/rerank batches](https://github.com/mudler/LocalAI/pull/10104), [ds4 layer-split distributed inference](https://github.com/mudler/LocalAI/pull/10098), [NATS JWT auth + TLS/mTLS](https://github.com/mudler/LocalAI/pull/10159), and [resumable file uploads](https://github.com/mudler/LocalAI/pull/10109).
 - **May 2026**: **LocalAI 4.3.0** - `llama.cpp` [prompt cache on by default](https://github.com/mudler/LocalAI/pull/9925) (repeated system prompts collapse from minutes to seconds), [keyless cosign signing of backend OCI images](https://github.com/mudler/LocalAI/pull/9823), [per-API-key + per-user usage attribution](https://github.com/mudler/LocalAI/pull/9920), Distributed v3 with [per-request replica routing](https://github.com/mudler/LocalAI/pull/9968). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.3.0)
 - **May 2026**: **LocalAI 4.2.0** - LocalAI sees and hears: [voice recognition](https://github.com/mudler/LocalAI/pull/9500), [face recognition + antispoofing liveness](https://github.com/mudler/LocalAI/pull/9480), speaker diarization. Plus [drop-in Ollama API](https://github.com/mudler/LocalAI/pull/9284), [video generation](https://github.com/mudler/LocalAI/pull/9420), redesigned UI with i18n + admin-configurable branding, vLLM at feature parity with llama.cpp, and 11 new backends. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.2.0)
 - **April 2026**: **LocalAI 4.1.0** - LocalAI becomes a control tower: distributed cluster mode with VRAM-aware smart routing + autoscaling, multi-user platform with OIDC and API keys, per-user quotas with predictive analytics, in-UI fine-tuning with TRL (auto-export to GGUF), on-the-fly quantization backend, visual pipeline editor. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.1.0)
@@ -221,29 +194,10 @@ For older news and full release notes, see [GitHub Releases](https://github.com/

 ## Supported Backends & Acceleration

-LocalAI supports **60+ backends** including llama.cpp, vLLM, SGLang, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
+LocalAI supports **36+ backends** including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).

 See the full [Backend & Model Compatibility Table](https://localai.io/model-compatibility/) and [GPU Acceleration guide](https://localai.io/features/gpu-acceleration/).

-### Backends built by us
-
-Most backends wrap a best-in-class upstream engine. A handful of them are native C/C++/GGML engines (no Python at inference) developed and maintained by the LocalAI project itself:
-
-| Backend | What it does |
-|---------|-------------|
-| [parakeet.cpp](https://github.com/mudler/parakeet.cpp) | C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription |
-| [ced.cpp](https://github.com/mudler/ced.cpp) | C++/GGML port of the CED audio-tagging models: sound-event classification (527-class AudioSet) over REST and the realtime API for live recognition |
-| [voxtral.c](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in pure C |
-| [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp) | Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization |
-| [rf-detr.cpp](https://github.com/mudler/rf-detr.cpp) | Native RF-DETR object detection and instance segmentation |
-| [locate-anything.cpp](https://github.com/mudler/locate-anything.cpp) | Open-vocabulary object detection and visual grounding (LocateAnything-3B) |
-| [depth-anything.cpp](https://github.com/mudler/depth-anything.cpp) | Depth Anything 3 monocular metric depth + camera pose estimation |
-| [privacy-filter.cpp](https://github.com/localai-org/privacy-filter.cpp) | Standalone GGML PII/NER token-classification engine powering LocalAI's PII redaction tier |
-| [LocalVQE](https://github.com/localai-org/LocalVQE) | Joint acoustic echo cancellation, noise suppression, and dereverberation |
-| [local-store](https://github.com/mudler/LocalAI) | Local-first vector database for embeddings (shipped in-tree) |
-
-We also maintain [apex-quant](https://github.com/localai-org/apex-quant), a per-tensor, per-layer quantization recipe for Mixture-of-Experts models that exploits their structural sparsity to produce GGUFs matching or beating Q8_0 quality - and they run out of the box on stock llama.cpp.
-
 ## Resources

 - [Documentation](https://localai.io/)
@@ -253,7 +207,7 @@ We also maintain [apex-quant](https://github.com/localai-org/apex-quant), a per-
 - [Integrations & community projects](https://localai.io/docs/integrations/)
 - [Installation video walkthrough](https://www.youtube.com/watch?v=cMVNnlqwfw4)
 - [Media & blog posts](https://localai.io/basics/news/#media-blogs-social)
- [Examples](https://github.com/mudler/LocalAI-examples) — including the [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (Go client for the Realtime API with tool calling)
+- [Examples](https://github.com/mudler/LocalAI-examples)

 ## Team

--- a/backend/Dockerfile.golang
+++ b/backend/Dockerfile.golang
@@ -65,12 +65,7 @@ RUN <<EOT bash
            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils && \
-        apt-get install -y mesa-vulkan-drivers libdrm2
-        # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe) + their manifests. The
-        # LunarG SDK below only provides the loader and shader tooling, not
-        # hardware drivers — without Mesa, package-gpu-libs.sh has no ICD to
-        # bundle and the packaged backend finds no GPU at runtime.
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
        if [ "amd64" = "$TARGETARCH" ]; then
            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
@@ -137,7 +132,7 @@ RUN <<EOT bash
            libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
        if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
            apt-get install -y --no-install-recommends \
-            libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} libcudnn9-dev-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
+            libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
        fi
        apt-get clean && \
        rm -rf /var/lib/apt/lists/*
@@ -211,16 +206,6 @@ RUN if [ "${BACKEND}" = "opus" ]; then \
    apt-get clean && rm -rf /var/lib/apt/lists/*; \
 fi

-# CrispASR's piper TTS backend dlopens libespeak-ng at runtime to phonemize
-# non-English text (the MIT-clean path; English uses a built-in G2P). Install
-# the espeak-ng runtime + its libpcaudio/libsonic deps + voice data so
-# package.sh can bundle them into the FROM scratch image.
-RUN if [ "${BACKEND}" = "crispasr" ]; then \
-    apt-get update && apt-get install -y --no-install-recommends \
-        espeak-ng-data libespeak-ng1 libpcaudio0 libsonic0 && \
-    apt-get clean && rm -rf /var/lib/apt/lists/*; \
-fi
-
 COPY . /LocalAI

 RUN git config --global --add safe.directory /LocalAI
--- a/backend/Dockerfile.privacy-filter
+++ b/backend/Dockerfile.privacy-filter
@@ -1,109 +0,0 @@
-ARG BASE_IMAGE=ubuntu:24.04
-# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses when no
-# prebuilt base is supplied; the builder-prebuilt stage is only entered when
-# BUILDER_TARGET=builder-prebuilt, so the fallback content is harmless
-# (BuildKit prunes the unreferenced builder).
-ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
-# BUILDER_TARGET selects which builder stage the scratch image copies from.
-# Declared before any FROM so it is usable in `FROM ${BUILDER_TARGET}`. The
-# backend_build workflow sets it to builder-prebuilt when the matrix entry
-# provides builder-base-image, else builder-fromsource (the local default).
-ARG BUILDER_TARGET=builder-fromsource
-ARG APT_MIRROR=""
-ARG APT_PORTS_MIRROR=""
-
-# privacy-filter: standalone GGML engine for the openai-privacy-filter PII/NER
-# token classifier, wrapped as a LocalAI gRPC backend.
-#
-# Mirrors backend/Dockerfile.llama-cpp: the build toolchain (gRPC + cmake +
-# protoc + conditional CUDA/Vulkan) comes from the shared
-# .docker/install-base-deps.sh (from-source path) or a prebuilt
-# quay.io/go-skynet/ci-cache:base-grpc-* image (CI path) — nothing GPU-specific
-# is hand-rolled here. BUILD_TYPE selects the engine backend in the Makefile:
-# "" = cpu, "cublas" -> -DPF_CUDA=ON, "vulkan" -> -DPF_VULKAN=ON.
-
-# ============================================================================
-# Stage: builder-fromsource — self-contained build. Runs the same install
-# script backend/Dockerfile.base-grpc-builder runs, so this path is
-# bit-equivalent to the prebuilt base. Used when BUILDER_TARGET=builder-fromsource
-# (the default; local `make backends/privacy-filter`).
-# ============================================================================
-FROM ${BASE_IMAGE} AS builder-fromsource
-ARG BUILD_TYPE
-ARG CUDA_MAJOR_VERSION
-ARG CUDA_MINOR_VERSION
-ARG CMAKE_FROM_SOURCE=false
-# CUDA Toolkit 13.x needs CMake 3.31.9+ for correct toolchain/arch detection.
-ARG CMAKE_VERSION=3.31.10
-ARG GRPC_VERSION=v1.65.0
-ARG GRPC_MAKEFLAGS="-j4 -Otarget"
-ARG SKIP_DRIVERS=false
-ARG TARGETARCH
-ARG UBUNTU_VERSION=2404
-ARG APT_MIRROR
-ARG APT_PORTS_MIRROR
-
-ENV BUILD_TYPE=${BUILD_TYPE} \
-    CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
-    CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
-    CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
-    CMAKE_VERSION=${CMAKE_VERSION} \
-    GRPC_VERSION=${GRPC_VERSION} \
-    GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
-    SKIP_DRIVERS=${SKIP_DRIVERS} \
-    TARGETARCH=${TARGETARCH} \
-    UBUNTU_VERSION=${UBUNTU_VERSION} \
-    APT_MIRROR=${APT_MIRROR} \
-    APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
-    DEBIAN_FRONTEND=noninteractive
-# CUDA on PATH (a no-op when CUDA is not installed, e.g. cpu/vulkan builds).
-ENV PATH=/usr/local/cuda/bin:${PATH}
-
-WORKDIR /build
-
-# apt deps + cmake + protoc + gRPC + conditional CUDA/Vulkan, all from the
-# shared script (the source of truth that base-grpc-builder also runs).
-RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
-    --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-    bash /usr/local/sbin/install-base-deps
-
-# install-base-deps installs gRPC under /opt/grpc; copy it to /usr/local so the
-# backend's find_package(gRPC CONFIG) resolves it at the canonical prefix.
-RUN cp -a /opt/grpc/. /usr/local/
-
-COPY . /LocalAI
-
-RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
-
-# ============================================================================
-# Stage: builder-prebuilt — FROM a prebuilt
-# quay.io/go-skynet/ci-cache:base-grpc-* image (gRPC at /opt/grpc + apt deps +
-# CUDA/Vulkan already installed). Used in CI when the matrix entry sets
-# builder-base-image.
-# ============================================================================
-FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
-ARG BUILD_TYPE
-ARG TARGETARCH
-ENV BUILD_TYPE=${BUILD_TYPE}
-# CUDA on PATH (a no-op for the cpu/vulkan base images).
-ENV PATH=/usr/local/cuda/bin:${PATH}
-
-# Mirror builder-fromsource: the base-grpc image installs gRPC to /opt/grpc but
-# does not copy it to /usr/local.
-RUN cp -a /opt/grpc/. /usr/local/
-
-COPY . /LocalAI
-
-RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
-
-# ============================================================================
-# Final stage — copy the package output from the selected builder. BuildKit
-# does not expand variables in `COPY --from=`, so alias the chosen builder to a
-# fixed stage name first.
-# ============================================================================
-FROM ${BUILDER_TARGET} AS builder
-
-FROM scratch
-COPY --from=builder /LocalAI/backend/cpp/privacy-filter/package/. ./
--- a/backend/Dockerfile.python
+++ b/backend/Dockerfile.python
@@ -66,12 +66,7 @@ RUN <<EOT bash
            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils && \
-        apt-get install -y mesa-vulkan-drivers libdrm2
-        # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe) + their manifests. The
-        # LunarG SDK below only provides the loader and shader tooling, not
-        # hardware drivers — without Mesa, package-gpu-libs.sh has no ICD to
-        # bundle and the packaged backend finds no GPU at runtime.
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
        if [ "amd64" = "$TARGETARCH" ]; then
            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
@@ -131,7 +126,6 @@ RUN <<EOT bash
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -24,10 +24,6 @@ service Backend {
  rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {}
  rpc Status(HealthMessage) returns (StatusResponse) {}
  rpc Detect(DetectOptions) returns (DetectResponse) {}
-  // SoundDetection runs an audio-tagging / sound-event-classification model
-  // (e.g. CED over the AudioSet ontology) on a clip and returns scored labels.
-  rpc SoundDetection(SoundDetectionRequest) returns (SoundDetectionResponse) {}
-  rpc Depth(DepthRequest) returns (DepthResponse) {}
  rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
  rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
  rpc VoiceVerify(VoiceVerifyRequest) returns (VoiceVerifyResponse) {}
@@ -541,15 +537,6 @@ message TTSRequest {
  string dst = 3;
  string voice = 4;
  optional string language = 5;
-  // instructions is a free-form, per-request style/voice description (maps to
-  // the OpenAI `instructions` field). Backends that support expressive synthesis
-  // (e.g. Qwen3-TTS CustomVoice/VoiceDesign) prefer this over the static YAML
-  // option when set; backends that don't simply ignore it.
-  optional string instructions = 6;
-  // params carries optional, backend-specific per-request generation parameters
-  // (e.g. Chatterbox exaggeration/cfg_weight/temperature). Values are strings and
-  // coerced by the backend; unset leaves the backend's configured defaults.
-  map<string, string> params = 7;
 }

 message VADRequest {
@@ -674,53 +661,6 @@ message DetectResponse {
  repeated Detection Detections = 1;
 }

-// --- Sound-event classification / audio tagging messages (CED) ---
-
-message SoundDetectionRequest {
-  string src = 1;       // audio file path (LocalAI writes the upload to disk)
-  int32 top_k = 2;      // number of top tags to return (0 = all classes)
-  float threshold = 3;  // optional: drop tags scoring below this
-}
-
-message SoundClass {
-  string label = 1;     // AudioSet class name, e.g. "Baby cry, infant cry"
-  float score = 2;      // per-class probability (multi-label, independent)
-  int32 index = 3;      // class index in the model ontology
-}
-
-message SoundDetectionResponse {
-  repeated SoundClass detections = 1;  // score-descending
-}
-
-// --- Depth estimation messages (Depth Anything 3) ---
-
-message DepthRequest {
-  string src = 1;                  // input image (filesystem path or base64-encoded payload)
-  string dst = 2;                  // optional output directory for exports (glb/colmap)
-  bool include_depth = 3;          // return the per-pixel metric depth map
-  bool include_confidence = 4;     // return the per-pixel confidence map (DualDPT)
-  bool include_pose = 5;           // return camera extrinsics/intrinsics (DualDPT)
-  bool include_sky = 6;            // return the per-pixel sky map (mono models)
-  bool include_points = 7;         // back-project to a 3D point cloud (DualDPT)
-  float points_conf_thresh = 8;    // keep points with confidence >= this threshold
-  repeated string exports = 9;     // requested exports: "glb", "colmap"
-}
-
-message DepthResponse {
-  int32 width = 1;                 // processed depth-map width
-  int32 height = 2;                // processed depth-map height
-  repeated float depth = 3;        // width*height row-major metric depth
-  repeated float confidence = 4;   // width*height row-major confidence (DualDPT)
-  repeated float sky = 5;          // width*height row-major sky map (mono)
-  repeated float extrinsics = 6;   // 12 floats, 3x4 row-major (world-to-camera)
-  repeated float intrinsics = 7;   // 9 floats, 3x3 row-major
-  int32 num_points = 8;            // number of 3D points
-  repeated float points = 9;       // num_points*3 xyz, world space
-  bytes point_colors = 10;         // num_points*3 uint8 rgb
-  repeated string export_paths = 11; // paths written for the requested exports
-  bool is_metric = 12;             // depth is in metric units
-}
-
 // --- Face recognition messages ---

 message FacialArea {
--- a/backend/cpp/ds4/CMakeLists.txt
+++ b/backend/cpp/ds4/CMakeLists.txt
@@ -9,22 +9,6 @@ option(DS4_NATIVE "Compile with -march=native / -mcpu=native" ON)
 set(DS4_GPU "cpu" CACHE STRING "GPU backend: cpu, cuda, or metal")
 set(DS4_DIR "${CMAKE_CURRENT_SOURCE_DIR}/ds4" CACHE PATH "Path to cloned ds4 source")

-if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
-    # Homebrew installs protobuf/grpc under a non-default prefix. The generated
-    # backend.pb.cc / backend.grpc.pb.cc pull in google/protobuf and grpcpp
-    # headers, but the hw_grpc_proto library links neither target, so on macOS
-    # the headers (e.g. google/protobuf/runtime_version.h) are never on the
-    # compiler's include path. Add the Homebrew prefix globally, matching the
-    # llama-cpp backend which builds on Darwin CI.
-    if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "arm64")
-        set(HOMEBREW_DEFAULT_PREFIX "/opt/homebrew")
-    else()
-        set(HOMEBREW_DEFAULT_PREFIX "/usr/local")
-    endif()
-    link_directories("${HOMEBREW_DEFAULT_PREFIX}/lib")
-    include_directories("${HOMEBREW_DEFAULT_PREFIX}/include")
-endif()
-
 find_package(Threads REQUIRED)
 find_package(Protobuf CONFIG QUIET)
 if(NOT Protobuf_FOUND)
@@ -76,12 +60,10 @@ elseif(DS4_GPU STREQUAL "cpu")
    set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
 endif()

-# ds4.c now references ds4_distributed.c (distributed inference) and ds4_ssd.c
-# (SSD expert-cache), each split into its own translation unit upstream. Both
-# are GPU-agnostic objects shared by every GPU mode, so link them in regardless
-# of DS4_GPU.
+# ds4.c now references ds4_distributed.c (distributed inference was split into
+# its own translation unit upstream). It is a single GPU-agnostic object shared
+# by every GPU mode, so link it in regardless of DS4_GPU.
 list(APPEND DS4_OBJS "${DS4_DIR}/ds4_distributed.o")
-list(APPEND DS4_OBJS "${DS4_DIR}/ds4_ssd.o")

 add_executable(${TARGET}
    grpc-server.cpp
--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -1,10 +1,10 @@
 # ds4 backend Makefile.
 #
-# Upstream pin lives below as DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
+# Upstream pin lives below as DS4_VERSION?=ba00a8a88c4c5810a3d1fed6b7b8fa2b44b82fdc
 # (.github/bump_deps.sh) can find and update it - matches the
 # llama-cpp / ik-llama-cpp / turboquant convention.

-DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
+DS4_VERSION?=ba00a8a88c4c5810a3d1fed6b7b8fa2b44b82fdc
 DS4_REPO?=https://github.com/antirez/ds4

 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
@@ -18,20 +18,19 @@ UNAME_S := $(shell uname -s)

 CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release

-# ds4_distributed.o and ds4_ssd.o are GPU-agnostic translation units that
-# ds4.c/ds4_cpu.o now reference (upstream split distributed inference and the
-# SSD expert-cache into their own .c files). Both objects are shared by every
-# GPU mode, so they are appended unconditionally below.
+# ds4_distributed.o is a GPU-agnostic translation unit that ds4.c/ds4_cpu.o now
+# reference (upstream split distributed inference into its own .c). The same
+# object is shared by every GPU mode, so it is appended unconditionally below.
 ifeq ($(BUILD_TYPE),cublas)
    CMAKE_ARGS += -DDS4_GPU=cuda
-    DS4_OBJ_TARGET := ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
+    DS4_OBJ_TARGET := ds4.o ds4_cuda.o ds4_distributed.o
 else ifeq ($(UNAME_S),Darwin)
    CMAKE_ARGS += -DDS4_GPU=metal
-    DS4_OBJ_TARGET := ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
+    DS4_OBJ_TARGET := ds4.o ds4_metal.o ds4_distributed.o
 else
    # CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
    CMAKE_ARGS += -DDS4_GPU=cpu
-    DS4_OBJ_TARGET := ds4_cpu.o ds4_distributed.o ds4_ssd.o
+    DS4_OBJ_TARGET := ds4_cpu.o ds4_distributed.o
 endif

 ifneq ($(NATIVE),true)
@@ -56,11 +55,11 @@ ds4:
 # the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
 ds4/ds4.o: ds4
 ifeq ($(BUILD_TYPE),cublas)
-	+$(MAKE) -C ds4 ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
+	+$(MAKE) -C ds4 ds4.o ds4_cuda.o ds4_distributed.o
 else ifeq ($(UNAME_S),Darwin)
-	+$(MAKE) -C ds4 ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
+	+$(MAKE) -C ds4 ds4.o ds4_metal.o ds4_distributed.o
 else
-	+$(MAKE) -C ds4 ds4_cpu.o ds4_distributed.o ds4_ssd.o
+	+$(MAKE) -C ds4 ds4_cpu.o ds4_distributed.o
 endif

 grpc-server: ds4/ds4.o
--- a/backend/cpp/ds4/grpc-server.cpp
+++ b/backend/cpp/ds4/grpc-server.cpp
@@ -25,8 +25,6 @@ extern "C" {
 #include <chrono>
 #include <climits>
 #include <csignal>
-#include <cstddef>
-#include <cstdint>
 #include <cstdlib>
 #include <cstring>
 #include <ctime>
@@ -107,130 +105,6 @@ static bool parse_layers_spec(const std::string &spec, ds4_distributed_layers *o
    return true;
 }

-// Parse a boolean LoadModel option. An empty value (a bare flag-style option
-// like "ssd_streaming" with no colon) means true so model YAMLs can write
-// options: ["ssd_streaming"] to enable a switch.
-static bool parse_bool_option(const std::string &s, bool *out) {
-    if (s.empty() || s == "true" || s == "1" || s == "yes" || s == "on") { *out = true; return true; }
-    if (s == "false" || s == "0" || s == "no" || s == "off") { *out = false; return true; }
-    return false;
-}
-
-// Table-driven mapping from LoadModel option keys to ds4_engine_options fields.
-// ds4_engine_options is a fixed C struct with no reflection, so the field set
-// is enumerated once here; adding a future engine knob is a one-line table
-// entry rather than a new branch in LoadModel. Two fields need ds4's own typed
-// parsers (Gib, CacheExperts) so a plain string passthrough can't cover them.
-enum class DsOptType { Bool, Int, Uint, Float, Str, Gib, CacheExperts };
-
-struct DsOptSpec {
-    const char *key;
-    DsOptType   type;
-    size_t      off;      // byte offset into ds4_engine_options
-    size_t      off2;     // second offset (CacheExperts writes experts + bytes)
-    bool        is_path;  // Str values: resolve a relative value against the model dir
-};
-
-static const DsOptSpec kEngineOptSpecs[] = {
-    {"mtp_path",                      DsOptType::Str,          offsetof(ds4_engine_options, mtp_path),                      0, true},
-    {"mtp_draft",                     DsOptType::Int,          offsetof(ds4_engine_options, mtp_draft_tokens),              0},
-    {"mtp_margin",                    DsOptType::Float,        offsetof(ds4_engine_options, mtp_margin),                    0},
-    {"prefill_chunk",                 DsOptType::Uint,         offsetof(ds4_engine_options, prefill_chunk),                 0},
-    {"power_percent",                 DsOptType::Int,          offsetof(ds4_engine_options, power_percent),                 0},
-    {"warm_weights",                  DsOptType::Bool,         offsetof(ds4_engine_options, warm_weights),                  0},
-    {"quality",                       DsOptType::Bool,         offsetof(ds4_engine_options, quality),                       0},
-    {"ssd_streaming",                 DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming),                 0},
-    {"ssd_streaming_cold",            DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming_cold),            0},
-    {"ssd_streaming_preload_experts", DsOptType::Uint,         offsetof(ds4_engine_options, ssd_streaming_preload_experts), 0},
-    {"ssd_streaming_cache_experts",   DsOptType::CacheExperts, offsetof(ds4_engine_options, ssd_streaming_cache_experts),
-                                                               offsetof(ds4_engine_options, ssd_streaming_cache_bytes)},
-    {"simulate_used_memory",          DsOptType::Gib,          offsetof(ds4_engine_options, simulate_used_memory_bytes),    0},
-    {"expert_profile_path",           DsOptType::Str,          offsetof(ds4_engine_options, expert_profile_path),           0, true},
-    {"directional_steering_file",     DsOptType::Str,          offsetof(ds4_engine_options, directional_steering_file),     0, true},
-    {"directional_steering_attn",     DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_attn),     0},
-    {"directional_steering_ffn",      DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_ffn),      0},
-};
-
-// Apply a single key:value LoadModel option to the engine options struct.
-// Unknown keys are ignored (back-compat: callers pass mixed option sets).
-// String values are copied into `storage`, whose elements the engine reads by
-// pointer during ds4_engine_open; `storage` MUST have reserved capacity so
-// push_back never reallocates and dangles an earlier c_str(). Returns false
-// with `err` set when a recognized key has an invalid value.
-static bool apply_engine_option(ds4_engine_options *opt, const std::string &key,
-                                const std::string &val, const std::string &model_dir,
-                                std::vector<std::string> &storage, std::string &err) {
-    const DsOptSpec *spec = nullptr;
-    for (const auto &s : kEngineOptSpecs) {
-        if (key == s.key) { spec = &s; break; }
-    }
-    if (!spec) return true; // unknown key: ignore
-
-    char *base = reinterpret_cast<char *>(opt);
-    switch (spec->type) {
-    case DsOptType::Bool: {
-        bool b = false;
-        if (!parse_bool_option(val, &b)) { err = key + " must be true/false"; return false; }
-        *reinterpret_cast<bool *>(base + spec->off) = b;
-        return true;
-    }
-    case DsOptType::Int: {
-        char *end = nullptr;
-        long v = std::strtol(val.c_str(), &end, 10);
-        if (val.empty() || !end || *end != '\0') { err = key + " must be an integer"; return false; }
-        *reinterpret_cast<int *>(base + spec->off) = static_cast<int>(v);
-        return true;
-    }
-    case DsOptType::Uint: {
-        char *end = nullptr;
-        long v = std::strtol(val.c_str(), &end, 10);
-        if (val.empty() || !end || *end != '\0' || v < 0 || v > static_cast<long>(UINT32_MAX)) {
-            err = key + " must be a non-negative integer"; return false;
-        }
-        *reinterpret_cast<uint32_t *>(base + spec->off) = static_cast<uint32_t>(v);
-        return true;
-    }
-    case DsOptType::Float: {
-        char *end = nullptr;
-        float f = std::strtof(val.c_str(), &end);
-        if (val.empty() || !end || *end != '\0') { err = key + " must be a number"; return false; }
-        *reinterpret_cast<float *>(base + spec->off) = f;
-        return true;
-    }
-    case DsOptType::Str: {
-        // Resolve a relative path option (e.g. mtp_path: a sibling GGUF the
-        // gallery downloaded next to the model) against the model directory, so
-        // YAMLs reference companion files by name. Absolute values pass through.
-        if (spec->is_path && !model_dir.empty() && !val.empty() && val.front() != '/') {
-            storage.push_back(model_dir + "/" + val);
-        } else {
-            storage.push_back(val);
-        }
-        *reinterpret_cast<const char **>(base + spec->off) = storage.back().c_str();
-        return true;
-    }
-    case DsOptType::Gib: {
-        uint64_t bytes = 0;
-        if (!ds4_parse_gib_arg(val.c_str(), &bytes)) {
-            err = key + " must be a GiB value, e.g. 64GB"; return false;
-        }
-        *reinterpret_cast<uint64_t *>(base + spec->off) = bytes;
-        return true;
-    }
-    case DsOptType::CacheExperts: {
-        uint32_t experts = 0;
-        uint64_t bytes = 0;
-        if (!ds4_parse_streaming_cache_experts_arg(val.c_str(), &experts, &bytes)) {
-            err = key + " must be a positive expert count or a <number>GB budget"; return false;
-        }
-        *reinterpret_cast<uint32_t *>(base + spec->off)  = experts;
-        *reinterpret_cast<uint64_t *>(base + spec->off2) = bytes;
-        return true;
-    }
-    }
-    return true;
-}
-
 // When acting as a distributed coordinator, block until the worker route
 // covers all layers (ds4_session_distributed_route_ready == 1) or the timeout
 // elapses. Returns an empty string on success, or an error message to return
@@ -602,10 +476,39 @@ public:
            return GStatus::OK;
        }

+        std::string mtp_path;
+        int mtp_draft = 0;
+        float mtp_margin = 3.0f;
+        std::string ds4_role, ds4_layers, ds4_listen;
+        for (const auto &opt : request->options()) {
+            auto [k, v] = split_option(opt);
+            if (k == "mtp_path") mtp_path = v;
+            else if (k == "mtp_draft") mtp_draft = std::stoi(v);
+            else if (k == "mtp_margin") mtp_margin = std::stof(v);
+            else if (k == "kv_cache_dir") g_kv_cache_dir = v;
+            else if (k == "ds4_role") ds4_role = v;
+            else if (k == "ds4_layers") ds4_layers = v;
+            else if (k == "ds4_listen") ds4_listen = v;
+            else if (k == "ds4_route_timeout") {
+                if (!parse_positive_int(v, &g_route_timeout_sec)) {
+                    result->set_success(false);
+                    result->set_message("ds4: ds4_route_timeout must be a positive integer");
+                    return GStatus::OK;
+                }
+            }
+        }
+
+        g_kv_cache.SetDir(g_kv_cache_dir);
+
        ds4_engine_options opt = {};
        opt.model_path = model_path.c_str();
+        opt.mtp_path = mtp_path.empty() ? nullptr : mtp_path.c_str();
        opt.n_threads = request->threads() > 0 ? request->threads() : 0;
-        opt.mtp_margin = 3.0f; // ds4 default; overridable via the mtp_margin option
+        opt.mtp_draft_tokens = mtp_draft;
+        opt.mtp_margin = mtp_margin;
+        opt.directional_steering_file = nullptr;
+        opt.warm_weights = false;
+        opt.quality = false;

 #if defined(DS4_NO_GPU)
        opt.backend = DS4_BACKEND_CPU;
@@ -615,46 +518,6 @@ public:
        opt.backend = DS4_BACKEND_CUDA;
 #endif

-        // Stable storage for string-valued engine options. The engine reads
-        // these by pointer during ds4_engine_open, so the std::string backing
-        // store must outlive the call and not reallocate; reserve up front so
-        // push_back keeps every prior c_str() valid. Static + clear() reuses
-        // the buffer across LoadModel calls (the old engine is closed above).
-        static std::vector<std::string> s_opt_strings;
-        s_opt_strings.clear();
-        s_opt_strings.reserve(sizeof(kEngineOptSpecs) / sizeof(kEngineOptSpecs[0]));
-
-        // Directory of the main model, used to resolve relative path options.
-        std::string model_dir;
-        if (auto slash = model_path.find_last_of('/'); slash != std::string::npos) {
-            model_dir = model_path.substr(0, slash);
-        }
-
-        std::string ds4_role, ds4_layers, ds4_listen;
-        for (const auto &o : request->options()) {
-            auto [k, v] = split_option(o);
-            if (k == "kv_cache_dir") { g_kv_cache_dir = v; continue; }
-            else if (k == "ds4_role") { ds4_role = v; continue; }
-            else if (k == "ds4_layers") { ds4_layers = v; continue; }
-            else if (k == "ds4_listen") { ds4_listen = v; continue; }
-            else if (k == "ds4_route_timeout") {
-                if (!parse_positive_int(v, &g_route_timeout_sec)) {
-                    result->set_success(false);
-                    result->set_message("ds4: ds4_route_timeout must be a positive integer");
-                    return GStatus::OK;
-                }
-                continue;
-            }
-            std::string err;
-            if (!apply_engine_option(&opt, k, v, model_dir, s_opt_strings, err)) {
-                result->set_success(false);
-                result->set_message("ds4: " + err);
-                return GStatus::OK;
-            }
-        }
-
-        g_kv_cache.SetDir(g_kv_cache_dir);
-
        // Coordinator wiring. 'ds4_role:coordinator' enables layer-split
        // distributed inference: this process listens on ds4_listen and owns
        // the ds4_layers slice; workers dial in (see `local-ai worker
--- a/backend/cpp/ik-llama-cpp/CMakeLists.txt
+++ b/backend/cpp/ik-llama-cpp/CMakeLists.txt
@@ -1,6 +1,15 @@
-## Multimodal support is provided by the in-tree `mtmd` library target
-## (examples/mtmd/), which the grpc-server links and includes below. clip/llava
-## were pruned upstream; the high-level mtmd_* / mtmd_helper_* API is used instead.
+## Clip/LLaVA library for multimodal support — built locally from copied sources
+set(TARGET myclip)
+add_library(${TARGET} clip.cpp clip.h llava.cpp llava.h)
+install(TARGETS ${TARGET} LIBRARY)
+target_include_directories(myclip PUBLIC .)
+target_include_directories(myclip PUBLIC ../..)
+target_include_directories(myclip PUBLIC ../../common)
+target_link_libraries(${TARGET} PRIVATE common ggml llama ${CMAKE_THREAD_LIBS_INIT})
+target_compile_features(${TARGET} PRIVATE cxx_std_11)
+if (NOT MSVC)
+    target_compile_options(${TARGET} PRIVATE -Wno-cast-qual)
+endif()

 set(TARGET grpc-server)
 set(CMAKE_CXX_STANDARD 17)
@@ -58,16 +67,12 @@ add_library(hw_grpc_proto
  ${hw_proto_hdrs} )

 add_executable(${TARGET} grpc-server.cpp json.hpp)
-# mtmd public headers (mtmd.h / mtmd-helper.h) live in examples/mtmd/.
-# Linking the mtmd target also propagates this include dir, but we add it
-# explicitly for clarity.
-target_include_directories(${TARGET} PRIVATE ../mtmd)
-target_link_libraries(${TARGET} PRIVATE common llama mtmd ${CMAKE_THREAD_LIBS_INIT} absl::flags hw_grpc_proto
+target_link_libraries(${TARGET} PRIVATE common llama myclip ${CMAKE_THREAD_LIBS_INIT} absl::flags hw_grpc_proto
  absl::flags_parse
  gRPC::${_REFLECTION}
  gRPC::${_GRPC_GRPCPP}
  protobuf::${_PROTOBUF_LIBPROTOBUF})
-target_compile_features(${TARGET} PRIVATE cxx_std_17)
+target_compile_features(${TARGET} PRIVATE cxx_std_11)
 if(TARGET BUILD_INFO)
  add_dependencies(${TARGET} BUILD_INFO)
 endif()
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=f96eaddba8bed6a9a5e628bbf6a566775c70b49c
+IK_LLAMA_VERSION?=3f40e73c367ad9f0c1b1819f28c7348c26aa340d
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/ik-llama-cpp/grpc-server.cpp
+++ b/backend/cpp/ik-llama-cpp/grpc-server.cpp
@@ -11,8 +11,8 @@
 #include <memory>
 #include <string>
 #include <getopt.h>
-#include "mtmd.h"
-#include "mtmd-helper.h"
+#include "clip.h"
+#include "llava.h"
 #include "log.h"
 #include "common.h"
 #include "json.hpp"
@@ -45,9 +45,7 @@ using backend::HealthMessage;

 ///// LLAMA.CPP server code below

-// Match mtmd.h and ik_llama's server/common headers, which all use
-// nlohmann::ordered_json; a plain nlohmann::json alias collides at global scope.
-using json = nlohmann::ordered_json;
+using json = nlohmann::json;

 struct server_params
 {
@@ -221,11 +219,6 @@ struct llama_client_slot

    // multimodal
    std::vector<slot_image> images;
-    // Full prompt with mtmd media markers (mtmd_default_marker()) substituted in
-    // place of the legacy [img-N] tags, covering the text up to and including the
-    // last image. The text after the last image is kept in params.input_suffix and
-    // decoded through the normal token path so the sampling loop is unchanged.
-    std::string mtmd_prompt;

    // stats
    size_t sent_count = 0;
@@ -259,14 +252,14 @@ struct llama_client_slot

        for (slot_image & img : images)
        {
-            if (img.bitmap) {
-                mtmd_bitmap_free(img.bitmap);
-                img.bitmap = nullptr;
+            free(img.image_embedding);
+            if (img.img_data) {
+                clip_image_u8_free(img.img_data);
            }
+            img.prefix_prompt = "";
        }

        images.clear();
-        mtmd_prompt = "";
    }

    bool has_budget(gpt_params &global_params) {
@@ -403,13 +396,46 @@ struct llama_metrics {
    }
 };

+struct llava_embd_batch {
+    std::vector<llama_pos>      pos;
+    std::vector<int32_t>        n_seq_id;
+    std::vector<llama_seq_id>   seq_id_0;
+    std::vector<llama_seq_id *> seq_ids;
+    std::vector<int8_t>         logits;
+    llama_batch batch;
+    llava_embd_batch(float * embd, int32_t n_tokens, llama_pos pos_0, llama_seq_id seq_id) {
+        pos     .resize(n_tokens);
+        n_seq_id.resize(n_tokens);
+        seq_ids .resize(n_tokens + 1);
+        logits  .resize(n_tokens);
+        seq_id_0.resize(1);
+        seq_id_0[0] = seq_id;
+        seq_ids [n_tokens] = nullptr;
+        batch = {
+            /*n_tokens       =*/ n_tokens,
+            /*tokens         =*/ nullptr,
+            /*embd           =*/ embd,
+            /*pos            =*/ pos.data(),
+            /*n_seq_id       =*/ n_seq_id.data(),
+            /*seq_id         =*/ seq_ids.data(),
+            /*logits         =*/ logits.data(),
+        };
+        for (int i = 0; i < n_tokens; i++) {
+            batch.pos     [i] = pos_0 + i;
+            batch.n_seq_id[i] = 1;
+            batch.seq_id  [i] = seq_id_0.data();
+            batch.logits  [i] = false;
+        }
+    }
+};
+
 struct llama_server_context
 {
    llama_model *model = nullptr;
    llama_context *ctx = nullptr;
    const llama_vocab * vocab = nullptr;

-    mtmd_context *mctx = nullptr;
+    clip_ctx *clp_ctx = nullptr;

    gpt_params params;

@@ -465,6 +491,11 @@ struct llama_server_context
        if (!params.mmproj.path.empty()) {
            multimodal = true;
            LOG_INFO("Multi Modal Mode Enabled", {});
+            clp_ctx = clip_model_load(params.mmproj.path.c_str(), /*verbosity=*/ 1);
+            if(clp_ctx == nullptr) {
+                LOG_ERR("unable to load clip model: %s", params.mmproj.path.c_str());
+                return false;
+            }

            if (params.n_ctx < 2048) { // request larger context for the image embedding
                params.n_ctx = 2048;
@@ -481,24 +512,10 @@ struct llama_server_context
        }

        if (multimodal) {
-            // mtmd_init_from_file requires the already-loaded text model, so it must
-            // run AFTER llama_init_from_gpt_params. It validates the projector
-            // against the model internally and returns nullptr on dim mismatch, so
-            // the explicit clip_n_mmproj_embd check is no longer needed.
-            mtmd_context_params mparams = mtmd_context_params_default();
-            mparams.use_gpu         = params.mmproj_use_gpu;
-            mparams.print_timings   = false;
-            mparams.n_threads       = params.n_threads_mtmd != -1 ? params.n_threads_mtmd
-                                      : params.n_threads_batch != -1 ? params.n_threads_batch
-                                                                     : params.n_threads;
-            mparams.verbosity       = GGML_LOG_LEVEL_INFO;
-            mparams.flash_attn_type = params.flash_attn ? LLAMA_FLASH_ATTN_TYPE_ENABLED
-                                                        : LLAMA_FLASH_ATTN_TYPE_DISABLED;
-            mparams.image_min_tokens = params.image_min_tokens;
-            mparams.image_max_tokens = params.image_max_tokens;
-            mctx = mtmd_init_from_file(params.mmproj.path.c_str(), model, mparams);
-            if (mctx == nullptr) {
-                LOG_ERR("unable to load multimodal projector: %s", params.mmproj.path.c_str());
+            const int n_embd_clip = clip_n_mmproj_embd(clp_ctx);
+            const int n_embd_llm  = llama_model_n_embd(model);
+            if (n_embd_clip != n_embd_llm) {
+                LOG("%s: embedding dim of the multimodal projector (%d) is not equal to that of LLaMA (%d). Make sure that you use the correct mmproj file.\n", __func__, n_embd_clip, n_embd_llm);
                llama_free(ctx);
                llama_free_model(model);
                return false;
@@ -848,8 +865,8 @@ struct llama_server_context

                    slot_image img_sl;
                    img_sl.id = img.count("id") != 0 ? img["id"].get<int>() : slot->images.size();
-                    img_sl.bitmap = mtmd_helper_bitmap_init_from_buf(mctx, image_buffer.data(), image_buffer.size());
-                    if (img_sl.bitmap == nullptr)
+                    img_sl.img_data = clip_image_u8_init();
+                    if (!clip_image_load_from_bytes(image_buffer.data(), image_buffer.size(), img_sl.img_data))
                    {
                        LOG_ERR("%s: failed to load image, slot_id: %d, img_sl_id: %d",
                             __func__,
@@ -862,74 +879,50 @@ struct llama_server_context
                        {"slot_id",   slot->id},
                        {"img_sl_id", img_sl.id}
                    });
+                    img_sl.request_encode_image = true;
                    slot->images.push_back(img_sl);
                }
-                // Translate the legacy [img-N] tags into mtmd media markers, in
-                // order, and collect the matching bitmaps in marker order so they
-                // line up with the markers passed to mtmd_tokenize(). The text after
-                // the last image stays in input_suffix and is decoded through the
-                // normal token path, so the sampling loop is unchanged.
-                // example: system prompt [img-102] user [img-103] describe [img-134]
+                // process prompt
+                // example: system prompt [img-102] user [img-103] describe [img-134] -> [{id: 102, prefix: 'system prompt '}, {id: 103, prefix: ' user '}, {id: 134, prefix: ' describe '}]}
                if (slot->images.size() > 0 && !slot->prompt.is_array())
                {
-                    const std::string marker = mtmd_default_marker();
                    std::string prompt = slot->prompt.get<std::string>();
-                    std::string built_prompt;
-                    std::vector<slot_image> ordered;
-                    size_t pos = 0, copy_from = 0;
+                    size_t pos = 0, begin_prefix = 0;
                    std::string pattern = "[img-";
-
-                    auto free_images = [&]() {
-                        for (slot_image &img : slot->images) {
-                            if (img.bitmap) {
-                                mtmd_bitmap_free(img.bitmap);
-                                img.bitmap = nullptr;
-                            }
-                        }
-                        slot->images.clear();
-                    };
-
                    while ((pos = prompt.find(pattern, pos)) != std::string::npos) {
-                        size_t tag_begin = pos;
+                        size_t end_prefix = pos;
                        pos += pattern.length();
                        size_t end_pos = prompt.find(']', pos);
-                        if (end_pos == std::string::npos) {
-                            break;
-                        }
-                        std::string image_id = prompt.substr(pos, end_pos - pos);
-                        try
+                        if (end_pos != std::string::npos)
                        {
-                            int img_id = std::stoi(image_id);
-                            bool found = false;
-                            for (slot_image &img : slot->images)
+                            std::string image_id = prompt.substr(pos, end_pos - pos);
+                            try
                            {
-                                if (img.id == img_id) {
-                                    found = true;
-                                    // text before this tag, then the media marker
-                                    built_prompt += prompt.substr(copy_from, tag_begin - copy_from);
-                                    built_prompt += marker;
-                                    copy_from = end_pos + 1;
-                                    ordered.push_back(img);
-                                    break;
+                                int img_id = std::stoi(image_id);
+                                bool found = false;
+                                for (slot_image &img : slot->images)
+                                {
+                                    if (img.id == img_id) {
+                                        found = true;
+                                        img.prefix_prompt = prompt.substr(begin_prefix, end_prefix - begin_prefix);
+                                        begin_prefix = end_pos + 1;
+                                        break;
+                                    }
                                }
-                            }
-                            if (!found) {
-                                LOG("ERROR: Image with id: %i, not found.\n", img_id);
-                                free_images();
+                                if (!found) {
+                                    LOG("ERROR: Image with id: %i, not found.\n", img_id);
+                                    slot->images.clear();
+                                    return false;
+                                }
+                            } catch (const std::invalid_argument& e) {
+                                LOG("Invalid image number id in prompt\n");
+                                slot->images.clear();
                                return false;
                            }
-                        } catch (const std::invalid_argument& e) {
-                            LOG("Invalid image number id in prompt\n");
-                            free_images();
-                            return false;
                        }
-                        pos = end_pos + 1;
                    }
-                    // bitmaps are consumed in marker order by mtmd_tokenize()
-                    slot->images = ordered;
-                    slot->mtmd_prompt = built_prompt;
                    slot->prompt = "";
-                    slot->params.input_suffix = prompt.substr(copy_from);
+                    slot->params.input_suffix = prompt.substr(begin_prefix);
                    slot->params.cache_prompt = false; // multimodal doesn't support cache prompt
                }
            }
@@ -1183,10 +1176,21 @@ struct llama_server_context

    bool process_images(llama_client_slot &slot) const
    {
-        // With the mtmd pipeline, image encoding is no longer eager: the bitmaps
-        // are tokenized and encoded together with the surrounding text inside
-        // ingest_images() via mtmd_tokenize() + mtmd_helper_eval_chunks(). This
-        // just reports whether the slot carries any images to process.
+        for (slot_image &img : slot.images)
+        {
+            if (!img.request_encode_image)
+            {
+                continue;
+            }
+
+            if (!llava_image_embed_make_with_clip_img(clp_ctx, params.n_threads, img.img_data, &img.image_embedding, &img.image_tokens)) {
+                LOG("Error processing the given image");
+                return false;
+            }
+
+            img.request_encode_image = false;
+        }
+
        return slot.images.size() > 0;
    }

@@ -1431,70 +1435,69 @@ struct llama_server_context
        }
    }

-    // Tokenize the multimodal prompt (text interleaved with media markers) together
-    // with the slot's bitmaps, then decode the resulting chunks into the llama
-    // context via the high-level mtmd helper. The helper runs llama_decode() on the
-    // text chunks and mtmd_encode() + llama_decode() on the image chunks, handling
-    // batching and any pre/post decode setup (e.g. non-causal attention for gemma3).
-    // Advances slot.n_past by the number of positions consumed, then leaves the
-    // post-image suffix tokens in `batch` so the normal decode + sampling loop
-    // produces the first generated token.
+    // for multiple images processing
    bool ingest_images(llama_client_slot &slot, int n_batch)
    {
-        if (mctx == nullptr)
-        {
-            LOG("%s : multimodal context is not initialized\n", __func__);
-            return false;
-        }
+        int image_idx = 0;

-        // bitmaps stay owned by slot.images (freed on reset()); pass non-owning ptrs
-        std::vector<const mtmd_bitmap *> bitmaps;
-        bitmaps.reserve(slot.images.size());
-        for (const slot_image &img : slot.images)
+        while (image_idx < (int) slot.images.size())
        {
-            bitmaps.push_back(img.bitmap);
-        }
+            slot_image &img = slot.images[image_idx];

-        mtmd_input_text inp_txt;
-        inp_txt.text          = slot.mtmd_prompt.c_str();
-        inp_txt.add_special   = add_bos_token;
-        inp_txt.parse_special = true;
+            // process prefix prompt
+            for (int32_t i = 0; i < (int32_t) batch.n_tokens; i += n_batch)
+            {
+                const int32_t n_tokens = std::min(n_batch, (int32_t) (batch.n_tokens - i));
+                llama_batch batch_view = {
+                    n_tokens,
+                    batch.token    + i,
+                    nullptr,
+                    batch.pos      + i,
+                    batch.n_seq_id + i,
+                    batch.seq_id   + i,
+                    batch.logits   + i,
+                };
+                if (llama_decode(ctx, batch_view))
+                {
+                    LOG("%s : failed to eval\n", __func__);
+                    return false;
+                }
+            }

-        mtmd::input_chunks chunks(mtmd_input_chunks_init());
-        int32_t res = mtmd_tokenize(mctx,
-                                    chunks.ptr.get(),
-                                    &inp_txt,
-                                    bitmaps.data(),
-                                    bitmaps.size());
-        if (res != 0)
-        {
-            LOG("%s : failed to tokenize multimodal prompt, res = %d\n", __func__, res);
-            return false;
-        }
+            // process image with llm
+            for (int i = 0; i < img.image_tokens; i += n_batch)
+            {
+                int n_eval = img.image_tokens - i;
+                if (n_eval > n_batch)
+                {
+                    n_eval = n_batch;
+                }

-        const llama_pos start_pos = (llama_pos) system_tokens.size() + slot.n_past;
-        llama_pos new_n_past = start_pos;
-        if (mtmd_helper_eval_chunks(mctx,
-                                    ctx,
-                                    chunks.ptr.get(),
-                                    start_pos,
-                                    slot.id,
-                                    n_batch,
-                                    /*logits_last=*/ false,
-                                    &new_n_past) != 0)
-        {
-            LOG("%s : failed to eval multimodal chunks\n", __func__);
-            return false;
-        }
-        slot.n_past += (int32_t) (new_n_past - start_pos);
+                const int n_embd = llama_model_n_embd(model);
+                float * embd = img.image_embedding + i * n_embd;
+                llava_embd_batch llava_batch = llava_embd_batch(embd, n_eval, slot.n_past, 0);
+                if (llama_decode(ctx, llava_batch.batch))
+                {
+                    LOG("%s : failed to eval image\n", __func__);
+                    return false;
+                }
+                slot.n_past += n_eval;
+            }
+            image_idx++;

-        // queue the post-image suffix text for the normal decode + sampling path
-        common_batch_clear(batch);
-        std::vector<llama_token> suffix_tokens = tokenize(slot.params.input_suffix, false);
-        for (llama_token tok : suffix_tokens)
-        {
-            common_batch_add(batch, tok, system_tokens.size() + slot.n_past, { slot.id }, false);
-            slot.n_past += 1;
+            common_batch_clear(batch);
+
+            // append prefix of next image
+            const auto json_prompt = (image_idx >= (int) slot.images.size()) ?
+                slot.params.input_suffix : // no more images, then process suffix prompt
+                (json)(slot.images[image_idx].prefix_prompt);
+
+            std::vector<llama_token> append_tokens = tokenize(json_prompt, false); // has next image
+            for (int i = 0; i < (int) append_tokens.size(); ++i)
+            {
+                common_batch_add(batch, append_tokens[i], system_tokens.size() + slot.n_past, { slot.id }, true);
+                slot.n_past += 1;
+            }
        }

        return true;
@@ -1881,11 +1884,8 @@ struct llama_server_context

                    const bool has_images = process_images(slot);

-                    // For the multimodal path the whole pre-image / inter-image text is
-                    // tokenized and decoded inside ingest_images() via mtmd, so no prefix
-                    // tokens are queued here; the post-image suffix is appended by
-                    // ingest_images() for the normal decode + sampling loop.
-                    std::vector<llama_token> prefix_tokens = has_images ? std::vector<llama_token>() : prompt_tokens;
+                    // process the prefix of first image
+                    std::vector<llama_token> prefix_tokens = has_images ? tokenize(slot.images[0].prefix_prompt, add_bos_token) : prompt_tokens;

                    int32_t slot_npast = slot.n_past_se > 0 ? slot.n_past_se : slot.n_past;

--- a/backend/cpp/ik-llama-cpp/patches/0002-clip-ggml-quantize-chunk-user-data.patch
+++ b/backend/cpp/ik-llama-cpp/patches/0002-clip-ggml-quantize-chunk-user-data.patch
@@ -0,0 +1,11 @@
+--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
+@@ -2494,7 +2494,7 @@
+             }
+             new_data = work.data();
+
+-            new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr);
+            new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr, nullptr);
+         } else {
+             new_type = cur->type;
+             new_data = cur->data;
--- a/backend/cpp/ik-llama-cpp/prepare.sh
+++ b/backend/cpp/ik-llama-cpp/prepare.sh
@@ -17,9 +17,28 @@ cp -r grpc-server.cpp llama.cpp/examples/grpc-server/
 cp -r utils.hpp llama.cpp/examples/grpc-server/
 cp -rfv llama.cpp/vendor/nlohmann/json.hpp llama.cpp/examples/grpc-server/

-## Multimodal support is provided by the `mtmd` library target (examples/mtmd/),
-## which the grpc-server links and includes directly. No source copy is needed:
-## clip/llava were pruned upstream and the high-level mtmd_* API is used instead.
+## Copy clip/llava files for multimodal support (built as myclip library)
+cp -rfv llama.cpp/examples/llava/clip.h llama.cpp/examples/grpc-server/clip.h
+cp -rfv llama.cpp/examples/llava/clip.cpp llama.cpp/examples/grpc-server/clip.cpp
+cp -rfv llama.cpp/examples/llava/llava.cpp llama.cpp/examples/grpc-server/llava.cpp
+# Prepend llama.h include to llava.h
+echo '#include "llama.h"' > llama.cpp/examples/grpc-server/llava.h
+cat llama.cpp/examples/llava/llava.h >> llama.cpp/examples/grpc-server/llava.h
+# Copy clip-impl.h if it exists
+if [ -f llama.cpp/examples/llava/clip-impl.h ]; then
+    cp -rfv llama.cpp/examples/llava/clip-impl.h llama.cpp/examples/grpc-server/clip-impl.h
+fi
+# Copy stb_image.h
+if [ -f llama.cpp/vendor/stb/stb_image.h ]; then
+    cp -rfv llama.cpp/vendor/stb/stb_image.h llama.cpp/examples/grpc-server/stb_image.h
+elif [ -f llama.cpp/common/stb_image.h ]; then
+    cp -rfv llama.cpp/common/stb_image.h llama.cpp/examples/grpc-server/stb_image.h
+fi
+
+## Fix API compatibility in llava.cpp (llama_n_embd -> llama_model_n_embd)
+if [ -f llama.cpp/examples/grpc-server/llava.cpp ]; then
+    sed -i 's/llama_n_embd(/llama_model_n_embd(/g' llama.cpp/examples/grpc-server/llava.cpp
+fi

 set +e
 if grep -q "grpc-server" llama.cpp/examples/CMakeLists.txt; then
--- a/backend/cpp/ik-llama-cpp/run.sh
+++ b/backend/cpp/ik-llama-cpp/run.sh
@@ -2,7 +2,7 @@
 set -ex

 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")

 cd /

@@ -13,28 +13,28 @@ grep -e "flags" /proc/cpuinfo | head -1
 # ik_llama.cpp requires AVX2 — default to avx2 binary
 BINARY=ik-llama-cpp-avx2

-if [ -e "$CURDIR"/ik-llama-cpp-fallback ] && ! grep -q -e "\savx2\s" /proc/cpuinfo ; then
+if [ -e $CURDIR/ik-llama-cpp-fallback ] && ! grep -q -e "\savx2\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX2   NOT found, using fallback"
 	BINARY=ik-llama-cpp-fallback
 fi

 # Extend ld library path with the dir where this script is located/lib
 if [ "$(uname)" == "Darwin" ]; then
-	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
-	#export DYLD_FALLBACK_LIBRARY_PATH="$CURDIR"/lib:$DYLD_FALLBACK_LIBRARY_PATH
+	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
+	#export DYLD_FALLBACK_LIBRARY_PATH=$CURDIR/lib:$DYLD_FALLBACK_LIBRARY_PATH
 else
-	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 fi

 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using binary: $BINARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/$BINARY "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
 fi

 echo "Using binary: $BINARY"
-exec "$CURDIR"/$BINARY "$@"
+exec $CURDIR/$BINARY "$@"

 # We should never reach this point, however just in case we do, run fallback
-exec "$CURDIR"/ik-llama-cpp-fallback "$@"
+exec $CURDIR/ik-llama-cpp-fallback "$@"
--- a/backend/cpp/ik-llama-cpp/utils.hpp
+++ b/backend/cpp/ik-llama-cpp/utils.hpp
@@ -11,12 +11,9 @@

 #include "json.hpp"

-#include "mtmd.h"
+#include "clip.h"

-// mtmd.h and ik_llama's entire server/common stack (chat.h, server-common.h,
-// server-task.h, ...) declare `using json = nlohmann::ordered_json`, so match it
-// here: a plain `nlohmann::json` alias collides with mtmd.h's at global scope.
-using json = nlohmann::ordered_json;
+using json = nlohmann::json;

 extern bool server_verbose;

@@ -114,12 +111,13 @@ struct slot_image
 {
    int32_t id;

-    // mtmd bitmap (image/audio) decoded from the request buffer. Owned by the
-    // slot; freed via mtmd_bitmap_free() on reset. The high-level mtmd pipeline
-    // (mtmd_tokenize + mtmd_helper_eval_chunks) consumes these directly, so the
-    // legacy eager-encode fields (embedding/tokens) and per-image prefix prompt
-    // are no longer needed.
-    mtmd_bitmap * bitmap = nullptr;
+    bool request_encode_image = false;
+    float * image_embedding = nullptr;
+    int32_t image_tokens = 0;
+
+    clip_image_u8 * img_data;
+
+    std::string prefix_prompt; // before of this image
 };

 // completion token output with probabilities
--- a/backend/cpp/llama-cpp/CMakeLists.txt
+++ b/backend/cpp/llama-cpp/CMakeLists.txt
@@ -50,13 +50,8 @@ add_custom_command(
        "${hw_proto}"
      DEPENDS "${hw_proto}")

-# hw_grpc_proto: force STATIC. Under the CPU_ALL_VARIANTS build BUILD_SHARED_LIBS=ON
-# (ggml/llama become shared), which would otherwise make this glue library a DSO. As a
-# DSO it references the hidden-visibility symbols in the static libprotobuf.a, which the
-# linker cannot satisfy ("hidden symbol ... in libprotobuf.a is referenced by DSO").
-# Keeping it STATIC links protobuf/gRPC directly into the grpc-server executable while
-# only ggml/llama stay shared. No effect on the static variants (already BUILD_SHARED_LIBS=OFF).
-add_library(hw_grpc_proto STATIC
+# hw_grpc_proto
+add_library(hw_grpc_proto
  ${hw_grpc_srcs}
  ${hw_grpc_hdrs}
  ${hw_proto_srcs}
@@ -87,18 +82,3 @@ target_compile_features(${TARGET} PRIVATE cxx_std_11)
 if(TARGET BUILD_INFO)
  add_dependencies(${TARGET} BUILD_INFO)
 endif()
-
-# Unit test for the message-content normalization helper (message_content.h).
-# Off by default so the normal backend build is untouched; enable with
-# -DLLAMA_GRPC_BUILD_TESTS=ON and run via ctest. It reuses llama.cpp's vendored
-# <nlohmann/json.hpp> (propagated by the common helpers library) so it has no
-# extra dependency beyond what the backend already builds against.
-option(LLAMA_GRPC_BUILD_TESTS "Build grpc-server unit tests" OFF)
-if(LLAMA_GRPC_BUILD_TESTS)
-    enable_testing()
-    add_executable(message_content_test message_content_test.cpp message_content.h)
-    target_include_directories(message_content_test PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
-    target_link_libraries(message_content_test PRIVATE ${_LLAMA_COMMON_TARGET})
-    target_compile_features(message_content_test PRIVATE cxx_std_17)
-    add_test(NAME message_content_test COMMAND message_content_test)
-endif()
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd
+LLAMA_VERSION?=5dcb71166686799f0d873eab7386234302d05ecf
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
@@ -10,16 +10,8 @@ TARGET?=--target grpc-server
 JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
 ARCH?=$(shell uname -m)

-# Shared libs default to OFF: we link static gRPC and the avx/avx2/avx512/fallback
-# variants are fully static. The CPU_ALL_VARIANTS build flips SHARED_LIBS=ON (ggml/llama
-# become shared so the dynamic CPU backends work; gRPC stays static via its imported
-# targets). SHARED_LIBS is a make variable, not an appended -D, so it survives the
-# recursive sub-make into the VARIANT build dir (which re-parses this Makefile) instead
-# of being re-clobbered by a second -DBUILD_SHARED_LIBS=OFF. EXTRA_CMAKE_ARGS is the hook
-# the CPU_ALL_VARIANTS target uses to inject -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS.
-SHARED_LIBS?=OFF
-EXTRA_CMAKE_ARGS?=
-CMAKE_ARGS+=-DBUILD_SHARED_LIBS=$(SHARED_LIBS) -DLLAMA_CURL=OFF $(EXTRA_CMAKE_ARGS)
+# Disable Shared libs as we are linking on static gRPC and we can't mix shared and static
+CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=OFF

 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
 ifeq ($(NATIVE),false)
@@ -128,39 +120,15 @@ llama-cpp-fallback: llama.cpp
 	CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) VARIANT="llama-cpp-fallback-build" build-llama-cpp-grpc-server
 	cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-fallback-build/grpc-server llama-cpp-fallback

-# Single-build CPU backend using ggml's CPU_ALL_VARIANTS. Produces ONE grpc-server
-# plus a set of dlopen-able libggml-cpu-*.so (sandybridge/haswell/skylakex/...) that
-# ggml's backend registry selects from at runtime by probing host CPU features.
-# Replaces the avx/avx2/avx512/fallback multi-binary build on x86.
-#
-# CPU_ALL_VARIANTS requires GGML_BACKEND_DL, which requires BUILD_SHARED_LIBS=ON, so we
-# pass SHARED_LIBS=ON and the DL flags as make variables (NOT pre-expanded into the
-# CMAKE_ARGS env string): command-line make variables propagate through every recursive
-# sub-make, so the deepest VARIANT-dir build computes BUILD_SHARED_LIBS=ON consistently.
-# Only ggml/llama go shared - gRPC is found via its static imported targets, so the
-# grpc-server binary keeps static gRPC and only dynamically links ggml.
-#
-# TARGET adds "ggml": the per-microarch backends are runtime-dlopened, not link deps of
-# grpc-server, so they only build because each is an add_dependencies() of the ggml target.
-llama-cpp-cpu-all: llama.cpp
-	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build
-	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build purge
-	$(info ${GREEN}I llama-cpp build info:cpu-all-variants${RESET})
-	$(MAKE) SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" VARIANT="llama-cpp-cpu-all-build" build-llama-cpp-grpc-server
-	cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/grpc-server llama-cpp-cpu-all
-	rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
-	find $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
-	@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
-
 llama-cpp-grpc: llama.cpp
 	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build purge
 	$(info ${GREEN}I llama-cpp build info:grpc${RESET})
-	CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" TARGET="--target grpc-server --target ggml-rpc-server" $(MAKE) VARIANT="llama-cpp-grpc-build" build-llama-cpp-grpc-server
+	CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" TARGET="--target grpc-server --target rpc-server" $(MAKE) VARIANT="llama-cpp-grpc-build" build-llama-cpp-grpc-server
 	cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build/grpc-server llama-cpp-grpc

 llama-cpp-rpc-server: llama-cpp-grpc
-	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build/llama.cpp/build/bin/ggml-rpc-server llama-cpp-rpc-server
+	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build/llama.cpp/build/bin/rpc-server llama-cpp-rpc-server

 llama.cpp:
 	mkdir -p llama.cpp
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
--- a/backend/cpp/llama-cpp/message_content.h
+++ b/backend/cpp/llama-cpp/message_content.h
@@ -1,192 +0,0 @@
-#pragma once
-
-#include <string>
-#include <vector>
-
-#include <nlohmann/json.hpp>
-
-namespace llama_grpc {
-
-// Normalizes a proto message's content string into the JSON value used when
-// reconstructing OpenAI-format messages for the tokenizer (jinja) template.
-//
-// Shared by the streaming (PredictStream) and non-streaming (Predict) message
-// reconstruction paths so the two cannot drift.
-//
-// LocalAI's Go layer (schema.Messages.ToProto) always sends content as a plain
-// text string; multimodal media travels in separate proto fields, never inside
-// content. So user/system/developer content is *only ever* opaque text and must
-// NOT be JSON-sniffed: a prompt that merely looks like JSON (e.g. an ingredient
-// list ["1/4 cup sugar", ...]) would otherwise be reinterpreted as structured
-// content parts and rejected by oaicompat_chat_params_parse with
-// "unsupported content[].type" (https://github.com/mudler/LocalAI/issues/10524).
-// (developer is OpenAI's modern system alias - same "human-authored text" nature.)
-//
-// For assistant/tool messages we still collapse a literal JSON null/object
-// (tool-call bookkeeping) to a string, but we never turn a plain string into an
-// array/scalar. The array defense is therefore role-independent (arrays/scalars
-// fall through for every role); the role gate only governs the null/object case.
-inline nlohmann::ordered_json normalize_message_content(const std::string& role,
-                                                        const std::string& content) {
-    nlohmann::ordered_json content_val = content;
-    if (role != "user" && role != "system" && role != "developer") {
-        try {
-            nlohmann::ordered_json parsed = nlohmann::ordered_json::parse(content);
-            if (parsed.is_null()) {
-                content_val = "";
-            } else if (parsed.is_object()) {
-                content_val = parsed.dump();
-            }
-            // arrays / scalars: keep the original plain-text string as-is
-        } catch (const nlohmann::ordered_json::parse_error&) {
-            // Not JSON, already the plain string
-        }
-    }
-    return content_val;
-}
-
-// Final safety pass applied to each reconstructed OpenAI message right before it
-// is handed to oaicompat_chat_params_parse (jinja templating). Jinja templates
-// assume content is a string: a literal null breaks slicing such as
-// message.content[:N] (#7324), and a tool message with array content is rejected
-// (#7528). A multimodal user message legitimately carries a typed-part array
-// ({type:text}, {type:image_url}, ...), which must be left intact. Shared by the
-// streaming and non-streaming paths so this invariant cannot drift between them.
-inline void normalize_template_message(nlohmann::ordered_json& msg) {
-    if (!msg.contains("content")) {
-        msg["content"] = ""; // templates expect the field to exist
-        return;
-    }
-    nlohmann::ordered_json& content = msg["content"];
-    const std::string role = (msg.contains("role") && msg["role"].is_string())
-                                 ? msg["role"].get<std::string>()
-                                 : std::string();
-    if (content.is_null()) {
-        content = ""; // #7324: null would crash content[:N] slicing
-    } else if (role == "tool" && content.is_array()) {
-        content = content.dump(); // #7528: tool messages must have string content
-    } else if (!content.is_string() && !content.is_array()) {
-        if (content.is_object()) {
-            content = content.dump(); // tool-call bookkeeping object -> string
-        } else {
-            content = ""; // other scalar (number/bool) -> empty
-        }
-    }
-    // string, or a non-tool (multimodal) typed-part array: leave untouched
-}
-
-// One proto message's data, flattened to plain types so the reconstruction logic
-// can be shared and unit-tested without protobuf. The streaming and non-streaming
-// predict paths both populate this from proto::Message + the request's media.
-struct ReconstructedMessageInput {
-    std::string role;
-    std::string content;            // proto.Message.content (always a plain string)
-    std::string name;
-    std::string tool_call_id;
-    std::string reasoning_content;
-    std::string tool_calls;         // tool_calls as a JSON string, or empty
-    bool is_last_user_msg = false;  // attach request media to this message
-    std::vector<std::string> images; // base64 (jpeg)
-    std::vector<std::string> audios; // base64 (wav)
-    std::vector<std::string> videos; // base64
-};
-
-// Appends the request's media as OpenAI typed content parts. Imperative (not
-// brace-init) to avoid nlohmann's object-vs-array initializer-list ambiguity.
-inline void append_media_parts(nlohmann::ordered_json& content_array,
-                               const std::vector<std::string>& images,
-                               const std::vector<std::string>& audios,
-                               const std::vector<std::string>& videos) {
-    for (const auto& img : images) {
-        nlohmann::ordered_json image_chunk;
-        image_chunk["type"] = "image_url";
-        nlohmann::ordered_json image_url;
-        image_url["url"] = "data:image/jpeg;base64," + img;
-        image_chunk["image_url"] = image_url;
-        content_array.push_back(image_chunk);
-    }
-    for (const auto& aud : audios) {
-        nlohmann::ordered_json audio_chunk;
-        audio_chunk["type"] = "input_audio";
-        nlohmann::ordered_json input_audio;
-        input_audio["data"] = aud;
-        input_audio["format"] = "wav"; // default; could be made configurable
-        audio_chunk["input_audio"] = input_audio;
-        content_array.push_back(audio_chunk);
-    }
-    for (const auto& vid : videos) {
-        nlohmann::ordered_json video_chunk;
-        video_chunk["type"] = "input_video";
-        nlohmann::ordered_json input_video;
-        input_video["data"] = vid;
-        video_chunk["input_video"] = input_video;
-        content_array.push_back(video_chunk);
-    }
-}
-
-// Reconstructs a single OpenAI-format message (the object fed to
-// oaicompat_chat_params_parse) from a proto message. Shared by PredictStream and
-// Predict so the content/multimodal/tool_calls handling cannot drift between the
-// two stream modes (it previously lived as two ~150-line copies with a redundant
-// Predict-only tool_calls->" " branch). Guarantees content is always a string or
-// a typed-part array, never null/missing.
-inline nlohmann::ordered_json build_reconstructed_message(const ReconstructedMessageInput& in) {
-    nlohmann::ordered_json msg_json;
-    msg_json["role"] = in.role;
-    const bool has_media = !in.images.empty() || !in.audios.empty() || !in.videos.empty();
-
-    if (!in.content.empty()) {
-        nlohmann::ordered_json content_val = normalize_message_content(in.role, in.content);
-        if (content_val.is_string() && in.is_last_user_msg && has_media) {
-            // Last user message + media: build a typed-part array (text first).
-            nlohmann::ordered_json content_array = nlohmann::ordered_json::array();
-            nlohmann::ordered_json text_part;
-            text_part["type"] = "text";
-            text_part["text"] = content_val.get<std::string>();
-            content_array.push_back(text_part);
-            append_media_parts(content_array, in.images, in.audios, in.videos);
-            msg_json["content"] = content_array;
-        } else if (content_val.is_null()) {
-            msg_json["content"] = "";
-        } else {
-            msg_json["content"] = content_val;
-        }
-    } else if (in.is_last_user_msg && has_media) {
-        // No text but media on the last user message: media-only typed array.
-        nlohmann::ordered_json content_array = nlohmann::ordered_json::array();
-        append_media_parts(content_array, in.images, in.audios, in.videos);
-        msg_json["content"] = content_array;
-    } else {
-        // Empty content (any role, incl. tool/assistant): templates need a string.
-        msg_json["content"] = "";
-    }
-
-    if (!in.name.empty()) {
-        msg_json["name"] = in.name;
-    }
-    if (!in.tool_call_id.empty()) {
-        msg_json["tool_call_id"] = in.tool_call_id;
-    }
-    if (!in.reasoning_content.empty()) {
-        msg_json["reasoning_content"] = in.reasoning_content;
-    }
-    if (!in.tool_calls.empty()) {
-        try {
-            nlohmann::ordered_json tool_calls = nlohmann::ordered_json::parse(in.tool_calls);
-            msg_json["tool_calls"] = tool_calls;
-            // tool_calls + empty/blank content: use " " not "", because llama.cpp's
-            // common_chat_msgs_to_json_oaicompat turns "" into null, which breaks
-            // templates that slice message.content[:tool_start_length] (#7324).
-            if (!msg_json.contains("content") ||
-                (msg_json["content"].is_string() && msg_json["content"].get<std::string>().empty())) {
-                msg_json["content"] = " ";
-            }
-        } catch (const nlohmann::ordered_json::parse_error&) {
-            // Malformed tool_calls JSON: leave content as-is (prior behavior).
-        }
-    }
-
-    return msg_json;
-}
-
-}  // namespace llama_grpc
--- a/backend/cpp/llama-cpp/message_content_test.cpp
+++ b/backend/cpp/llama-cpp/message_content_test.cpp
@@ -1,234 +0,0 @@
-// Unit tests for the shared message-reconstruction helpers (message_content.h).
-//
-// Build & run standalone (nlohmann/json single header on the include path):
-//   g++ -std=c++17 -I<dir-with-nlohmann> message_content_test.cpp -o t && ./t
-// or via CMake: -DLLAMA_GRPC_BUILD_TESTS=ON then ctest.
-//
-// Regression coverage for:
-//   #10524 - a user/system prompt that is itself a JSON-array string must stay
-//            plain text, never be reinterpreted as OpenAI structured parts.
-//   #7324  - assistant/tool null content -> "" (templates slice content[:N]);
-//            assistant+tool_calls+empty content -> " " (not "", which becomes null).
-//   #7528  - tool message array content must reach the template as a string.
-//   multimodal - last user message text + media -> typed-part array, media kept.
-
-#include <cassert>
-#include <iostream>
-#include <string>
-
-#include "message_content.h"
-
-using nlohmann::ordered_json;
-using llama_grpc::normalize_message_content;
-using llama_grpc::normalize_template_message;
-using llama_grpc::build_reconstructed_message;
-using llama_grpc::ReconstructedMessageInput;
-
-static int failures = 0;
-
-static void check(bool ok, const std::string& name, const std::string& detail = "") {
-    if (!ok) {
-        std::cerr << "FAIL " << name << (detail.empty() ? "" : ": " + detail) << "\n";
-        failures++;
-    }
-}
-
-// ---- normalize_message_content -------------------------------------------
-
-static void expect_norm_string(const char* name, const std::string& role,
-                               const std::string& content, const std::string& want) {
-    auto got = normalize_message_content(role, content);
-    if (!got.is_string()) {
-        check(false, name, "expected a JSON string, got " +
-                               std::string(got.is_array() ? "array" : got.is_object() ? "object" : "other") +
-                               " (" + got.dump() + ")");
-        return;
-    }
-    check(got.get<std::string>() == want, name, "expected \"" + want + "\", got \"" + got.get<std::string>() + "\"");
-}
-
-static void test_normalize() {
-    const std::string ingredients = R"(["1/4 cup brown sugar, packed","1 pound ground beef"])";
-
-    // #10524 - JSON-array text must stay a string. Role-INDEPENDENT array defense.
-    for (const char* role : {"user", "system", "developer", "function", "assistant", "tool"}) {
-        expect_norm_string((std::string("json_array_stays_text:") + role).c_str(), role, ingredients, ingredients);
-    }
-
-    // #10524 - user/system/developer JSON-object text stays verbatim (NOT re-dumped).
-    expect_norm_string("user_json_object_verbatim", "user", R"({"a":1})", R"({"a":1})");
-    expect_norm_string("system_json_object_verbatim", "system", R"({"a":1})", R"({"a":1})");
-    expect_norm_string("developer_json_object_verbatim", "developer", R"({"a":1})", R"({"a":1})");
-
-    // Plain text unchanged for all roles.
-    expect_norm_string("user_plain_text", "user", "hello world", "hello world");
-    expect_norm_string("assistant_non_json_text_kept", "assistant", "hi [unclosed", "hi [unclosed");
-
-    // #7324 boundary - user/system/developer literal "null" preserved (never parsed).
-    expect_norm_string("user_literal_null_stays", "user", "null", "null");
-    expect_norm_string("system_literal_null_stays", "system", "null", "null");
-    expect_norm_string("developer_literal_null_stays", "developer", "null", "null");
-
-    // #7324 - assistant/tool literal null collapses to empty string.
-    expect_norm_string("assistant_null_to_empty", "assistant", "null", "");
-    expect_norm_string("tool_null_to_empty", "tool", "null", "");
-
-    // #7324/#7528 - assistant/tool object bookkeeping stringified (stays a string).
-    check(normalize_message_content("assistant", R"({"tool":"x"})").is_string(), "assistant_object_stringified");
-    check(normalize_message_content("tool", R"({"error":"boom"})").is_string(), "tool_object_stringified");
-
-    // #10524-family - a bare scalar that parses as a JSON number stays the string.
-    expect_norm_string("assistant_scalar_number_stays_string", "assistant", "42", "42");
-
-    // baseline - empty content stays empty.
-    expect_norm_string("user_empty_stays_empty", "user", "", "");
-}
-
-// ---- normalize_template_message (BEFORE TEMPLATE sanitizer) ---------------
-
-static void test_template_sanitizer() {
-    // #7528 - a tool message with an ACTUAL array becomes a string.
-    {
-        ordered_json msg = {{"role", "tool"}, {"content", ordered_json::array({{{"type", "text"}, {"text", "r"}}})}};
-        normalize_template_message(msg);
-        check(msg["content"].is_string(), "before_template_tool_array_to_string", "got " + msg["content"].dump());
-    }
-    // #7324 - null content -> "" for any role.
-    {
-        ordered_json msg = {{"role", "assistant"}, {"content", nullptr}};
-        normalize_template_message(msg);
-        check(msg["content"].is_string() && msg["content"] == "", "before_template_null_to_empty");
-    }
-    // object content -> dumped string (would otherwise throw at the template).
-    {
-        ordered_json msg = {{"role", "assistant"}, {"content", {{"x", 1}}}};
-        normalize_template_message(msg);
-        check(msg["content"].is_string(), "before_template_object_to_string", "got " + msg["content"].dump());
-    }
-    // missing content field -> "".
-    {
-        ordered_json msg = {{"role", "user"}};
-        normalize_template_message(msg);
-        check(msg.contains("content") && msg["content"] == "", "before_template_missing_to_empty");
-    }
-    // multimodal: a well-typed user array must be left UNTOUCHED (role!=tool).
-    {
-        ordered_json parts = ordered_json::array();
-        parts.push_back({{"type", "text"}, {"text", "x"}});
-        ordered_json img; img["type"] = "image_url"; img["image_url"] = {{"url", "data:..."}};
-        parts.push_back(img);
-        ordered_json msg = {{"role", "user"}, {"content", parts}};
-        normalize_template_message(msg);
-        check(msg["content"].is_array() && msg["content"].size() == 2, "before_template_user_typed_array_preserved",
-              "got " + msg["content"].dump());
-    }
-    // a plain string is left untouched.
-    {
-        ordered_json msg = {{"role", "user"}, {"content", "hello"}};
-        normalize_template_message(msg);
-        check(msg["content"] == "hello", "before_template_string_untouched");
-    }
-}
-
-// ---- build_reconstructed_message ----------------------------------------
-
-static void test_reconstruction() {
-    const std::string ingredients = R"(["1/4 cup brown sugar","1 pound ground beef"])";
-
-    // #10524 end-state - user JSON-array text, no media -> string content.
-    {
-        ReconstructedMessageInput in;
-        in.role = "user"; in.content = ingredients;
-        auto m = build_reconstructed_message(in);
-        check(m["content"].is_string() && m["content"] == ingredients, "recon_user_json_array_string",
-              "got " + m["content"].dump());
-    }
-    // multimodal - user text + one image on last user msg -> typed array, image kept.
-    {
-        ReconstructedMessageInput in;
-        in.role = "user"; in.content = ingredients; in.is_last_user_msg = true;
-        in.images.push_back("BASE64IMG");
-        auto m = build_reconstructed_message(in);
-        check(m["content"].is_array() && m["content"].size() == 2, "recon_multimodal_text_plus_image",
-              "got " + m["content"].dump());
-        check(m["content"][0]["type"] == "text" && m["content"][0]["text"] == ingredients, "recon_multimodal_text_first");
-        check(m["content"][1]["type"] == "image_url", "recon_multimodal_image_kept");
-    }
-    // multimodal media-only - empty text + image on last user msg.
-    {
-        ReconstructedMessageInput in;
-        in.role = "user"; in.content = ""; in.is_last_user_msg = true;
-        in.images.push_back("BASE64IMG");
-        auto m = build_reconstructed_message(in);
-        check(m["content"].is_array() && m["content"].size() == 1 && m["content"][0]["type"] == "image_url",
-              "recon_media_only", "got " + m["content"].dump());
-    }
-    // #7528 - tool array-string content stays a string.
-    {
-        ReconstructedMessageInput in;
-        in.role = "tool"; in.content = R"(["a","b"])"; in.tool_call_id = "call_1";
-        auto m = build_reconstructed_message(in);
-        check(m["content"].is_string() && m["content"] == R"(["a","b"])", "recon_tool_array_string",
-              "got " + m["content"].dump());
-        check(m["tool_call_id"] == "call_1", "recon_tool_call_id_set");
-    }
-    // tool empty content -> "".
-    {
-        ReconstructedMessageInput in;
-        in.role = "tool"; in.content = "";
-        auto m = build_reconstructed_message(in);
-        check(m["content"].is_string() && m["content"] == "", "recon_tool_empty_to_string");
-    }
-    // #7324 - assistant + tool_calls + empty content -> " " (single space, not "").
-    {
-        ReconstructedMessageInput in;
-        in.role = "assistant"; in.content = "";
-        in.tool_calls = R"([{"id":"c1","type":"function","function":{"name":"f","arguments":"{}"}}])";
-        auto m = build_reconstructed_message(in);
-        check(m["content"].is_string() && m["content"] == " ", "recon_toolcalls_empty_content_space",
-              "got " + m["content"].dump());
-        check(m["tool_calls"].is_array() && m["tool_calls"].size() == 1, "recon_toolcalls_parsed");
-    }
-    // assistant + tool_calls + real content keeps the content.
-    {
-        ReconstructedMessageInput in;
-        in.role = "assistant"; in.content = "I'll call f";
-        in.tool_calls = R"([{"id":"c1","type":"function","function":{"name":"f","arguments":"{}"}}])";
-        auto m = build_reconstructed_message(in);
-        check(m["content"] == "I'll call f", "recon_toolcalls_with_content_kept");
-    }
-    // assistant null content -> "".
-    {
-        ReconstructedMessageInput in;
-        in.role = "assistant"; in.content = "null";
-        auto m = build_reconstructed_message(in);
-        check(m["content"] == "", "recon_assistant_null_to_empty");
-    }
-    // malformed tool_calls JSON must not throw; content preserved.
-    {
-        ReconstructedMessageInput in;
-        in.role = "assistant"; in.content = "hi"; in.tool_calls = "{not json";
-        auto m = build_reconstructed_message(in);
-        check(m["content"] == "hi" && !m.contains("tool_calls"), "recon_malformed_toolcalls_safe");
-    }
-    // optional fields: name + reasoning carried through.
-    {
-        ReconstructedMessageInput in;
-        in.role = "tool"; in.content = "result"; in.name = "get_weather"; in.reasoning_content = "thinking";
-        auto m = build_reconstructed_message(in);
-        check(m["name"] == "get_weather" && m["reasoning_content"] == "thinking", "recon_optional_fields");
-    }
-}
-
-int main() {
-    test_normalize();
-    test_template_sanitizer();
-    test_reconstruction();
-
-    if (failures == 0) {
-        std::cout << "OK: all message_content tests passed\n";
-        return 0;
-    }
-    std::cerr << failures << " test(s) failed\n";
-    return 1;
-}
--- a/backend/cpp/llama-cpp/package.sh
+++ b/backend/cpp/llama-cpp/package.sh
@@ -14,22 +14,6 @@ mkdir -p $CURDIR/package/lib
 cp -avrf $CURDIR/llama-cpp-* $CURDIR/package/
 cp -rfv $CURDIR/run.sh $CURDIR/package/

-# Bundle the ggml shared backends produced by the CPU_ALL_VARIANTS build (libggml-base.so,
-# libggml.so, libllama.so and the per-microarch libggml-cpu-*.so), all into package/lib.
-#
-# Two distinct resolution mechanisms both land here:
-#   - NEEDED deps (libggml-base/libggml/libllama): resolved by the dynamic linker via the
-#     LD_LIBRARY_PATH=$CURDIR/lib that run.sh exports.
-#   - The per-microarch libggml-cpu-*.so are NOT linked; ggml *discovers* them at runtime by
-#     scanning the executable's own directory (readlink /proc/self/exe). run.sh launches via
-#     the bundled $CURDIR/lib/ld.so, so /proc/self/exe -> .../lib/ld.so and ggml scans lib/.
-#     That is why the variants must sit in lib/ (next to ld.so), not just on the link path.
-# No-op on builds (arm64/darwin) that don't produce the all-variants set.
-if [ -d "$CURDIR/ggml-shared-libs" ]; then
-    echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
-    cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
-fi
-
 # Detect architecture and copy appropriate libraries
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    # x86_64 architecture
--- a/backend/cpp/llama-cpp/prepare.sh
+++ b/backend/cpp/llama-cpp/prepare.sh
@@ -18,10 +18,6 @@ done

 cp -r CMakeLists.txt llama.cpp/tools/grpc-server/
 cp -r grpc-server.cpp llama.cpp/tools/grpc-server/
-# Shared message-reconstruction helpers (included by grpc-server.cpp) and their
-# unit test (compiled only when -DLLAMA_GRPC_BUILD_TESTS=ON).
-cp -r message_content.h llama.cpp/tools/grpc-server/
-cp -r message_content_test.cpp llama.cpp/tools/grpc-server/
 cp -rfv llama.cpp/vendor/nlohmann/json.hpp llama.cpp/tools/grpc-server/
 cp -rfv llama.cpp/vendor/cpp-httplib/httplib.h llama.cpp/tools/grpc-server/

--- a/backend/cpp/llama-cpp/run.sh
+++ b/backend/cpp/llama-cpp/run.sh
@@ -2,7 +2,7 @@
 set -ex

 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")

 cd /

@@ -12,41 +12,55 @@ grep -e "flags" /proc/cpuinfo | head -1

 BINARY=llama-cpp-fallback

-# CPU images (x86, arm64, darwin) ship a single llama-cpp-cpu-all built with ggml
-# CPU_ALL_VARIANTS: ggml's backend registry dlopens the best libggml-cpu-*.so for this
-# host, so no shell-side AVX probing. GPU images (cublas/sycl/vulkan/hipblas) ship only
-# llama-cpp-fallback (the accelerator does the compute), so fall back to it when absent.
-if [ -e "$CURDIR"/llama-cpp-cpu-all ]; then
-	BINARY=llama-cpp-cpu-all
+if grep -q -e "\savx\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX    found OK"
+	if [ -e $CURDIR/llama-cpp-avx ]; then
+		BINARY=llama-cpp-avx
+	fi
+fi
+
+if grep -q -e "\savx2\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX2   found OK"
+	if [ -e $CURDIR/llama-cpp-avx2 ]; then
+		BINARY=llama-cpp-avx2
+	fi
+fi
+
+# Check avx 512
+if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX512F found OK"
+	if [ -e $CURDIR/llama-cpp-avx512 ]; then
+		BINARY=llama-cpp-avx512
+	fi
 fi

 if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
-	if [ -e "$CURDIR"/llama-cpp-grpc ]; then
+	if [ -e $CURDIR/llama-cpp-grpc ]; then
 		BINARY=llama-cpp-grpc
 	fi
 fi
 
 # Extend ld library path with the dir where this script is located/lib
 if [ "$(uname)" == "Darwin" ]; then
-	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
-	#export DYLD_FALLBACK_LIBRARY_PATH="$CURDIR"/lib:$DYLD_FALLBACK_LIBRARY_PATH
+	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
+	#export DYLD_FALLBACK_LIBRARY_PATH=$CURDIR/lib:$DYLD_FALLBACK_LIBRARY_PATH
 else
-	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 	# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
 	if [ -d "$CURDIR/lib/rocblas/library" ]; then
-		export ROCBLAS_TENSILE_LIBPATH="$CURDIR"/lib/rocblas/library
+		export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
 	fi
 fi

 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using binary: $BINARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/$BINARY "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
 fi

 echo "Using binary: $BINARY"
-exec "$CURDIR"/$BINARY "$@"
+exec $CURDIR/$BINARY "$@"

 # We should never reach this point, however just in case we do, run fallback
-exec "$CURDIR"/llama-cpp-fallback "$@"
+exec $CURDIR/llama-cpp-fallback "$@"
--- a/backend/cpp/privacy-filter/.gitignore
+++ b/backend/cpp/privacy-filter/.gitignore
@@ -1,9 +0,0 @@
-/privacy-filter.cpp
-build/
-package/
-grpc-server
-*.o
-backend.pb.cc
-backend.pb.h
-backend.grpc.pb.cc
-backend.grpc.pb.h
--- a/backend/cpp/privacy-filter/CMakeLists.txt
+++ b/backend/cpp/privacy-filter/CMakeLists.txt
@@ -1,77 +0,0 @@
-cmake_minimum_required(VERSION 3.21)
-project(privacy-filter-grpc-server LANGUAGES CXX C)
-
-set(CMAKE_CXX_STANDARD 17)
-set(CMAKE_CXX_STANDARD_REQUIRED ON)
-set(TARGET grpc-server)
-
-# Path to the privacy-filter.cpp engine sources. The Makefile arranges for this
-# to exist (clone of a pinned commit, or a symlink to PRIVACY_FILTER_SRC).
-set(PRIVACY_FILTER_DIR "${CMAKE_CURRENT_SOURCE_DIR}/privacy-filter.cpp"
-    CACHE PATH "Path to the privacy-filter.cpp engine source tree")
-
-find_package(Threads REQUIRED)
-find_package(Protobuf CONFIG QUIET)
-if(NOT Protobuf_FOUND)
-    find_package(Protobuf REQUIRED)
-endif()
-find_package(gRPC CONFIG QUIET)
-if(NOT gRPC_FOUND)
-    # Ubuntu's apt-installed grpc++ does not ship a CMake config - fall back.
-    find_library(GRPCPP_LIB grpc++ REQUIRED)
-    find_library(GRPCPP_REFLECTION_LIB grpc++_reflection REQUIRED)
-    add_library(gRPC::grpc++ INTERFACE IMPORTED)
-    set_target_properties(gRPC::grpc++ PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_LIB}")
-    add_library(gRPC::grpc++_reflection INTERFACE IMPORTED)
-    set_target_properties(gRPC::grpc++_reflection PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_REFLECTION_LIB}")
-endif()
-
-find_program(_PROTOC NAMES protoc REQUIRED)
-find_program(_GRPC_CPP_PLUGIN NAMES grpc_cpp_plugin REQUIRED)
-
-get_filename_component(HW_PROTO "${CMAKE_CURRENT_SOURCE_DIR}/../../backend.proto" ABSOLUTE)
-get_filename_component(HW_PROTO_PATH "${HW_PROTO}" PATH)
-
-set(HW_PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.cc")
-set(HW_PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.h")
-set(HW_GRPC_SRCS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.cc")
-set(HW_GRPC_HDRS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.h")
-
-add_custom_command(
-    OUTPUT "${HW_PROTO_SRCS}" "${HW_PROTO_HDRS}" "${HW_GRPC_SRCS}" "${HW_GRPC_HDRS}"
-    COMMAND ${_PROTOC}
-    ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
-         --cpp_out  "${CMAKE_CURRENT_BINARY_DIR}"
-         -I "${HW_PROTO_PATH}"
-         --plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN}"
-         "${HW_PROTO}"
-    DEPENDS "${HW_PROTO}")
-
-add_library(hw_grpc_proto STATIC
-    ${HW_GRPC_SRCS} ${HW_GRPC_HDRS}
-    ${HW_PROTO_SRCS} ${HW_PROTO_HDRS})
-target_include_directories(hw_grpc_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
-# The generated proto/grpc sources include protobuf and grpc++ headers, so this
-# library must see their include dirs. Linking the imported targets propagates
-# them. On Linux the apt headers live in /usr/include (default search path) so
-# this was a no-op; on macOS the Homebrew headers are under /opt/homebrew and
-# would otherwise be missed (runtime_version.h not found).
-target_link_libraries(hw_grpc_proto PUBLIC
-    protobuf::libprotobuf
-    gRPC::grpc++)
-
-# Build only the pf static lib (+ ggml) from the engine tree — no CLI/bench/tests.
-# PF_VULKAN is honored when passed on the cmake command line (it lands in the
-# shared cache the engine reads).
-set(PF_BUILD_TOOLS OFF CACHE BOOL "" FORCE)
-set(PF_BUILD_TESTS OFF CACHE BOOL "" FORCE)
-add_subdirectory(${PRIVACY_FILTER_DIR} ${CMAKE_CURRENT_BINARY_DIR}/privacy-filter.cpp)
-
-add_executable(${TARGET} grpc-server.cpp)
-target_link_libraries(${TARGET} PRIVATE
-    pf
-    hw_grpc_proto
-    gRPC::grpc++
-    gRPC::grpc++_reflection
-    protobuf::libprotobuf
-    Threads::Threads)
--- a/backend/cpp/privacy-filter/Makefile
+++ b/backend/cpp/privacy-filter/Makefile
@@ -1,77 +0,0 @@
-# privacy-filter backend Makefile.
-#
-# Wraps the standalone privacy-filter.cpp GGML engine (the openai-privacy-filter
-# PII/NER token classifier) as a LocalAI gRPC backend. The engine source is
-# fetched at the pin below — .github/workflows/bump_deps.yaml finds and updates
-# PRIVACY_FILTER_VERSION, matching the llama-cpp / ds4 convention.
-#
-# Local development: point at a working checkout instead of cloning, e.g.
-#   make PRIVACY_FILTER_SRC=$HOME/c/privacy-filter.cpp grpc-server
-
-PRIVACY_FILTER_VERSION?=98f52c5ef2250f207cc6b9a6aef05393a120cb7c
-PRIVACY_FILTER_REPO?=https://github.com/localai-org/privacy-filter.cpp
-PRIVACY_FILTER_SRC?=
-
-CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
-BUILD_DIR := build
-
-BUILD_TYPE ?=
-NATIVE ?= false
-JOBS ?= $(shell nproc 2>/dev/null || echo 4)
-
-CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
-
-# GPU backends; the default (cpu) needs no extra flags. 'cublas' is LocalAI's
-# name for the CUDA build (matches llama-cpp / ds4), mapping to the engine's
-# GGML_CUDA path; 'vulkan' selects the ggml Vulkan backend.
-ifeq ($(BUILD_TYPE),cublas)
-    CMAKE_ARGS += -DPF_CUDA=ON
-endif
-ifeq ($(BUILD_TYPE),vulkan)
-    CMAKE_ARGS += -DPF_VULKAN=ON
-endif
-
-# Portable binaries for distribution: disable -march=native unless asked.
-ifneq ($(NATIVE),true)
-    CMAKE_ARGS += -DGGML_NATIVE=OFF
-endif
-
-.PHONY: grpc-server package clean purge test all
-all: grpc-server
-
-# Provide the engine sources at ./privacy-filter.cpp. With PRIVACY_FILTER_SRC
-# set we symlink a local checkout (instant, no network); otherwise we clone the
-# pinned commit and its ggml submodule. The directory/symlink is the target, so
-# make only does this once — run 'make purge && make' to refetch after a bump.
-privacy-filter.cpp:
-ifneq ($(PRIVACY_FILTER_SRC),)
-	ln -sfn $(abspath $(PRIVACY_FILTER_SRC)) privacy-filter.cpp
-else
-	mkdir -p privacy-filter.cpp
-	cd privacy-filter.cpp && \
-	git init -q && \
-	git remote add origin $(PRIVACY_FILTER_REPO) && \
-	git fetch --depth 1 origin $(PRIVACY_FILTER_VERSION) && \
-	git checkout FETCH_HEAD && \
-	git submodule update --init --recursive --depth 1
-endif
-
-grpc-server: privacy-filter.cpp
-	@echo "Building privacy-filter grpc-server ($(BUILD_TYPE)) with $(CMAKE_ARGS)"
-	mkdir -p $(BUILD_DIR)
-	cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
-	cp $(BUILD_DIR)/grpc-server grpc-server
-
-package: grpc-server
-	bash package.sh
-
-test:
-	@echo "privacy-filter backend: parity/regression coverage lives in the engine repo"
-
-clean:
-	rm -rf $(BUILD_DIR) grpc-server package
-
-# 'privacy-filter.cpp' may be a symlink (PRIVACY_FILTER_SRC) — rm without a
-# trailing slash removes the link, never the linked-to checkout.
-purge: clean
-	rm -rf privacy-filter.cpp
--- a/backend/cpp/privacy-filter/grpc-server.cpp
+++ b/backend/cpp/privacy-filter/grpc-server.cpp
@@ -1,210 +0,0 @@
-// privacy-filter LocalAI gRPC backend.
-//
-// Thin shim over privacy-filter.cpp's flat C API (include/pf.h): a standalone
-// GGML engine for the openai-privacy-filter token-classification model family
-// (PII NER). It replaces the llama.cpp-patched TokenClassify path for this one
-// model family — same GGUF files, no llama.cpp carry-patches.
-//
-// Only the RPCs the PII tier needs are implemented: LoadModel, TokenClassify,
-// plus Health / Status / Free. Everything else inherits the generated base
-// class default (UNIMPLEMENTED).
-
-#include "backend.pb.h"
-#include "backend.grpc.pb.h"
-
-#include "pf.h"
-
-#include <grpcpp/grpcpp.h>
-#include <grpcpp/server.h>
-#include <grpcpp/server_builder.h>
-#include <grpcpp/ext/proto_server_reflection_plugin.h>
-
-#include <atomic>
-#include <chrono>
-#include <csignal>
-#include <iostream>
-#include <memory>
-#include <mutex>
-#include <string>
-
-using grpc::Server;
-using grpc::ServerBuilder;
-using grpc::ServerContext;
-// NOTE: do NOT alias grpc::Status as Status — the Status RPC method below would
-// shadow the type and break the other method signatures. Use GStatus instead.
-using GStatus = ::grpc::Status;
-using grpc::StatusCode;
-
-namespace {
-
-// The engine is single-model-per-process: LocalAI spawns one backend process
-// per loaded model. g_mu guards (re)load against in-flight classification.
-std::mutex          g_mu;
-pf_ctx *            g_ctx = nullptr;
-std::atomic<Server *> g_server{nullptr};
-
-// Resolve the device string the engine expects ("cpu" / "gpu" / "cuda" /
-// "vulkan", optionally ":N"). Priority: an explicit "device:..." in
-// ModelOptions.Options, then a non-zero NGPULayers as a coarse "use the GPU"
-// signal, else CPU. "gpu" lets the engine pick whichever GPU backend this
-// binary was compiled with (CUDA or Vulkan), so the same config works on
-// either build; pin "device:cuda"/"device:vulkan" to be explicit.
-std::string resolve_device(const backend::ModelOptions * opts) {
-    for (const auto & o : opts->options()) {
-        const std::string prefix = "device:";
-        if (o.rfind(prefix, 0) == 0) {
-            return o.substr(prefix.size());
-        }
-    }
-    if (opts->ngpulayers() > 0) {
-        return "gpu";
-    }
-    return "cpu";
-}
-
-class PrivacyFilterBackend final : public backend::Backend::Service {
-public:
-    GStatus Health(ServerContext *, const backend::HealthMessage *,
-                   backend::Reply * reply) override {
-        reply->set_message("OK");
-        return GStatus::OK;
-    }
-
-    GStatus Status(ServerContext *, const backend::HealthMessage *,
-                   backend::StatusResponse * response) override {
-        std::lock_guard<std::mutex> lock(g_mu);
-        response->set_state(g_ctx ? backend::StatusResponse::READY
-                                  : backend::StatusResponse::UNINITIALIZED);
-        return GStatus::OK;
-    }
-
-    GStatus LoadModel(ServerContext *, const backend::ModelOptions * request,
-                      backend::Result * result) override {
-        std::lock_guard<std::mutex> lock(g_mu);
-
-        // ModelFile is the absolute path LocalAI resolves; Model is the bare
-        // name. Prefer the former, fall back to the latter.
-        const std::string path =
-            !request->modelfile().empty() ? request->modelfile() : request->model();
-        if (path.empty()) {
-            result->set_success(false);
-            result->set_message("no model path supplied");
-            return GStatus::OK;
-        }
-
-        const std::string device = resolve_device(request);
-
-        if (g_ctx) { pf_free(g_ctx); g_ctx = nullptr; }
-
-        pf_ctx * ctx = pf_load(path.c_str(), device.c_str(), request->threads());
-        const char * err = pf_last_error(ctx);
-        if (err) {
-            result->set_success(false);
-            result->set_message(std::string("privacy-filter load failed: ") + err);
-            pf_free(ctx);
-            return GStatus::OK;
-        }
-
-        // ContextSize, when set, becomes the per-forward window. The engine
-        // ignores values that are too small to window (<= 2*halo) and just
-        // runs a single forward, so passing it through is always safe.
-        if (request->contextsize() > 0) {
-            pf_set_window(ctx, request->contextsize());
-        }
-
-        g_ctx = ctx;
-        result->set_success(true);
-        result->set_message("privacy-filter loaded (" + device + ")");
-        return GStatus::OK;
-    }
-
-    GStatus TokenClassify(ServerContext *, const backend::TokenClassifyRequest * request,
-                          backend::TokenClassifyResponse * response) override {
-        std::lock_guard<std::mutex> lock(g_mu);
-        if (!g_ctx) {
-            return GStatus(StatusCode::FAILED_PRECONDITION, "Model not loaded");
-        }
-
-        const std::string & text = request->text();
-        if (text.empty()) {
-            return GStatus::OK;  // no text -> no entities
-        }
-
-        pf_entity * ents = nullptr;
-        size_t      n    = 0;
-        if (pf_classify(g_ctx, text.data(), text.size(), request->threshold(), &ents, &n) != 0) {
-            const char * err = pf_last_error(g_ctx);
-            return GStatus(StatusCode::INTERNAL,
-                           std::string("TokenClassify failed: ") + (err ? err : "unknown"));
-        }
-
-        // Byte offsets are into the original UTF-8 text; the engine already
-        // applied the threshold and whitespace-trimmed span edges.
-        for (size_t i = 0; i < n; i++) {
-            backend::TokenClassifyEntity * ent = response->add_entities();
-            ent->set_entity_group(ents[i].label ? ents[i].label : "");
-            ent->set_start(ents[i].start);
-            ent->set_end(ents[i].end);
-            ent->set_score(ents[i].score);
-            ent->set_text(text.substr((size_t) ents[i].start,
-                                      (size_t) (ents[i].end - ents[i].start)));
-        }
-        pf_entities_free(ents, n);
-        return GStatus::OK;
-    }
-
-    GStatus Free(ServerContext *, const backend::HealthMessage *,
-                 backend::Result * result) override {
-        std::lock_guard<std::mutex> lock(g_mu);
-        if (g_ctx) { pf_free(g_ctx); g_ctx = nullptr; }
-        result->set_success(true);
-        return GStatus::OK;
-    }
-};
-
-void RunServer(const std::string & addr) {
-    PrivacyFilterBackend service;
-    grpc::EnableDefaultHealthCheckService(true);
-    grpc::reflection::InitProtoReflectionServerBuilderPlugin();
-
-    ServerBuilder builder;
-    builder.AddListeningPort(addr, grpc::InsecureServerCredentials());
-    builder.RegisterService(&service);
-    builder.SetMaxReceiveMessageSize(64 * 1024 * 1024);
-    builder.SetMaxSendMessageSize(64 * 1024 * 1024);
-
-    std::unique_ptr<Server> server(builder.BuildAndStart());
-    if (!server) {
-        std::cerr << "privacy-filter grpc-server: failed to bind " << addr << "\n";
-        std::exit(1);
-    }
-    g_server = server.get();
-    std::cerr << "privacy-filter grpc-server listening on " << addr << "\n";
-    server->Wait();
-}
-
-void signal_handler(int) {
-    if (auto * srv = g_server.load()) {
-        srv->Shutdown(std::chrono::system_clock::now() + std::chrono::seconds(3));
-    }
-}
-
-} // namespace
-
-int main(int argc, char * argv[]) {
-    std::string addr = "127.0.0.1:50051";
-    for (int i = 1; i < argc; ++i) {
-        std::string a = argv[i];
-        const std::string addr_flag = "--addr=";
-        if (a.rfind(addr_flag, 0) == 0)      addr = a.substr(addr_flag.size());
-        else if (a == "--addr" && i + 1 < argc) addr = argv[++i];
-        else if (a == "--help" || a == "-h") {
-            std::cout << "Usage: grpc-server --addr=HOST:PORT\n";
-            return 0;
-        }
-    }
-    std::signal(SIGINT,  signal_handler);
-    std::signal(SIGTERM, signal_handler);
-    RunServer(addr);
-    return 0;
-}
--- a/backend/cpp/privacy-filter/package.sh
+++ b/backend/cpp/privacy-filter/package.sh
@@ -1,39 +0,0 @@
-#!/bin/bash
-# Assemble package/ for the from-scratch backend image: the grpc-server binary,
-# run.sh, the dynamic loader, and every shared library the binary needs.
-set -e
-CURDIR=$(dirname "$(realpath "$0")")
-REPO_ROOT="${CURDIR}/../../.."
-
-mkdir -p "$CURDIR/package/lib"
-cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
-cp -rfv "$CURDIR/run.sh"      "$CURDIR/package/"
-
-# The dynamic loader, renamed to lib/ld.so so run.sh can invoke it explicitly
-# (makes the image independent of the host's glibc layout).
-if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
-    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
-elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
-    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
-else
-    echo "package.sh: unknown architecture" >&2; exit 1
-fi
-
-# Bundle the binary's transitive shared deps (libstdc++, libgomp, and the apt
-# grpc++/protobuf/absl stack) by walking ldd — robust to whichever of those are
-# linked shared vs static. The loader line (no "=>") is skipped; ld.so above
-# already covers it.
-ldd "$CURDIR/grpc-server" | awk '$2 == "=>" && $3 ~ /^\// { print $3 }' | sort -u | \
-while read -r so; do
-    [ -f "$so" ] && cp -arfLv "$so" "$CURDIR/package/lib/"
-done
-
-# Vulkan loader / GPU libs when building the GPU variant.
-GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
-if [ -f "$GPU_LIB_SCRIPT" ]; then
-    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
-    package_gpu_libs
-fi
-
-echo "privacy-filter package contents:"
-ls -lah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/cpp/privacy-filter/run.sh
+++ b/backend/cpp/privacy-filter/run.sh
@@ -1,15 +0,0 @@
-#!/bin/bash
-# Entry point for the privacy-filter backend image / BACKEND_BINARY mode.
-set -e
-CURDIR=$(dirname "$(realpath "$0")")
-# macOS has no bundled ld.so; the darwin package ships only dylibs under lib/,
-# resolved via DYLD_LIBRARY_PATH (the ld.so branch below is skipped there).
-if [ "$(uname)" = "Darwin" ]; then
-    export DYLD_LIBRARY_PATH="$CURDIR/lib:$DYLD_LIBRARY_PATH"
-else
-    export LD_LIBRARY_PATH="$CURDIR/lib:$LD_LIBRARY_PATH"
-fi
-if [ -f "$CURDIR/lib/ld.so" ]; then
-    exec "$CURDIR/lib/ld.so" "$CURDIR/grpc-server" "$@"
-fi
-exec "$CURDIR/grpc-server" "$@"
--- a/backend/cpp/run-unit-tests.sh
+++ b/backend/cpp/run-unit-tests.sh
@@ -1,71 +0,0 @@
-#!/bin/bash
-#
-# Discovers and runs every standalone C++ unit test under backend/cpp/.
-#
-# A "standalone" unit test is a *_test.cpp that depends only on the C++ standard
-# library and nlohmann/json (single header) - i.e. it exercises pure helpers and
-# does not need the full llama.cpp + gRPC backend build. Tests that DO need the
-# backend build use the CMake/ctest path (e.g. -DLLAMA_GRPC_BUILD_TESTS=ON)
-# instead and are skipped here.
-#
-# This keeps CI generic: adding a new pure-C++ unit test file named *_test.cpp in
-# an active backend source dir is picked up automatically, with no CI edits.
-#
-# Env:
-#   NLOHMANN_INCLUDE  include dir that contains nlohmann/json.hpp. If unset, the
-#                     nlohmann/json single header is fetched to a temp dir.
-#   CXX               compiler (default: g++).
-#   JSON_VERSION      nlohmann/json tag to fetch when NLOHMANN_INCLUDE is unset
-#                     (default: v3.11.3).
-set -uo pipefail
-
-ROOT="$(cd "$(dirname "$0")" && pwd)"
-CXX="${CXX:-g++}"
-JSON_VERSION="${JSON_VERSION:-v3.11.3}"
-
-JSON_INC="${NLOHMANN_INCLUDE:-}"
-if [ -z "$JSON_INC" ]; then
-    JSON_INC="$(mktemp -d)"
-    mkdir -p "$JSON_INC/nlohmann"
-    echo "Fetching nlohmann/json ${JSON_VERSION} single header..."
-    if ! curl -L -sf \
-        "https://raw.githubusercontent.com/nlohmann/json/${JSON_VERSION}/single_include/nlohmann/json.hpp" \
-        -o "$JSON_INC/nlohmann/json.hpp"; then
-        echo "ERROR: failed to fetch nlohmann/json header" >&2
-        exit 1
-    fi
-fi
-
-# Active source dirs only - exclude per-variant build copies, dev snapshots and
-# the vendored upstream llama.cpp tree.
-mapfile -t tests < <(find "$ROOT" -name '*_test.cpp' \
-    -not -path '*/llama.cpp/*' \
-    -not -path '*-build/*' \
-    -not -path '*-dev/*' \
-    -not -path '*fallback*' | sort)
-
-if [ "${#tests[@]}" -eq 0 ]; then
-    echo "No standalone C++ unit tests found under $ROOT"
-    exit 0
-fi
-
-fail=0
-for test_src in "${tests[@]}"; do
-    name="$(basename "$test_src" .cpp)"
-    bin="$(mktemp -d)/$name"
-    echo "==> $test_src"
-    if ! "$CXX" -std=c++17 -Wall -Wextra \
-        -I"$JSON_INC" -I"$(dirname "$test_src")" \
-        "$test_src" -o "$bin"; then
-        echo "COMPILE FAILED: $test_src" >&2
-        fail=1
-        continue
-    fi
-    if ! "$bin"; then
-        echo "TEST FAILED: $test_src" >&2
-        fail=1
-    fi
-done
-
-echo "Ran ${#tests[@]} standalone C++ unit test file(s)"
-exit "$fail"
--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -1,7 +1,7 @@

 # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
 # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
-TURBOQUANT_VERSION?=7d9715f1f071fa07c7b2ad3dbfd320b314139e65
+TURBOQUANT_VERSION?=5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403
 LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant

 CMAKE_ARGS?=
@@ -65,29 +65,6 @@ turboquant-avx:
 turboquant-fallback:
 	$(call turboquant-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)

-# Single-build CPU backend via ggml CPU_ALL_VARIANTS (mirrors llama-cpp-cpu-all).
-# turboquant reuses backend/cpp/llama-cpp's CMakeLists.txt (hw_grpc_proto STATIC) and
-# Makefile (SHARED_LIBS make-var + EXTRA_CMAKE_ARGS), so this passes the same overrides
-# through to the copied build: SHARED_LIBS=ON, the DL flags, and --target ggml (which
-# pulls in the per-microarch libggml-cpu-*.so via ggml's add_dependencies). The .so set
-# is collected for package.sh to bundle into package/lib.
-turboquant-cpu-all:
-	rm -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
-	cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
-	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build purge
-	bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server.cpp
-	$(info $(GREEN)I turboquant build info:cpu-all-variants$(RESET))
-	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
-	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build llama.cpp
-	bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp $(PATCHES_DIR)
-	SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" \
-	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
-	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build grpc-server
-	cp -rfv $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server turboquant-cpu-all
-	rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
-	find $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
-	@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
-
 turboquant-grpc:
 	$(call turboquant-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)

--- a/backend/cpp/turboquant/package.sh
+++ b/backend/cpp/turboquant/package.sh
@@ -14,15 +14,6 @@ mkdir -p $CURDIR/package/lib
 cp -avrf $CURDIR/turboquant-* $CURDIR/package/
 cp -rfv $CURDIR/run.sh $CURDIR/package/

-# Bundle the ggml shared backends from the CPU_ALL_VARIANTS build into package/lib. ggml
-# discovers the per-microarch libggml-cpu-*.so by scanning the executable directory, which
-# (via the bundled lib/ld.so that run.sh launches through) resolves to lib/. See the
-# matching comment in backend/cpp/llama-cpp/package.sh. No-op on the fallback/ROCm builds.
-if [ -d "$CURDIR/ggml-shared-libs" ]; then
-    echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
-    cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
-fi
-
 # Detect architecture and copy appropriate libraries
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    # x86_64 architecture
--- a/backend/cpp/turboquant/patch-grpc-server.sh
+++ b/backend/cpp/turboquant/patch-grpc-server.sh
@@ -4,19 +4,21 @@
 #
 #   1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
 #      fork-specific `turbo2` / `turbo3` / `turbo4` cache types.
-#   2. Define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top of the file
-#      so the grpc-server option parser skips the two references to
-#      common_params::checkpoint_min_step (the default and the option handler).
-#      That field does not exist in the fork yet; drop this once it does.
-#
-# The fork used to lag upstream on the whole common_params_speculative refactor
-# (ggml-org/llama.cpp#22397/#22838/#22964), the model_tgt rename (#22838) and
-# get_media_marker (#21962), which required a much larger compat shim here
-# (flat-field sed renames + a coarse LOCALAI_LEGACY_LLAMA_CPP_SPEC define). The
-# fork has since rebased past all of those, so the only remaining gap is
-# checkpoint_min_step. If a future bump reintroduces a divergence, add a narrow
-# guard in grpc-server.cpp keyed on a fork-specific macro and inject it here
-# rather than resurrecting the coarse one.
+#   2. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
+#      server-side random per-instance marker) with the legacy "<__media__>"
+#      literal. The fork branched before that PR, so server-common.cpp has no
+#      get_media_marker symbol. The fork's mtmd_default_marker() still returns
+#      "<__media__>", and Go-side tooling falls back to that sentinel when the
+#      backend does not expose media_marker, so substituting the literal keeps
+#      behavior identical on the turboquant path.
+#   3. Revert the `common_params_speculative` field references to the
+#      pre-refactor flat layout. Upstream ggml-org/llama.cpp#22397 split the
+#      struct into nested `draft` / `ngram_simple` / `ngram_mod` / etc. members;
+#      the turboquant fork branched before that PR and still exposes the flat
+#      `n_max`, `mparams_dft`, `ngram_size_n`, ... fields. The substitutions
+#      below map the new nested paths back to the legacy flat names so the
+#      shared grpc-server.cpp keeps compiling against the fork's common.h.
+#      Drop this block once the fork rebases past #22397.
 #
 # We patch the *copy* sitting in turboquant-<flavor>-build/, never the original
 # under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps compiling
@@ -70,20 +72,72 @@ else
    echo "==> KV allow-list patch OK"
 fi

-# 2. Define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top of the file so
-#    the grpc-server option parser skips the two references to
-#    common_params::checkpoint_min_step (the default assignment and the option
-#    handler). That field does not exist in the fork yet. Drop this block once
-#    the fork rebases past the bump that added checkpoint_min_step.
-if grep -q '^#define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP' "$SRC"; then
-    echo "==> $SRC already defines LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP, skipping"
+if grep -q 'get_media_marker()' "$SRC"; then
+    echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
+    # Only one call site today (ModelMetadata), but replace all occurrences to
+    # stay robust if upstream adds more. Use a temp file to avoid relying on
+    # sed -i portability (the builder image uses GNU sed, but keeping this
+    # consistent with the awk block above).
+    sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> get_media_marker() substitution OK"
 else
-    echo "==> patching $SRC to define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top"
-    # Insert the define before the very first `#include` so it precedes the
-    # checkpoint_min_step references.
+    echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
+fi
+
+if grep -q 'params\.speculative\.draft\.\|params\.speculative\.ngram_simple\.' "$SRC"; then
+    echo "==> patching $SRC to revert common_params_speculative refs to pre-#22397 flat layout"
+    # Each substitution is the exact post-refactor path → legacy flat field.
+    # Order doesn't matter because the source paths are disjoint, but we keep
+    # the most-specific (mparams.path) first for readability.
+    sed -E \
+        -e 's/params\.speculative\.draft\.mparams\.path/params.speculative.mparams_dft.path/g' \
+        -e 's/params\.speculative\.draft\.n_max/params.speculative.n_max/g' \
+        -e 's/params\.speculative\.draft\.n_min/params.speculative.n_min/g' \
+        -e 's/params\.speculative\.draft\.p_min/params.speculative.p_min/g' \
+        -e 's/params\.speculative\.draft\.p_split/params.speculative.p_split/g' \
+        -e 's/params\.speculative\.draft\.n_gpu_layers/params.speculative.n_gpu_layers/g' \
+        -e 's/params\.speculative\.draft\.n_ctx/params.speculative.n_ctx/g' \
+        -e 's/params\.speculative\.ngram_simple\.size_n/params.speculative.ngram_size_n/g' \
+        -e 's/params\.speculative\.ngram_simple\.size_m/params.speculative.ngram_size_m/g' \
+        -e 's/params\.speculative\.ngram_simple\.min_hits/params.speculative.ngram_min_hits/g' \
+        "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> speculative field rename OK"
+else
+    echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
+fi
+
+# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream
+#    ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still
+#    exposes the field as `model` on `server_context_impl`. The two call sites
+#    are in the Rerank and ModelMetadata RPC handlers.
+if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then
+    echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model"
+    sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> model_tgt rename OK"
+else
+    echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch"
+fi
+
+# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
+#    grpc-server option parser skips the new option-handler blocks (ngram_mod,
+#    ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
+#    draft.tensor_buft_overrides) introduced for the post-#22838 layout, the
+#    draft.tensor_buft_overrides sentinel termination, and the
+#    common_params::checkpoint_min_step default/option (added with the
+#    35c9b1f3 bump). Those blocks reference struct fields that simply do not
+#    exist in the fork.
+if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
+    echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
+else
+    echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top"
+    # Insert the define before the very first `#include` so it precedes all the
+    # speculative-decoding code paths.
    awk '
        !done && /^#include/ {
-            print "#define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP 1"
+            print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1"
            print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
            print ""
            done = 1
@@ -91,13 +145,13 @@ else
        { print }
        END {
            if (!done) {
-                print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP" > "/dev/stderr"
+                print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr"
                exit 1
            }
        }
    ' "$SRC" > "$SRC.tmp"
    mv "$SRC.tmp" "$SRC"
-    echo "==> LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP define OK"
+    echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK"
 fi

 echo "==> all patches applied"
--- a/backend/cpp/turboquant/patches/0001-hip-guard-copy2d-peer-fastpath.patch
+++ b/backend/cpp/turboquant/patches/0001-hip-guard-copy2d-peer-fastpath.patch
@@ -1,55 +0,0 @@
-hip: port the turboquant CUDA additions that ggml's HIP shim doesn't cover
-
-The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs
-that ggml's HIP (and MUSA) compatibility layer does not provide, breaking
-the -gpu-rocm-hipblas-turboquant build:
-
-  1. ggml_cuda_copy2d_across_devices() (host-staged cross-device copy for
-     split mul_mat output) uses the CUDA 3D-peer copy APIs
-     cudaMemcpy3DPeerParms / make_cudaPitchedPtr / make_cudaExtent /
-     cudaMemcpy3DPeerAsync. HIP genuinely does not support these (see the
-     fork's own comment "HIP does not support cudaMemcpy3DPeerAsync"), so
-     guard the peer fast path with #if !defined(GGML_USE_HIP) &&
-     !defined(GGML_USE_MUSA) -- matching how the fork already guards the
-     same API for the sibling 2D copy -- and fall through to the existing
-     cudaMemcpyAsync staging fallback below (functionally identical,
-     slightly slower on multi-GPU ROCm).
-
-  2. ggml_backend_cuda_device_event_new() creates its event with plain
-     cudaEventCreate, which ggml's HIP shim does not alias (it only aliases
-     cudaEventCreateWithFlags). Use cudaEventCreateWithFlags(..., 
-     cudaEventDisableTiming) -- exactly what the rest of this file already
-     does (cf. lines ~1034, ~3461) and HIP-safe.
-
-CUDA builds are unaffected. Drop the relevant hunk once the fork HIP-ports
-these; apply-patches.sh fails fast if an anchor goes stale.
-
-diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index 0427e6b..6352e6a 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
-+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -1933,6 +1933,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
-     size_t width, size_t height, cudaStream_t dst_stream, cudaStream_t src_stream) {
- 
-     const auto & info = ggml_cuda_info();
-+#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)  // 3D-peer copy types unmapped by ggml's HIP/MUSA shim; use staging fallback below
-     if (info.peer_access[src_device][dst_device]) {
-         cudaMemcpy3DPeerParms p = {};
-         p.dstDevice = dst_device;
-@@ -1942,6 +1943,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
-         p.extent = make_cudaExtent(width, height, 1);
-         return cudaMemcpy3DPeerAsync(&p, dst_stream);
-     }
-+#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
- 
-     // Fallback: stage all rows through a single contiguous pinned buffer
-     int prev_device = ggml_cuda_get_device();
-@@ -5714,7 +5716,7 @@ static ggml_backend_event_t ggml_backend_cuda_device_event_new(ggml_backend_dev_
-     ggml_cuda_set_device(dev_ctx->device);
- 
-     cudaEvent_t event;
-    CUDA_CHECK(cudaEventCreate(&event));
-+    CUDA_CHECK(cudaEventCreateWithFlags(&event, cudaEventDisableTiming));
- 
-     return new ggml_backend_event {
-         /* .device  = */ dev,
--- a/backend/cpp/turboquant/run.sh
+++ b/backend/cpp/turboquant/run.sh
@@ -2,7 +2,7 @@
 set -ex

 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")

 cd /

@@ -12,39 +12,54 @@ grep -e "flags" /proc/cpuinfo | head -1

 BINARY=turboquant-fallback

-# x86/arm64 ship a single turboquant-cpu-all built with ggml CPU_ALL_VARIANTS: ggml's
-# backend registry dlopens the best libggml-cpu-*.so for this host, so no shell-side
-# probing. ROCm ships only turboquant-fallback, so fall back to it when cpu-all is absent.
-if [ -e "$CURDIR"/turboquant-cpu-all ]; then
-	BINARY=turboquant-cpu-all
+if grep -q -e "\savx\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX    found OK"
+	if [ -e $CURDIR/turboquant-avx ]; then
+		BINARY=turboquant-avx
+	fi
+fi
+
+if grep -q -e "\savx2\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX2   found OK"
+	if [ -e $CURDIR/turboquant-avx2 ]; then
+		BINARY=turboquant-avx2
+	fi
+fi
+
+# Check avx 512
+if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX512F found OK"
+	if [ -e $CURDIR/turboquant-avx512 ]; then
+		BINARY=turboquant-avx512
+	fi
 fi

 if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
-	if [ -e "$CURDIR"/turboquant-grpc ]; then
+	if [ -e $CURDIR/turboquant-grpc ]; then
 		BINARY=turboquant-grpc
 	fi
 fi

 # Extend ld library path with the dir where this script is located/lib
 if [ "$(uname)" == "Darwin" ]; then
-	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
+	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
 else
-	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 	# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
 	if [ -d "$CURDIR/lib/rocblas/library" ]; then
-		export ROCBLAS_TENSILE_LIBPATH="$CURDIR"/lib/rocblas/library
+		export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
 	fi
 fi

 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using binary: $BINARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/$BINARY "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
 fi

 echo "Using binary: $BINARY"
-exec "$CURDIR"/$BINARY "$@"
+exec $CURDIR/$BINARY "$@"

 # We should never reach this point, however just in case we do, run fallback
-exec "$CURDIR"/turboquant-fallback "$@"
+exec $CURDIR/turboquant-fallback "$@"
--- a/backend/go/acestep-cpp/Makefile
+++ b/backend/go/acestep-cpp/Makefile
@@ -117,8 +117,7 @@ libgoacestepcpp-custom: CMakeLists.txt cpp/goacestepcpp.cpp cpp/goacestepcpp.h
 	cmake .. $(CMAKE_ARGS) && \
 	cmake --build . --config Release -j$(JOBS) --target goacestepcpp && \
 	cd .. && \
-	(mv build-$(SO_TARGET)/libgoacestepcpp.so ./$(SO_TARGET) 2>/dev/null || \
-	 mv build-$(SO_TARGET)/libgoacestepcpp.dylib ./$(SO_TARGET) 2>/dev/null)
+	mv build-$(SO_TARGET)/libgoacestepcpp.so ./$(SO_TARGET)

 test: acestep-cpp
 	@echo "Running acestep-cpp tests..."
--- a/backend/go/acestep-cpp/main.go
+++ b/backend/go/acestep-cpp/main.go
@@ -4,7 +4,6 @@ package main
 import (
 	"flag"
 	"os"
-	"runtime"

 	"github.com/ebitengine/purego"
 	grpc "github.com/mudler/LocalAI/pkg/grpc"
@@ -23,11 +22,7 @@ func main() {
 	// Get library name from environment variable, default to fallback
 	libName := os.Getenv("ACESTEP_LIBRARY")
 	if libName == "" {
-		if runtime.GOOS == "darwin" {
-			libName = "./libgoacestepcpp-fallback.dylib"
-		} else {
-			libName = "./libgoacestepcpp-fallback.so"
-		}
+		libName = "./libgoacestepcpp-fallback.so"
 	}

 	gosd, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
--- a/backend/go/acestep-cpp/package.sh
+++ b/backend/go/acestep-cpp/package.sh
@@ -13,7 +13,6 @@ mkdir -p $CURDIR/package/lib

 cp -avf $CURDIR/acestep-cpp $CURDIR/package/
 cp -fv $CURDIR/libgoacestepcpp-*.so $CURDIR/package/
-cp -fv $CURDIR/libgoacestepcpp-*.dylib $CURDIR/package/ 2>/dev/null || true
 cp -fv $CURDIR/run.sh $CURDIR/package/

 # Detect architecture and copy appropriate libraries
--- a/backend/go/acestep-cpp/run.sh
+++ b/backend/go/acestep-cpp/run.sh
@@ -2,7 +2,7 @@
 set -ex

 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")

 cd /

@@ -12,29 +12,19 @@ if [ "$(uname)" != "Darwin" ]; then
 	grep -e "flags" /proc/cpuinfo | head -1
 fi

-if [ "$(uname)" = "Darwin" ]; then
-	# macOS: single library variant (Metal or Accelerate). The goacestepcpp
-	# target is built as a CMake MODULE, which emits a .dylib for a SHARED
-	# build but a .so for a MODULE build on Apple, so prefer .dylib and fall
-	# back to .so.
-	LIBRARY="$CURDIR/libgoacestepcpp-fallback.dylib"
-	if [ ! -e "$LIBRARY" ]; then
-		LIBRARY="$CURDIR/libgoacestepcpp-fallback.so"
-	fi
-	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
-else
-	LIBRARY="$CURDIR/libgoacestepcpp-fallback.so"
+LIBRARY="$CURDIR/libgoacestepcpp-fallback.so"

+if [ "$(uname)" != "Darwin" ]; then
 	if grep -q -e "\savx\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX    found OK"
-		if [ -e "$CURDIR"/libgoacestepcpp-avx.so ]; then
+		if [ -e $CURDIR/libgoacestepcpp-avx.so ]; then
 			LIBRARY="$CURDIR/libgoacestepcpp-avx.so"
 		fi
 	fi

 	if grep -q -e "\savx2\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX2   found OK"
-		if [ -e "$CURDIR"/libgoacestepcpp-avx2.so ]; then
+		if [ -e $CURDIR/libgoacestepcpp-avx2.so ]; then
 			LIBRARY="$CURDIR/libgoacestepcpp-avx2.so"
 		fi
 	fi
@@ -42,22 +32,21 @@ else
 	# Check avx 512
 	if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX512F found OK"
-		if [ -e "$CURDIR"/libgoacestepcpp-avx512.so ]; then
+		if [ -e $CURDIR/libgoacestepcpp-avx512.so ]; then
 			LIBRARY="$CURDIR/libgoacestepcpp-avx512.so"
 		fi
 	fi
-
-	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
 fi

+export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 export ACESTEP_LIBRARY=$LIBRARY

 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using library: $LIBRARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/acestep-cpp "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/acestep-cpp "$@"
 fi

 echo "Using library: $LIBRARY"
-exec "$CURDIR"/acestep-cpp "$@"
+exec $CURDIR/acestep-cpp "$@"
--- a/backend/go/ced/.gitignore
+++ b/backend/go/ced/.gitignore
@@ -1,11 +0,0 @@
-.cache/
-sources/
-build/
-package/
-ced-grpc
-# build artifacts staged in-tree by the Makefile (cp from sources/) or
-# symlinked for local dev; the real sources live in ced.cpp upstream.
-*.so
-*.so.*
-ced_capi.h
-compile_commands.json
--- a/backend/go/ced/Makefile
+++ b/backend/go/ced/Makefile
@@ -1,78 +0,0 @@
-# ced sound-classification backend Makefile.
-#
-# Upstream pin lives below as CED_VERSION?=<sha> so .github/bump_deps.sh can find
-# and update it (matches the parakeet-cpp / whisper.cpp convention).
-#
-# Local dev shortcut: symlink an out-of-tree ced.cpp shared build + header and
-# skip the clone/cmake steps entirely:
-#   ln -sf /path/to/ced.cpp/build-shared/libced.so .
-#   ln -sf /path/to/ced.cpp/include/ced_capi.h .
-#   go build -o ced-grpc .
-
-CED_VERSION?=c04ac14b7992d00584d9e812c9bb6268598a6ce7
-CED_REPO?=https://github.com/mudler/ced.cpp
-
-GOCMD?=go
-GO_TAGS?=
-JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
-
-BUILD_TYPE?=
-NATIVE?=false
-
-# Static-link ggml into libced.so (PIC) so the shared lib is self-contained:
-# dlopen needs no libggml*.so alongside it, only system libs the runtime image
-# already provides.
-CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DCED_SHARED=ON -DCED_BUILD_CLI=OFF -DCED_BUILD_TESTS=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
-
-ifeq ($(NATIVE),false)
-	CMAKE_ARGS+=-DGGML_NATIVE=OFF
-endif
-
-# ced.cpp gates its ggml backends behind CED_GGML_* options (set(... CACHE BOOL
-# "" FORCE)), so forward those instead of a bare -DGGML_CUDA=ON.
-ifeq ($(BUILD_TYPE),cublas)
-	CMAKE_ARGS+=-DCED_GGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON
-else ifeq ($(BUILD_TYPE),openblas)
-	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
-else ifeq ($(BUILD_TYPE),hipblas)
-	CMAKE_ARGS+=-DCED_GGML_HIP=ON
-else ifeq ($(BUILD_TYPE),vulkan)
-	CMAKE_ARGS+=-DCED_GGML_VULKAN=ON
-endif
-
-.PHONY: ced-grpc package build clean purge test all
-
-all: ced-grpc
-
-sources/ced.cpp:
-	mkdir -p sources/ced.cpp
-	cd sources/ced.cpp && \
-	git init -q && \
-	git remote add origin $(CED_REPO) && \
-	git fetch --depth 1 origin $(CED_VERSION) && \
-	git checkout FETCH_HEAD && \
-	git submodule update --init --recursive --depth 1 --single-branch
-
-libced.so: sources/ced.cpp
-	cmake -B sources/ced.cpp/build-shared -S sources/ced.cpp $(CMAKE_ARGS)
-	cmake --build sources/ced.cpp/build-shared --config Release -j$(JOBS)
-	cp -fv sources/ced.cpp/build-shared/libced.so* ./ 2>/dev/null || true
-	cp -fv sources/ced.cpp/build-shared/libced.dylib ./ 2>/dev/null || true
-	cp -fv sources/ced.cpp/include/ced_capi.h ./
-
-ced-grpc: libced.so main.go goced.go
-	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o ced-grpc .
-
-package: ced-grpc
-	bash package.sh
-
-build: package
-
-test:
-	LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1
-
-clean: purge
-	rm -rf libced.so* ced_capi.h package ced-grpc
-
-purge:
-	rm -rf sources/ced.cpp
--- a/backend/go/ced/goced.go
+++ b/backend/go/ced/goced.go
@@ -1,130 +0,0 @@
-package main
-
-// Go side of the ced backend: purego bindings over ced_capi.h plus the gRPC
-// SoundDetection implementation.
-//
-// SKETCH: the pb.SoundDetection* types come from backend.proto (regenerate with
-// `make protogen-go`). The C side is single-threaded per ctx, so we guard the
-// engine with engineMu; LocalAI also serializes via base.SingleThread.
-import (
-	"context"
-	"encoding/json"
-	"errors"
-	"fmt"
-	"sort"
-	"sync"
-	"unsafe"
-
-	"github.com/mudler/LocalAI/pkg/grpc/base"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-)
-
-// purego-bound entry points from libced.so. Names match ced_capi.h exactly.
-var (
-	CppAbiVersion       func() int32
-	CppLoad             func(ggufPath string) uintptr
-	CppFree             func(ctx uintptr)
-	CppLastError        func(ctx uintptr) string
-	CppNumClasses       func(ctx uintptr) int32
-	CppSampleRate       func(ctx uintptr) int32
-	CppClassifyPathJSON func(ctx uintptr, wavPath string, topK int32) uintptr
-	CppClassifyPcmJSON  func(ctx uintptr, pcm []float32, nSamples int32, sampleRate int32, topK int32) uintptr
-	CppFreeString       func(s uintptr)
-)
-
-// cstr copies a malloc'd C string (returned as uintptr) into a Go string and
-// frees the original via ced_capi_free_string. Empty/0 -> "".
-func cstr(p uintptr) string {
-	if p == 0 {
-		return ""
-	}
-	defer CppFreeString(p)
-	var b []byte
-	for i := 0; ; i++ {
-		ch := *(*byte)(unsafe.Pointer(p + uintptr(i))) //nolint:govet // #nosec G103 -- C-owned NUL-terminated string from libced (not Go-GC memory)
-		if ch == 0 {
-			break
-		}
-		b = append(b, ch)
-	}
-	return string(b)
-}
-
-// Ced is the gRPC backend. One loaded CED model per instance.
-type Ced struct {
-	base.Base
-	ctxPtr   uintptr
-	engineMu sync.Mutex
-}
-
-// Load resolves the GGUF and opens the C-API context.
-func (c *Ced) Load(opts *pb.ModelOptions) error {
-	if opts.ModelFile == "" {
-		return errors.New("ced: ModelFile is required")
-	}
-	ctx := CppLoad(opts.ModelFile)
-	if ctx == 0 {
-		return fmt.Errorf("ced: ced_capi_load failed for %q: %s", opts.ModelFile, CppLastError(0))
-	}
-	c.ctxPtr = ctx
-	return nil
-}
-
-// jsonTag mirrors the ced_capi JSON tag objects.
-type jsonTag struct {
-	Index int     `json:"index"`
-	Score float32 `json:"score"`
-	Label string  `json:"label"`
-}
-
-// SoundDetection classifies the clip at req.Src and returns scored AudioSet tags.
-func (c *Ced) SoundDetection(ctx context.Context, req *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
-	if c.ctxPtr == 0 {
-		return nil, errors.New("ced: model not loaded")
-	}
-	if req.GetSrc() == "" {
-		return nil, errors.New("ced: SoundDetectionRequest.src (audio path) is required")
-	}
-	topK := req.GetTopK()
-	if topK <= 0 {
-		topK = 10 // sensible default for a tagging response
-	}
-
-	c.engineMu.Lock()
-	out := cstr(CppClassifyPathJSON(c.ctxPtr, req.GetSrc(), topK))
-	lastErr := CppLastError(c.ctxPtr)
-	c.engineMu.Unlock()
-
-	if out == "" {
-		return nil, fmt.Errorf("ced: classification failed: %s", lastErr)
-	}
-	var tags []jsonTag
-	if err := json.Unmarshal([]byte(out), &tags); err != nil {
-		return nil, fmt.Errorf("ced: bad classifier JSON: %w", err)
-	}
-
-	thr := req.GetThreshold()
-	resp := &pb.SoundDetectionResponse{}
-	for _, t := range tags {
-		if t.Score < thr {
-			continue
-		}
-		resp.Detections = append(resp.Detections, &pb.SoundClass{
-			Label: t.Label, Score: t.Score, Index: int32(t.Index),
-		})
-	}
-	sort.Slice(resp.Detections, func(i, j int) bool {
-		return resp.Detections[i].Score > resp.Detections[j].Score
-	})
-	return resp, nil
-}
-
-func (c *Ced) Free() error {
-	c.engineMu.Lock()
-	defer c.engineMu.Unlock()
-	if c.ctxPtr != 0 {
-		CppFree(c.ctxPtr)
-		c.ctxPtr = 0
-	}
-	return nil
-}
--- a/backend/go/ced/main.go
+++ b/backend/go/ced/main.go
@@ -1,64 +0,0 @@
-package main
-
-// ced sound-classification backend. Started internally by LocalAI: one gRPC
-// server per loaded model. Loads libced.so via purego and registers the flat
-// C-API declared in ced_capi.h. The library name can be overridden with
-// CED_LIBRARY (mirrors PARAKEET_LIBRARY / WHISPER_LIBRARY); the default looks
-// for the .so next to this binary.
-//
-// SKETCH: requires `make protogen-go` after the backend.proto SoundDetection
-// addition, and a built libced.so (see Makefile). See DESIGN.md.
-import (
-	"flag"
-	"fmt"
-	"os"
-	"runtime"
-
-	"github.com/ebitengine/purego"
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-)
-
-var addr = flag.String("addr", "localhost:50051", "the address to connect to")
-
-type libFunc struct {
-	ptr  any
-	name string
-}
-
-func main() {
-	libName := os.Getenv("CED_LIBRARY")
-	if libName == "" {
-		if runtime.GOOS == "darwin" {
-			libName = "libced.dylib"
-		} else {
-			libName = "libced.so"
-		}
-	}
-	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
-	if err != nil {
-		panic(fmt.Errorf("ced: dlopen %q: %w", libName, err))
-	}
-
-	// Bound 1:1 to ced_capi.h. char*-returning functions are declared uintptr
-	// so we can free the same pointer with ced_capi_free_string after copying
-	// (purego's string return would copy and leak the original).
-	for _, lf := range []libFunc{
-		{&CppAbiVersion, "ced_capi_abi_version"},
-		{&CppLoad, "ced_capi_load"},
-		{&CppFree, "ced_capi_free"},
-		{&CppLastError, "ced_capi_last_error"},
-		{&CppNumClasses, "ced_capi_num_classes"},
-		{&CppSampleRate, "ced_capi_sample_rate"},
-		{&CppClassifyPathJSON, "ced_capi_classify_path_json"},
-		{&CppClassifyPcmJSON, "ced_capi_classify_pcm_json"},
-		{&CppFreeString, "ced_capi_free_string"},
-	} {
-		purego.RegisterLibFunc(lf.ptr, lib, lf.name)
-	}
-
-	fmt.Fprintf(os.Stderr, "[ced] ABI=%d\n", CppAbiVersion())
-	flag.Parse()
-	if err := grpc.StartServer(*addr, &Ced{}); err != nil {
-		panic(err)
-	}
-}
--- a/backend/go/ced/package.sh
+++ b/backend/go/ced/package.sh
@@ -1,62 +0,0 @@
-#!/bin/bash
-#
-# Bundle the ced-grpc binary, libced.so, the core runtime libs (libc/libstdc++/
-# libgomp + ld.so) and the GPU runtime for the active BUILD_TYPE so the package
-# is self-contained. Mirrors backend/go/parakeet-cpp/package.sh; run.sh routes
-# the (CGO_ENABLED=0) binary through lib/ld.so so the packaged libc is used.
-
-set -e
-
-CURDIR=$(dirname "$(realpath "$0")")
-REPO_ROOT="${CURDIR}/../../.."
-
-mkdir -p "$CURDIR/package/lib"
-
-cp -avf "$CURDIR/ced-grpc" "$CURDIR/package/"
-cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
-
-cp -avf "$CURDIR"/libced.so* "$CURDIR/package/lib/" 2>/dev/null || true
-cp -avf "$CURDIR"/libced.dylib "$CURDIR/package/lib/" 2>/dev/null || true
-if ! ls "$CURDIR"/package/lib/libced.* >/dev/null 2>&1; then
-	echo "ERROR: libced shared library not found in $CURDIR, run 'make' first" >&2
-	exit 1
-fi
-
-if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
-    echo "Detected x86_64 architecture, copying x86_64 libraries..."
-    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
-    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
-    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
-    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
-    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
-    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
-    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
-    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
-    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
-elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
-    echo "Detected ARM64 architecture, copying ARM64 libraries..."
-    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
-    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
-    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
-    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
-    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
-    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
-    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
-    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
-    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
-elif [ "$(uname -s)" = "Darwin" ]; then
-    echo "Detected Darwin"
-else
-    echo "Error: Could not detect architecture"
-    exit 1
-fi
-
-GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
-if [ -f "$GPU_LIB_SCRIPT" ]; then
-    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
-    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
-    package_gpu_libs
-fi
-
-echo "Packaging completed successfully"
-ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/go/ced/run.sh
+++ b/backend/go/ced/run.sh
@@ -1,20 +0,0 @@
-#!/bin/bash
-set -e
-
-CURDIR=$(dirname "$(realpath "$0")")
-
-if [ "$(uname)" = "Darwin" ]; then
-	export DYLD_LIBRARY_PATH="$CURDIR/lib:"$CURDIR":${DYLD_LIBRARY_PATH:-}"
-	export CED_LIBRARY="$CURDIR/lib/libced.dylib"
-else
-	export LD_LIBRARY_PATH="$CURDIR/lib:"$CURDIR":${LD_LIBRARY_PATH:-}"
-fi
-
-# If a self-contained ld.so was packaged, route through it so the packaged
-# libc / libstdc++ are used instead of the host's (matches the sibling backends).
-if [ -f "$CURDIR/lib/ld.so" ]; then
-	echo "Using lib/ld.so"
-	exec "$CURDIR/lib/ld.so" "$CURDIR/ced-grpc" "$@"
-fi
-
-exec "$CURDIR/ced-grpc" "$@"
--- a/backend/go/cloud-proxy/proxy.go
+++ b/backend/go/cloud-proxy/proxy.go
@@ -14,7 +14,6 @@ import (
 	"github.com/mudler/xlog"

 	"github.com/mudler/LocalAI/pkg/grpc/base"
-	"github.com/mudler/LocalAI/pkg/grpc/grpcerrors"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 	"github.com/mudler/LocalAI/pkg/httpclient"
 )
@@ -146,7 +145,7 @@ func resolveAPIKey(envName, filePath string) (string, error) {
 func (c *CloudProxy) PredictRich(opts *pb.PredictOptions) (reply *pb.Reply, err error) {
 	cfg := c.cfg.Load()
 	if cfg == nil {
-		return nil, grpcerrors.ModelNotLoaded("cloud-proxy")
+		return nil, errors.New("cloud-proxy: model not loaded")
 	}
 	if cfg.mode != modeTranslate {
 		return nil, fmt.Errorf("cloud-proxy: Predict only valid in translate mode (have %s)", cfg.mode)
@@ -176,7 +175,7 @@ func (c *CloudProxy) PredictRich(opts *pb.PredictOptions) (reply *pb.Reply, err
 func (c *CloudProxy) PredictStreamRich(opts *pb.PredictOptions, results chan<- *pb.Reply) (err error) {
 	cfg := c.cfg.Load()
 	if cfg == nil {
-		return grpcerrors.ModelNotLoaded("cloud-proxy")
+		return errors.New("cloud-proxy: model not loaded")
 	}
 	if cfg.mode != modeTranslate {
 		return fmt.Errorf("cloud-proxy: PredictStream only valid in translate mode (have %s)", cfg.mode)
@@ -270,7 +269,7 @@ func (c *CloudProxy) Forward(ctx context.Context, in <-chan *pb.ForwardRequest,

 	cfg := c.cfg.Load()
 	if cfg == nil {
-		return grpcerrors.ModelNotLoaded("cloud-proxy")
+		return errors.New("cloud-proxy: model not loaded")
 	}
 	if cfg.mode != modePassthrough {
 		return fmt.Errorf("cloud-proxy: Forward only valid in passthrough mode (have %s)", cfg.mode)
--- a/backend/go/cloud-proxy/run.sh
+++ b/backend/go/cloud-proxy/run.sh
@@ -1,6 +1,6 @@
 #!/bin/bash
 set -ex

-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")

-exec "$CURDIR"/cloud-proxy "$@"
+exec $CURDIR/cloud-proxy "$@"
--- a/backend/go/crispasr/CMakeLists.txt
+++ b/backend/go/crispasr/CMakeLists.txt
@@ -14,7 +14,7 @@ target_include_directories(gocrispasr PRIVATE
 # whisper. crispasr is the referencer; the backend static libs supply the
 # per-architecture symbols; ggml is the math/runtime base.
 target_link_libraries(gocrispasr PRIVATE
-    crispasr-lib
+    crispasr
    parakeet canary canary_ctc cohere granite_speech granite_nle
    voxtral voxtral4b qwen3_asr qwen3_tts orpheus chatterbox indextts
    kokoro voxcpm2_tts m2m100 t5_translate wav2vec2-ggml vibevoice
--- a/backend/go/crispasr/Makefile
+++ b/backend/go/crispasr/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # CrispASR version (release tag)
 CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
-CRISPASR_VERSION?=6514c9da00b03a2f0f1b49a43fae4f3a01a41844
+CRISPASR_VERSION?=05e60432bcb5bc2113f8c395a41e86497c11504a
 SO_TARGET?=libgocrispasr.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
@@ -67,7 +67,7 @@ sources/CrispASR:
 	# it, so ${CMAKE_SOURCE_DIR} is THIS backend dir and the talk-llama sources
 	# aren't found. Rewrite to ${PROJECT_SOURCE_DIR} (the crispasr project root),
 	# which is correct both standalone and as a subproject. Idempotent.
-	sed -i.bak 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt && rm -f sources/CrispASR/src/CMakeLists.txt.bak
+	sed -i 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt

 # Detect OS
 UNAME_S := $(shell uname -s)
@@ -75,8 +75,7 @@ UNAME_S := $(shell uname -s)
 ifeq ($(UNAME_S),Linux)
 	VARIANT_TARGETS = libgocrispasr-avx.so libgocrispasr-avx2.so libgocrispasr-avx512.so libgocrispasr-fallback.so
 else
-	# On non-Linux (e.g., Darwin), build only fallback variant (as a dylib)
-	VARIANT_TARGETS = libgocrispasr-fallback.dylib
+	VARIANT_TARGETS = libgocrispasr-fallback.so
 endif

 crispasr: main.go gocrispasr.go $(VARIANT_TARGETS)
@@ -88,7 +87,7 @@ package: crispasr
 build: package

 clean: purge
-	rm -rf libgocrispasr*.so libgocrispasr*.dylib package sources/CrispASR crispasr
+	rm -rf libgocrispasr*.so package sources/CrispASR crispasr

 purge:
 	rm -rf build*
@@ -119,21 +118,13 @@ libgocrispasr-fallback.so: sources/CrispASR
 	SO_TARGET=libgocrispasr-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
 	rm -rfv build*

-# Build fallback variant as a dylib (Darwin)
-libgocrispasr-fallback.dylib: sources/CrispASR
-	$(MAKE) purge
-	$(info ${GREEN}I crispasr build info:fallback (dylib)${RESET})
-	SO_TARGET=libgocrispasr-fallback.dylib CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
-	rm -rfv build*
-
 libgocrispasr-custom: CMakeLists.txt cpp/crispasr_shim.cpp cpp/crispasr_shim.h
 	mkdir -p build-$(SO_TARGET) && \
 	cd build-$(SO_TARGET) && \
 	cmake .. $(CMAKE_ARGS) && \
 	cmake --build . --config Release -j$(JOBS) && \
 	cd .. && \
-	(mv build-$(SO_TARGET)/libgocrispasr.so ./$(SO_TARGET) 2>/dev/null || \
-	 mv build-$(SO_TARGET)/libgocrispasr.dylib ./$(SO_TARGET) 2>/dev/null)
+	mv build-$(SO_TARGET)/libgocrispasr.so ./$(SO_TARGET)

 test: crispasr
 	CGO_ENABLED=0 $(GOCMD) test -v ./...
--- a/backend/go/crispasr/cpp/crispasr_shim.cpp
+++ b/backend/go/crispasr/cpp/crispasr_shim.cpp
@@ -47,74 +47,6 @@ extern "C" void set_abort(int v) {
  g_abort.store(v, std::memory_order_relaxed);
 }

-// --- word-level timestamp accessors ---
-extern "C" {
-int crispasr_session_result_n_words(crispasr_session_result *r, int seg_i);
-const char *crispasr_session_result_word_text(crispasr_session_result *r,
-                                               int seg_i, int word_i);
-int64_t crispasr_session_result_word_t0(crispasr_session_result *r, int seg_i,
-                                         int word_i);
-int64_t crispasr_session_result_word_t1(crispasr_session_result *r, int seg_i,
-                                         int word_i);
-
-// Parakeet-specific word accessors
-int crispasr_parakeet_result_n_words(void *r);
-const char *crispasr_parakeet_result_word_text(void *r, int word_i);
-int64_t crispasr_parakeet_result_word_t0(void *r, int word_i);
-int64_t crispasr_parakeet_result_word_t1(void *r, int word_i);
-}
-
-void *get_result(void) { return g_result; }
-
-int get_word_count(int seg_i) {
-  if (!g_result)
-    return 0;
-  return crispasr_session_result_n_words(g_result, seg_i);
-}
-
-const char *get_word_text(int seg_i, int word_i) {
-  if (!g_result)
-    return "";
-  return crispasr_session_result_word_text(g_result, seg_i, word_i);
-}
-
-int64_t get_word_t0(int seg_i, int word_i) {
-  if (!g_result)
-    return 0;
-  return crispasr_session_result_word_t0(g_result, seg_i, word_i);
-}
-
-int64_t get_word_t1(int seg_i, int word_i) {
-  if (!g_result)
-    return 0;
-  return crispasr_session_result_word_t1(g_result, seg_i, word_i);
-}
-
-// Parakeet-specific word accessors
-int get_parakeet_word_count(void) {
-  if (!g_result)
-    return 0;
-  return crispasr_parakeet_result_n_words(g_result);
-}
-
-const char *get_parakeet_word_text(int word_i) {
-  if (!g_result)
-    return "";
-  return crispasr_parakeet_result_word_text(g_result, word_i);
-}
-
-int64_t get_parakeet_word_t0(int word_i) {
-  if (!g_result)
-    return 0;
-  return crispasr_parakeet_result_word_t0(g_result, word_i);
-}
-
-int64_t get_parakeet_word_t1(int word_i) {
-  if (!g_result)
-    return 0;
-  return crispasr_parakeet_result_word_t1(g_result, word_i);
-}
-
 static void ggml_log_cb(enum ggml_log_level level, const char *log,
                        void *data) {
  const char *level_str;
--- a/backend/go/crispasr/cpp/crispasr_shim.h
+++ b/backend/go/crispasr/cpp/crispasr_shim.h
@@ -20,18 +20,4 @@ float *tts_synthesize(const char *text, int *out_n_samples); // 24kHz mono float
 void tts_free(float *pcm);
 int tts_set_voice(const char *name); // best-effort speaker selection; 0 ok
 int tts_set_voice_file(const char *path, const char *ref_text); // load voice pack (.gguf) or zero-shot clone (.wav + ref_text)
-
-// --- word-level timestamp accessors ---
-// Session-based (works for whisper-like backends)
-void *get_result(void);
-int get_word_count(int seg_i);
-const char *get_word_text(int seg_i, int word_i);
-int64_t get_word_t0(int seg_i, int word_i);
-int64_t get_word_t1(int seg_i, int word_i);
-
-// Parakeet-specific (global word list, no segment index)
-int get_parakeet_word_count(void);
-const char *get_parakeet_word_text(int word_i);
-int64_t get_parakeet_word_t0(int word_i);
-int64_t get_parakeet_word_t1(int word_i);
 }
--- a/backend/go/crispasr/gocrispasr.go
+++ b/backend/go/crispasr/gocrispasr.go
@@ -11,7 +11,6 @@ import (

 	"github.com/go-audio/audio"
 	"github.com/go-audio/wav"
-	gguf "github.com/gpustack/gguf-parser-go"
 	"github.com/mudler/LocalAI/pkg/grpc/base"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 	"github.com/mudler/LocalAI/pkg/utils"
@@ -34,55 +33,10 @@ var (
 	CppTTSFree         func(ptr uintptr)
 	CppTTSSetVoice     func(name string) int
 	CppTTSSetVoiceFile func(path string, refText string) int
-
-	// Word-level timestamp accessors (session-based, per-segment)
-	CppGetWordCount func(segI int) int
-	CppGetWordText  func(segI int, wordI int) string
-	CppGetWordT0    func(segI int, wordI int) int64
-	CppGetWordT1    func(segI int, wordI int) int64
-
-	// Parakeet-specific word accessors (global, no segment index)
-	CppGetParakeetWordCount func() int
-	CppGetParakeetWordText  func(wordI int) string
-	CppGetParakeetWordT0    func(wordI int) int64
-	CppGetParakeetWordT1    func(wordI int) int64
 )

 type CrispASR struct {
 	base.SingleThread
-	// sampleRate is the output rate (Hz) of the loaded TTS engine's PCM, used to
-	// write a correct WAV header. Most CrispASR TTS backends emit 24 kHz, but
-	// piper returns its model's native rate (16 kHz for x_low/low voices,
-	// 22.05 kHz for medium/high), so it is read from the GGUF metadata at Load.
-	sampleRate int
-}
-
-// defaultTTSSampleRate is the output rate assumed for CrispASR TTS engines that
-// don't advertise one in GGUF metadata (vibevoice/orpheus/chatterbox/qwen3-tts
-// all emit 24 kHz). piper is the exception and carries piper.sample_rate.
-const defaultTTSSampleRate = 24000
-
-// piperSampleRate reads the piper.sample_rate metadata key from a GGUF model.
-// CrispASR's piper backend returns PCM at the model's native rate without
-// resampling, so the WAV header must match it. Returns ok=false for non-piper
-// models (key absent) or an unreadable file, letting the caller fall back to
-// defaultTTSSampleRate.
-func piperSampleRate(modelPath string) (int, bool) {
-	// Only scalar architecture keys are read, so skip the large array metadata
-	// (phoneme map) and mmap the header - same rationale as pkg/vram's reader.
-	f, err := gguf.ParseGGUFFile(modelPath, gguf.UseMMap(), gguf.SkipLargeMetadata())
-	if err != nil {
-		return 0, false
-	}
-	kv, ok := f.Header.MetadataKV.Get("piper.sample_rate")
-	if !ok || kv.ValueType != gguf.GGUFMetadataValueTypeUint32 {
-		return 0, false
-	}
-	rate := int(kv.ValueUint32())
-	if rate <= 0 {
-		return 0, false
-	}
-	return rate, true
 }

 // splitOption splits a "prefix:value" model option into its key and value,
@@ -149,14 +103,6 @@ func (w *CrispASR) Load(opts *pb.ModelOptions) error {
 		return fmt.Errorf("Failed to load CrispASR transcription model")
 	}

-	// Determine the TTS output sample rate for the WAV header. piper voices
-	// carry their native rate in GGUF metadata and CrispASR does not resample;
-	// every other engine emits the 24 kHz default.
-	w.sampleRate = defaultTTSSampleRate
-	if rate, ok := piperSampleRate(opts.ModelFile); ok {
-		w.sampleRate = rate
-	}
-
 	// Load the companion file (codec/tokenizer/s3gen) after the session is open.
 	// rc==0 means success or "not applicable" for the active backend; only a
 	// negative code is fatal.
@@ -224,28 +170,6 @@ func (w *CrispASR) VAD(req *pb.VADRequest) (pb.VADResponse, error) {
 	}, nil
 }

-// isValidWord reports whether a TranscriptWord contains recognisable speech
-// content. The parakeet-specific word accessors can return stale initialisation
-// data (model name, binary blobs) when a segment has no real speech. A word is
-// considered valid only when:
-//   - the text is non-empty after trimming,
-//   - it contains no U+FFFD replacement characters (from binary data scrubbing),
-//   - both timestamps are non-negative,
-//   - the word has positive duration (end > start).
-func isValidWord(w *pb.TranscriptWord) bool {
-	txt := strings.TrimSpace(w.Text)
-	if txt == "" {
-		return false
-	}
-	if strings.ContainsRune(txt, '\uFFFD') {
-		return false
-	}
-	if w.Start < 0 || w.End < 0 || w.End <= w.Start {
-		return false
-	}
-	return true
-}
-
 func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
 	if err := ctx.Err(); err != nil {
 		return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
@@ -324,54 +248,15 @@ func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRe
 		// IDs, so Tokens is left empty.
 		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")

-		// Populate word-level timestamps. Try session-based functions first
-		// (per-segment); fall back to parakeet-specific functions (global word
-		// list with no segment index — only populated on the first segment to
-		// avoid duplication).
-		words := []*pb.TranscriptWord{}
-		wordCount := CppGetWordCount(i)
-		if wordCount == 0 && i == 0 {
-			wordCount = CppGetParakeetWordCount()
-			for j := 0; j < wordCount; j++ {
-				w := &pb.TranscriptWord{
-					Start: CppGetParakeetWordT0(j) * (10000000),
-					End:   CppGetParakeetWordT1(j) * (10000000),
-					Text:  strings.ToValidUTF8(strings.Clone(CppGetParakeetWordText(j)), "<22>"),
-				}
-				if isValidWord(w) {
-					words = append(words, w)
-				}
-			}
-		} else {
-			for j := 0; j < wordCount; j++ {
-				w := &pb.TranscriptWord{
-					Start: CppGetWordT0(i, j) * (10000000),
-					End:   CppGetWordT1(i, j) * (10000000),
-					Text:  strings.ToValidUTF8(strings.Clone(CppGetWordText(i, j)), "<22>"),
-				}
-				if isValidWord(w) {
-					words = append(words, w)
-				}
-			}
-		}
-
-		// Skip empty segments with no recognisable content (e.g. trailing
-		// silence segments that parakeet emits with stale init data).
-		trimmed := strings.TrimSpace(txt)
-		if trimmed == "" && len(words) == 0 {
-			continue
-		}
-
 		segment := &pb.TranscriptSegment{
 			Id:    int32(i),
 			Text:  txt,
 			Start: s, End: t,
-			Words: words,
 		}

 		segments = append(segments, segment)

-		text += " " + trimmed
+		text += " " + strings.TrimSpace(txt)
 	}

 	return pb.TranscriptResult{
@@ -463,20 +348,13 @@ func (w *CrispASR) AudioTranscriptionStream(ctx context.Context, opts *pb.Transc
 		s := CppGetSegmentStart(i) * 10000000
 		t := CppGetSegmentEnd(i) * 10000000
 		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
-
-		// Skip empty segments (e.g. trailing silence that parakeet emits
-		// with stale init data).
-		trimmed := strings.TrimSpace(txt)
-		if trimmed == "" && s == t {
-			continue
-		}
-
 		segments = append(segments, &pb.TranscriptSegment{
 			Id:    int32(i),
 			Text:  txt,
 			Start: s, End: t,
 		})

+		trimmed := strings.TrimSpace(txt)
 		if trimmed == "" {
 			continue
 		}
@@ -512,7 +390,7 @@ func (w *CrispASR) synthesize(text string) ([]float32, error) {
 	}
 	defer CppTTSFree(ptr)
 	src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free.
-	out := make([]float32, int(n))                               // copy out of C memory before free
+	out := make([]float32, int(n)) // copy out of C memory before free
 	copy(out, src)
 	return out, nil
 }
@@ -539,7 +417,7 @@ func (w *CrispASR) TTS(req *pb.TTSRequest) error {
 	if err != nil {
 		return err
 	}
-	return writeWAV(req.Dst, pcm, w.sampleRate)
+	return writeWAV24k(req.Dst, pcm)
 }

 // TTSStream is the streaming counterpart to TTS. CrispASR has no progressive
@@ -569,7 +447,7 @@ func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
 	}
 	defer func() { _ = os.Remove(dst) }()

-	if err := writeWAV(dst, pcm, w.sampleRate); err != nil {
+	if err := writeWAV24k(dst, pcm); err != nil {
 		return err
 	}

@@ -581,14 +459,14 @@ func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
 	return nil
 }

-// writeWAV writes pcm as a sampleRate Hz, mono, 16-bit PCM WAV at dst.
-func writeWAV(dst string, pcm []float32, sampleRate int) error {
+// writeWAV24k writes pcm as a 24000 Hz, mono, 16-bit PCM WAV at dst.
+func writeWAV24k(dst string, pcm []float32) error {
 	f, err := os.Create(dst)
 	if err != nil {
 		return fmt.Errorf("crispasr: create %q: %w", dst, err)
 	}

-	enc := wav.NewEncoder(f, sampleRate, 16, 1, 1)
+	enc := wav.NewEncoder(f, 24000, 16, 1, 1)
 	ints := make([]int, len(pcm))
 	for i, s := range pcm {
 		if s > 1 {
@@ -599,7 +477,7 @@ func writeWAV(dst string, pcm []float32, sampleRate int) error {
 		ints[i] = int(s * 32767)
 	}
 	buf := &audio.IntBuffer{
-		Format:         &audio.Format{NumChannels: 1, SampleRate: sampleRate},
+		Format:         &audio.Format{NumChannels: 1, SampleRate: 24000},
 		Data:           ints,
 		SourceBitDepth: 16,
 	}
--- a/backend/go/crispasr/gocrispasr_samplerate_test.go
+++ b/backend/go/crispasr/gocrispasr_samplerate_test.go
@@ -1,164 +0,0 @@
-package main
-
-import (
-	"bytes"
-	"encoding/binary"
-	"os"
-	"path/filepath"
-
-	"github.com/go-audio/wav"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-// GGUF metadata value type tags (subset) from the GGUF spec.
-const (
-	ggufTypeUint32 uint32 = 4
-	ggufTypeString uint32 = 8
-)
-
-type ggufKV struct {
-	key   string
-	vtype uint32
-	val   any
-}
-
-// writeMinimalGGUF emits a valid, tensor-less GGUF file carrying only the given
-// metadata key-values. Enough for the header-only parse path piperSampleRate
-// uses; avoids pulling a real multi-MB voice into the test.
-func writeMinimalGGUF(path string, kvs []ggufKV) error {
-	var b bytes.Buffer
-	b.WriteString("GGUF")                                // magic
-	_ = binary.Write(&b, binary.LittleEndian, uint32(3)) // version
-	_ = binary.Write(&b, binary.LittleEndian, uint64(0)) // tensor count
-	_ = binary.Write(&b, binary.LittleEndian, uint64(len(kvs)))
-	for _, kv := range kvs {
-		_ = binary.Write(&b, binary.LittleEndian, uint64(len(kv.key)))
-		b.WriteString(kv.key)
-		_ = binary.Write(&b, binary.LittleEndian, kv.vtype)
-		switch v := kv.val.(type) {
-		case uint32:
-			_ = binary.Write(&b, binary.LittleEndian, v)
-		case string:
-			_ = binary.Write(&b, binary.LittleEndian, uint64(len(v)))
-			b.WriteString(v)
-		}
-	}
-	return os.WriteFile(path, b.Bytes(), 0o644)
-}
-
-// wavSampleRate decodes the WAV header at path and returns its sample rate.
-func wavSampleRate(path string) (int, error) {
-	f, err := os.Open(path)
-	if err != nil {
-		return 0, err
-	}
-	defer func() { _ = f.Close() }()
-	dec := wav.NewDecoder(f)
-	dec.ReadInfo()
-	return int(dec.SampleRate), nil
-}
-
-var _ = Describe("piper sample rate", func() {
-	Context("piperSampleRate", func() {
-		It("reads piper.sample_rate from a piper GGUF (medium = 22050)", func() {
-			p := filepath.Join(GinkgoT().TempDir(), "voice.gguf")
-			Expect(writeMinimalGGUF(p, []ggufKV{
-				{key: "general.architecture", vtype: ggufTypeString, val: "piper"},
-				{key: "piper.sample_rate", vtype: ggufTypeUint32, val: uint32(22050)},
-			})).To(Succeed())
-
-			rate, ok := piperSampleRate(p)
-			Expect(ok).To(BeTrue(), "piper.sample_rate should be found")
-			Expect(rate).To(Equal(22050))
-		})
-
-		It("reads the low-quality rate (16000)", func() {
-			p := filepath.Join(GinkgoT().TempDir(), "voice.gguf")
-			Expect(writeMinimalGGUF(p, []ggufKV{
-				{key: "piper.sample_rate", vtype: ggufTypeUint32, val: uint32(16000)},
-			})).To(Succeed())
-
-			rate, ok := piperSampleRate(p)
-			Expect(ok).To(BeTrue())
-			Expect(rate).To(Equal(16000))
-		})
-
-		It("returns ok=false for a non-piper GGUF (no piper.sample_rate key)", func() {
-			p := filepath.Join(GinkgoT().TempDir(), "other.gguf")
-			Expect(writeMinimalGGUF(p, []ggufKV{
-				{key: "general.architecture", vtype: ggufTypeString, val: "vibevoice"},
-			})).To(Succeed())
-
-			_, ok := piperSampleRate(p)
-			Expect(ok).To(BeFalse())
-		})
-
-		It("returns ok=false for an unreadable/non-GGUF file", func() {
-			p := filepath.Join(GinkgoT().TempDir(), "garbage.gguf")
-			Expect(os.WriteFile(p, []byte("not a gguf"), 0o644)).To(Succeed())
-
-			_, ok := piperSampleRate(p)
-			Expect(ok).To(BeFalse())
-		})
-	})
-
-	// End-to-end through the built .so. Gated on CRISPASR_PIPER_MODEL_PATH (a
-	// real piper voice GGUF) like the other model-backed specs; never runs in
-	// default CI. Proves CrispASR's piper backend output rate flows into the
-	// WAV header instead of the hardcoded 24 kHz default.
-	Context("piper TTS end-to-end", func() {
-		It("writes the WAV at the model's native piper.sample_rate", func() {
-			model := os.Getenv("CRISPASR_PIPER_MODEL_PATH")
-			if model == "" {
-				Skip("set CRISPASR_PIPER_MODEL_PATH to run the piper e2e spec")
-			}
-			ensureLibLoaded()
-
-			expected, ok := piperSampleRate(model)
-			Expect(ok).To(BeTrue(), "model should carry piper.sample_rate metadata")
-
-			w := &CrispASR{}
-			Expect(w.Load(&pb.ModelOptions{
-				ModelFile: model,
-				Options:   []string{"backend:piper"},
-				Threads:   4,
-			})).To(Succeed())
-
-			dst := filepath.Join(GinkgoT().TempDir(), "piper.wav")
-			Expect(w.TTS(&pb.TTSRequest{Text: "Hello from CrispASR piper.", Dst: dst})).To(Succeed())
-
-			info, err := os.Stat(dst)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(info.Size()).To(BeNumerically(">", 1024), "expected a non-trivial WAV")
-
-			rate, err := wavSampleRate(dst)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(rate).To(Equal(expected),
-				"WAV header rate must equal the model's native piper.sample_rate, not the 24k default")
-		})
-	})
-
-	Context("writeWAV", func() {
-		It("writes the WAV header at the given sample rate (22050 for piper, not the 24k default)", func() {
-			dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
-			pcm := make([]float32, 220) // 10 ms of silence is enough for a header
-			Expect(writeWAV(dst, pcm, 22050)).To(Succeed())
-
-			rate, err := wavSampleRate(dst)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(rate).To(Equal(22050))
-		})
-
-		It("writes a 16000 Hz header for low-quality piper voices", func() {
-			dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
-			pcm := make([]float32, 160)
-			Expect(writeWAV(dst, pcm, 16000)).To(Succeed())
-
-			rate, err := wavSampleRate(dst)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(rate).To(Equal(16000))
-		})
-	})
-})
--- a/backend/go/crispasr/main.go
+++ b/backend/go/crispasr/main.go
@@ -4,7 +4,6 @@ package main
 import (
 	"flag"
 	"os"
-	"runtime"

 	"github.com/ebitengine/purego"
 	grpc "github.com/mudler/LocalAI/pkg/grpc"
@@ -22,11 +21,7 @@ type LibFuncs struct {
 func main() {
 	libName := os.Getenv("CRISPASR_LIBRARY")
 	if libName == "" {
-		if runtime.GOOS == "darwin" {
-			libName = "./libgocrispasr-fallback.dylib"
-		} else {
-			libName = "./libgocrispasr-fallback.so"
-		}
+		libName = "./libgocrispasr-fallback.so"
 	}

 	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
@@ -49,14 +44,6 @@ func main() {
 		{&CppTTSFree, "tts_free"},
 		{&CppTTSSetVoice, "tts_set_voice"},
 		{&CppTTSSetVoiceFile, "tts_set_voice_file"},
-		{&CppGetWordCount, "get_word_count"},
-		{&CppGetWordText, "get_word_text"},
-		{&CppGetWordT0, "get_word_t0"},
-		{&CppGetWordT1, "get_word_t1"},
-		{&CppGetParakeetWordCount, "get_parakeet_word_count"},
-		{&CppGetParakeetWordText, "get_parakeet_word_text"},
-		{&CppGetParakeetWordT0, "get_parakeet_word_t0"},
-		{&CppGetParakeetWordT1, "get_parakeet_word_t1"},
 	}

 	for _, lf := range libFuncs {
--- a/backend/go/crispasr/package.sh
+++ b/backend/go/crispasr/package.sh
@@ -12,8 +12,7 @@ REPO_ROOT="${CURDIR}/../../.."
 mkdir -p $CURDIR/package/lib

 cp -avf $CURDIR/crispasr $CURDIR/package/
-cp -fv $CURDIR/libgocrispasr-*.so $CURDIR/package/ 2>/dev/null || true
-cp -fv $CURDIR/libgocrispasr-*.dylib $CURDIR/package/ 2>/dev/null || true
+cp -fv $CURDIR/libgocrispasr-*.so $CURDIR/package/
 cp -fv $CURDIR/run.sh $CURDIR/package/

 # Detect architecture and copy appropriate libraries
@@ -52,32 +51,6 @@ else
    exit 1
 fi

-# Bundle espeak-ng (+ its libpcaudio/libsonic runtime deps) and its voice data so
-# the piper TTS backend can phonemize non-English text. CrispASR dlopens
-# libespeak-ng.so.1 at runtime (the MIT-clean path); the dlopen succeeds loading
-# libespeak-ng but FAILS if libpcaudio/libsonic are absent, so all three .so are
-# required. run.sh points CRISPASR_ESPEAK_DATA_PATH at the bundled data dir.
-# Best-effort: only copied when present, so a local dev build without espeak-ng
-# installed still packages the rest (English voices keep working).
-ESPEAK_LIBDIR=""
-for d in /usr/lib/x86_64-linux-gnu /usr/lib/aarch64-linux-gnu; do
-    if [ -f "$d/libespeak-ng.so.1" ]; then
-        ESPEAK_LIBDIR="$d"
-        break
-    fi
-done
-if [ -n "$ESPEAK_LIBDIR" ]; then
-    echo "Bundling espeak-ng from $ESPEAK_LIBDIR ..."
-    cp -arfLv "$ESPEAK_LIBDIR/libespeak-ng.so.1" $CURDIR/package/lib/
-    cp -arfLv "$ESPEAK_LIBDIR/libpcaudio.so.0" $CURDIR/package/lib/
-    cp -arfLv "$ESPEAK_LIBDIR/libsonic.so.0" $CURDIR/package/lib/
-    if [ -d "$ESPEAK_LIBDIR/espeak-ng-data" ]; then
-        cp -arfLv "$ESPEAK_LIBDIR/espeak-ng-data" $CURDIR/package/
-    fi
-else
-    echo "espeak-ng not found; non-English piper voices will not phonemize"
-fi
-
 # Package GPU libraries based on BUILD_TYPE
 # The GPU library packaging script will detect BUILD_TYPE and copy appropriate GPU libraries
 GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
--- a/backend/go/crispasr/run.sh
+++ b/backend/go/crispasr/run.sh
@@ -2,7 +2,7 @@
 set -ex

 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")

 cd /

@@ -12,23 +12,19 @@ if [ "$(uname)" != "Darwin" ]; then
 	grep -e "flags" /proc/cpuinfo | head -1
 fi

-if [ "$(uname)" = "Darwin" ]; then
-	# macOS: single dylib variant (Metal or Accelerate)
-	LIBRARY="$CURDIR/libgocrispasr-fallback.dylib"
-	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
-else
-	LIBRARY="$CURDIR/libgocrispasr-fallback.so"
+LIBRARY="$CURDIR/libgocrispasr-fallback.so"

+if [ "$(uname)" != "Darwin" ]; then
 	if grep -q -e "\savx\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX    found OK"
-		if [ -e "$CURDIR"/libgocrispasr-avx.so ]; then
+		if [ -e $CURDIR/libgocrispasr-avx.so ]; then
 			LIBRARY="$CURDIR/libgocrispasr-avx.so"
 		fi
 	fi

 	if grep -q -e "\savx2\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX2   found OK"
-		if [ -e "$CURDIR"/libgocrispasr-avx2.so ]; then
+		if [ -e $CURDIR/libgocrispasr-avx2.so ]; then
 			LIBRARY="$CURDIR/libgocrispasr-avx2.so"
 		fi
 	fi
@@ -36,27 +32,21 @@ else
 	# Check avx 512
 	if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX512F found OK"
-		if [ -e "$CURDIR"/libgocrispasr-avx512.so ]; then
+		if [ -e $CURDIR/libgocrispasr-avx512.so ]; then
 			LIBRARY="$CURDIR/libgocrispasr-avx512.so"
 		fi
 	fi
-
-	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
 fi

+export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 export CRISPASR_LIBRARY=$LIBRARY

-# Point piper's espeak-ng phonemizer at the bundled voice data. The variable
-# names the directory CONTAINING espeak-ng-data (package.sh drops it next to
-# this script). Harmless when espeak-ng wasn't bundled.
-export CRISPASR_ESPEAK_DATA_PATH="$CURDIR"
-
 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using library: $LIBRARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/crispasr "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/crispasr "$@"
 fi

 echo "Using library: $LIBRARY"
-exec "$CURDIR"/crispasr "$@"
+exec $CURDIR/crispasr "$@"
--- a/backend/go/depth-anything-cpp/.gitignore
+++ b/backend/go/depth-anything-cpp/.gitignore
@@ -1,7 +0,0 @@
-sources/
-build*/
-package/
-libdepthanythingcpp*.so
-depth-anything-cpp
-test-models/
-test-data/
--- a/backend/go/depth-anything-cpp/CMakeLists.txt
+++ b/backend/go/depth-anything-cpp/CMakeLists.txt
@@ -1,28 +0,0 @@
-cmake_minimum_required(VERSION 3.18)
-project(libdepthanythingcpp LANGUAGES C CXX)
-
-set(CMAKE_POSITION_INDEPENDENT_CODE ON)
-set(CMAKE_CXX_STANDARD 17)
-set(CMAKE_CXX_STANDARD_REQUIRED ON)
-
-# Static-link ggml into the depth-anything shared library so the resulting .so
-# has no runtime dependency on an external libggml — only on
-# libc/libstdc++/libgomp, which the LocalAI package step bundles into the
-# docker image.
-set(BUILD_SHARED_LIBS OFF CACHE BOOL "Build static libraries" FORCE)
-
-# depth-anything.cpp build switches: skip CLI/tests, but build libdepthanything
-# itself as a SHARED library (DA_SHARED) while ggml stays static
-# (BUILD_SHARED_LIBS OFF above). The da_capi_* C ABI is compiled into
-# src/da_capi.cpp and re-exported by that shared library, so no extra MODULE
-# wrapper is needed (unlike locate-anything.cpp).
-set(DA_BUILD_CLI OFF CACHE BOOL "Disable depth-anything CLI" FORCE)
-set(DA_BUILD_TESTS OFF CACHE BOOL "Disable depth-anything tests" FORCE)
-set(DA_SHARED ON CACHE BOOL "Build libdepthanything as a shared lib" FORCE)
-
-add_subdirectory(./sources/depth-anything.cpp)
-
-# Emit libdepthanything.so into the top-level build dir so the Makefile can
-# rename it to the per-variant libdepthanythingcpp-<variant>.so.
-set_target_properties(depthanything PROPERTIES
-    LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
--- a/Show More
+++ b/Show More