chore(dllm): bump dllm.cpp pin to P5 head

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
feat(dllm): image input through the backend (multimodal C-ABI)
2026-06-30 03:17:01 -04:00 · 2026-06-13 00:00:42 +00:00 · 2026-06-12 00:41:04 +00:00 · 2026-06-11 20:24:26 +00:00 · 2026-06-11 19:22:02 +00:00 · 2026-06-11 17:50:04 +00:00
823 changed files with 16904 additions and 57968 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -102,24 +102,6 @@ Multi-arch backends are NOT a single matrix entry with `platforms: 'linux/amd64,
 Entries whose `dockerfile` is `./backend/Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` must also set a `builder-base-image` field pointing at a prebuilt base from `quay.io/go-skynet/ci-cache:base-grpc-*` (CI builds these via `.github/workflows/base-images.yml`). The mapping is by `(build-type, platforms)` — see existing entries for the pattern. CI uses these prebuilt bases to skip the gRPC compile (~25–35 min cold). Local `make backends/<name>` ignores `builder-base-image` and uses the from-source path inside the Dockerfile, so you don't need quay access for local builds.
 ### Cover every OS the project supports (Linux **and** Darwin)
 `.github/backend-matrix.yml` has two matrices, and they are the source of truth for which OS a backend ships on:
 - `include:` — the **Linux** matrix (x86_64 + arm64; CPU and CUDA / ROCm / SYCL / Vulkan).
 - `includeDarwin:` — the **macOS / Apple Silicon** matrix (arm64; Metal where the engine supports it, otherwise a native arm64 CPU build).
 **A new backend must target every OS it can build for — do not ship Linux-only by default.** A backend that appears only under `include:` is silently unavailable on macOS even when its code would run there. Most C/C++/GGML engines build on Darwin out of the box (ggml defaults `GGML_METAL=ON` on Apple, so a plain build is Metal-enabled), and many Python backends do too (CPU / MPS wheels). If a backend genuinely cannot support an OS (e.g. CUDA-only, no CPU variant), state that in the PR description instead of omitting it silently.
 Wiring a backend into `includeDarwin:` is more than the matrix entry:
 1. **`includeDarwin:` entry** — `tag-suffix: "-metal-darwin-arm64-<backend>"`, `build-type: "metal"`, `lang: "go"` for go+ggml backends; omit `build-type` for the bespoke C++ ones (llama-cpp / ds4 / privacy-filter). Match an existing entry of the same shape.
 2. **`backend/index.yaml`** — add `metal:` to the backend's `capabilities` map (main and `-development`) and concrete `metal-<backend>` / `metal-<backend>-development` image entries pointing at the `-metal-darwin-arm64-<backend>` images.
 3. **C/C++ backends only** — add an `inferBackendPathDarwin` case in `scripts/changed-backends.js` returning `backend/cpp/<backend>/` (the generic fallthrough assumes `backend/<lang>/`, which is wrong for a C++ source tree driven with `lang: go`), and give `run.sh` a Darwin branch that exports `DYLD_LIBRARY_PATH` instead of `LD_LIBRARY_PATH`. If the build is bespoke (single `grpc-server` + dylib bundling), model it on `scripts/build/ds4-darwin.sh` and add a `backends/<backend>-darwin` make target plus a gated step in `.github/workflows/backend_build_darwin.yml`.
 4. **C++ proto gotcha** — if the backend compiles the generated gRPC/protobuf in a separate CMake target (e.g. `hw_grpc_proto`), that target must link `protobuf::libprotobuf` + `gRPC::grpc++` so the Homebrew include dirs propagate; otherwise macOS fails with `google/protobuf/runtime_version.h not found` (Linux hides this because apt headers sit in `/usr/include`).
 The CI path filter only builds a backend on a PR when a file under its directory changes, so a darwin-only YAML edit builds nothing — touch a file under `backend/<lang>/<backend>/` (a one-line comment is enough) in the same PR.
 ## 3. Add Backend Metadata to `backend/index.yaml`
 **Step 3a: Add Meta Definition**
@@ -216,34 +198,12 @@ docker-build-backends: ... docker-build-<backend-name>
 - If the backend is in `backend/python/<backend-name>/` but uses `.` as context in the workflow file, use `.` context
 - Check similar backends to determine the correct context
 ## Documenting the backend (README + docs)
 A backend is not "added" until it is discoverable. Update the user-facing docs:
 - **`docs/content/features/backends.md`** - add the backend to the right
  category in the "LocalAI supports various types of backends" list (and add a
  new category if it introduces a new modality, e.g. sound classification).
 - If the backend introduces a **new API surface** (a new endpoint or a realtime
  capability), document it under `docs/content/` where its area lives (audio,
  vision, etc.) and follow the api-endpoints checklist in
  [api-endpoints-and-auth.md](api-endpoints-and-auth.md).
 **If the backend is a native C/C++/GGML engine created and maintained by the
 LocalAI team** (a from-scratch port like `parakeet.cpp`, `ced.cpp`,
 `vibevoice.cpp`, `rf-detr.cpp`, not a wrapper around a third-party runtime), it
 ALSO belongs in the top-level **`README.md`** table under "native C/C++/GGML
 engines ... developed and maintained by the LocalAI project itself". Add a row
 linking the upstream engine repo with a one-line description. This is the
 project's showcase of its own engines; a new in-house backend that is missing
 from it is a documentation bug.
 ## 5. Verification Checklist
 After adding a new backend, verify:
 - [ ] Backend directory structure is complete with all necessary files
 - [ ] Build configurations added to `.github/backend-matrix.yml` for all desired platforms (per-arch entries with `platform-tag` for multi-arch; `builder-base-image` for llama-cpp / ik-llama-cpp / turboquant)
 - [ ] **OS coverage considered**: added to `includeDarwin:` (macOS/Apple Silicon) if the backend can build there — with the `backend/index.yaml` `metal:` capability + `metal-<backend>` image entries, a `run.sh` Darwin/DYLD branch and `inferBackendPathDarwin` case for C++ backends — or the PR explains why an OS is unsupported. Do not ship Linux-only by default.
 - [ ] Meta definition added to `backend/index.yaml` in the `## metas` section
 - [ ] Image entries added to `backend/index.yaml` for all build variants (latest + development)
 - [ ] Tag suffixes match between workflow file and index.yaml
@@ -251,8 +211,6 @@ After adding a new backend, verify:
 - [ ] No YAML syntax errors (check with linter)
 - [ ] No Makefile syntax errors (check with linter)
 - [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
 - [ ] Documented: added to the category list in `docs/content/features/backends.md` (and any new endpoint/realtime capability documented under `docs/content/`)
 - [ ] If it is an in-house native C/C++/GGML engine, added to the maintained-engines table in the top-level `README.md`
 ## Bundling runtime shared libraries (`package.sh`)
--- a/.agents/dllm-backend.md
+++ b/.agents/dllm-backend.md
@@ -0,0 +1,138 @@
 # Working on the dllm Backend
 `mudler/dllm.cpp` is a standalone C++/ggml engine for DiffusionGemma
 block-diffusion models. LocalAI wraps it with a **pure-Go** backend at
 `backend/go/dllm/` that dlopens `libdllm.so` via purego (ebitengine/purego) -
 NOT cgo, and NOT a C++ grpc-server fork. The Go side owns chat templating
 (gemma4 renderer) and output parsing (gemma4 streaming parser) and implements
 the rich gRPC interface (`PredictRich`/`PredictStreamRich`, ChatDelta replies).
 > NOTE: github.com/mudler/dllm.cpp is still **private** (publishing is
 > planned). Until then the Makefile's anonymous clone fails; use the local-dev
 > symlink shortcut documented at the top of `backend/go/dllm/Makefile`
 > (symlink an out-of-tree `build/libdllm.so` into the backend dir and skip the
 > clone), or a git credential helper with repo access.
 ## Pin
 `backend/go/dllm/Makefile` pins `DLLM_VERSION?=<sha>` at the top
 (whisper / parakeet-cpp / ds4 convention). The bump-deps bot
 (`.github/workflows/bump_deps.yaml`) tracks `mudler/dllm.cpp` `main` and
 rewrites that variable. After a manual bump: `make -C backend/go/dllm purge &&
 make -C backend/go/dllm` (the clone is keyed on the directory existing, not
 the sha).
 ## C-ABI and the serialization contract
 The binding covers the 9-symbol flat C-ABI from dllm.cpp's
 `include/dllm_capi.h` (ABI v1; `main.go` hard-fails on a version mismatch):
 `abi_version, load, free, last_error, free_string, tokenize_json, generate,
 generate_stream, cancel`. Contract points the Go wiring encodes (`capi.go`
 header comment has the full list):
 - **One ctx = one concurrent generate/tokenize.** A per-model worker
  goroutine (`Dllm.jobs` in `dllm.go`) owns ALL C calls, making the
  serialization structural instead of lock discipline.
 - **`dllm_capi_cancel` is the ONE exception**: it only flips an atomic and may
  be called from any goroutine mid-generate, so `Dllm.Cancel` bypasses the
  worker queue. The flag resets at the start of each generate, so a watchdog
  racing a new generate must re-issue cancel.
 - **`last_error` is a borrowed pointer** and must only be read AFTER the
  failing call returned (never while a generate is in flight on the same ctx).
 - **Free vs in-flight requests**: requests hold `genMu.RLock` for their full
  duration; `Free` takes the write lock, so it only runs when nothing is in
  flight, then drains and closes the worker. Post-Free requests get a clean
  "model not loaded" error.
 - `tokenize_json`/`generate` return malloc'd `char*` (bound as `uintptr`,
  copied, then `dllm_capi_free_string`d); opts/params JSON must be a FLAT
  object of scalars (`buildOptsJSON` rejects anything else).
 ## Wire shape
 | RPC | Implementation |
 |---|---|
 | LoadModel | `dllm_capi_load` (params: `n_gpu_layers`, `n_threads`, `ctx_len`); `Options[]` parsed into per-request gen opts (`eb_*`, `blocks`, `kv_cache`) by `parseModelGenOpts` |
 | PredictRich | render (if templated) → `dllm_capi_generate` → parse → ONE Reply with aggregated ChatDeltas + legacy `Message` bytes |
 | PredictStreamRich | `dllm_capi_generate_stream`; per committed diffusion block → UTF-8 holdback → parser.Feed → one Reply per non-empty delta batch (channel closed by the CALLER, per `pkg/grpc/interface.go`) |
 | Predict / PredictStream | Legacy paths, delegate to the rich pair (legacy stream INVERTS channel ownership: the impl closes) |
 | TokenizeString | `dllm_capi_tokenize_json` (C side prepends BOS per `vocab.add_bos`) |
 | Cancel | `dllm_capi_cancel`, exposed as the `grpc.Cancellable` capability (`pkg/grpc/interface.go`): the gRPC server arms it via `context.AfterFunc` on the Predict/PredictStream context, so client disconnects/timeouts abort the in-flight generate - llama.cpp `IsCancelled()` parity for Go backends |
 `n_threads` and `ctx_len` are accepted-but-ignored by the engine at the
 current pin (the context bound comes from GGUF `n_ctx_train`); they are sent
 for forward compatibility.
 ## Renderer / parser (the templated chat path)
 With `use_tokenizer_template` + raw Messages, the backend owns templating and
 parsing (the ds4 precedent, but in Go):
 - `gemma4_renderer.go` - `RenderGemma4(msgs, toolsJSON, enableThinking,
  addGenerationPrompt)`. The file embeds the FULL `tokenizer.chat_template`
  jinja (17466 bytes, md5 `8c34cf93c7a7815b3fdb300a009c4c17`) extracted
  verbatim from `diffusiongemma-26B-A4B-it-BF16.gguf` via gguf-py - e.g.
  `python scripts/dump_gguf.py model.gguf | grep -A400 chat_template` in the
  dllm.cpp checkout - as a numbered comment block; every Go rule cites its
  "tpl L<n>" line. Re-verify the md5 before blaming the renderer for a
  mismatch with a new GGUF. **BOS exception**: the template emits
  `{{- bos_token -}}` but the renderer deliberately does NOT - dllm.cpp's
  `run_generate` tokenizes with `prepend_bos = vocab.add_bos` (true for
  gemma4), so a literal `<bos>` would double it.
 - `gemma4_parser.go` - streaming state machine turning raw model text
  (fragments can split anywhere, including mid-marker) into ChatDeltas:
  thought channels → `reasoning_content`, `<|tool_call>call:name{...}` →
  ToolCallDelta, `<turn|>` → done. Marker grammar cross-checked against vLLM
  PR #45163's gemma4 tool/reasoning parsers. Malformed payloads are re-emitted
  raw as content, never dropped.
 - Thinking is **opt-in** for this family (`Metadata["enable_thinking"]`,
  default OFF - the inverse of ds4): the template gates every thinking branch
  on `enable_thinking`, and the no-thinking render pre-closes an empty thought
  channel, so the parser always starts in content state.
 - **UTF-8 boundary holdback** (`splitValidUTF8` in `dllm.go`): per-block
  detokenization can split a multi-byte character across block boundaries, and
  grpc-go refuses to marshal invalid UTF-8 in proto3 strings. An incomplete
  trailing sequence (at most 3 bytes) is carried into the next block; genuinely
  undecodable bytes become U+FFFD.
 Without `use_tokenizer_template`, the prompt passes through verbatim and the
 output is NOT gemma4-parsed (plain content, like any non-autoparsing backend).
 ## Tests
 | Layer | Gate | What |
 |---|---|---|
 | `backend/go/dllm/*_test.go` (renderer/parser/wiring) | none - run in plain `go test ./backend/go/dllm/...` | Ginkgo specs over a fake `generator` seam; canonical renderer fixtures from transformers' `test_modeling_diffusion_gemma.py`, parser tables from the vLLM gemma4 parsers |
 | `backend/go/dllm/dllm_test.go` C-ABI smoke | `DLLM_TEST_LIBRARY` + `DLLM_TEST_TINY_MODEL` (dllm.cpp's `tests/fixtures/tiny_with_vocab.gguf`); Skips when unset | Drives the real `libdllm.so`: ABI check, load, tokenize `[2,18]`, deterministic generate, cancel (incl. mid-stream `Dllm.Cancel` aborting a deliberately slow `eb_max_steps:256` run in ~10ms) |
 | `tests/e2e-backends/dllm_test.go` | `BACKEND_TEST_DLLM=1` + `BACKEND_BINARY` (packaged run.sh) + `BACKEND_TEST_MODEL_FILE` (tiny fixture) | Templated chat round trip (Messages + UseTokenizerTemplate) over the real gRPC binary, non-streaming + streaming; plus client-context cancellation mid-stream (proves the `Cancellable` server plumbing end to end) |
 | Real-model e2e | `BACKEND_TEST_DLLM_REAL_MODEL_FILE` (26B BF16, ~50 GB) + `BACKEND_TEST_DLLM_REAL_GPU_LAYERS` | CUDA-13-class hardware only |
 Tool-call e2e is deliberately absent from the tiny-model spec: the fixture has
 random weights and cannot be coaxed into emitting tool markup; the unit tables
 carry that coverage.
 ## Build matrix
 `cpu-dllm` (amd64 + arm64), `cuda13-dllm` (amd64), and
 `cuda13-nvidia-l4t-arm64-dllm` (arm64 CUDA: Jetson / DGX Spark GB10), via
 `.github/backend-matrix.yml`. No darwin/Metal. CUDA builds forward
 `-DDLLM_CUDA=ON` (dllm.cpp gates ggml's CUDA behind its own flag - a bare
 `-DGGML_CUDA=ON` is overridden by the cache FORCE). `libdllm.so` is
 self-contained (ggml statically absorbed, PIC), so `package.sh` only ships
 the binary, `run.sh` and that one .so (the parakeet-cpp-style stub layout;
 no ldd walk yet).
 ## Known limitations
 - **Cancel granularity**: the C-ABI cancel flag is per-ctx and resets on
  every generate entry, so a Cancel racing a NEW generate can be lost, and
  with requests queued on the worker it aborts whichever generate is
  currently running (acceptable: the server de-registers the hook on normal
  completion, one process serves one model).
 - **Throughput**: ~0.15 tok/s on the 26B at default settings (GB10) - every
  denoise step recomputes the full prompt+canvas. The upstream prefix-KV
  cache (dllm.cpp P3) is the fix; `kv_cache:on` errors until it lands
  (`auto`/`off` are accepted no-ops).
 - **Repo privacy**: see the note at the top - CI clone of dllm.cpp needs the
  repo published (or credentials) before the backend images can build.
 - Engine spec/validation references: dllm.cpp `docs/validation.md` and
  LocalAI `docs/superpowers/specs/2026-06-10-dllm-cpp-design.md`.
--- a/.agents/ds4-backend.md
+++ b/.agents/ds4-backend.md
@@ -44,39 +44,6 @@ maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_
 via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
 NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).
 ## Engine options (LoadModel)
 `LoadModel` maps `ModelOptions.Options[]` (`"key:value"`, from model-YAML
 `options:`) onto `ds4_engine_options` through a **declarative table**
 (`kEngineOptSpecs` + `apply_engine_option` in `grpc-server.cpp`). The struct is
 plain C with no reflection, so the field set is enumerated once in the table;
 adding a future engine knob is a one-line table row, not a new branch. Unknown
 keys are ignored (back-compat). A bare flag (`ssd_streaming` with no value)
 means `true`. Path-type values (`mtp_path`, `expert_profile_path`,
 `directional_steering_file`) resolve **relative to the model directory**, so a
 gallery entry can reference a companion file it downloaded by bare filename;
 absolute values pass through. `ds4_role` / `ds4_layers` / `ds4_listen` /
 `ds4_route_timeout` / `kv_cache_dir` keep their dedicated handling (validation
 + coordinator wiring) and are not in the table.
 Wired keys: `mtp_path`, `mtp_draft`, `mtp_margin`, `prefill_chunk`,
 `power_percent`, `warm_weights`, `quality`, `ssd_streaming`,
 `ssd_streaming_cold`, `ssd_streaming_preload_experts`,
 `ssd_streaming_cache_experts` (count or `NGB`, sets both experts+bytes via
 `ds4_parse_streaming_cache_experts_arg`), `simulate_used_memory` (`NGB` via
 `ds4_parse_gib_arg`), `expert_profile_path`, `directional_steering_file`,
 `directional_steering_attn`, `directional_steering_ffn`.
 ## SSD streaming (running models larger than RAM)
 ds4's **SSD streaming** keeps non-routed weights resident and streams routed MoE
 experts from the GGUF on cache misses, turning "does it fit in RAM" into a speed
 spectrum. **Metal (Darwin) only** - it is a no-op on CUDA/CPU. Enable with
 `options: ["ssd_streaming"]`; size the routed-expert cache with
 `ssd_streaming_cache_experts:NGB` (omit for ds4's automatic 80%-of-working-set
 budget). Gallery entries built on this: `deepseek-v4-flash-q4-ssd` (153 GB Flash
 on a 128 GB Mac) and `deepseek-v4-pro-q2-ssd` (433 GB Pro, experimental).
 ## Build matrix
 | Build | Where | Notes |
--- a/.docker/install-base-deps.sh
+++ b/.docker/install-base-deps.sh
@@ -70,12 +70,6 @@ if [ "${BUILD_TYPE:-}" = "vulkan" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; t
        git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
        ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
        clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
    # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe + Arm SoC) and their ICD
    # manifests. The LunarG SDK below only provides the loader and shader
    # tooling, not hardware drivers — without Mesa the packaged Vulkan backend
    # would ship a loader that finds no GPU. package-gpu-libs.sh bundles these
    # .so files plus their deps into the backend so it stays self-contained.
    apt-get install -y mesa-vulkan-drivers libdrm2
    if [ "amd64" = "${TARGETARCH:-}" ]; then
        wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz"
        tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz
--- a/.docker/llama-cpp-compile.sh
+++ b/.docker/llama-cpp-compile.sh
@@ -17,29 +17,19 @@ if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
  rm -rf /LocalAI/backend/cpp/llama-cpp-*-build
 fi
-cd /LocalAI/backend/cpp/llama-cpp
+if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
-if [ -z "${BUILD_TYPE:-}" ]; then
+  cd /LocalAI/backend/cpp/llama-cpp
  # Pure CPU image (BUILD_TYPE empty): one build with ggml CPU_ALL_VARIANTS replaces the
  # per-microarch binaries (x86: avx/avx2/avx512/fallback; arm64: armv8.x/armv9.x). ggml
  # dlopens the best libggml-cpu-*.so at runtime by probing host CPU features.
  #
  # arm64: the CPU_ALL_VARIANTS table includes armv9.2 SME variants whose -march=...+sme is
  # rejected by the Ubuntu 24.04 default gcc-13. gcc-14 accepts it, so build the arm64
  # variants with it (the host never *selects* SME unless it has it, but every variant must
  # still compile).
  if [ "${TARGETARCH}" = "arm64" ]; then
    apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
    export CC=gcc-14 CXX=g++-14
  fi
  make llama-cpp-cpu-all
 else
  # GPU build (cublas/hipblas/sycl/vulkan/...): the accelerator does the compute, so a
  # single fallback CPU build is enough - no per-microarch CPU variants needed. (This also
  # keeps the heavy GPU backend compile from also building the whole CPU variant matrix,
  # and avoids the gcc-14 apt step on GPU base images such as nvidia l4t.)
  make llama-cpp-fallback
  make llama-cpp-grpc
  make llama-cpp-rpc-server
 else
  cd /LocalAI/backend/cpp/llama-cpp
  make llama-cpp-avx
  make llama-cpp-avx2
  make llama-cpp-avx512
  make llama-cpp-fallback
  make llama-cpp-grpc
  make llama-cpp-rpc-server
 fi
 make llama-cpp-grpc
 make llama-cpp-rpc-server
 ccache -s || true
--- a/.docker/turboquant-compile.sh
+++ b/.docker/turboquant-compile.sh
@@ -19,21 +19,17 @@ fi
 cd /LocalAI/backend/cpp/turboquant
-if [ -z "${BUILD_TYPE:-}" ]; then
+if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
  # Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
  # arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
  if [ "${TARGETARCH}" = "arm64" ]; then
    apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
    export CC=gcc-14 CXX=g++-14
  fi
  make turboquant-cpu-all
 else
  # GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator
  # does the compute. Keeps the GPU compile from also building the CPU variant matrix and
  # avoids the gcc-14 apt step on GPU base images such as nvidia l4t.
  make turboquant-fallback
  make turboquant-grpc
  make turboquant-rpc-server
 else
  make turboquant-avx
  make turboquant-avx2
  make turboquant-avx512
  make turboquant-fallback
  make turboquant-grpc
  make turboquant-rpc-server
 fi
 make turboquant-grpc
 make turboquant-rpc-server
 ccache -s || true
--- a/.dockerignore
+++ b/.dockerignore
@@ -31,15 +31,6 @@ backend/python/**/source
 backend/cpp/llama-cpp/llama.cpp
 backend/cpp/llama-cpp-*-build
 # privacy-filter: same in-place pattern. The Makefile fetches privacy-filter.cpp
 # at the pinned commit (or symlinks a PRIVACY_FILTER_SRC checkout for local dev).
 # A stale dir/symlink COPY'd into the image makes the clone step fail (dangling
 # symlink) or compile against the wrong commit, so keep host build state out.
 backend/cpp/privacy-filter/privacy-filter.cpp
 backend/cpp/privacy-filter/build
 backend/cpp/privacy-filter/grpc-server
 backend/cpp/privacy-filter/package
 # Rust backend build output (sources are tracked; target/ is generated)
 backend/rust/*/target
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
--- a/.github/bump_vllm_metal.sh
+++ b/.github/bump_vllm_metal.sh
@@ -1,55 +0,0 @@
 #!/bin/bash
 # Bump the single vllm-metal pin (VLLM_METAL_VERSION) in the vLLM backend's
 # darwin (Apple Silicon) install path. The macOS/Metal build
 # (backend/python/vllm/install.sh, Darwin branch) installs vllm-metal, which is
 # version-locked to a specific vLLM source release. install.sh derives that vLLM
 # version at build time from vllm-metal's own installer (`vllm_v=`) at the pinned
 # tag, so there is only ONE value to bump here -- mirroring bump_vllm_wheel.sh,
 # which bumps the Linux cu130 wheel pin.
 #
 # This deliberately tracks vllm-project/vllm-metal, NOT vllm-project/vllm: the
 # darwin build can only use the exact vLLM version vllm-metal supports, so it may
 # lag the Linux pin (requirements-cublas13-after.txt) until vllm-metal catches up.
 set -xe
 REPO=$1   # vllm-project/vllm-metal
 FILE=$2   # backend/python/vllm/install.sh
 VAR=$3    # VLLM_METAL_VERSION (used for the workflow's output file names)
 if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
    echo "usage: $0 <repo> <install-file> <var-name>" >&2
    exit 1
 fi
 # vllm-metal ships frequent dev releases, all flagged as non-prerelease, so
 # /releases/latest returns the newest one (with its cp312 wheel asset).
 LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
    "https://api.github.com/repos/$REPO/releases/latest" \
    | python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
 # The coupled vLLM source version lives in vllm-metal's installer at that tag.
 NEW_VLLM_VERSION=$(curl -fsSL \
    "https://raw.githubusercontent.com/$REPO/$LATEST_TAG/install.sh" \
    | grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -1 | cut -d'"' -f2)
 if [ -z "$LATEST_TAG" ] || [ -z "$NEW_VLLM_VERSION" ]; then
    echo "Could not resolve vllm-metal tag ($LATEST_TAG) or its vllm_v ($NEW_VLLM_VERSION)." >&2
    exit 1
 fi
 set +e
 CURRENT_TAG=$(grep -oE 'VLLM_METAL_VERSION="[^"]*"' "$FILE" | head -1 | cut -d'"' -f2)
 set -e
 # Rewrite the single pin. install.sh derives VLLM_VERSION from this tag at build
 # time, so there is nothing else to touch. peter-evans/create-pull-request opens
 # no PR on a clean tree, so a no-op rewrite (already current) is safe.
 sed -i "$FILE" \
    -e "s|VLLM_METAL_VERSION=\"[^\"]*\"|VLLM_METAL_VERSION=\"$LATEST_TAG\"|"
 if [ -z "$CURRENT_TAG" ]; then
    echo "Could not find VLLM_METAL_VERSION=\"...\" in $FILE." >&2
    exit 0
 fi
 echo "vllm-metal ${CURRENT_TAG} -> ${LATEST_TAG} (builds vLLM ${NEW_VLLM_VERSION}): https://github.com/$REPO/releases/tag/${LATEST_TAG}" >> "${VAR}_message.txt"
 echo "${LATEST_TAG}" >> "${VAR}_commit.txt"
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -44,7 +44,7 @@ jobs:
      has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -101,7 +101,7 @@ jobs:
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -57,7 +57,7 @@ jobs:
      HOMEBREW_NO_ANALYTICS: '1'
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
@@ -82,7 +82,7 @@ jobs:
      # as the Linux registry cache.
      - name: Restore Homebrew cache
        id: brew-cache
-        uses: actions/cache/restore@v6
+        uses: actions/cache/restore@v4
        with:
          path: |
            ~/Library/Caches/Homebrew/downloads
@@ -98,8 +98,6 @@ jobs:
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
            /opt/homebrew/Cellar/nlohmann-json
            /opt/homebrew/Cellar/opus
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
      - name: Dependencies
@@ -111,15 +109,7 @@ jobs:
          # Without explicitly installing them, a brew cache-hit run restores
          # ccache's Cellar dir but skips installing those transitive deps,
          # and ccache fails at runtime with `dyld: Library not loaded`.
-          # nlohmann-json is header-only and required by the ds4 backend
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd
          # (dsml_renderer.cpp includes <nlohmann/json.hpp>); on Linux it comes
          # from the apt-installed nlohmann-json3-dev in the build image.
          # opus + pkg-config are required by the opus go backend: its
          # Makefile/package.sh call `pkg-config --cflags/--libs opus` to build
          # libopusshim.dylib and to locate libopus.dylib for bundling. brew's
          # pkg-config defaults its search path to the Homebrew prefix so the
          # opus.pc is found.
          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json opus pkg-config
          # Force-reinstall ccache so brew re-validates its full runtime-dep
          # closure on every run. This is the durable fix: when the upstream
          # ccache formula gains a new transitive dep (as it has multiple times
@@ -138,11 +128,11 @@ jobs:
          # and decides "already installed" without re-linking, so on a cache-
          # hit run the formulas aren't on PATH. Force-link them; --overwrite
          # tolerates pre-existing symlinks from earlier installs.
-          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json opus pkg-config 2>/dev/null || true
+          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd 2>/dev/null || true
      - name: Save Homebrew cache
        if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
-        uses: actions/cache/save@v6
+        uses: actions/cache/save@v4
        with:
          path: |
            ~/Library/Caches/Homebrew/downloads
@@ -158,8 +148,6 @@ jobs:
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
            /opt/homebrew/Cellar/nlohmann-json
            /opt/homebrew/Cellar/opus
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
      # ---- ccache for llama.cpp CMake builds ----
@@ -178,7 +166,7 @@ jobs:
      - name: Restore ccache
        if: inputs.backend == 'llama-cpp'
        id: ccache-cache
-        uses: actions/cache/restore@v6
+        uses: actions/cache/restore@v4
        with:
          path: ~/Library/Caches/ccache
          key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
@@ -211,7 +199,7 @@ jobs:
      - name: Restore Python wheel cache
        if: inputs.lang == 'python'
        id: pyenv-cache
-        uses: actions/cache/restore@v6
+        uses: actions/cache/restore@v4
        with:
          path: |
            ~/Library/Caches/pip
@@ -235,17 +223,8 @@ jobs:
        run: |
          make backends/ds4-darwin
      # privacy-filter is a C++/ggml backend like ds4 - a single grpc-server with
      # otool dylib bundling - so it gets its own bespoke darwin script rather than
      # the generic build-darwin-go-backend path.
      - name: Build privacy-filter backend (Darwin Metal)
        if: inputs.backend == 'privacy-filter'
        run: |
          make protogen-go
          make backends/privacy-filter-darwin
      - name: Build ${{ inputs.backend }}-darwin
-        if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4' && inputs.backend != 'privacy-filter'
+        if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4'
        run: |
          make protogen-go
          BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend
@@ -256,14 +235,14 @@ jobs:
      - name: Save ccache
        if: inputs.backend == 'llama-cpp' && github.event_name != 'pull_request'
-        uses: actions/cache/save@v6
+        uses: actions/cache/save@v4
        with:
          path: ~/Library/Caches/ccache
          key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
      - name: Save Python wheel cache
        if: inputs.lang == 'python' && github.event_name != 'pull_request' && steps.pyenv-cache.outputs.cache-hit != 'true'
-        uses: actions/cache/save@v6
+        uses: actions/cache/save@v4
        with:
          path: |
            ~/Library/Caches/pip
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -49,7 +49,7 @@ jobs:
      # Sparse checkout: the merge job needs `.github/scripts/` (for the
      # keepalive cleanup script) but none of the source tree.
      - name: Checkout (.github/scripts only)
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          sparse-checkout: |
            .github/scripts
--- a/.github/workflows/backend_pr.yml
+++ b/.github/workflows/backend_pr.yml
@@ -23,7 +23,7 @@ jobs:
      has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
--- a/.github/workflows/base-images.yml
+++ b/.github/workflows/base-images.yml
@@ -127,7 +127,7 @@ jobs:
            # the original l4t matrix entry which set skip-drivers: 'true'.
            skip-drivers: 'true'
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
        with:
          submodules: false
      - name: Free disk space
--- a/.github/workflows/build-test.yaml
+++ b/.github/workflows/build-test.yaml
@@ -11,7 +11,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -25,7 +25,7 @@ jobs:
    runs-on: macos-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -47,7 +47,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Configure apt mirror on runner
--- a/.github/workflows/bump-inference-defaults.yml
+++ b/.github/workflows/bump-inference-defaults.yml
@@ -14,7 +14,7 @@ jobs:
  bump:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - uses: actions/setup-go@v5
        with:
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -26,10 +26,6 @@ jobs:
            variable: "DS4_VERSION"
            branch: "main"
            file: "backend/cpp/ds4/Makefile"
          - repository: "localai-org/privacy-filter.cpp"
            variable: "PRIVACY_FILTER_VERSION"
            branch: "master"
            file: "backend/cpp/privacy-filter/Makefile"
          - repository: "ggml-org/whisper.cpp"
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
@@ -42,22 +38,10 @@ jobs:
            variable: "PARAKEET_VERSION"
            branch: "master"
            file: "backend/go/parakeet-cpp/Makefile"
-          - repository: "mudler/ced.cpp"
+          - repository: "mudler/dllm.cpp"
-            variable: "CED_VERSION"
+            variable: "DLLM_VERSION"
-            branch: "master"
+            branch: "main"
-            file: "backend/go/ced/Makefile"
+            file: "backend/go/dllm/Makefile"
          - repository: "mudler/voice-detect.cpp"
            variable: "VOICEDETECT_VERSION"
            branch: "master"
            file: "backend/go/voice-detect/Makefile"
          - repository: "mudler/face-detect.cpp"
            variable: "FACEDETECT_VERSION"
            branch: "master"
            file: "backend/go/face-detect/Makefile"
          - repository: "mudler/depth-anything.cpp"
            variable: "DEPTHANYTHING_VERSION"
            branch: "master"
            file: "backend/go/depth-anything-cpp/Makefile"
          - repository: "leejet/stable-diffusion.cpp"
            variable: "STABLEDIFFUSION_GGML_VERSION"
            branch: "master"
@@ -82,25 +66,17 @@ jobs:
            variable: "RFDETR_VERSION"
            branch: "main"
            file: "backend/go/rfdetr-cpp/Makefile"
-          - repository: "mudler/locate-anything.cpp"
+          - repository: "predict-woo/qwen3-tts.cpp"
            variable: "LOCATEANYTHING_VERSION"
            branch: "master"
            file: "backend/go/locate-anything-cpp/Makefile"
          - repository: "ServeurpersoCom/qwentts.cpp"
            variable: "QWEN3TTS_CPP_VERSION"
-            branch: "master"
+            branch: "main"
            file: "backend/go/qwen3-tts-cpp/Makefile"
          - repository: "ServeurpersoCom/omnivoice.cpp"
            variable: "OMNIVOICE_VERSION"
            branch: "master"
            file: "backend/go/omnivoice-cpp/Makefile"
          - repository: "localai-org/vibevoice.cpp"
            variable: "VIBEVOICE_CPP_VERSION"
            branch: "master"
            file: "backend/go/vibevoice-cpp/Makefile"
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Bump dependencies 🔧
        id: bump
        run: |
@@ -136,7 +112,7 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Bump vLLM cu130 wheel pin 🔧
        id: bump
        run: |
@@ -162,39 +138,3 @@ jobs:
          branch: "update/VLLM_VERSION"
          body: ${{ steps.bump.outputs.message }}
          signoff: true
  bump-vllm-metal:
    # The darwin (Apple Silicon) vLLM build installs vllm-metal, which is locked
    # to a specific vLLM source release. install.sh pins both VLLM_METAL_VERSION
    # (the wheel release) and VLLM_VERSION (the vLLM it builds against); this job
    # tracks vllm-project/vllm-metal and rewrites both atomically. Separate from
    # bump-vllm-wheel because darwin follows vllm-metal, not vllm/vllm latest.
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v7
      - name: Bump vllm-metal pin 🔧
        id: bump
        run: |
          bash .github/bump_vllm_metal.sh vllm-project/vllm-metal backend/python/vllm/install.sh VLLM_METAL_VERSION
          {
            echo 'message<<EOF'
            cat "VLLM_METAL_VERSION_message.txt"
            echo EOF
          } >> "$GITHUB_OUTPUT"
          {
            echo 'commit<<EOF'
            cat "VLLM_METAL_VERSION_commit.txt"
            echo EOF
          } >> "$GITHUB_OUTPUT"
          rm -rfv VLLM_METAL_VERSION_message.txt VLLM_METAL_VERSION_commit.txt
      - name: Create Pull Request
        uses: peter-evans/create-pull-request@v8
        with:
          token: ${{ secrets.UPDATE_BOT_TOKEN }}
          push-to-fork: ci-forks/LocalAI
          commit-message: ':arrow_up: Update vllm-project/vllm-metal (darwin)'
          title: 'chore: :arrow_up: Update vllm-metal (darwin) to `${{ steps.bump.outputs.commit }}`'
          branch: "update/VLLM_METAL_VERSION"
          body: ${{ steps.bump.outputs.message }}
          signoff: true
--- a/.github/workflows/bump_docs.yaml
+++ b/.github/workflows/bump_docs.yaml
@@ -13,7 +13,7 @@ jobs:
          - repository: "mudler/LocalAI"
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Bump dependencies 🔧
        run: |
          bash .github/bump_docs.sh ${{ matrix.repository }}
--- a/.github/workflows/checksum_checker.yaml
+++ b/.github/workflows/checksum_checker.yaml
@@ -8,7 +8,7 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - name: Install dependencies
--- a/.github/workflows/deploy-explorer.yaml
+++ b/.github/workflows/deploy-explorer.yaml
@@ -16,7 +16,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - uses: actions/setup-go@v5
--- a/.github/workflows/gallery-agent.yaml
+++ b/.github/workflows/gallery-agent.yaml
@@ -31,7 +31,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
--- a/.github/workflows/generate_intel_image.yaml
+++ b/.github/workflows/generate_intel_image.yaml
@@ -44,7 +44,7 @@ jobs:
        uses: docker/setup-buildx-action@master
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
      - name: Cache Intel images
        uses: docker/build-push-action@v7
--- a/.github/workflows/gh-pages.yml
+++ b/.github/workflows/gh-pages.yml
@@ -28,7 +28,7 @@ jobs:
      HUGO_VERSION: "0.146.3"
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0  # needed for enableGitInfo
          submodules: true
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -80,7 +80,7 @@ jobs:
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
      - name: Configure apt mirror on runner
        id: apt_mirror
--- a/.github/workflows/image_merge.yml
+++ b/.github/workflows/image_merge.yml
@@ -36,7 +36,7 @@ jobs:
      # Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
      # script). Skips the rest of the source tree.
      - name: Checkout (.github/scripts only)
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          sparse-checkout: |
            .github/scripts
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -20,7 +20,7 @@ jobs:
  golangci-lint:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
        with:
          # Full history so golangci-lint's new-from-merge-base can reach
          # origin/master and compute the diff against it.
--- a/.github/workflows/release.yaml
+++ b/.github/workflows/release.yaml
@@ -10,7 +10,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -24,35 +24,20 @@ jobs:
          args: release --clean
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          MACOS_SIGN_P12: ${{ secrets.MACOS_CERTIFICATE }}
          MACOS_SIGN_PASSWORD: ${{ secrets.MACOS_CERTIFICATE_PWD }}
          MACOS_NOTARY_KEY: ${{ secrets.MACOS_NOTARY_KEY }}
          MACOS_NOTARY_KEY_ID: ${{ secrets.MACOS_NOTARY_KEY_ID }}
          MACOS_NOTARY_ISSUER_ID: ${{ secrets.MACOS_NOTARY_ISSUER_ID }}
  launcher-build-darwin:
    runs-on: macos-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Set up Go
        uses: actions/setup-go@v5
        with:
          go-version: 1.23
-      - name: Import signing certificate
+      - name: Build launcher for macOS ARM64
-        env:
+        run: |
-          MACOS_CERTIFICATE: ${{ secrets.MACOS_CERTIFICATE }}
+          make build-launcher-darwin
          MACOS_CERTIFICATE_PWD: ${{ secrets.MACOS_CERTIFICATE_PWD }}
          MACOS_CI_KEYCHAIN_PWD: ${{ secrets.MACOS_CI_KEYCHAIN_PWD }}
        run: bash contrib/macos/sign-and-notarize.sh import-cert
      - name: Build, sign and notarize the DMG
        env:
          MACOS_SIGN_IDENTITY: ${{ secrets.MACOS_SIGN_IDENTITY }}
          MACOS_NOTARY_KEY: ${{ secrets.MACOS_NOTARY_KEY }}
          MACOS_NOTARY_KEY_ID: ${{ secrets.MACOS_NOTARY_KEY_ID }}
          MACOS_NOTARY_ISSUER_ID: ${{ secrets.MACOS_NOTARY_ISSUER_ID }}
        run: make release-launcher-darwin
      - name: Upload DMG to Release
        uses: softprops/action-gh-release@v3
        with:
@@ -61,7 +46,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Configure apt mirror on runner
--- a/.github/workflows/secscan.yaml
+++ b/.github/workflows/secscan.yaml
@@ -14,17 +14,14 @@ jobs:
      GO111MODULE: on
    steps:
      - name: Checkout Source
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        if: ${{ github.actor != 'dependabot[bot]' }}
      - name: Run Gosec Security Scanner
        if: ${{ github.actor != 'dependabot[bot]' }}
        uses: securego/gosec@v2.27.1
        with:
          # we let the report trigger content trigger a failure using the GitHub Security features.
-          # backend/go/supertonic is excluded: it vendors upstream supertone-inc/supertonic
+          args: '-no-fail -fmt sarif -out results.sarif ./...'
          # (helper.go), whose findings (G304 model-file loads, G404 math/rand for flow-matching
          # noise, G104 unhandled errors) are inherent to that upstream code, not ours to rewrite.
          args: '-no-fail -exclude-dir=backend/go/supertonic -fmt sarif -out results.sarif ./...'
      - name: Upload SARIF file
        if: ${{ github.actor != 'dependabot[bot]' }}
        uses: github/codeql-action/upload-sarif@v4
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -38,7 +38,6 @@ jobs:
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
      qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
      rfdetr-cpp: ${{ steps.detect.outputs.rfdetr-cpp }}
      locate-anything-cpp: ${{ steps.detect.outputs.locate-anything-cpp }}
      vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
      localvqe: ${{ steps.detect.outputs.localvqe }}
      voxtral: ${{ steps.detect.outputs.voxtral }}
@@ -50,7 +49,7 @@ jobs:
      parakeet-cpp: ${{ steps.detect.outputs.parakeet-cpp }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
      - name: Install dependencies
@@ -67,7 +66,7 @@ jobs:
  #   runs-on: ubuntu-latest
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -90,7 +89,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -113,7 +112,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -137,7 +136,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -158,7 +157,7 @@ jobs:
  #  runs-on: ubuntu-latest
  #  steps:
  #    - name: Clone
-  #      uses: actions/checkout@v7
+  #      uses: actions/checkout@v6
  #      with:
  #        submodules: true
  #    - name: Dependencies
@@ -178,7 +177,7 @@ jobs:
  #   runs-on: ubuntu-latest
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -240,7 +239,7 @@ jobs:
  #           sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
  #           df -h
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -265,7 +264,7 @@ jobs:
  #   runs-on: ubuntu-latest
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -288,7 +287,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -309,7 +308,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -330,7 +329,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -351,7 +350,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -373,7 +372,7 @@ jobs:
  #   timeout-minutes: 45
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -394,7 +393,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -415,7 +414,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -436,7 +435,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -462,7 +461,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -484,7 +483,7 @@ jobs:
    timeout-minutes: 30
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -513,7 +512,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -530,7 +529,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -552,7 +551,7 @@ jobs:
    timeout-minutes: 20
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -564,7 +563,7 @@ jobs:
      - name: Run e2e-backends smoke
        env:
          BACKEND_IMAGE: quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
-          BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias,tokenize
+          BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias
        run: |
          make test-extra-backend
  # Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
@@ -579,7 +578,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -604,7 +603,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -625,7 +624,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -645,7 +644,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -664,7 +663,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -681,7 +680,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -698,7 +697,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -741,7 +740,7 @@ jobs:
  #   timeout-minutes: 90
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -783,7 +782,7 @@ jobs:
  #   timeout-minutes: 90
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -808,7 +807,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -840,7 +839,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -876,7 +875,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -902,45 +901,6 @@ jobs:
      - name: Test rfdetr-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp test
  # Per-backend e2e for locate-anything-cpp: builds the .so + Go binary and
  # runs `make -C backend/go/locate-anything-cpp test`. test.sh fetches the
  # locate-anything-q8_0 GGUF (~6.3 GB, NVIDIA LocateAnything-3B) from the
  # published mudler/locate-anything.cpp-gguf HF repo + a COCO image, then the
  # Go wire test loads the model and runs an open-vocabulary Detect, asserting
  # at least one labeled box. Heavier than the other Go backends (it is a 3B),
  # so it is gated to changes under backend/go/locate-anything-cpp/.
  tests-locate-anything-cpp:
    needs: detect-changes
    if: needs.detect-changes.outputs.locate-anything-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential cmake curl libopenblas-dev
      - name: Setup Go
        uses: actions/setup-go@v5
      - name: Display Go version
        run: go version
      - name: Proto Dependencies
        run: |
          # Install protoc
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
          rm protoc.zip
          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
          PATH="$PATH:$HOME/go/bin" make protogen-go
      - name: Build locate-anything-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp
      - name: Test locate-anything-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp test
  # Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
  # runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
  # the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K
@@ -952,7 +912,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -987,7 +947,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -1008,16 +968,12 @@ jobs:
  # image + working dir.
  tests-vibevoice-cpp-grpc-transcription:
    needs: detect-changes
-    # Skip on release tag pushes: the ASR Q4_K model is ~10 GB and cannot be
+    if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
    # pulled from HF within the inner `go test -timeout 30m` budget on a CI
    # runner, so every tag build hung and timed out. Still runs on PRs/branch
    # pushes that touch vibevoice-cpp so regressions are caught off the release path.
    if: (needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true') && !startsWith(github.ref, 'refs/tags/')
    runs-on: bigger-runner
    timeout-minutes: 150
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -1046,7 +1002,7 @@ jobs:
    timeout-minutes: 60
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -1062,7 +1018,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -1095,7 +1051,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -1118,7 +1074,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -1144,7 +1100,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -21,7 +21,7 @@ jobs:
        go-version: ['1.26.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Free disk space
@@ -84,7 +84,7 @@ jobs:
        go-version: ['1.26.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go ${{ matrix.go-version }}
@@ -121,19 +121,3 @@ jobs:
          detached: true
          connect-timeout-seconds: 180
          limit-access-to-actor: true
  # Fast standalone unit tests for the backends' pure C++ helpers - currently the
  # llama-cpp message reconstruction (backend/cpp/llama-cpp/message_content.h),
  # which guards the OpenAI chat content normalization (mudler/LocalAI#10524,
  # #7324, #7528). The runner discovers every *_test.cpp under backend/cpp/, so
  # new pure-C++ unit tests are picked up with no CI changes. These need only the
  # C++ stdlib + nlohmann/json, so they run on every PR without the full
  # llama.cpp + gRPC backend build. (The same suite is also wired as an opt-in
  # CMake/ctest target, -DLLAMA_GRPC_BUILD_TESTS=ON, for in-backend-build runs.)
  tests-backend-cpp:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
        uses: actions/checkout@v7
      - name: Run backend C++ unit tests
        run: make test-backend-cpp
--- a/.github/workflows/tests-aio.yml
+++ b/.github/workflows/tests-aio.yml
@@ -62,7 +62,7 @@ jobs:
          sudo rm -rfv build || true
          df -h
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
--- a/.github/workflows/tests-e2e.yml
+++ b/.github/workflows/tests-e2e.yml
@@ -21,7 +21,7 @@ jobs:
        go-version: ['1.25.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Configure apt mirror on runner
--- a/.github/workflows/tests-pii-ner-e2e.yml
+++ b/.github/workflows/tests-pii-ner-e2e.yml
@@ -1,97 +0,0 @@
 ---
 name: 'PII NER tier E2E (live GGUF, CPU)'
 # Runs the real privacy-filter GGUF NER tier end-to-end on CPU — the gap the
 # hermetic tests/e2e suite cannot cover (it only exercises the in-process
 # pattern tier). Heavy (builds the C++ backend image + downloads a ~2.7 GB
 # GGUF), so it is path-filtered on PRs and otherwise runs nightly / on demand.
 #
 # This drives the container-level harness (tests/e2e-backends) via
 # `make test-extra-backend-privacy-filter`: it builds the privacy-filter image,
 # downloads the model, loads it on CPU, and asserts byte-correct, UTF-8-aligned
 # TokenClassify spans. The complementary HTTP-path specs in tests/e2e
 # (e2e_pii_ner_test.go) Skip unless PII_NER_MODEL_GGUF is wired.
 on:
  workflow_dispatch:
  schedule:
    - cron: '0 3 * * *'
  push:
    branches:
      - master
    paths:
      - 'backend/cpp/privacy-filter/**'
      - 'backend/Dockerfile.privacy-filter'
      - 'core/services/routing/pii/**'
      - 'core/services/routing/piidetector/**'
      - 'core/backend/token_classify.go'
      - 'core/http/endpoints/localai/pii.go'
      - 'core/schema/pii.go'
      - 'tests/e2e-backends/**'
      - 'tests/e2e/e2e_pii_ner_test.go'
      - 'tests/e2e/e2e_suite_test.go'
      - '.github/workflows/tests-pii-ner-e2e.yml'
  pull_request:
    paths:
      - 'backend/cpp/privacy-filter/**'
      - 'backend/Dockerfile.privacy-filter'
      - 'core/services/routing/pii/**'
      - 'core/services/routing/piidetector/**'
      - 'core/backend/token_classify.go'
      - 'core/http/endpoints/localai/pii.go'
      - 'core/schema/pii.go'
      - 'tests/e2e-backends/**'
      - 'tests/e2e/e2e_pii_ner_test.go'
      - 'tests/e2e/e2e_suite_test.go'
      - '.github/workflows/tests-pii-ner-e2e.yml'
 concurrency:
  group: ci-tests-pii-ner-e2e-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
 jobs:
  tests-pii-ner-e2e:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        go-version: ['1.25.x']
    steps:
      - name: Clone
        uses: actions/checkout@v7
        with:
          submodules: true
      - name: Free disk space
        run: |
          sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc /opt/hostedtoolcache/CodeQL || true
          sudo docker image prune --all --force || true
          df -h
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - name: Setup Go ${{ matrix.go-version }}
        uses: actions/setup-go@v5
        with:
          go-version: ${{ matrix.go-version }}
          cache: false
      - name: Proto Dependencies
        run: |
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
          rm protoc.zip
          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
          PATH="$PATH:$HOME/go/bin" make protogen-go
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential
      # Builds local-ai-backend:privacy-filter, downloads the GGUF, loads it on
      # CPU and runs the token_classify capability spec (byte-offset contract).
      - name: Run live PII NER backend E2E
        run: PATH="$PATH:$HOME/go/bin" make test-extra-backend-privacy-filter
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
        with:
          detached: true
          connect-timeout-seconds: 180
          limit-access-to-actor: true
--- a/.github/workflows/tests-ui-e2e.yml
+++ b/.github/workflows/tests-ui-e2e.yml
@@ -23,7 +23,7 @@ jobs:
        go-version: ['1.26.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Configure apt mirror on runner
--- a/.github/workflows/update_swagger.yaml
+++ b/.github/workflows/update_swagger.yaml
@@ -10,7 +10,7 @@ jobs:
      fail-fast: false
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - uses: actions/setup-go@v5
--- a/.gitignore
+++ b/.gitignore
@@ -91,9 +91,3 @@ core/http/react-ui/test-results/
 # Local worktrees
 .worktrees/
 # SDD / brainstorm scratch (agent-driven development)
 .superpowers/
 # Local Apple signing material (never commit)
 .certs/
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -74,8 +74,6 @@ linters:
    paths:
      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
      - 'backend/go/whisper/sources'
      # Vendored upstream supertonic pipeline (supertone-inc/supertonic go/helper.go).
      - 'backend/go/supertonic/helper.go'
      - 'docs/'
    rules:
      # CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
--- a/.goreleaser.yaml
+++ b/.goreleaser.yaml
@@ -9,8 +9,7 @@ source:
  enabled: true
  name_template: '{{ .ProjectName }}-{{ .Tag }}-source'
 builds:
-  - id: local-ai
+  - main: ./cmd/local-ai
    main: ./cmd/local-ai
    env:
      - CGO_ENABLED=0
    ldflags:
@@ -36,19 +35,3 @@ snapshot:
  version_template: "{{ .Tag }}-next"
 changelog:
  use: github-native
 # Sign + notarize the macOS server binary via the quill backend (runs on Linux,
 # no macOS runner needed). Disabled automatically when MACOS_SIGN_P12 is unset
 # (forks / PRs), so those builds stay unsigned and green.
 notarize:
  macos:
    - enabled: '{{ isEnvSet "MACOS_SIGN_P12" }}'
      ids:
        - local-ai
      sign:
        certificate: "{{.Env.MACOS_SIGN_P12}}"
        password: "{{.Env.MACOS_SIGN_PASSWORD}}"
      notarize:
        issuer_id: "{{.Env.MACOS_NOTARY_ISSUER_ID}}"
        key_id: "{{.Env.MACOS_NOTARY_KEY_ID}}"
        key: "{{.Env.MACOS_NOTARY_KEY}}"
        wait: true
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -26,6 +26,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
 | [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
 | [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
 | [.agents/dllm-backend.md](.agents/dllm-backend.md) | Working on the dllm backend (DiffusionGemma block-diffusion) - purego C-ABI binding, per-ctx serialization contract, gemma4 renderer/parser, gated test layers |
 | [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
 | [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
 | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
@@ -43,5 +44,4 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 - **New API endpoints**: LocalAI advertises its capability surface in several independent places — swagger `@Tags`, `/api/instructions` registry, auth `RouteFeatureRegistry`, React UI `capabilities.js`, docs. Read [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) and follow its checklist — missing any surface means clients, admins, and the UI won't know the endpoint exists.
 - **Admin endpoints → MCP tool**: every admin endpoint that an admin would manage conversationally (install/list/edit/toggle/upgrade) MUST also be exposed as an MCP tool in `pkg/mcp/localaitools/`. The LocalAI Assistant chat modality and the standalone `local-ai mcp-server` consume that package; drift between REST and MCP is a real risk. Read [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) — the `TestToolHTTPRouteMappingComplete` test fails until you wire the new tool and update the route map.
 - **Build**: Inspect `Makefile` and `.github/workflows/` — ask the user before running long builds
 - **Backend OS coverage**: a new backend must target every OS it can build for, not just Linux. `.github/backend-matrix.yml` has two matrices — `include:` (Linux) and `includeDarwin:` (macOS / Apple Silicon). Most C/C++/GGML and many Python backends build on Darwin too — wire the `includeDarwin` entry + `backend/index.yaml` `metal:` entries, or say in the PR why an OS is unsupported. See the darwin checklist in [.agents/adding-backends.md](.agents/adding-backends.md).
 - **UI**: The active UI is the React app in `core/http/react-ui/`. The older Alpine.js/HTML UI in `core/http/static/` is pending deprecation — all new UI work goes in the React UI
--- a/1
+++ b/1
@@ -108,7 +108,6 @@ RUN <<EOT bash
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
--- a/81
+++ b/81
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter backends/privacy-filter-darwin
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/dllm backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -103,7 +103,7 @@ COVERAGE_E2E_LABELS?=!real-models
 COVERAGE_EXCLUDE_RE?=grpc/proto/.*[.]pb[.]go
-.PHONY: all test test-coverage test-coverage-baseline test-coverage-check test-backend-cpp test-ui test-ui-coverage-baseline test-ui-coverage-check install-hooks build vendor lint lint-all
+.PHONY: all test test-coverage test-coverage-baseline test-coverage-check test-ui test-ui-coverage-baseline test-ui-coverage-check install-hooks build vendor lint lint-all
 all: help
@@ -201,13 +201,6 @@ test: prepare-test
 	OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)
 ## Compiles and runs the standalone C++ unit tests for the backends (pure
 ## helpers that depend only on the stdlib + nlohmann/json, no full backend
 ## build). Discovers every *_test.cpp under backend/cpp/ - see
 ## backend/cpp/run-unit-tests.sh. Set NLOHMANN_INCLUDE to skip the header fetch.
 test-backend-cpp:
 	bash backend/cpp/run-unit-tests.sh
 ## Runs the core suite ($(TEST_PATHS)) with statement-coverage instrumentation
 ## and writes a merged profile to $(COVERAGE_PROFILE). Deliberately omits
 ## --fail-fast so a single failure doesn't truncate the coverage number, and
@@ -573,7 +566,6 @@ prepare-test-extra: protogen-python
 	$(MAKE) -C backend/python/speaker-recognition
 	$(MAKE) -C backend/rust/kokoros kokoros-grpc
 	$(MAKE) -C backend/go/rfdetr-cpp
 	$(MAKE) -C backend/go/locate-anything-cpp
 test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/transformers test
@@ -601,9 +593,6 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/speaker-recognition test
 	$(MAKE) -C backend/rust/kokoros test
 	$(MAKE) -C backend/go/rfdetr-cpp test
 	$(MAKE) -C backend/go/locate-anything-cpp test
 	$(MAKE) -C backend/go/depth-anything-cpp test
 	$(MAKE) -C backend/go/supertonic test
 ##
 ## End-to-end gRPC tests that exercise a built backend container image.
@@ -697,16 +686,6 @@ test-extra-backend-llama-cpp-transcription: docker-build-llama-cpp
 	BACKEND_TEST_CTX_SIZE=2048 \
 	$(MAKE) test-extra-backend
 ## privacy-filter: the PII/NER token-classification backend. Exercises the
 ## TokenClassify RPC and asserts byte-correct, UTF-8-aligned span offsets
 ## against the openai-privacy-filter multilingual GGUF (CPU-runnable, ~50M
 ## active params). This is the live-backend coverage for the PII NER tier.
 test-extra-backend-privacy-filter: docker-build-privacy-filter
 	BACKEND_IMAGE=local-ai-backend:privacy-filter \
 	BACKEND_TEST_MODEL_URL=https://huggingface.co/LocalAI-io/privacy-filter-multilingual-GGUF/resolve/main/privacy-filter-multilingual-f16.gguf \
 	BACKEND_TEST_CAPS=health,load,token_classify \
 	$(MAKE) test-extra-backend
 ## vllm is resolved from a HuggingFace model id (no file download) and
 ## exercises Predict + streaming + tool-call extraction via the hermes parser.
 ## Requires a host CPU with the SIMD instructions the prebuilt vllm CPU
@@ -1136,10 +1115,6 @@ backends/ds4-darwin: build
 	bash ./scripts/build/ds4-darwin.sh
 	./local-ai backends install "ocifile://$(abspath ./backend-images/ds4.tar)"
 backends/privacy-filter-darwin: build
 	bash ./scripts/build/privacy-filter-darwin.sh
 	./local-ai backends install "ocifile://$(abspath ./backend-images/privacy-filter.tar)"
 build-darwin-python-backend: build
 	bash ./scripts/build/python-darwin.sh
@@ -1185,10 +1160,6 @@ BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
 # Single-model; hardware-only validation lives at tests/e2e-backends/
 # (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
 BACKEND_DS4 = ds4|ds4|.|false|false
 # privacy-filter wraps the standalone privacy-filter.cpp GGML engine (the
 # openai-privacy-filter PII/NER token classifier) — the TokenClassify RPC for
 # the PII redactor tier, on stock ggml with no llama.cpp carry-patches.
 BACKEND_PRIVACY_FILTER = privacy-filter|privacy-filter|.|false|false
 # Golang backends
 BACKEND_PIPER = piper|golang|.|false|true
@@ -1200,16 +1171,16 @@ BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|tr
 BACKEND_WHISPER = whisper|golang|.|false|true
 BACKEND_CRISPASR = crispasr|golang|.|false|true
 BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
-BACKEND_DEPTH_ANYTHING_CPP = depth-anything-cpp|golang|.|false|true
+# dllm is mudler/dllm.cpp, the DiffusionGemma block-diffusion engine,
 # wrapped by the purego backend at backend/go/dllm.
 BACKEND_DLLM = dllm|golang|.|false|true
 BACKEND_VOXTRAL = voxtral|golang|.|false|true
 BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
 BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
 BACKEND_OMNIVOICE_CPP = omnivoice-cpp|golang|.|false|true
 BACKEND_VIBEVOICE_CPP = vibevoice-cpp|golang|.|false|true
 BACKEND_LOCALVQE = localvqe|golang|.|false|true
 BACKEND_OPUS = opus|golang|.|false|true
 BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true
 BACKEND_SUPERTONIC = supertonic|golang|.|false|true
 # Python backends with root context
 BACKEND_RERANKERS = rerankers|python|.|false|true
@@ -1283,7 +1254,6 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
 $(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PRIVACY_FILTER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_CLOUD_PROXY)))
@@ -1293,7 +1263,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
 $(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_DEPTH_ANYTHING_CPP)))
+$(eval $(call generate-docker-build-target,$(BACKEND_DLLM)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
 $(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
@@ -1326,7 +1296,6 @@ $(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX)))
 $(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_OMNIVOICE_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCALVQE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_MLX)))
@@ -1339,13 +1308,12 @@ $(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RFDETR_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SUPERTONIC)))
 # Pattern rule for docker-save targets
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar
-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy
 ########################################################
 ### Mock Backend for E2E Tests
@@ -1460,32 +1428,13 @@ docs: docs/static/gallery.html
 ########################################################
 ## fyne cross-platform build
-# Build LocalAI.app from the launcher via fyne (metadata read from cmd/launcher/FyneApp.toml).
+build-launcher-darwin: build-launcher
-# Signing happens via contrib/macos/sign-and-notarize.sh, which is a no-op when the signing
+	go run github.com/tiagomelo/macos-dmg-creator/cmd/createdmg@latest \
-# secrets are unset, so unsigned local/fork builds keep working.
+	--appName "LocalAI" \
-build-launcher-darwin:
+	--appBinaryPath "$(LAUNCHER_BINARY_NAME)" \
-	rm -rf dist/LocalAI.app cmd/launcher/LocalAI.app
+	--bundleIdentifier "com.localai.launcher" \
-	mkdir -p dist
+	--iconPath "core/http/static/logo.png" \
-	cd cmd/launcher && go run fyne.io/tools/cmd/fyne@latest package -os darwin -icon ../../core/http/static/logo.png --executable $(LAUNCHER_BINARY_NAME)
+	--outputDir "dist/"
 	mv cmd/launcher/LocalAI.app dist/LocalAI.app
 	bash contrib/macos/sign-and-notarize.sh sign dist/LocalAI.app
 # Wrap the (signed) app into a drag-to-Applications DMG via hdiutil, then sign the DMG.
 dmg-launcher-darwin: build-launcher-darwin
 	rm -rf dist/dmg dist/LocalAI.dmg
 	mkdir -p dist/dmg
 	cp -R dist/LocalAI.app dist/dmg/LocalAI.app
 	ln -s /Applications dist/dmg/Applications
 	hdiutil create -volname "LocalAI" -srcfolder dist/dmg -ov -format UDZO dist/LocalAI.dmg
 	bash contrib/macos/sign-and-notarize.sh sign dist/LocalAI.dmg
 # Submit the DMG to Apple notarization and staple the ticket (no-op without notary secrets).
 notarize-launcher-darwin: dmg-launcher-darwin
 	bash contrib/macos/sign-and-notarize.sh notarize dist/LocalAI.dmg
 # Single entrypoint for CI: build -> sign app -> dmg -> sign dmg -> notarize -> staple.
 release-launcher-darwin: notarize-launcher-darwin
 	@echo "dist/LocalAI.dmg is ready"
 build-launcher-linux:
-	cd cmd/launcher && go run fyne.io/tools/cmd/fyne@latest package -os linux -icon ../../core/http/static/logo.png --executable $(LAUNCHER_BINARY_NAME)-linux && mv LocalAI.tar.xz ../../$(LAUNCHER_BINARY_NAME)-linux.tar.xz
+	cd cmd/launcher && go run fyne.io/tools/cmd/fyne@latest package -os linux -icon ../../core/http/static/logo.png --executable $(LAUNCHER_BINARY_NAME)-linux && mv launcher.tar.xz ../../$(LAUNCHER_BINARY_NAME)-linux.tar.xz
--- a/README.md
+++ b/README.md
@@ -29,18 +29,6 @@
 <a href="https://trendshift.io/repositories/5539" target="_blank"><img src="https://trendshift.io/api/badge/repositories/5539" alt="mudler%2FLocalAI | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 </p>
 <!-- Keep these links, translations synced daily. -->
 <p align="center">
 <a href="https://zdoc.app/de/mudler/LocalAI">Deutsch</a> |
 <a href="https://zdoc.app/es/mudler/LocalAI">Español</a> |
 <a href="https://zdoc.app/fr/mudler/LocalAI">français</a> |
 <a href="https://zdoc.app/ja/mudler/LocalAI">日本語</a> |
 <a href="https://zdoc.app/ko/mudler/LocalAI">한국어</a> |
 <a href="https://zdoc.app/pt/mudler/LocalAI">Português</a> |
 <a href="https://zdoc.app/ru/mudler/LocalAI">Русский</a> |
 <a href="https://zdoc.app/zh/mudler/LocalAI">中文</a>
 </p>
 **LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
 **A small core, not a bundle.** Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.
@@ -177,11 +165,6 @@ For more details, see the [Getting Started guide](https://localai.io/basics/gett
 ## Latest News
 - **June 2026**: New native biometric backends from the LocalAI team: [voice-detect.cpp](https://github.com/mudler/voice-detect.cpp) for speaker recognition and voice analysis (ECAPA-TDNN, WeSpeaker, ERes2Net, CAM++, wav2vec2 age/gender/emotion) and [face-detect.cpp](https://github.com/mudler/face-detect.cpp) for face detection, recognition, demographics and anti-spoofing (SCRFD/ArcFace, YuNet/SFace). Both are from-scratch C++/ggml engines with no Python or onnxruntime at inference, self-contained GGUF weights, bit-exact parity with the reference, and GPU cuDNN parity, replacing the heavier Python `insightface` and `speaker-recognition` backends ([PR #10441](https://github.com/mudler/LocalAI/pull/10441)).
 - **June 2026**: New [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (a tiny Go client for the Realtime API with a full talk-back voice loop and tool calling), plus [streaming of the realtime LLM / TTS / transcription pipeline stages](https://github.com/mudler/LocalAI/pull/10176) and [configurable WebRTC ICE candidates](https://github.com/mudler/LocalAI/pull/10231).
 - **June 2026**: Big speech push: the [parakeet.cpp](https://github.com/mudler/parakeet.cpp) ASR engine gains [NeMo-faithful segment timestamps](https://github.com/mudler/LocalAI/pull/10207), a [multilingual streaming Nemotron-3.5 model](https://github.com/mudler/LocalAI/pull/10199), [dynamic batching for concurrent transcription](https://github.com/mudler/LocalAI/pull/10112) and [CUDA graphs](https://github.com/mudler/LocalAI/pull/10273); the new [CrispASR backend](https://github.com/mudler/LocalAI/pull/10099) adds multi-architecture ASR + TTS, and [60 Piper TTS voices across 42 languages](https://github.com/mudler/LocalAI/pull/10296) land in the gallery (plus [per-request TTS instructions and params](https://github.com/mudler/LocalAI/pull/10172)).
 - **June 2026**: New backends and models: [locate-anything.cpp](https://github.com/mudler/LocalAI/pull/10264) for open-vocabulary object detection via ggml, [Ideogram4 image generation](https://github.com/mudler/LocalAI/pull/10201) in stablediffusion-ggml, [llama.cpp video input](https://github.com/mudler/LocalAI/pull/10216), and the [Gemma 4 QAT family with MTP speculative-decoding pairs](https://github.com/mudler/LocalAI/pull/10215). Plus an [interactive CLI chat mode](https://github.com/mudler/LocalAI/pull/10226) and [RAG source citations in agent responses](https://github.com/mudler/LocalAI/pull/10228).
 - **June 2026**: Distributed mode hardening: [prefix-cache-aware routing](https://github.com/mudler/LocalAI/pull/10071), a [production-ready request router with auto-sized embedding/rerank batches](https://github.com/mudler/LocalAI/pull/10104), [ds4 layer-split distributed inference](https://github.com/mudler/LocalAI/pull/10098), [NATS JWT auth + TLS/mTLS](https://github.com/mudler/LocalAI/pull/10159), and [resumable file uploads](https://github.com/mudler/LocalAI/pull/10109).
 - **May 2026**: **LocalAI 4.3.0** - `llama.cpp` [prompt cache on by default](https://github.com/mudler/LocalAI/pull/9925) (repeated system prompts collapse from minutes to seconds), [keyless cosign signing of backend OCI images](https://github.com/mudler/LocalAI/pull/9823), [per-API-key + per-user usage attribution](https://github.com/mudler/LocalAI/pull/9920), Distributed v3 with [per-request replica routing](https://github.com/mudler/LocalAI/pull/9968). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.3.0)
 - **May 2026**: **LocalAI 4.2.0** - LocalAI sees and hears: [voice recognition](https://github.com/mudler/LocalAI/pull/9500), [face recognition + antispoofing liveness](https://github.com/mudler/LocalAI/pull/9480), speaker diarization. Plus [drop-in Ollama API](https://github.com/mudler/LocalAI/pull/9284), [video generation](https://github.com/mudler/LocalAI/pull/9420), redesigned UI with i18n + admin-configurable branding, vLLM at feature parity with llama.cpp, and 11 new backends. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.2.0)
 - **April 2026**: **LocalAI 4.1.0** - LocalAI becomes a control tower: distributed cluster mode with VRAM-aware smart routing + autoscaling, multi-user platform with OIDC and API keys, per-user quotas with predictive analytics, in-UI fine-tuning with TRL (auto-export to GGUF), on-the-fly quantization backend, visual pipeline editor. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.1.0)
@@ -221,29 +204,10 @@ For older news and full release notes, see [GitHub Releases](https://github.com/
 ## Supported Backends & Acceleration
-LocalAI supports **60+ backends** including llama.cpp, vLLM, SGLang, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
+LocalAI supports **36+ backends** including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
 See the full [Backend & Model Compatibility Table](https://localai.io/model-compatibility/) and [GPU Acceleration guide](https://localai.io/features/gpu-acceleration/).
 ### Backends built by us
 Most backends wrap a best-in-class upstream engine. A handful of them are native C/C++/GGML engines (no Python at inference) developed and maintained by the LocalAI project itself:
 | Backend | What it does |
 |---------|-------------|
 | [parakeet.cpp](https://github.com/mudler/parakeet.cpp) | C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription |
 | [ced.cpp](https://github.com/mudler/ced.cpp) | C++/GGML port of the CED audio-tagging models: sound-event classification (527-class AudioSet) over REST and the realtime API for live recognition |
 | [voxtral.c](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in pure C |
 | [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp) | Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization |
 | [rf-detr.cpp](https://github.com/mudler/rf-detr.cpp) | Native RF-DETR object detection and instance segmentation |
 | [locate-anything.cpp](https://github.com/mudler/locate-anything.cpp) | Open-vocabulary object detection and visual grounding (LocateAnything-3B) |
 | [depth-anything.cpp](https://github.com/mudler/depth-anything.cpp) | Depth Anything 3 monocular metric depth + camera pose estimation |
 | [privacy-filter.cpp](https://github.com/localai-org/privacy-filter.cpp) | Standalone GGML PII/NER token-classification engine powering LocalAI's PII redaction tier |
 | [LocalVQE](https://github.com/localai-org/LocalVQE) | Joint acoustic echo cancellation, noise suppression, and dereverberation |
 | [local-store](https://github.com/mudler/LocalAI) | Local-first vector database for embeddings (shipped in-tree) |
 We also maintain [apex-quant](https://github.com/localai-org/apex-quant), a per-tensor, per-layer quantization recipe for Mixture-of-Experts models that exploits their structural sparsity to produce GGUFs matching or beating Q8_0 quality - and they run out of the box on stock llama.cpp.
 ## Resources
 - [Documentation](https://localai.io/)
@@ -253,7 +217,7 @@ We also maintain [apex-quant](https://github.com/localai-org/apex-quant), a per-
 - [Integrations & community projects](https://localai.io/docs/integrations/)
 - [Installation video walkthrough](https://www.youtube.com/watch?v=cMVNnlqwfw4)
 - [Media & blog posts](https://localai.io/basics/news/#media-blogs-social)
- [Examples](https://github.com/mudler/LocalAI-examples) — including the [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (Go client for the Realtime API with tool calling)
+- [Examples](https://github.com/mudler/LocalAI-examples)
 ## Team
--- a/backend/Dockerfile.golang
+++ b/backend/Dockerfile.golang
@@ -65,12 +65,7 @@ RUN <<EOT bash
            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils && \
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
        apt-get install -y mesa-vulkan-drivers libdrm2
        # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe) + their manifests. The
        # LunarG SDK below only provides the loader and shader tooling, not
        # hardware drivers — without Mesa, package-gpu-libs.sh has no ICD to
        # bundle and the packaged backend finds no GPU at runtime.
        if [ "amd64" = "$TARGETARCH" ]; then
            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
@@ -137,7 +132,7 @@ RUN <<EOT bash
            libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
        if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
            apt-get install -y --no-install-recommends \
-            libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} libcudnn9-dev-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
+            libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
        fi
        apt-get clean && \
        rm -rf /var/lib/apt/lists/*
@@ -211,16 +206,6 @@ RUN if [ "${BACKEND}" = "opus" ]; then \
    apt-get clean && rm -rf /var/lib/apt/lists/*; \
 fi
 # CrispASR's piper TTS backend dlopens libespeak-ng at runtime to phonemize
 # non-English text (the MIT-clean path; English uses a built-in G2P). Install
 # the espeak-ng runtime + its libpcaudio/libsonic deps + voice data so
 # package.sh can bundle them into the FROM scratch image.
 RUN if [ "${BACKEND}" = "crispasr" ]; then \
    apt-get update && apt-get install -y --no-install-recommends \
        espeak-ng-data libespeak-ng1 libpcaudio0 libsonic0 && \
    apt-get clean && rm -rf /var/lib/apt/lists/*; \
 fi
 COPY . /LocalAI
 RUN git config --global --add safe.directory /LocalAI
--- a/backend/Dockerfile.privacy-filter
+++ b/backend/Dockerfile.privacy-filter
@@ -1,109 +0,0 @@
 ARG BASE_IMAGE=ubuntu:24.04
 # BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses when no
 # prebuilt base is supplied; the builder-prebuilt stage is only entered when
 # BUILDER_TARGET=builder-prebuilt, so the fallback content is harmless
 # (BuildKit prunes the unreferenced builder).
 ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
 # BUILDER_TARGET selects which builder stage the scratch image copies from.
 # Declared before any FROM so it is usable in `FROM ${BUILDER_TARGET}`. The
 # backend_build workflow sets it to builder-prebuilt when the matrix entry
 # provides builder-base-image, else builder-fromsource (the local default).
 ARG BUILDER_TARGET=builder-fromsource
 ARG APT_MIRROR=""
 ARG APT_PORTS_MIRROR=""
 # privacy-filter: standalone GGML engine for the openai-privacy-filter PII/NER
 # token classifier, wrapped as a LocalAI gRPC backend.
 #
 # Mirrors backend/Dockerfile.llama-cpp: the build toolchain (gRPC + cmake +
 # protoc + conditional CUDA/Vulkan) comes from the shared
 # .docker/install-base-deps.sh (from-source path) or a prebuilt
 # quay.io/go-skynet/ci-cache:base-grpc-* image (CI path) — nothing GPU-specific
 # is hand-rolled here. BUILD_TYPE selects the engine backend in the Makefile:
 # "" = cpu, "cublas" -> -DPF_CUDA=ON, "vulkan" -> -DPF_VULKAN=ON.
 # ============================================================================
 # Stage: builder-fromsource — self-contained build. Runs the same install
 # script backend/Dockerfile.base-grpc-builder runs, so this path is
 # bit-equivalent to the prebuilt base. Used when BUILDER_TARGET=builder-fromsource
 # (the default; local `make backends/privacy-filter`).
 # ============================================================================
 FROM ${BASE_IMAGE} AS builder-fromsource
 ARG BUILD_TYPE
 ARG CUDA_MAJOR_VERSION
 ARG CUDA_MINOR_VERSION
 ARG CMAKE_FROM_SOURCE=false
 # CUDA Toolkit 13.x needs CMake 3.31.9+ for correct toolchain/arch detection.
 ARG CMAKE_VERSION=3.31.10
 ARG GRPC_VERSION=v1.65.0
 ARG GRPC_MAKEFLAGS="-j4 -Otarget"
 ARG SKIP_DRIVERS=false
 ARG TARGETARCH
 ARG UBUNTU_VERSION=2404
 ARG APT_MIRROR
 ARG APT_PORTS_MIRROR
 ENV BUILD_TYPE=${BUILD_TYPE} \
    CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
    CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
    CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
    CMAKE_VERSION=${CMAKE_VERSION} \
    GRPC_VERSION=${GRPC_VERSION} \
    GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
    SKIP_DRIVERS=${SKIP_DRIVERS} \
    TARGETARCH=${TARGETARCH} \
    UBUNTU_VERSION=${UBUNTU_VERSION} \
    APT_MIRROR=${APT_MIRROR} \
    APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
    DEBIAN_FRONTEND=noninteractive
 # CUDA on PATH (a no-op when CUDA is not installed, e.g. cpu/vulkan builds).
 ENV PATH=/usr/local/cuda/bin:${PATH}
 WORKDIR /build
 # apt deps + cmake + protoc + gRPC + conditional CUDA/Vulkan, all from the
 # shared script (the source of truth that base-grpc-builder also runs).
 RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
    --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
    bash /usr/local/sbin/install-base-deps
 # install-base-deps installs gRPC under /opt/grpc; copy it to /usr/local so the
 # backend's find_package(gRPC CONFIG) resolves it at the canonical prefix.
 RUN cp -a /opt/grpc/. /usr/local/
 COPY . /LocalAI
 RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
 # ============================================================================
 # Stage: builder-prebuilt — FROM a prebuilt
 # quay.io/go-skynet/ci-cache:base-grpc-* image (gRPC at /opt/grpc + apt deps +
 # CUDA/Vulkan already installed). Used in CI when the matrix entry sets
 # builder-base-image.
 # ============================================================================
 FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
 ARG BUILD_TYPE
 ARG TARGETARCH
 ENV BUILD_TYPE=${BUILD_TYPE}
 # CUDA on PATH (a no-op for the cpu/vulkan base images).
 ENV PATH=/usr/local/cuda/bin:${PATH}
 # Mirror builder-fromsource: the base-grpc image installs gRPC to /opt/grpc but
 # does not copy it to /usr/local.
 RUN cp -a /opt/grpc/. /usr/local/
 COPY . /LocalAI
 RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
 # ============================================================================
 # Final stage — copy the package output from the selected builder. BuildKit
 # does not expand variables in `COPY --from=`, so alias the chosen builder to a
 # fixed stage name first.
 # ============================================================================
 FROM ${BUILDER_TARGET} AS builder
 FROM scratch
 COPY --from=builder /LocalAI/backend/cpp/privacy-filter/package/. ./
--- a/backend/Dockerfile.python
+++ b/backend/Dockerfile.python
@@ -66,12 +66,7 @@ RUN <<EOT bash
            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils && \
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
        apt-get install -y mesa-vulkan-drivers libdrm2
        # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe) + their manifests. The
        # LunarG SDK below only provides the loader and shader tooling, not
        # hardware drivers — without Mesa, package-gpu-libs.sh has no ICD to
        # bundle and the packaged backend finds no GPU at runtime.
        if [ "amd64" = "$TARGETARCH" ]; then
            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
@@ -131,7 +126,6 @@ RUN <<EOT bash
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -24,10 +24,6 @@ service Backend {
  rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {}
  rpc Status(HealthMessage) returns (StatusResponse) {}
  rpc Detect(DetectOptions) returns (DetectResponse) {}
  // SoundDetection runs an audio-tagging / sound-event-classification model
  // (e.g. CED over the AudioSet ontology) on a clip and returns scored labels.
  rpc SoundDetection(SoundDetectionRequest) returns (SoundDetectionResponse) {}
  rpc Depth(DepthRequest) returns (DepthResponse) {}
  rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
  rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
  rpc VoiceVerify(VoiceVerifyRequest) returns (VoiceVerifyResponse) {}
@@ -674,53 +670,6 @@ message DetectResponse {
  repeated Detection Detections = 1;
 }
 // --- Sound-event classification / audio tagging messages (CED) ---
 message SoundDetectionRequest {
  string src = 1;       // audio file path (LocalAI writes the upload to disk)
  int32 top_k = 2;      // number of top tags to return (0 = all classes)
  float threshold = 3;  // optional: drop tags scoring below this
 }
 message SoundClass {
  string label = 1;     // AudioSet class name, e.g. "Baby cry, infant cry"
  float score = 2;      // per-class probability (multi-label, independent)
  int32 index = 3;      // class index in the model ontology
 }
 message SoundDetectionResponse {
  repeated SoundClass detections = 1;  // score-descending
 }
 // --- Depth estimation messages (Depth Anything 3) ---
 message DepthRequest {
  string src = 1;                  // input image (filesystem path or base64-encoded payload)
  string dst = 2;                  // optional output directory for exports (glb/colmap)
  bool include_depth = 3;          // return the per-pixel metric depth map
  bool include_confidence = 4;     // return the per-pixel confidence map (DualDPT)
  bool include_pose = 5;           // return camera extrinsics/intrinsics (DualDPT)
  bool include_sky = 6;            // return the per-pixel sky map (mono models)
  bool include_points = 7;         // back-project to a 3D point cloud (DualDPT)
  float points_conf_thresh = 8;    // keep points with confidence >= this threshold
  repeated string exports = 9;     // requested exports: "glb", "colmap"
 }
 message DepthResponse {
  int32 width = 1;                 // processed depth-map width
  int32 height = 2;                // processed depth-map height
  repeated float depth = 3;        // width*height row-major metric depth
  repeated float confidence = 4;   // width*height row-major confidence (DualDPT)
  repeated float sky = 5;          // width*height row-major sky map (mono)
  repeated float extrinsics = 6;   // 12 floats, 3x4 row-major (world-to-camera)
  repeated float intrinsics = 7;   // 9 floats, 3x3 row-major
  int32 num_points = 8;            // number of 3D points
  repeated float points = 9;       // num_points*3 xyz, world space
  bytes point_colors = 10;         // num_points*3 uint8 rgb
  repeated string export_paths = 11; // paths written for the requested exports
  bool is_metric = 12;             // depth is in metric units
 }
 // --- Face recognition messages ---
 message FacialArea {
--- a/backend/cpp/ds4/CMakeLists.txt
+++ b/backend/cpp/ds4/CMakeLists.txt
@@ -9,22 +9,6 @@ option(DS4_NATIVE "Compile with -march=native / -mcpu=native" ON)
 set(DS4_GPU "cpu" CACHE STRING "GPU backend: cpu, cuda, or metal")
 set(DS4_DIR "${CMAKE_CURRENT_SOURCE_DIR}/ds4" CACHE PATH "Path to cloned ds4 source")
 if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
    # Homebrew installs protobuf/grpc under a non-default prefix. The generated
    # backend.pb.cc / backend.grpc.pb.cc pull in google/protobuf and grpcpp
    # headers, but the hw_grpc_proto library links neither target, so on macOS
    # the headers (e.g. google/protobuf/runtime_version.h) are never on the
    # compiler's include path. Add the Homebrew prefix globally, matching the
    # llama-cpp backend which builds on Darwin CI.
    if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "arm64")
        set(HOMEBREW_DEFAULT_PREFIX "/opt/homebrew")
    else()
        set(HOMEBREW_DEFAULT_PREFIX "/usr/local")
    endif()
    link_directories("${HOMEBREW_DEFAULT_PREFIX}/lib")
    include_directories("${HOMEBREW_DEFAULT_PREFIX}/include")
 endif()
 find_package(Threads REQUIRED)
 find_package(Protobuf CONFIG QUIET)
 if(NOT Protobuf_FOUND)
--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -1,10 +1,10 @@
 # ds4 backend Makefile.
 #
-# Upstream pin lives below as DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
+# Upstream pin lives below as DS4_VERSION?=8384adf0f9fa0f3bb342dd925372de778b95b263
 # (.github/bump_deps.sh) can find and update it - matches the
 # llama-cpp / ik-llama-cpp / turboquant convention.
-DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
+DS4_VERSION?=8384adf0f9fa0f3bb342dd925372de778b95b263
 DS4_REPO?=https://github.com/antirez/ds4
 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
--- a/backend/cpp/ds4/grpc-server.cpp
+++ b/backend/cpp/ds4/grpc-server.cpp
@@ -25,8 +25,6 @@ extern "C" {
 #include <chrono>
 #include <climits>
 #include <csignal>
 #include <cstddef>
 #include <cstdint>
 #include <cstdlib>
 #include <cstring>
 #include <ctime>
@@ -107,130 +105,6 @@ static bool parse_layers_spec(const std::string &spec, ds4_distributed_layers *o
    return true;
 }
 // Parse a boolean LoadModel option. An empty value (a bare flag-style option
 // like "ssd_streaming" with no colon) means true so model YAMLs can write
 // options: ["ssd_streaming"] to enable a switch.
 static bool parse_bool_option(const std::string &s, bool *out) {
    if (s.empty() || s == "true" || s == "1" || s == "yes" || s == "on") { *out = true; return true; }
    if (s == "false" || s == "0" || s == "no" || s == "off") { *out = false; return true; }
    return false;
 }
 // Table-driven mapping from LoadModel option keys to ds4_engine_options fields.
 // ds4_engine_options is a fixed C struct with no reflection, so the field set
 // is enumerated once here; adding a future engine knob is a one-line table
 // entry rather than a new branch in LoadModel. Two fields need ds4's own typed
 // parsers (Gib, CacheExperts) so a plain string passthrough can't cover them.
 enum class DsOptType { Bool, Int, Uint, Float, Str, Gib, CacheExperts };
 struct DsOptSpec {
    const char *key;
    DsOptType   type;
    size_t      off;      // byte offset into ds4_engine_options
    size_t      off2;     // second offset (CacheExperts writes experts + bytes)
    bool        is_path;  // Str values: resolve a relative value against the model dir
 };
 static const DsOptSpec kEngineOptSpecs[] = {
    {"mtp_path",                      DsOptType::Str,          offsetof(ds4_engine_options, mtp_path),                      0, true},
    {"mtp_draft",                     DsOptType::Int,          offsetof(ds4_engine_options, mtp_draft_tokens),              0},
    {"mtp_margin",                    DsOptType::Float,        offsetof(ds4_engine_options, mtp_margin),                    0},
    {"prefill_chunk",                 DsOptType::Uint,         offsetof(ds4_engine_options, prefill_chunk),                 0},
    {"power_percent",                 DsOptType::Int,          offsetof(ds4_engine_options, power_percent),                 0},
    {"warm_weights",                  DsOptType::Bool,         offsetof(ds4_engine_options, warm_weights),                  0},
    {"quality",                       DsOptType::Bool,         offsetof(ds4_engine_options, quality),                       0},
    {"ssd_streaming",                 DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming),                 0},
    {"ssd_streaming_cold",            DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming_cold),            0},
    {"ssd_streaming_preload_experts", DsOptType::Uint,         offsetof(ds4_engine_options, ssd_streaming_preload_experts), 0},
    {"ssd_streaming_cache_experts",   DsOptType::CacheExperts, offsetof(ds4_engine_options, ssd_streaming_cache_experts),
                                                               offsetof(ds4_engine_options, ssd_streaming_cache_bytes)},
    {"simulate_used_memory",          DsOptType::Gib,          offsetof(ds4_engine_options, simulate_used_memory_bytes),    0},
    {"expert_profile_path",           DsOptType::Str,          offsetof(ds4_engine_options, expert_profile_path),           0, true},
    {"directional_steering_file",     DsOptType::Str,          offsetof(ds4_engine_options, directional_steering_file),     0, true},
    {"directional_steering_attn",     DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_attn),     0},
    {"directional_steering_ffn",      DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_ffn),      0},
 };
 // Apply a single key:value LoadModel option to the engine options struct.
 // Unknown keys are ignored (back-compat: callers pass mixed option sets).
 // String values are copied into `storage`, whose elements the engine reads by
 // pointer during ds4_engine_open; `storage` MUST have reserved capacity so
 // push_back never reallocates and dangles an earlier c_str(). Returns false
 // with `err` set when a recognized key has an invalid value.
 static bool apply_engine_option(ds4_engine_options *opt, const std::string &key,
                                const std::string &val, const std::string &model_dir,
                                std::vector<std::string> &storage, std::string &err) {
    const DsOptSpec *spec = nullptr;
    for (const auto &s : kEngineOptSpecs) {
        if (key == s.key) { spec = &s; break; }
    }
    if (!spec) return true; // unknown key: ignore
    char *base = reinterpret_cast<char *>(opt);
    switch (spec->type) {
    case DsOptType::Bool: {
        bool b = false;
        if (!parse_bool_option(val, &b)) { err = key + " must be true/false"; return false; }
        *reinterpret_cast<bool *>(base + spec->off) = b;
        return true;
    }
    case DsOptType::Int: {
        char *end = nullptr;
        long v = std::strtol(val.c_str(), &end, 10);
        if (val.empty() || !end || *end != '\0') { err = key + " must be an integer"; return false; }
        *reinterpret_cast<int *>(base + spec->off) = static_cast<int>(v);
        return true;
    }
    case DsOptType::Uint: {
        char *end = nullptr;
        long v = std::strtol(val.c_str(), &end, 10);
        if (val.empty() || !end || *end != '\0' || v < 0 || v > static_cast<long>(UINT32_MAX)) {
            err = key + " must be a non-negative integer"; return false;
        }
        *reinterpret_cast<uint32_t *>(base + spec->off) = static_cast<uint32_t>(v);
        return true;
    }
    case DsOptType::Float: {
        char *end = nullptr;
        float f = std::strtof(val.c_str(), &end);
        if (val.empty() || !end || *end != '\0') { err = key + " must be a number"; return false; }
        *reinterpret_cast<float *>(base + spec->off) = f;
        return true;
    }
    case DsOptType::Str: {
        // Resolve a relative path option (e.g. mtp_path: a sibling GGUF the
        // gallery downloaded next to the model) against the model directory, so
        // YAMLs reference companion files by name. Absolute values pass through.
        if (spec->is_path && !model_dir.empty() && !val.empty() && val.front() != '/') {
            storage.push_back(model_dir + "/" + val);
        } else {
            storage.push_back(val);
        }
        *reinterpret_cast<const char **>(base + spec->off) = storage.back().c_str();
        return true;
    }
    case DsOptType::Gib: {
        uint64_t bytes = 0;
        if (!ds4_parse_gib_arg(val.c_str(), &bytes)) {
            err = key + " must be a GiB value, e.g. 64GB"; return false;
        }
        *reinterpret_cast<uint64_t *>(base + spec->off) = bytes;
        return true;
    }
    case DsOptType::CacheExperts: {
        uint32_t experts = 0;
        uint64_t bytes = 0;
        if (!ds4_parse_streaming_cache_experts_arg(val.c_str(), &experts, &bytes)) {
            err = key + " must be a positive expert count or a <number>GB budget"; return false;
        }
        *reinterpret_cast<uint32_t *>(base + spec->off)  = experts;
        *reinterpret_cast<uint64_t *>(base + spec->off2) = bytes;
        return true;
    }
    }
    return true;
 }
 // When acting as a distributed coordinator, block until the worker route
 // covers all layers (ds4_session_distributed_route_ready == 1) or the timeout
 // elapses. Returns an empty string on success, or an error message to return
@@ -602,10 +476,39 @@ public:
            return GStatus::OK;
        }
        std::string mtp_path;
        int mtp_draft = 0;
        float mtp_margin = 3.0f;
        std::string ds4_role, ds4_layers, ds4_listen;
        for (const auto &opt : request->options()) {
            auto [k, v] = split_option(opt);
            if (k == "mtp_path") mtp_path = v;
            else if (k == "mtp_draft") mtp_draft = std::stoi(v);
            else if (k == "mtp_margin") mtp_margin = std::stof(v);
            else if (k == "kv_cache_dir") g_kv_cache_dir = v;
            else if (k == "ds4_role") ds4_role = v;
            else if (k == "ds4_layers") ds4_layers = v;
            else if (k == "ds4_listen") ds4_listen = v;
            else if (k == "ds4_route_timeout") {
                if (!parse_positive_int(v, &g_route_timeout_sec)) {
                    result->set_success(false);
                    result->set_message("ds4: ds4_route_timeout must be a positive integer");
                    return GStatus::OK;
                }
            }
        }
        g_kv_cache.SetDir(g_kv_cache_dir);
        ds4_engine_options opt = {};
        opt.model_path = model_path.c_str();
        opt.mtp_path = mtp_path.empty() ? nullptr : mtp_path.c_str();
        opt.n_threads = request->threads() > 0 ? request->threads() : 0;
-        opt.mtp_margin = 3.0f; // ds4 default; overridable via the mtp_margin option
+        opt.mtp_draft_tokens = mtp_draft;
        opt.mtp_margin = mtp_margin;
        opt.directional_steering_file = nullptr;
        opt.warm_weights = false;
        opt.quality = false;
 #if defined(DS4_NO_GPU)
        opt.backend = DS4_BACKEND_CPU;
@@ -615,46 +518,6 @@ public:
        opt.backend = DS4_BACKEND_CUDA;
 #endif
        // Stable storage for string-valued engine options. The engine reads
        // these by pointer during ds4_engine_open, so the std::string backing
        // store must outlive the call and not reallocate; reserve up front so
        // push_back keeps every prior c_str() valid. Static + clear() reuses
        // the buffer across LoadModel calls (the old engine is closed above).
        static std::vector<std::string> s_opt_strings;
        s_opt_strings.clear();
        s_opt_strings.reserve(sizeof(kEngineOptSpecs) / sizeof(kEngineOptSpecs[0]));
        // Directory of the main model, used to resolve relative path options.
        std::string model_dir;
        if (auto slash = model_path.find_last_of('/'); slash != std::string::npos) {
            model_dir = model_path.substr(0, slash);
        }
        std::string ds4_role, ds4_layers, ds4_listen;
        for (const auto &o : request->options()) {
            auto [k, v] = split_option(o);
            if (k == "kv_cache_dir") { g_kv_cache_dir = v; continue; }
            else if (k == "ds4_role") { ds4_role = v; continue; }
            else if (k == "ds4_layers") { ds4_layers = v; continue; }
            else if (k == "ds4_listen") { ds4_listen = v; continue; }
            else if (k == "ds4_route_timeout") {
                if (!parse_positive_int(v, &g_route_timeout_sec)) {
                    result->set_success(false);
                    result->set_message("ds4: ds4_route_timeout must be a positive integer");
                    return GStatus::OK;
                }
                continue;
            }
            std::string err;
            if (!apply_engine_option(&opt, k, v, model_dir, s_opt_strings, err)) {
                result->set_success(false);
                result->set_message("ds4: " + err);
                return GStatus::OK;
            }
        }
        g_kv_cache.SetDir(g_kv_cache_dir);
        // Coordinator wiring. 'ds4_role:coordinator' enables layer-split
        // distributed inference: this process listens on ds4_listen and owns
        // the ds4_layers slice; workers dial in (see `local-ai worker
--- a/backend/cpp/ik-llama-cpp/CMakeLists.txt
+++ b/backend/cpp/ik-llama-cpp/CMakeLists.txt
@@ -1,6 +1,15 @@
-## Multimodal support is provided by the in-tree `mtmd` library target
+## Clip/LLaVA library for multimodal support — built locally from copied sources
-## (examples/mtmd/), which the grpc-server links and includes below. clip/llava
+set(TARGET myclip)
-## were pruned upstream; the high-level mtmd_* / mtmd_helper_* API is used instead.
+add_library(${TARGET} clip.cpp clip.h llava.cpp llava.h)
 install(TARGETS ${TARGET} LIBRARY)
 target_include_directories(myclip PUBLIC .)
 target_include_directories(myclip PUBLIC ../..)
 target_include_directories(myclip PUBLIC ../../common)
 target_link_libraries(${TARGET} PRIVATE common ggml llama ${CMAKE_THREAD_LIBS_INIT})
 target_compile_features(${TARGET} PRIVATE cxx_std_11)
 if (NOT MSVC)
    target_compile_options(${TARGET} PRIVATE -Wno-cast-qual)
 endif()
 set(TARGET grpc-server)
 set(CMAKE_CXX_STANDARD 17)
@@ -58,16 +67,12 @@ add_library(hw_grpc_proto
  ${hw_proto_hdrs} )
 add_executable(${TARGET} grpc-server.cpp json.hpp)
-# mtmd public headers (mtmd.h / mtmd-helper.h) live in examples/mtmd/.
+target_link_libraries(${TARGET} PRIVATE common llama myclip ${CMAKE_THREAD_LIBS_INIT} absl::flags hw_grpc_proto
 # Linking the mtmd target also propagates this include dir, but we add it
 # explicitly for clarity.
 target_include_directories(${TARGET} PRIVATE ../mtmd)
 target_link_libraries(${TARGET} PRIVATE common llama mtmd ${CMAKE_THREAD_LIBS_INIT} absl::flags hw_grpc_proto
  absl::flags_parse
  gRPC::${_REFLECTION}
  gRPC::${_GRPC_GRPCPP}
  protobuf::${_PROTOBUF_LIBPROTOBUF})
-target_compile_features(${TARGET} PRIVATE cxx_std_17)
+target_compile_features(${TARGET} PRIVATE cxx_std_11)
 if(TARGET BUILD_INFO)
  add_dependencies(${TARGET} BUILD_INFO)
 endif()
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@
-IK_LLAMA_VERSION?=f96eaddba8bed6a9a5e628bbf6a566775c70b49c
+IK_LLAMA_VERSION?=e6f8112f3ba126eed3ff5b30cdd08085414a7516
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
 CMAKE_ARGS?=
--- a/backend/cpp/ik-llama-cpp/grpc-server.cpp
+++ b/backend/cpp/ik-llama-cpp/grpc-server.cpp
@@ -11,8 +11,8 @@
 #include <memory>
 #include <string>
 #include <getopt.h>
-#include "mtmd.h"
+#include "clip.h"
-#include "mtmd-helper.h"
+#include "llava.h"
 #include "log.h"
 #include "common.h"
 #include "json.hpp"
@@ -45,9 +45,7 @@ using backend::HealthMessage;
 ///// LLAMA.CPP server code below
-// Match mtmd.h and ik_llama's server/common headers, which all use
+using json = nlohmann::json;
 // nlohmann::ordered_json; a plain nlohmann::json alias collides at global scope.
 using json = nlohmann::ordered_json;
 struct server_params
 {
@@ -221,11 +219,6 @@ struct llama_client_slot
    // multimodal
    std::vector<slot_image> images;
    // Full prompt with mtmd media markers (mtmd_default_marker()) substituted in
    // place of the legacy [img-N] tags, covering the text up to and including the
    // last image. The text after the last image is kept in params.input_suffix and
    // decoded through the normal token path so the sampling loop is unchanged.
    std::string mtmd_prompt;
    // stats
    size_t sent_count = 0;
@@ -259,14 +252,14 @@ struct llama_client_slot
        for (slot_image & img : images)
        {
-            if (img.bitmap) {
+            free(img.image_embedding);
-                mtmd_bitmap_free(img.bitmap);
+            if (img.img_data) {
-                img.bitmap = nullptr;
+                clip_image_u8_free(img.img_data);
            }
            img.prefix_prompt = "";
        }
        images.clear();
        mtmd_prompt = "";
    }
    bool has_budget(gpt_params &global_params) {
@@ -403,13 +396,46 @@ struct llama_metrics {
    }
 };
 struct llava_embd_batch {
    std::vector<llama_pos>      pos;
    std::vector<int32_t>        n_seq_id;
    std::vector<llama_seq_id>   seq_id_0;
    std::vector<llama_seq_id *> seq_ids;
    std::vector<int8_t>         logits;
    llama_batch batch;
    llava_embd_batch(float * embd, int32_t n_tokens, llama_pos pos_0, llama_seq_id seq_id) {
        pos     .resize(n_tokens);
        n_seq_id.resize(n_tokens);
        seq_ids .resize(n_tokens + 1);
        logits  .resize(n_tokens);
        seq_id_0.resize(1);
        seq_id_0[0] = seq_id;
        seq_ids [n_tokens] = nullptr;
        batch = {
            /*n_tokens       =*/ n_tokens,
            /*tokens         =*/ nullptr,
            /*embd           =*/ embd,
            /*pos            =*/ pos.data(),
            /*n_seq_id       =*/ n_seq_id.data(),
            /*seq_id         =*/ seq_ids.data(),
            /*logits         =*/ logits.data(),
        };
        for (int i = 0; i < n_tokens; i++) {
            batch.pos     [i] = pos_0 + i;
            batch.n_seq_id[i] = 1;
            batch.seq_id  [i] = seq_id_0.data();
            batch.logits  [i] = false;
        }
    }
 };
 struct llama_server_context
 {
    llama_model *model = nullptr;
    llama_context *ctx = nullptr;
    const llama_vocab * vocab = nullptr;
-    mtmd_context *mctx = nullptr;
+    clip_ctx *clp_ctx = nullptr;
    gpt_params params;
@@ -465,6 +491,11 @@ struct llama_server_context
        if (!params.mmproj.path.empty()) {
            multimodal = true;
            LOG_INFO("Multi Modal Mode Enabled", {});
            clp_ctx = clip_model_load(params.mmproj.path.c_str(), /*verbosity=*/ 1);
            if(clp_ctx == nullptr) {
                LOG_ERR("unable to load clip model: %s", params.mmproj.path.c_str());
                return false;
            }
            if (params.n_ctx < 2048) { // request larger context for the image embedding
                params.n_ctx = 2048;
@@ -481,24 +512,10 @@ struct llama_server_context
        }
        if (multimodal) {
-            // mtmd_init_from_file requires the already-loaded text model, so it must
+            const int n_embd_clip = clip_n_mmproj_embd(clp_ctx);
-            // run AFTER llama_init_from_gpt_params. It validates the projector
+            const int n_embd_llm  = llama_model_n_embd(model);
-            // against the model internally and returns nullptr on dim mismatch, so
+            if (n_embd_clip != n_embd_llm) {
-            // the explicit clip_n_mmproj_embd check is no longer needed.
+                LOG("%s: embedding dim of the multimodal projector (%d) is not equal to that of LLaMA (%d). Make sure that you use the correct mmproj file.\n", __func__, n_embd_clip, n_embd_llm);
            mtmd_context_params mparams = mtmd_context_params_default();
            mparams.use_gpu         = params.mmproj_use_gpu;
            mparams.print_timings   = false;
            mparams.n_threads       = params.n_threads_mtmd != -1 ? params.n_threads_mtmd
                                      : params.n_threads_batch != -1 ? params.n_threads_batch
                                                                     : params.n_threads;
            mparams.verbosity       = GGML_LOG_LEVEL_INFO;
            mparams.flash_attn_type = params.flash_attn ? LLAMA_FLASH_ATTN_TYPE_ENABLED
                                                        : LLAMA_FLASH_ATTN_TYPE_DISABLED;
            mparams.image_min_tokens = params.image_min_tokens;
            mparams.image_max_tokens = params.image_max_tokens;
            mctx = mtmd_init_from_file(params.mmproj.path.c_str(), model, mparams);
            if (mctx == nullptr) {
                LOG_ERR("unable to load multimodal projector: %s", params.mmproj.path.c_str());
                llama_free(ctx);
                llama_free_model(model);
                return false;
@@ -848,8 +865,8 @@ struct llama_server_context
                    slot_image img_sl;
                    img_sl.id = img.count("id") != 0 ? img["id"].get<int>() : slot->images.size();
-                    img_sl.bitmap = mtmd_helper_bitmap_init_from_buf(mctx, image_buffer.data(), image_buffer.size());
+                    img_sl.img_data = clip_image_u8_init();
-                    if (img_sl.bitmap == nullptr)
+                    if (!clip_image_load_from_bytes(image_buffer.data(), image_buffer.size(), img_sl.img_data))
                    {
                        LOG_ERR("%s: failed to load image, slot_id: %d, img_sl_id: %d",
                             __func__,
@@ -862,74 +879,50 @@ struct llama_server_context
                        {"slot_id",   slot->id},
                        {"img_sl_id", img_sl.id}
                    });
                    img_sl.request_encode_image = true;
                    slot->images.push_back(img_sl);
                }
-                // Translate the legacy [img-N] tags into mtmd media markers, in
+                // process prompt
-                // order, and collect the matching bitmaps in marker order so they
+                // example: system prompt [img-102] user [img-103] describe [img-134] -> [{id: 102, prefix: 'system prompt '}, {id: 103, prefix: ' user '}, {id: 134, prefix: ' describe '}]}
                // line up with the markers passed to mtmd_tokenize(). The text after
                // the last image stays in input_suffix and is decoded through the
                // normal token path, so the sampling loop is unchanged.
                // example: system prompt [img-102] user [img-103] describe [img-134]
                if (slot->images.size() > 0 && !slot->prompt.is_array())
                {
                    const std::string marker = mtmd_default_marker();
                    std::string prompt = slot->prompt.get<std::string>();
-                    std::string built_prompt;
+                    size_t pos = 0, begin_prefix = 0;
                    std::vector<slot_image> ordered;
                    size_t pos = 0, copy_from = 0;
                    std::string pattern = "[img-";
                    auto free_images = [&]() {
                        for (slot_image &img : slot->images) {
                            if (img.bitmap) {
                                mtmd_bitmap_free(img.bitmap);
                                img.bitmap = nullptr;
                            }
                        }
                        slot->images.clear();
                    };
                    while ((pos = prompt.find(pattern, pos)) != std::string::npos) {
-                        size_t tag_begin = pos;
+                        size_t end_prefix = pos;
                        pos += pattern.length();
                        size_t end_pos = prompt.find(']', pos);
-                        if (end_pos == std::string::npos) {
+                        if (end_pos != std::string::npos)
                            break;
                        }
                        std::string image_id = prompt.substr(pos, end_pos - pos);
                        try
                        {
-                            int img_id = std::stoi(image_id);
+                            std::string image_id = prompt.substr(pos, end_pos - pos);
-                            bool found = false;
+                            try
                            for (slot_image &img : slot->images)
                            {
-                                if (img.id == img_id) {
+                                int img_id = std::stoi(image_id);
-                                    found = true;
+                                bool found = false;
-                                    // text before this tag, then the media marker
+                                for (slot_image &img : slot->images)
-                                    built_prompt += prompt.substr(copy_from, tag_begin - copy_from);
+                                {
-                                    built_prompt += marker;
+                                    if (img.id == img_id) {
-                                    copy_from = end_pos + 1;
+                                        found = true;
-                                    ordered.push_back(img);
+                                        img.prefix_prompt = prompt.substr(begin_prefix, end_prefix - begin_prefix);
-                                    break;
+                                        begin_prefix = end_pos + 1;
                                        break;
                                    }
                                }
-                            }
+                                if (!found) {
-                            if (!found) {
+                                    LOG("ERROR: Image with id: %i, not found.\n", img_id);
-                                LOG("ERROR: Image with id: %i, not found.\n", img_id);
+                                    slot->images.clear();
-                                free_images();
+                                    return false;
                                }
                            } catch (const std::invalid_argument& e) {
                                LOG("Invalid image number id in prompt\n");
                                slot->images.clear();
                                return false;
                            }
                        } catch (const std::invalid_argument& e) {
                            LOG("Invalid image number id in prompt\n");
                            free_images();
                            return false;
                        }
                        pos = end_pos + 1;
                    }
                    // bitmaps are consumed in marker order by mtmd_tokenize()
                    slot->images = ordered;
                    slot->mtmd_prompt = built_prompt;
                    slot->prompt = "";
-                    slot->params.input_suffix = prompt.substr(copy_from);
+                    slot->params.input_suffix = prompt.substr(begin_prefix);
                    slot->params.cache_prompt = false; // multimodal doesn't support cache prompt
                }
            }
@@ -1183,10 +1176,21 @@ struct llama_server_context
    bool process_images(llama_client_slot &slot) const
    {
-        // With the mtmd pipeline, image encoding is no longer eager: the bitmaps
+        for (slot_image &img : slot.images)
-        // are tokenized and encoded together with the surrounding text inside
+        {
-        // ingest_images() via mtmd_tokenize() + mtmd_helper_eval_chunks(). This
+            if (!img.request_encode_image)
-        // just reports whether the slot carries any images to process.
+            {
                continue;
            }
            if (!llava_image_embed_make_with_clip_img(clp_ctx, params.n_threads, img.img_data, &img.image_embedding, &img.image_tokens)) {
                LOG("Error processing the given image");
                return false;
            }
            img.request_encode_image = false;
        }
        return slot.images.size() > 0;
    }
@@ -1431,70 +1435,69 @@ struct llama_server_context
        }
    }
-    // Tokenize the multimodal prompt (text interleaved with media markers) together
+    // for multiple images processing
    // with the slot's bitmaps, then decode the resulting chunks into the llama
    // context via the high-level mtmd helper. The helper runs llama_decode() on the
    // text chunks and mtmd_encode() + llama_decode() on the image chunks, handling
    // batching and any pre/post decode setup (e.g. non-causal attention for gemma3).
    // Advances slot.n_past by the number of positions consumed, then leaves the
    // post-image suffix tokens in `batch` so the normal decode + sampling loop
    // produces the first generated token.
    bool ingest_images(llama_client_slot &slot, int n_batch)
    {
-        if (mctx == nullptr)
+        int image_idx = 0;
        {
            LOG("%s : multimodal context is not initialized\n", __func__);
            return false;
        }
-        // bitmaps stay owned by slot.images (freed on reset()); pass non-owning ptrs
+        while (image_idx < (int) slot.images.size())
        std::vector<const mtmd_bitmap *> bitmaps;
        bitmaps.reserve(slot.images.size());
        for (const slot_image &img : slot.images)
        {
-            bitmaps.push_back(img.bitmap);
+            slot_image &img = slot.images[image_idx];
        }
-        mtmd_input_text inp_txt;
+            // process prefix prompt
-        inp_txt.text          = slot.mtmd_prompt.c_str();
+            for (int32_t i = 0; i < (int32_t) batch.n_tokens; i += n_batch)
-        inp_txt.add_special   = add_bos_token;
+            {
-        inp_txt.parse_special = true;
+                const int32_t n_tokens = std::min(n_batch, (int32_t) (batch.n_tokens - i));
                llama_batch batch_view = {
                    n_tokens,
                    batch.token    + i,
                    nullptr,
                    batch.pos      + i,
                    batch.n_seq_id + i,
                    batch.seq_id   + i,
                    batch.logits   + i,
                };
                if (llama_decode(ctx, batch_view))
                {
                    LOG("%s : failed to eval\n", __func__);
                    return false;
                }
            }
-        mtmd::input_chunks chunks(mtmd_input_chunks_init());
+            // process image with llm
-        int32_t res = mtmd_tokenize(mctx,
+            for (int i = 0; i < img.image_tokens; i += n_batch)
-                                    chunks.ptr.get(),
+            {
-                                    &inp_txt,
+                int n_eval = img.image_tokens - i;
-                                    bitmaps.data(),
+                if (n_eval > n_batch)
-                                    bitmaps.size());
+                {
-        if (res != 0)
+                    n_eval = n_batch;
-        {
+                }
            LOG("%s : failed to tokenize multimodal prompt, res = %d\n", __func__, res);
            return false;
        }
-        const llama_pos start_pos = (llama_pos) system_tokens.size() + slot.n_past;
+                const int n_embd = llama_model_n_embd(model);
-        llama_pos new_n_past = start_pos;
+                float * embd = img.image_embedding + i * n_embd;
-        if (mtmd_helper_eval_chunks(mctx,
+                llava_embd_batch llava_batch = llava_embd_batch(embd, n_eval, slot.n_past, 0);
-                                    ctx,
+                if (llama_decode(ctx, llava_batch.batch))
-                                    chunks.ptr.get(),
+                {
-                                    start_pos,
+                    LOG("%s : failed to eval image\n", __func__);
-                                    slot.id,
+                    return false;
-                                    n_batch,
+                }
-                                    /*logits_last=*/ false,
+                slot.n_past += n_eval;
-                                    &new_n_past) != 0)
+            }
-        {
+            image_idx++;
            LOG("%s : failed to eval multimodal chunks\n", __func__);
            return false;
        }
        slot.n_past += (int32_t) (new_n_past - start_pos);
-        // queue the post-image suffix text for the normal decode + sampling path
+            common_batch_clear(batch);
-        common_batch_clear(batch);
+
-        std::vector<llama_token> suffix_tokens = tokenize(slot.params.input_suffix, false);
+            // append prefix of next image
-        for (llama_token tok : suffix_tokens)
+            const auto json_prompt = (image_idx >= (int) slot.images.size()) ?
-        {
+                slot.params.input_suffix : // no more images, then process suffix prompt
-            common_batch_add(batch, tok, system_tokens.size() + slot.n_past, { slot.id }, false);
+                (json)(slot.images[image_idx].prefix_prompt);
-            slot.n_past += 1;
+
            std::vector<llama_token> append_tokens = tokenize(json_prompt, false); // has next image
            for (int i = 0; i < (int) append_tokens.size(); ++i)
            {
                common_batch_add(batch, append_tokens[i], system_tokens.size() + slot.n_past, { slot.id }, true);
                slot.n_past += 1;
            }
        }
        return true;
@@ -1881,11 +1884,8 @@ struct llama_server_context
                    const bool has_images = process_images(slot);
-                    // For the multimodal path the whole pre-image / inter-image text is
+                    // process the prefix of first image
-                    // tokenized and decoded inside ingest_images() via mtmd, so no prefix
+                    std::vector<llama_token> prefix_tokens = has_images ? tokenize(slot.images[0].prefix_prompt, add_bos_token) : prompt_tokens;
                    // tokens are queued here; the post-image suffix is appended by
                    // ingest_images() for the normal decode + sampling loop.
                    std::vector<llama_token> prefix_tokens = has_images ? std::vector<llama_token>() : prompt_tokens;
                    int32_t slot_npast = slot.n_past_se > 0 ? slot.n_past_se : slot.n_past;
--- a/backend/cpp/ik-llama-cpp/patches/0002-clip-ggml-quantize-chunk-user-data.patch
+++ b/backend/cpp/ik-llama-cpp/patches/0002-clip-ggml-quantize-chunk-user-data.patch
@@ -0,0 +1,11 @@
 --- a/examples/llava/clip.cpp
 +++ b/examples/llava/clip.cpp
@@ -2494,7 +2494,7 @@
             }
             new_data = work.data();
 -            new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr);
 +            new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr, nullptr);
         } else {
             new_type = cur->type;
             new_data = cur->data;
--- a/backend/cpp/ik-llama-cpp/prepare.sh
+++ b/backend/cpp/ik-llama-cpp/prepare.sh
@@ -17,9 +17,28 @@ cp -r grpc-server.cpp llama.cpp/examples/grpc-server/
 cp -r utils.hpp llama.cpp/examples/grpc-server/
 cp -rfv llama.cpp/vendor/nlohmann/json.hpp llama.cpp/examples/grpc-server/
-## Multimodal support is provided by the `mtmd` library target (examples/mtmd/),
+## Copy clip/llava files for multimodal support (built as myclip library)
-## which the grpc-server links and includes directly. No source copy is needed:
+cp -rfv llama.cpp/examples/llava/clip.h llama.cpp/examples/grpc-server/clip.h
-## clip/llava were pruned upstream and the high-level mtmd_* API is used instead.
+cp -rfv llama.cpp/examples/llava/clip.cpp llama.cpp/examples/grpc-server/clip.cpp
 cp -rfv llama.cpp/examples/llava/llava.cpp llama.cpp/examples/grpc-server/llava.cpp
 # Prepend llama.h include to llava.h
 echo '#include "llama.h"' > llama.cpp/examples/grpc-server/llava.h
 cat llama.cpp/examples/llava/llava.h >> llama.cpp/examples/grpc-server/llava.h
 # Copy clip-impl.h if it exists
 if [ -f llama.cpp/examples/llava/clip-impl.h ]; then
    cp -rfv llama.cpp/examples/llava/clip-impl.h llama.cpp/examples/grpc-server/clip-impl.h
 fi
 # Copy stb_image.h
 if [ -f llama.cpp/vendor/stb/stb_image.h ]; then
    cp -rfv llama.cpp/vendor/stb/stb_image.h llama.cpp/examples/grpc-server/stb_image.h
 elif [ -f llama.cpp/common/stb_image.h ]; then
    cp -rfv llama.cpp/common/stb_image.h llama.cpp/examples/grpc-server/stb_image.h
 fi
 ## Fix API compatibility in llava.cpp (llama_n_embd -> llama_model_n_embd)
 if [ -f llama.cpp/examples/grpc-server/llava.cpp ]; then
    sed -i 's/llama_n_embd(/llama_model_n_embd(/g' llama.cpp/examples/grpc-server/llava.cpp
 fi
 set +e
 if grep -q "grpc-server" llama.cpp/examples/CMakeLists.txt; then
--- a/backend/cpp/ik-llama-cpp/run.sh
+++ b/backend/cpp/ik-llama-cpp/run.sh
@@ -2,7 +2,7 @@
 set -ex
 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")
 cd /
@@ -13,28 +13,28 @@ grep -e "flags" /proc/cpuinfo | head -1
 # ik_llama.cpp requires AVX2 — default to avx2 binary
 BINARY=ik-llama-cpp-avx2
-if [ -e "$CURDIR"/ik-llama-cpp-fallback ] && ! grep -q -e "\savx2\s" /proc/cpuinfo ; then
+if [ -e $CURDIR/ik-llama-cpp-fallback ] && ! grep -q -e "\savx2\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX2   NOT found, using fallback"
 	BINARY=ik-llama-cpp-fallback
 fi
 # Extend ld library path with the dir where this script is located/lib
 if [ "$(uname)" == "Darwin" ]; then
-	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
+	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
-	#export DYLD_FALLBACK_LIBRARY_PATH="$CURDIR"/lib:$DYLD_FALLBACK_LIBRARY_PATH
+	#export DYLD_FALLBACK_LIBRARY_PATH=$CURDIR/lib:$DYLD_FALLBACK_LIBRARY_PATH
 else
-	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 fi
 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using binary: $BINARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/$BINARY "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
 fi
 echo "Using binary: $BINARY"
-exec "$CURDIR"/$BINARY "$@"
+exec $CURDIR/$BINARY "$@"
 # We should never reach this point, however just in case we do, run fallback
-exec "$CURDIR"/ik-llama-cpp-fallback "$@"
+exec $CURDIR/ik-llama-cpp-fallback "$@"
--- a/backend/cpp/ik-llama-cpp/utils.hpp
+++ b/backend/cpp/ik-llama-cpp/utils.hpp
@@ -11,12 +11,9 @@
 #include "json.hpp"
-#include "mtmd.h"
+#include "clip.h"
-// mtmd.h and ik_llama's entire server/common stack (chat.h, server-common.h,
+using json = nlohmann::json;
 // server-task.h, ...) declare `using json = nlohmann::ordered_json`, so match it
 // here: a plain `nlohmann::json` alias collides with mtmd.h's at global scope.
 using json = nlohmann::ordered_json;
 extern bool server_verbose;
@@ -114,12 +111,13 @@ struct slot_image
 {
    int32_t id;
-    // mtmd bitmap (image/audio) decoded from the request buffer. Owned by the
+    bool request_encode_image = false;
-    // slot; freed via mtmd_bitmap_free() on reset. The high-level mtmd pipeline
+    float * image_embedding = nullptr;
-    // (mtmd_tokenize + mtmd_helper_eval_chunks) consumes these directly, so the
+    int32_t image_tokens = 0;
-    // legacy eager-encode fields (embedding/tokens) and per-image prefix prompt
+
-    // are no longer needed.
+    clip_image_u8 * img_data;
-    mtmd_bitmap * bitmap = nullptr;
+
    std::string prefix_prompt; // before of this image
 };
 // completion token output with probabilities
--- a/backend/cpp/llama-cpp/CMakeLists.txt
+++ b/backend/cpp/llama-cpp/CMakeLists.txt
@@ -50,13 +50,8 @@ add_custom_command(
        "${hw_proto}"
      DEPENDS "${hw_proto}")
-# hw_grpc_proto: force STATIC. Under the CPU_ALL_VARIANTS build BUILD_SHARED_LIBS=ON
+# hw_grpc_proto
-# (ggml/llama become shared), which would otherwise make this glue library a DSO. As a
+add_library(hw_grpc_proto
 # DSO it references the hidden-visibility symbols in the static libprotobuf.a, which the
 # linker cannot satisfy ("hidden symbol ... in libprotobuf.a is referenced by DSO").
 # Keeping it STATIC links protobuf/gRPC directly into the grpc-server executable while
 # only ggml/llama stay shared. No effect on the static variants (already BUILD_SHARED_LIBS=OFF).
 add_library(hw_grpc_proto STATIC
  ${hw_grpc_srcs}
  ${hw_grpc_hdrs}
  ${hw_proto_srcs}
@@ -87,18 +82,3 @@ target_compile_features(${TARGET} PRIVATE cxx_std_11)
 if(TARGET BUILD_INFO)
  add_dependencies(${TARGET} BUILD_INFO)
 endif()
 # Unit test for the message-content normalization helper (message_content.h).
 # Off by default so the normal backend build is untouched; enable with
 # -DLLAMA_GRPC_BUILD_TESTS=ON and run via ctest. It reuses llama.cpp's vendored
 # <nlohmann/json.hpp> (propagated by the common helpers library) so it has no
 # extra dependency beyond what the backend already builds against.
 option(LLAMA_GRPC_BUILD_TESTS "Build grpc-server unit tests" OFF)
 if(LLAMA_GRPC_BUILD_TESTS)
    enable_testing()
    add_executable(message_content_test message_content_test.cpp message_content.h)
    target_include_directories(message_content_test PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
    target_link_libraries(message_content_test PRIVATE ${_LLAMA_COMMON_TARGET})
    target_compile_features(message_content_test PRIVATE cxx_std_17)
    add_test(NAME message_content_test COMMAND message_content_test)
 endif()
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@
-LLAMA_VERSION?=dbdaece23de9ac63f2e7ca9e6bfcdc4fc156a3fa
+LLAMA_VERSION?=039e20a2db9e87b2477c76cc04905f3e1acad77f
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
 CMAKE_ARGS?=
@@ -10,16 +10,8 @@ TARGET?=--target grpc-server
 JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
 ARCH?=$(shell uname -m)
-# Shared libs default to OFF: we link static gRPC and the avx/avx2/avx512/fallback
+# Disable Shared libs as we are linking on static gRPC and we can't mix shared and static
-# variants are fully static. The CPU_ALL_VARIANTS build flips SHARED_LIBS=ON (ggml/llama
+CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=OFF
 # become shared so the dynamic CPU backends work; gRPC stays static via its imported
 # targets). SHARED_LIBS is a make variable, not an appended -D, so it survives the
 # recursive sub-make into the VARIANT build dir (which re-parses this Makefile) instead
 # of being re-clobbered by a second -DBUILD_SHARED_LIBS=OFF. EXTRA_CMAKE_ARGS is the hook
 # the CPU_ALL_VARIANTS target uses to inject -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS.
 SHARED_LIBS?=OFF
 EXTRA_CMAKE_ARGS?=
 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=$(SHARED_LIBS) -DLLAMA_CURL=OFF $(EXTRA_CMAKE_ARGS)
 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
 ifeq ($(NATIVE),false)
@@ -128,39 +120,15 @@ llama-cpp-fallback: llama.cpp
 	CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) VARIANT="llama-cpp-fallback-build" build-llama-cpp-grpc-server
 	cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-fallback-build/grpc-server llama-cpp-fallback
 # Single-build CPU backend using ggml's CPU_ALL_VARIANTS. Produces ONE grpc-server
 # plus a set of dlopen-able libggml-cpu-*.so (sandybridge/haswell/skylakex/...) that
 # ggml's backend registry selects from at runtime by probing host CPU features.
 # Replaces the avx/avx2/avx512/fallback multi-binary build on x86.
 #
 # CPU_ALL_VARIANTS requires GGML_BACKEND_DL, which requires BUILD_SHARED_LIBS=ON, so we
 # pass SHARED_LIBS=ON and the DL flags as make variables (NOT pre-expanded into the
 # CMAKE_ARGS env string): command-line make variables propagate through every recursive
 # sub-make, so the deepest VARIANT-dir build computes BUILD_SHARED_LIBS=ON consistently.
 # Only ggml/llama go shared - gRPC is found via its static imported targets, so the
 # grpc-server binary keeps static gRPC and only dynamically links ggml.
 #
 # TARGET adds "ggml": the per-microarch backends are runtime-dlopened, not link deps of
 # grpc-server, so they only build because each is an add_dependencies() of the ggml target.
 llama-cpp-cpu-all: llama.cpp
 	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build purge
 	$(info ${GREEN}I llama-cpp build info:cpu-all-variants${RESET})
 	$(MAKE) SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" VARIANT="llama-cpp-cpu-all-build" build-llama-cpp-grpc-server
 	cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/grpc-server llama-cpp-cpu-all
 	rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
 	find $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
 	@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
 llama-cpp-grpc: llama.cpp
 	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build purge
 	$(info ${GREEN}I llama-cpp build info:grpc${RESET})
-	CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" TARGET="--target grpc-server --target ggml-rpc-server" $(MAKE) VARIANT="llama-cpp-grpc-build" build-llama-cpp-grpc-server
+	CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" TARGET="--target grpc-server --target rpc-server" $(MAKE) VARIANT="llama-cpp-grpc-build" build-llama-cpp-grpc-server
 	cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build/grpc-server llama-cpp-grpc
 llama-cpp-rpc-server: llama-cpp-grpc
-	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build/llama.cpp/build/bin/ggml-rpc-server llama-cpp-rpc-server
+	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build/llama.cpp/build/bin/rpc-server llama-cpp-rpc-server
 llama.cpp:
 	mkdir -p llama.cpp
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -18,31 +18,6 @@
 #if __has_include("server-chat.cpp")
 #include "server-chat.cpp"
 #endif
 // server-schema.cpp exists only in llama.cpp after the upstream refactor that
 // extracted the JSON request-schema evaluation (previously the static
 // server_task::params_from_json_cmpl) into server_schema::eval_llama_cmpl_schema.
 // server-context.cpp and grpc-server.cpp both call into it, so its definitions
 // must be part of this translation unit or the link fails. __has_include keeps
 // the source compatible with older pins/forks (e.g. llama-cpp-turboquant) that
 // predate the split and still expose params_from_json_cmpl (see the guarded
 // call sites below).
 #if __has_include("server-schema.cpp")
 #define LOCALAI_HAS_SERVER_SCHEMA 1
 #include "server-schema.cpp"
 #endif
 // server-stream.cpp exists only in llama.cpp after the upstream refactor that
 // added the SSE stream-resumption layer (stream_session/stream_pipe_producer).
 // server-context.cpp calls into it (spipe->cleanup(), stream_aware_should_stop,
 // stream_session_attach_pipe), so its definitions must be part of this
 // translation unit or the link fails with "undefined reference to
 // stream_pipe_producer::cleanup()". The file is self-contained (its only
 // external symbols come from server-common, already pulled in above) and the
 // http route-handler factories it also defines are unused here but harmless.
 // __has_include keeps the source compatible with older pins/forks that predate
 // the split.
 #if __has_include("server-stream.cpp")
 #include "server-stream.cpp"
 #endif
 #include "server-context.cpp"
 // LocalAI
@@ -50,9 +25,7 @@
 #include "backend.pb.h"
 #include "backend.grpc.pb.h"
 #include "common.h"
 #include "arg.h"
 #include "chat-auto-parser.h"
 #include "message_content.h"
 #include <getopt.h>
 #include <grpcpp/ext/proto_server_reflection_plugin.h>
 #include <grpcpp/grpcpp.h>
@@ -607,10 +580,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    params.checkpoint_min_step = 256;
 #endif
    // Raw upstream llama-server flags collected from any option entry that
    // starts with '-'. Applied once after the loop via common_params_parse.
    std::vector<std::string> extra_argv;
     // decode options. Options are in form optname:optvale, or if booleans only optname.
    for (int i = 0; i < request->options_size(); i++) {
        std::string opt = request->options(i);
@@ -1099,31 +1068,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                } catch (...) {}
            }
        // --- main model MoE on CPU (upstream --cpu-moe / --n-cpu-moe) ---
        } else if (!strcmp(optname, "cpu_moe")) {
            // Bool-style flag: keep all MoE expert weights on CPU.
            const bool enable = (optval == NULL) ||
                optval_str == "true" || optval_str == "1" || optval_str == "yes" ||
                optval_str == "on" || optval_str == "enabled";
            if (enable) {
                params.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
            }
        } else if (!strcmp(optname, "n_cpu_moe")) {
            if (optval != NULL) {
                try {
                    int n = std::stoi(optval_str);
                    if (n < 0) n = 0;
                    // Keep override-name storage alive for the lifetime of the
                    // params struct (mirrors upstream arg.cpp's function-local static).
                    static std::list<std::string> buft_overrides_main;
                    for (int i = 0; i < n; ++i) {
                        buft_overrides_main.push_back(llm_ffn_exps_block_regex(i));
                        params.tensor_buft_overrides.push_back(
                            {buft_overrides_main.back().c_str(), ggml_backend_cpu_buffer_type()});
                    }
                } catch (...) {}
            }
        // --- draft model tensor buffer overrides (upstream --spec-draft-override-tensor) ---
        } else if (!strcmp(optname, "draft_override_tensor") || !strcmp(optname, "spec_draft_override_tensor")) {
            // Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
@@ -1155,30 +1099,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                else { cur.push_back(c); }
            }
            if (!cur.empty()) flush(cur);
        // --- generic passthrough: any entry starting with '-' is a raw
        //     upstream llama-server flag, forwarded verbatim to the parser. ---
        } else if (optname[0] == '-') {
            std::string flag = optname;
            // These flags make upstream's parser exit() (printing usage /
            // completion), which would kill the backend process. Skip them.
            if (flag == "-h" || flag == "--help" || flag == "--usage" ||
                flag == "--version" || flag == "--license" ||
                flag == "--list-devices" || flag == "-cl" ||
                flag == "--cache-list" ||
                flag.rfind("--completion", 0) == 0) {
                fprintf(stderr,
                    "[llama-cpp] ignoring passthrough flag that would exit: %s\n",
                    flag.c_str());
            } else {
                extra_argv.push_back(flag);
                // Preserve the whole value after the first ':' so embedded
                // colons (e.g. host:port) survive strtok's truncation of optval.
                auto colon = opt.find(':');
                if (colon != std::string::npos) {
                    extra_argv.push_back(opt.substr(colon + 1));
                }
            }
        }
    }
@@ -1214,6 +1134,27 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
        }
    }
    if (!params.kv_overrides.empty()) {
        params.kv_overrides.emplace_back();
        params.kv_overrides.back().key[0] = 0;
    }
    // tensor_buft_overrides sentinel termination (mirrors upstream common/arg.cpp).
    // Real entries are pushed during option parsing; here we pad/terminate so the
    // model loader sees back().pattern == nullptr (GGML_ASSERT at common.cpp:1543)
    // and so llama_params_fit has the placeholder slots it requires.
    {
        const size_t ntbo = llama_max_tensor_buft_overrides();
        while (params.tensor_buft_overrides.size() < ntbo) {
            params.tensor_buft_overrides.push_back({nullptr, nullptr});
        }
    }
    // Terminate the draft tensor_buft_overrides list with a sentinel, mirroring
    // the main-model handling above.
    if (!params.speculative.draft.tensor_buft_overrides.empty()) {
        params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
    }
    // TODO: Add yarn
    if (!request->tensorsplit().empty()) {
@@ -1306,69 +1247,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            params.sampling.grammar_triggers.push_back(std::move(trigger));
        }
    }
    // Apply any raw upstream flags last so an explicit passthrough flag wins
    // over the LocalAI-resolved field it maps to (e.g. --ctx-size beats
    // context_size). This is the same parser llama-server itself uses.
    if (!extra_argv.empty()) {
        // common_params_parser_init resets a few fields for the SERVER example
        // (n_parallel -> -1, use_color). Snapshot n_parallel so an unrelated
        // passthrough flag can't silently clobber LocalAI's resolved value.
        const int saved_n_parallel = params.n_parallel;
        std::vector<char *> argv;
        std::string prog = "llama-server";
        argv.push_back(prog.data());
        for (auto & a : extra_argv) {
            argv.push_back(a.data());
        }
        // ctx_arg.params is a reference, so this overlays the given flags onto
        // `params` in place. Returns false on a recoverable parse error (and
        // self-restores params); may exit() on a hard error, exactly as
        // passing the same bad flag to llama-server would.
        if (!common_params_parse((int)argv.size(), argv.data(), params,
                                 LLAMA_EXAMPLE_SERVER)) {
            fprintf(stderr,
                "[llama-cpp] failed to parse passthrough options; ignoring them\n");
        }
        // Restore n_parallel unless a passthrough flag explicitly set it
        // (parser_init's reset sentinel for SERVER is -1).
        if (params.n_parallel == -1) {
            params.n_parallel = saved_n_parallel;
        }
    }
    // Terminate/pad the override vectors only after BOTH the named-option loop
    // and the generic passthrough (common_params_parse above) have pushed their
    // real entries, so back() is the null sentinel the model loader asserts on.
    // Running these before the passthrough let a passthrough flag (--cpu-moe,
    // --override-tensor, --override-kv, ...) append a real entry after the
    // sentinel: a GGML_ASSERT crash for tensor_buft_overrides, a silent drop for
    // kv_overrides. Double-termination is harmless (the while is a no-op if the
    // passthrough parse already padded; an extra trailing null is ignored).
    if (!params.kv_overrides.empty()) {
        params.kv_overrides.emplace_back();
        params.kv_overrides.back().key[0] = 0;
    }
    // tensor_buft_overrides sentinel termination (mirrors upstream common/arg.cpp).
    // Real entries are pushed during option parsing; here we pad/terminate so the
    // model loader sees back().pattern == nullptr (GGML_ASSERT at common.cpp:1543)
    // and so llama_params_fit has the placeholder slots it requires.
    {
        const size_t ntbo = llama_max_tensor_buft_overrides();
        while (params.tensor_buft_overrides.size() < ntbo) {
            params.tensor_buft_overrides.push_back({nullptr, nullptr});
        }
    }
    // Terminate the draft tensor_buft_overrides list with a sentinel, mirroring
    // the main-model handling above.
    if (!params.speculative.draft.tensor_buft_overrides.empty()) {
        params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
    }
 }
@@ -1630,20 +1508,242 @@ public:
                for (int i = 0; i < request->messages_size(); i++) {
                    const auto& msg = request->messages(i);
-                    llama_grpc::ReconstructedMessageInput rin;
+                    json msg_json;
-                    rin.role = msg.role();
+                    msg_json["role"] = msg.role();
-                    rin.content = msg.content();
+
-                    rin.name = msg.name();
+                    bool is_last_user_msg = (i == last_user_msg_idx);
-                    rin.tool_call_id = msg.tool_call_id();
+                    bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0 || request->videos_size() > 0);
-                    rin.reasoning_content = msg.reasoning_content();
+
-                    rin.tool_calls = msg.tool_calls();
+                    // Handle content - can be string, null, or array
-                    rin.is_last_user_msg = (i == last_user_msg_idx);
+                    // For multimodal content, we'll embed images/audio from separate fields
-                    if (rin.is_last_user_msg) {
+                    if (!msg.content().empty()) {
-                        for (int j = 0; j < request->images_size(); j++) rin.images.push_back(request->images(j));
+                        // Try to parse content as JSON to see if it's already an array
-                        for (int j = 0; j < request->audios_size(); j++) rin.audios.push_back(request->audios(j));
+                        json content_val;
-                        for (int j = 0; j < request->videos_size(); j++) rin.videos.push_back(request->videos(j));
+                        try {
                            content_val = json::parse(msg.content());
                            // Handle null values - convert to empty string to avoid template errors
                            if (content_val.is_null()) {
                                content_val = "";
                            }
                        } catch (const json::parse_error&) {
                            // Not JSON, treat as plain string
                            content_val = msg.content();
                        }
                        // If content is an object (e.g., from tool call failures), convert to string
                        if (content_val.is_object()) {
                            content_val = content_val.dump();
                        }
                        // If content is a string and this is the last user message with images/audio, combine them
                        if (content_val.is_string() && is_last_user_msg && has_images_or_audio) {
                            json content_array = json::array();
                            // Add text first
                            content_array.push_back({{"type", "text"}, {"text", content_val.get<std::string>()}});
                            // Add images
                            if (request->images_size() > 0) {
                                for (int j = 0; j < request->images_size(); j++) {
                                    json image_chunk;
                                    image_chunk["type"] = "image_url";
                                    json image_url;
                                    image_url["url"] = "data:image/jpeg;base64," + request->images(j);
                                    image_chunk["image_url"] = image_url;
                                    content_array.push_back(image_chunk);
                                }
                            }
                            // Add audios
                            if (request->audios_size() > 0) {
                                for (int j = 0; j < request->audios_size(); j++) {
                                    json audio_chunk;
                                    audio_chunk["type"] = "input_audio";
                                    json input_audio;
                                    input_audio["data"] = request->audios(j);
                                    input_audio["format"] = "wav"; // default, could be made configurable
                                    audio_chunk["input_audio"] = input_audio;
                                    content_array.push_back(audio_chunk);
                                }
                            }
                            if (request->videos_size() > 0) {
                                for (int j = 0; j < request->videos_size(); j++) {
                                    json video_chunk;
                                    video_chunk["type"] = "input_video";
                                    json input_video;
                                    input_video["data"] = request->videos(j);
                                    video_chunk["input_video"] = input_video;
                                    content_array.push_back(video_chunk);
                                }
                            }
                            msg_json["content"] = content_array;
                        } else {
                            // Use content as-is (already array or not last user message)
                            // Ensure null values are converted to empty string
                            if (content_val.is_null()) {
                                msg_json["content"] = "";
                            } else {
                                msg_json["content"] = content_val;
                            }
                        }
                    } else if (is_last_user_msg && has_images_or_audio) {
                        // If no content but this is the last user message with images/audio, create content array
                        json content_array = json::array();
                        if (request->images_size() > 0) {
                            for (int j = 0; j < request->images_size(); j++) {
                                json image_chunk;
                                image_chunk["type"] = "image_url";
                                json image_url;
                                image_url["url"] = "data:image/jpeg;base64," + request->images(j);
                                image_chunk["image_url"] = image_url;
                                content_array.push_back(image_chunk);
                            }
                        }
                        if (request->audios_size() > 0) {
                            for (int j = 0; j < request->audios_size(); j++) {
                                json audio_chunk;
                                audio_chunk["type"] = "input_audio";
                                json input_audio;
                                input_audio["data"] = request->audios(j);
                                input_audio["format"] = "wav"; // default, could be made configurable
                                audio_chunk["input_audio"] = input_audio;
                                content_array.push_back(audio_chunk);
                            }
                        }
                        if (request->videos_size() > 0) {
                            for (int j = 0; j < request->videos_size(); j++) {
                                json video_chunk;
                                video_chunk["type"] = "input_video";
                                json input_video;
                                input_video["data"] = request->videos(j);
                                video_chunk["input_video"] = input_video;
                                content_array.push_back(video_chunk);
                            }
                        }
                        msg_json["content"] = content_array;
                    } else if (msg.role() == "tool") {
                        // Tool role messages must have content field set, even if empty
                        // Jinja templates expect content to be a string, not null or object
                        SRV_INF("[CONTENT DEBUG] PredictStream: Message %d is tool role, content_empty=%d\n", i, msg.content().empty() ? 1 : 0);
                        if (msg.content().empty()) {
                            msg_json["content"] = "";
                            SRV_INF("[CONTENT DEBUG] PredictStream: Message %d (tool): empty content, set to empty string\n", i);
                        } else {
                            SRV_INF("[CONTENT DEBUG] PredictStream: Message %d (tool): content exists: %s\n",
                                    i, msg.content().substr(0, std::min<size_t>(200, msg.content().size())).c_str());
                            // Content exists, parse and ensure it's a string
                            json content_val;
                            try {
                                content_val = json::parse(msg.content());
                                SRV_INF("[CONTENT DEBUG] PredictStream: Message %d (tool): parsed JSON, type=%s\n",
                                        i, content_val.is_null() ? "null" :
                                           content_val.is_object() ? "object" :
                                           content_val.is_string() ? "string" :
                                           content_val.is_array() ? "array" : "other");
                                // Handle null values - Jinja templates expect content to be a string, not null
                                if (content_val.is_null()) {
                                    msg_json["content"] = "";
                                    SRV_INF("[CONTENT DEBUG] PredictStream: Message %d (tool): null content, converted to empty string\n", i);
                                } else if (content_val.is_object()) {
                                    // If content is an object (e.g., from tool call failures/errors), convert to string
                                    msg_json["content"] = content_val.dump();
                                    SRV_INF("[CONTENT DEBUG] PredictStream: Message %d (tool): object content, converted to string: %s\n",
                                            i, content_val.dump().substr(0, std::min<size_t>(200, content_val.dump().size())).c_str());
                                } else if (content_val.is_string()) {
                                    msg_json["content"] = content_val.get<std::string>();
                                    SRV_INF("[CONTENT DEBUG] PredictStream: Message %d (tool): string content, using as-is\n", i);
                                } else {
                                    // For arrays or other types, convert to string
                                    msg_json["content"] = content_val.dump();
                                    SRV_INF("[CONTENT DEBUG] PredictStream: Message %d (tool): %s content, converted to string\n",
                                            i, content_val.is_array() ? "array" : "other type");
                                }
                            } catch (const json::parse_error&) {
                                // Not JSON, treat as plain string
                                msg_json["content"] = msg.content();
                                SRV_INF("[CONTENT DEBUG] PredictStream: Message %d (tool): not JSON, using as string\n", i);
                            }
                        }
                    } else {
                        // Ensure all messages have content set (fallback for any unhandled cases)
                        // Jinja templates expect content to be present, default to empty string if not set
                        if (!msg_json.contains("content")) {
                            SRV_INF("[CONTENT DEBUG] PredictStream: Message %d (role=%s): no content field, adding empty string\n",
                                    i, msg.role().c_str());
                            msg_json["content"] = "";
                        }
                    }
-                    messages_json.push_back(llama_grpc::build_reconstructed_message(rin));
+
                    // Add optional fields for OpenAI-compatible message format
                    if (!msg.name().empty()) {
                        msg_json["name"] = msg.name();
                    }
                    if (!msg.tool_call_id().empty()) {
                        msg_json["tool_call_id"] = msg.tool_call_id();
                    }
                    if (!msg.reasoning_content().empty()) {
                        msg_json["reasoning_content"] = msg.reasoning_content();
                    }
                    if (!msg.tool_calls().empty()) {
                        // Parse tool_calls JSON string and add to message
                        try {
                            json tool_calls = json::parse(msg.tool_calls());
                            msg_json["tool_calls"] = tool_calls;
                            SRV_INF("[TOOL CALLS DEBUG] PredictStream: Message %d has tool_calls: %s\n", i, tool_calls.dump().c_str());
                            // IMPORTANT: If message has tool_calls but content is empty or not set,
                            // set content to space " " instead of empty string "", because llama.cpp's
                            // common_chat_msgs_to_json_oaicompat converts empty strings to null (line 312),
                            // which causes template errors when accessing message.content[:tool_start_length]
                            if (!msg_json.contains("content") || (msg_json.contains("content") && msg_json["content"].is_string() && msg_json["content"].get<std::string>().empty())) {
                                SRV_INF("[CONTENT DEBUG] PredictStream: Message %d has tool_calls but empty content, setting to space\n", i);
                                msg_json["content"] = " ";
                            }
                            // Log each tool call with name and arguments
                            if (tool_calls.is_array()) {
                                for (size_t tc_idx = 0; tc_idx < tool_calls.size(); tc_idx++) {
                                    const auto& tc = tool_calls[tc_idx];
                                    std::string tool_name = "unknown";
                                    std::string tool_args = "{}";
                                    if (tc.contains("function")) {
                                        const auto& func = tc["function"];
                                        if (func.contains("name")) {
                                            tool_name = func["name"].get<std::string>();
                                        }
                                        if (func.contains("arguments")) {
                                            tool_args = func["arguments"].is_string() ?
                                                func["arguments"].get<std::string>() :
                                                func["arguments"].dump();
                                        }
                                    } else if (tc.contains("name")) {
                                        tool_name = tc["name"].get<std::string>();
                                        if (tc.contains("arguments")) {
                                            tool_args = tc["arguments"].is_string() ?
                                                tc["arguments"].get<std::string>() :
                                                tc["arguments"].dump();
                                        }
                                    }
                                    SRV_INF("[TOOL CALLS DEBUG] PredictStream: Message %d, tool_call %zu: name=%s, arguments=%s\n",
                                            i, tc_idx, tool_name.c_str(), tool_args.c_str());
                                }
                            }
                        } catch (const json::parse_error& e) {
                            SRV_WRN("Failed to parse tool_calls JSON: %s\n", e.what());
                        }
                    }
                    // Debug: Log final content state before adding to array
                    if (msg_json.contains("content")) {
                        if (msg_json["content"].is_null()) {
                            SRV_INF("[CONTENT DEBUG] PredictStream: Message %d FINAL STATE: content is NULL - THIS WILL CAUSE ERROR!\n", i);
                        } else {
                            SRV_INF("[CONTENT DEBUG] PredictStream: Message %d FINAL STATE: content type=%s, has_value=%d\n",
                                    i, msg_json["content"].is_string() ? "string" :
                                       msg_json["content"].is_array() ? "array" :
                                       msg_json["content"].is_object() ? "object" : "other",
                                    msg_json["content"].is_null() ? 0 : 1);
                        }
                    } else {
                        SRV_INF("[CONTENT DEBUG] PredictStream: Message %d FINAL STATE: NO CONTENT FIELD - THIS WILL CAUSE ERROR!\n", i);
                    }
                    messages_json.push_back(msg_json);
                }
                // Final safety check: Ensure no message has null content (Jinja templates require strings)
@@ -1822,27 +1922,25 @@ public:
                    body_json["min_p"] = data["min_p"];
                }
-                // Forward the chat_template_kwargs the Go layer resolved (model config
+                // Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
                // chat_template_kwargs + per-request metadata: enable_thinking,
                // reasoning_effort, preserve_thinking, ...). One generic merge replaces
                // the previous per-key handling - new template levers need no C++ change.
                // oaicompat_chat_params_parse reads these from body_json.
                const auto& metadata = request->metadata();
-                auto ctk_it = metadata.find("chat_template_kwargs");
+                auto et_it = metadata.find("enable_thinking");
-                if (ctk_it != metadata.end() && !ctk_it->second.empty()) {
+                if (et_it != metadata.end()) {
-                    try {
+                    if (!body_json.contains("chat_template_kwargs")) {
-                        json ctk = json::parse(ctk_it->second);
+                        body_json["chat_template_kwargs"] = json::object();
                        if (ctk.is_object()) {
                            if (!body_json.contains("chat_template_kwargs")) {
                                body_json["chat_template_kwargs"] = json::object();
                            }
                            for (auto& el : ctk.items()) {
                                body_json["chat_template_kwargs"][el.key()] = el.value();
                            }
                        }
                    } catch (const std::exception & e) {
                        SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
                    }
                    body_json["chat_template_kwargs"]["enable_thinking"] = (et_it->second == "true");
                }
                // Pass reasoning_effort via chat_template_kwargs too: the lever
                // jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
                // from enable_thinking which those templates ignore.
                auto re_it = metadata.find("reasoning_effort");
                if (re_it != metadata.end() && !re_it->second.empty()) {
                    if (!body_json.contains("chat_template_kwargs")) {
                        body_json["chat_template_kwargs"] = json::object();
                    }
                    body_json["chat_template_kwargs"]["reasoning_effort"] = re_it->second;
                }
                // Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
@@ -1864,7 +1962,36 @@ public:
                if (body_json.contains("messages") && body_json["messages"].is_array()) {
                    SRV_INF("[CONTENT DEBUG] PredictStream: Before oaicompat_chat_params_parse - checking %zu messages\n", body_json["messages"].size());
                    for (size_t idx = 0; idx < body_json["messages"].size(); idx++) {
-                        llama_grpc::normalize_template_message(body_json["messages"][idx]);
+                        auto& msg = body_json["messages"][idx];
                        std::string role_str = msg.contains("role") ? msg["role"].get<std::string>() : "unknown";
                        if (msg.contains("content")) {
                            if (msg["content"].is_null()) {
                                SRV_INF("[CONTENT DEBUG] PredictStream: BEFORE TEMPLATE - Message %zu (role=%s) has NULL content - FIXING!\n", idx, role_str.c_str());
                                msg["content"] = ""; // Fix null content
                            } else if (role_str == "tool" && msg["content"].is_array()) {
                                // Tool messages must have string content, not array
                                // oaicompat_chat_params_parse expects tool messages to have string content
                                SRV_INF("[CONTENT DEBUG] PredictStream: BEFORE TEMPLATE - Message %zu (role=tool) has array content, converting to string\n", idx);
                                msg["content"] = msg["content"].dump();
                            } else if (!msg["content"].is_string() && !msg["content"].is_array()) {
                                // If content is object or other non-string type, convert to string for templates
                                SRV_INF("[CONTENT DEBUG] PredictStream: BEFORE TEMPLATE - Message %zu (role=%s) content is not string/array, converting\n", idx, role_str.c_str());
                                if (msg["content"].is_object()) {
                                    msg["content"] = msg["content"].dump();
                                } else {
                                    msg["content"] = "";
                                }
                            } else {
                                SRV_INF("[CONTENT DEBUG] PredictStream: BEFORE TEMPLATE - Message %zu (role=%s): content type=%s\n",
                                        idx, role_str.c_str(),
                                        msg["content"].is_string() ? "string" :
                                        msg["content"].is_array() ? "array" :
                                        msg["content"].is_object() ? "object" : "other");
                            }
                        } else {
                            SRV_INF("[CONTENT DEBUG] PredictStream: BEFORE TEMPLATE - Message %zu (role=%s) MISSING content field - ADDING!\n", idx, role_str.c_str());
                            msg["content"] = ""; // Add missing content
                        }
                    }
                }
@@ -1973,11 +2100,7 @@ public:
                task.index = i;
                task.tokens    = std::move(inputs[i]);
 #ifdef LOCALAI_HAS_SERVER_SCHEMA
                task.params           = server_schema::eval_llama_cmpl_schema(
 #else
                task.params           = server_task::params_from_json_cmpl(
 #endif
                        ctx_server.impl->vocab,
                        params_base,
                        ctx_server.get_meta().slot_n_ctx,
@@ -1991,7 +2114,7 @@ public:
                // cannot detect tool calls or separate reasoning from content.
                task.params.res_type                 = TASK_RESPONSE_TYPE_OAI_CHAT;
                task.params.oaicompat_cmpl_id         = completion_id;
-                // oaicompat_model is already populated by eval_llama_cmpl_schema
+                // oaicompat_model is already populated by params_from_json_cmpl
                tasks.push_back(std::move(task));
            }
@@ -2196,20 +2319,264 @@ public:
                SRV_INF("[CONTENT DEBUG] Predict: Processing %d messages\n", request->messages_size());
                for (int i = 0; i < request->messages_size(); i++) {
                    const auto& msg = request->messages(i);
-                    llama_grpc::ReconstructedMessageInput rin;
+                    json msg_json;
-                    rin.role = msg.role();
+                    msg_json["role"] = msg.role();
-                    rin.content = msg.content();
+
-                    rin.name = msg.name();
+                    SRV_INF("[CONTENT DEBUG] Predict: Message %d: role=%s, content_empty=%d, content_length=%zu\n",
-                    rin.tool_call_id = msg.tool_call_id();
+                            i, msg.role().c_str(), msg.content().empty() ? 1 : 0, msg.content().size());
-                    rin.reasoning_content = msg.reasoning_content();
+                    if (!msg.content().empty()) {
-                    rin.tool_calls = msg.tool_calls();
+                        SRV_INF("[CONTENT DEBUG] Predict: Message %d content (first 200 chars): %s\n",
-                    rin.is_last_user_msg = (i == last_user_msg_idx);
+                                i, msg.content().substr(0, std::min<size_t>(200, msg.content().size())).c_str());
                    if (rin.is_last_user_msg) {
                        for (int j = 0; j < request->images_size(); j++) rin.images.push_back(request->images(j));
                        for (int j = 0; j < request->audios_size(); j++) rin.audios.push_back(request->audios(j));
                        for (int j = 0; j < request->videos_size(); j++) rin.videos.push_back(request->videos(j));
                    }
-                    messages_json.push_back(llama_grpc::build_reconstructed_message(rin));
+
                    bool is_last_user_msg = (i == last_user_msg_idx);
                    bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0 || request->videos_size() > 0);
                    // Handle content - can be string, null, or array
                    // For multimodal content, we'll embed images/audio from separate fields
                    if (!msg.content().empty()) {
                        // Try to parse content as JSON to see if it's already an array
                        json content_val;
                        try {
                            content_val = json::parse(msg.content());
                            // Handle null values - convert to empty string to avoid template errors
                            if (content_val.is_null()) {
                                SRV_INF("[CONTENT DEBUG] Predict: Message %d parsed JSON is null, converting to empty string\n", i);
                                content_val = "";
                            }
                        } catch (const json::parse_error&) {
                            // Not JSON, treat as plain string
                            content_val = msg.content();
                        }
                        // If content is an object (e.g., from tool call failures), convert to string
                        if (content_val.is_object()) {
                            SRV_INF("[CONTENT DEBUG] Predict: Message %d content is object, converting to string\n", i);
                            content_val = content_val.dump();
                        }
                        // If content is a string and this is the last user message with images/audio, combine them
                        if (content_val.is_string() && is_last_user_msg && has_images_or_audio) {
                            json content_array = json::array();
                            // Add text first
                            content_array.push_back({{"type", "text"}, {"text", content_val.get<std::string>()}});
                            // Add images
                            if (request->images_size() > 0) {
                                for (int j = 0; j < request->images_size(); j++) {
                                    json image_chunk;
                                    image_chunk["type"] = "image_url";
                                    json image_url;
                                    image_url["url"] = "data:image/jpeg;base64," + request->images(j);
                                    image_chunk["image_url"] = image_url;
                                    content_array.push_back(image_chunk);
                                }
                            }
                            // Add audios
                            if (request->audios_size() > 0) {
                                for (int j = 0; j < request->audios_size(); j++) {
                                    json audio_chunk;
                                    audio_chunk["type"] = "input_audio";
                                    json input_audio;
                                    input_audio["data"] = request->audios(j);
                                    input_audio["format"] = "wav"; // default, could be made configurable
                                    audio_chunk["input_audio"] = input_audio;
                                    content_array.push_back(audio_chunk);
                                }
                            }
                            if (request->videos_size() > 0) {
                                for (int j = 0; j < request->videos_size(); j++) {
                                    json video_chunk;
                                    video_chunk["type"] = "input_video";
                                    json input_video;
                                    input_video["data"] = request->videos(j);
                                    video_chunk["input_video"] = input_video;
                                    content_array.push_back(video_chunk);
                                }
                            }
                            msg_json["content"] = content_array;
                        } else {
                            // Use content as-is (already array or not last user message)
                            // Ensure null values are converted to empty string
                            if (content_val.is_null()) {
                                SRV_INF("[CONTENT DEBUG] Predict: Message %d content_val was null, setting to empty string\n", i);
                                msg_json["content"] = "";
                            } else {
                                msg_json["content"] = content_val;
                                SRV_INF("[CONTENT DEBUG] Predict: Message %d content set, type=%s\n",
                                        i, content_val.is_string() ? "string" :
                                           content_val.is_array() ? "array" :
                                           content_val.is_object() ? "object" : "other");
                            }
                        }
                    } else if (is_last_user_msg && has_images_or_audio) {
                        // If no content but this is the last user message with images/audio, create content array
                        json content_array = json::array();
                        if (request->images_size() > 0) {
                            for (int j = 0; j < request->images_size(); j++) {
                                json image_chunk;
                                image_chunk["type"] = "image_url";
                                json image_url;
                                image_url["url"] = "data:image/jpeg;base64," + request->images(j);
                                image_chunk["image_url"] = image_url;
                                content_array.push_back(image_chunk);
                            }
                        }
                        if (request->audios_size() > 0) {
                            for (int j = 0; j < request->audios_size(); j++) {
                                json audio_chunk;
                                audio_chunk["type"] = "input_audio";
                                json input_audio;
                                input_audio["data"] = request->audios(j);
                                input_audio["format"] = "wav"; // default, could be made configurable
                                audio_chunk["input_audio"] = input_audio;
                                content_array.push_back(audio_chunk);
                            }
                        }
                        if (request->videos_size() > 0) {
                            for (int j = 0; j < request->videos_size(); j++) {
                                json video_chunk;
                                video_chunk["type"] = "input_video";
                                json input_video;
                                input_video["data"] = request->videos(j);
                                video_chunk["input_video"] = input_video;
                                content_array.push_back(video_chunk);
                            }
                        }
                        msg_json["content"] = content_array;
                        SRV_INF("[CONTENT DEBUG] Predict: Message %d created content array with media\n", i);
                    } else if (!msg.tool_calls().empty()) {
                        // Tool call messages may have null content, but templates expect string
                        // IMPORTANT: Set to space " " instead of empty string "", because llama.cpp's
                        // common_chat_msgs_to_json_oaicompat converts empty strings to null (line 312),
                        // which causes template errors when accessing message.content[:tool_start_length]
                        SRV_INF("[CONTENT DEBUG] Predict: Message %d has tool_calls, setting content to space (not empty string)\n", i);
                        msg_json["content"] = " ";
                    } else if (msg.role() == "tool") {
                        // Tool role messages must have content field set, even if empty
                        // Jinja templates expect content to be a string, not null or object
                        SRV_INF("[CONTENT DEBUG] Predict: Message %d is tool role, content_empty=%d\n", i, msg.content().empty() ? 1 : 0);
                        if (msg.content().empty()) {
                            msg_json["content"] = "";
                            SRV_INF("[CONTENT DEBUG] Predict: Message %d (tool): empty content, set to empty string\n", i);
                        } else {
                            SRV_INF("[CONTENT DEBUG] Predict: Message %d (tool): content exists: %s\n",
                                    i, msg.content().substr(0, std::min<size_t>(200, msg.content().size())).c_str());
                            // Content exists, parse and ensure it's a string
                            json content_val;
                            try {
                                content_val = json::parse(msg.content());
                                SRV_INF("[CONTENT DEBUG] Predict: Message %d (tool): parsed JSON, type=%s\n",
                                        i, content_val.is_null() ? "null" :
                                           content_val.is_object() ? "object" :
                                           content_val.is_string() ? "string" :
                                           content_val.is_array() ? "array" : "other");
                                // Handle null values - Jinja templates expect content to be a string, not null
                                if (content_val.is_null()) {
                                    msg_json["content"] = "";
                                    SRV_INF("[CONTENT DEBUG] Predict: Message %d (tool): null content, converted to empty string\n", i);
                                } else if (content_val.is_object()) {
                                    // If content is an object (e.g., from tool call failures/errors), convert to string
                                    msg_json["content"] = content_val.dump();
                                    SRV_INF("[CONTENT DEBUG] Predict: Message %d (tool): object content, converted to string: %s\n",
                                            i, content_val.dump().substr(0, std::min<size_t>(200, content_val.dump().size())).c_str());
                                } else if (content_val.is_string()) {
                                    msg_json["content"] = content_val.get<std::string>();
                                    SRV_INF("[CONTENT DEBUG] Predict: Message %d (tool): string content, using as-is\n", i);
                                } else {
                                    // For arrays or other types, convert to string
                                    msg_json["content"] = content_val.dump();
                                    SRV_INF("[CONTENT DEBUG] Predict: Message %d (tool): %s content, converted to string\n",
                                            i, content_val.is_array() ? "array" : "other type");
                                }
                            } catch (const json::parse_error&) {
                                // Not JSON, treat as plain string
                                msg_json["content"] = msg.content();
                                SRV_INF("[CONTENT DEBUG] Predict: Message %d (tool): not JSON, using as string\n", i);
                            }
                        }
                    } else {
                        // Ensure all messages have content set (fallback for any unhandled cases)
                        // Jinja templates expect content to be present, default to empty string if not set
                        if (!msg_json.contains("content")) {
                            SRV_INF("[CONTENT DEBUG] Predict: Message %d (role=%s): no content field, adding empty string\n",
                                    i, msg.role().c_str());
                            msg_json["content"] = "";
                        }
                    }
                    // Add optional fields for OpenAI-compatible message format
                    if (!msg.name().empty()) {
                        msg_json["name"] = msg.name();
                    }
                    if (!msg.tool_call_id().empty()) {
                        msg_json["tool_call_id"] = msg.tool_call_id();
                    }
                    if (!msg.reasoning_content().empty()) {
                        msg_json["reasoning_content"] = msg.reasoning_content();
                    }
                    if (!msg.tool_calls().empty()) {
                        // Parse tool_calls JSON string and add to message
                        try {
                            json tool_calls = json::parse(msg.tool_calls());
                            msg_json["tool_calls"] = tool_calls;
                            SRV_INF("[TOOL CALLS DEBUG] Predict: Message %d has tool_calls: %s\n", i, tool_calls.dump().c_str());
                            // IMPORTANT: If message has tool_calls but content is empty or not set,
                            // set content to space " " instead of empty string "", because llama.cpp's
                            // common_chat_msgs_to_json_oaicompat converts empty strings to null (line 312),
                            // which causes template errors when accessing message.content[:tool_start_length]
                            if (!msg_json.contains("content") || (msg_json.contains("content") && msg_json["content"].is_string() && msg_json["content"].get<std::string>().empty())) {
                                SRV_INF("[CONTENT DEBUG] Predict: Message %d has tool_calls but empty content, setting to space\n", i);
                                msg_json["content"] = " ";
                            }
                            // Log each tool call with name and arguments
                            if (tool_calls.is_array()) {
                                for (size_t tc_idx = 0; tc_idx < tool_calls.size(); tc_idx++) {
                                    const auto& tc = tool_calls[tc_idx];
                                    std::string tool_name = "unknown";
                                    std::string tool_args = "{}";
                                    if (tc.contains("function")) {
                                        const auto& func = tc["function"];
                                        if (func.contains("name")) {
                                            tool_name = func["name"].get<std::string>();
                                        }
                                        if (func.contains("arguments")) {
                                            tool_args = func["arguments"].is_string() ?
                                                func["arguments"].get<std::string>() :
                                                func["arguments"].dump();
                                        }
                                    } else if (tc.contains("name")) {
                                        tool_name = tc["name"].get<std::string>();
                                        if (tc.contains("arguments")) {
                                            tool_args = tc["arguments"].is_string() ?
                                                tc["arguments"].get<std::string>() :
                                                tc["arguments"].dump();
                                        }
                                    }
                                    SRV_INF("[TOOL CALLS DEBUG] Predict: Message %d, tool_call %zu: name=%s, arguments=%s\n",
                                            i, tc_idx, tool_name.c_str(), tool_args.c_str());
                                }
                            }
                        } catch (const json::parse_error& e) {
                            SRV_WRN("Failed to parse tool_calls JSON: %s\n", e.what());
                        }
                    }
                    // Debug: Log final content state before adding to array
                    if (msg_json.contains("content")) {
                        if (msg_json["content"].is_null()) {
                            SRV_INF("[CONTENT DEBUG] Predict: Message %d FINAL STATE: content is NULL - THIS WILL CAUSE ERROR!\n", i);
                        } else {
                            SRV_INF("[CONTENT DEBUG] Predict: Message %d FINAL STATE: content type=%s, has_value=%d\n",
                                    i, msg_json["content"].is_string() ? "string" :
                                       msg_json["content"].is_array() ? "array" :
                                       msg_json["content"].is_object() ? "object" : "other",
                                    msg_json["content"].is_null() ? 0 : 1);
                        }
                    } else {
                        SRV_INF("[CONTENT DEBUG] Predict: Message %d FINAL STATE: NO CONTENT FIELD - THIS WILL CAUSE ERROR!\n", i);
                    }
                    messages_json.push_back(msg_json);
                }
                // Final safety check: Ensure no message has null content (Jinja templates require strings)
@@ -2389,26 +2756,25 @@ public:
                    body_json["min_p"] = data["min_p"];
                }
-                // Forward the chat_template_kwargs the Go layer resolved (model config
+                // Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
                // chat_template_kwargs + per-request metadata: enable_thinking,
                // reasoning_effort, preserve_thinking, ...). One generic merge replaces
                // the previous per-key handling - new template levers need no C++ change.
                const auto& predict_metadata = request->metadata();
-                auto predict_ctk_it = predict_metadata.find("chat_template_kwargs");
+                auto predict_et_it = predict_metadata.find("enable_thinking");
-                if (predict_ctk_it != predict_metadata.end() && !predict_ctk_it->second.empty()) {
+                if (predict_et_it != predict_metadata.end()) {
-                    try {
+                    if (!body_json.contains("chat_template_kwargs")) {
-                        json ctk = json::parse(predict_ctk_it->second);
+                        body_json["chat_template_kwargs"] = json::object();
                        if (ctk.is_object()) {
                            if (!body_json.contains("chat_template_kwargs")) {
                                body_json["chat_template_kwargs"] = json::object();
                            }
                            for (auto& el : ctk.items()) {
                                body_json["chat_template_kwargs"][el.key()] = el.value();
                            }
                        }
                    } catch (const std::exception & e) {
                        SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
                    }
                    body_json["chat_template_kwargs"]["enable_thinking"] = (predict_et_it->second == "true");
                }
                // Pass reasoning_effort via chat_template_kwargs too: the lever
                // jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
                // from enable_thinking which those templates ignore.
                auto predict_re_it = predict_metadata.find("reasoning_effort");
                if (predict_re_it != predict_metadata.end() && !predict_re_it->second.empty()) {
                    if (!body_json.contains("chat_template_kwargs")) {
                        body_json["chat_template_kwargs"] = json::object();
                    }
                    body_json["chat_template_kwargs"]["reasoning_effort"] = predict_re_it->second;
                }
                // Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
@@ -2430,7 +2796,36 @@ public:
                if (body_json.contains("messages") && body_json["messages"].is_array()) {
                    SRV_INF("[CONTENT DEBUG] Predict: Before oaicompat_chat_params_parse - checking %zu messages\n", body_json["messages"].size());
                    for (size_t idx = 0; idx < body_json["messages"].size(); idx++) {
-                        llama_grpc::normalize_template_message(body_json["messages"][idx]);
+                        auto& msg = body_json["messages"][idx];
                        std::string role_str = msg.contains("role") ? msg["role"].get<std::string>() : "unknown";
                        if (msg.contains("content")) {
                            if (msg["content"].is_null()) {
                                SRV_INF("[CONTENT DEBUG] Predict: BEFORE TEMPLATE - Message %zu (role=%s) has NULL content - FIXING!\n", idx, role_str.c_str());
                                msg["content"] = ""; // Fix null content
                            } else if (role_str == "tool" && msg["content"].is_array()) {
                                // Tool messages must have string content, not array
                                // oaicompat_chat_params_parse expects tool messages to have string content
                                SRV_INF("[CONTENT DEBUG] Predict: BEFORE TEMPLATE - Message %zu (role=tool) has array content, converting to string\n", idx);
                                msg["content"] = msg["content"].dump();
                            } else if (!msg["content"].is_string() && !msg["content"].is_array()) {
                                // If content is object or other non-string type, convert to string for templates
                                SRV_INF("[CONTENT DEBUG] Predict: BEFORE TEMPLATE - Message %zu (role=%s) content is not string/array, converting\n", idx, role_str.c_str());
                                if (msg["content"].is_object()) {
                                    msg["content"] = msg["content"].dump();
                                } else {
                                    msg["content"] = "";
                                }
                            } else {
                                SRV_INF("[CONTENT DEBUG] Predict: BEFORE TEMPLATE - Message %zu (role=%s): content type=%s\n",
                                        idx, role_str.c_str(),
                                        msg["content"].is_string() ? "string" :
                                        msg["content"].is_array() ? "array" :
                                        msg["content"].is_object() ? "object" : "other");
                            }
                        } else {
                            SRV_INF("[CONTENT DEBUG] Predict: BEFORE TEMPLATE - Message %zu (role=%s) MISSING content field - ADDING!\n", idx, role_str.c_str());
                            msg["content"] = ""; // Add missing content
                        }
                    }
                }
@@ -2542,11 +2937,7 @@ public:
                task.index = i;
                task.tokens    = std::move(inputs[i]);
 #ifdef LOCALAI_HAS_SERVER_SCHEMA
                task.params           = server_schema::eval_llama_cmpl_schema(
 #else
                task.params           = server_task::params_from_json_cmpl(
 #endif
                        ctx_server.impl->vocab,
                        params_base,
                        ctx_server.get_meta().slot_n_ctx,
@@ -2558,7 +2949,7 @@ public:
                // reasoning, tool calls, and content are classified into ChatDeltas.
                task.params.res_type                 = TASK_RESPONSE_TYPE_OAI_CHAT;
                task.params.oaicompat_cmpl_id         = completion_id;
-                // oaicompat_model is already populated by eval_llama_cmpl_schema
+                // oaicompat_model is already populated by params_from_json_cmpl
                tasks.push_back(std::move(task));
            }
@@ -3095,7 +3486,7 @@ public:
        if (body.count("prompt") != 0) {
            const bool add_special = json_value(body, "add_special", false);
-            llama_tokens tokens = tokenize_mixed(ctx_server.impl->vocab, body.at("prompt"), add_special, true);
+            llama_tokens tokens = tokenize_mixed(ctx_server.impl->vocab, body.at("content"), add_special, true);
            for (const auto& token : tokens) {
--- a/backend/cpp/llama-cpp/message_content.h
+++ b/backend/cpp/llama-cpp/message_content.h
@@ -1,192 +0,0 @@
 #pragma once
 #include <string>
 #include <vector>
 #include <nlohmann/json.hpp>
 namespace llama_grpc {
 // Normalizes a proto message's content string into the JSON value used when
 // reconstructing OpenAI-format messages for the tokenizer (jinja) template.
 //
 // Shared by the streaming (PredictStream) and non-streaming (Predict) message
 // reconstruction paths so the two cannot drift.
 //
 // LocalAI's Go layer (schema.Messages.ToProto) always sends content as a plain
 // text string; multimodal media travels in separate proto fields, never inside
 // content. So user/system/developer content is *only ever* opaque text and must
 // NOT be JSON-sniffed: a prompt that merely looks like JSON (e.g. an ingredient
 // list ["1/4 cup sugar", ...]) would otherwise be reinterpreted as structured
 // content parts and rejected by oaicompat_chat_params_parse with
 // "unsupported content[].type" (https://github.com/mudler/LocalAI/issues/10524).
 // (developer is OpenAI's modern system alias - same "human-authored text" nature.)
 //
 // For assistant/tool messages we still collapse a literal JSON null/object
 // (tool-call bookkeeping) to a string, but we never turn a plain string into an
 // array/scalar. The array defense is therefore role-independent (arrays/scalars
 // fall through for every role); the role gate only governs the null/object case.
 inline nlohmann::ordered_json normalize_message_content(const std::string& role,
                                                        const std::string& content) {
    nlohmann::ordered_json content_val = content;
    if (role != "user" && role != "system" && role != "developer") {
        try {
            nlohmann::ordered_json parsed = nlohmann::ordered_json::parse(content);
            if (parsed.is_null()) {
                content_val = "";
            } else if (parsed.is_object()) {
                content_val = parsed.dump();
            }
            // arrays / scalars: keep the original plain-text string as-is
        } catch (const nlohmann::ordered_json::parse_error&) {
            // Not JSON, already the plain string
        }
    }
    return content_val;
 }
 // Final safety pass applied to each reconstructed OpenAI message right before it
 // is handed to oaicompat_chat_params_parse (jinja templating). Jinja templates
 // assume content is a string: a literal null breaks slicing such as
 // message.content[:N] (#7324), and a tool message with array content is rejected
 // (#7528). A multimodal user message legitimately carries a typed-part array
 // ({type:text}, {type:image_url}, ...), which must be left intact. Shared by the
 // streaming and non-streaming paths so this invariant cannot drift between them.
 inline void normalize_template_message(nlohmann::ordered_json& msg) {
    if (!msg.contains("content")) {
        msg["content"] = ""; // templates expect the field to exist
        return;
    }
    nlohmann::ordered_json& content = msg["content"];
    const std::string role = (msg.contains("role") && msg["role"].is_string())
                                 ? msg["role"].get<std::string>()
                                 : std::string();
    if (content.is_null()) {
        content = ""; // #7324: null would crash content[:N] slicing
    } else if (role == "tool" && content.is_array()) {
        content = content.dump(); // #7528: tool messages must have string content
    } else if (!content.is_string() && !content.is_array()) {
        if (content.is_object()) {
            content = content.dump(); // tool-call bookkeeping object -> string
        } else {
            content = ""; // other scalar (number/bool) -> empty
        }
    }
    // string, or a non-tool (multimodal) typed-part array: leave untouched
 }
 // One proto message's data, flattened to plain types so the reconstruction logic
 // can be shared and unit-tested without protobuf. The streaming and non-streaming
 // predict paths both populate this from proto::Message + the request's media.
 struct ReconstructedMessageInput {
    std::string role;
    std::string content;            // proto.Message.content (always a plain string)
    std::string name;
    std::string tool_call_id;
    std::string reasoning_content;
    std::string tool_calls;         // tool_calls as a JSON string, or empty
    bool is_last_user_msg = false;  // attach request media to this message
    std::vector<std::string> images; // base64 (jpeg)
    std::vector<std::string> audios; // base64 (wav)
    std::vector<std::string> videos; // base64
 };
 // Appends the request's media as OpenAI typed content parts. Imperative (not
 // brace-init) to avoid nlohmann's object-vs-array initializer-list ambiguity.
 inline void append_media_parts(nlohmann::ordered_json& content_array,
                               const std::vector<std::string>& images,
                               const std::vector<std::string>& audios,
                               const std::vector<std::string>& videos) {
    for (const auto& img : images) {
        nlohmann::ordered_json image_chunk;
        image_chunk["type"] = "image_url";
        nlohmann::ordered_json image_url;
        image_url["url"] = "data:image/jpeg;base64," + img;
        image_chunk["image_url"] = image_url;
        content_array.push_back(image_chunk);
    }
    for (const auto& aud : audios) {
        nlohmann::ordered_json audio_chunk;
        audio_chunk["type"] = "input_audio";
        nlohmann::ordered_json input_audio;
        input_audio["data"] = aud;
        input_audio["format"] = "wav"; // default; could be made configurable
        audio_chunk["input_audio"] = input_audio;
        content_array.push_back(audio_chunk);
    }
    for (const auto& vid : videos) {
        nlohmann::ordered_json video_chunk;
        video_chunk["type"] = "input_video";
        nlohmann::ordered_json input_video;
        input_video["data"] = vid;
        video_chunk["input_video"] = input_video;
        content_array.push_back(video_chunk);
    }
 }
 // Reconstructs a single OpenAI-format message (the object fed to
 // oaicompat_chat_params_parse) from a proto message. Shared by PredictStream and
 // Predict so the content/multimodal/tool_calls handling cannot drift between the
 // two stream modes (it previously lived as two ~150-line copies with a redundant
 // Predict-only tool_calls->" " branch). Guarantees content is always a string or
 // a typed-part array, never null/missing.
 inline nlohmann::ordered_json build_reconstructed_message(const ReconstructedMessageInput& in) {
    nlohmann::ordered_json msg_json;
    msg_json["role"] = in.role;
    const bool has_media = !in.images.empty() || !in.audios.empty() || !in.videos.empty();
    if (!in.content.empty()) {
        nlohmann::ordered_json content_val = normalize_message_content(in.role, in.content);
        if (content_val.is_string() && in.is_last_user_msg && has_media) {
            // Last user message + media: build a typed-part array (text first).
            nlohmann::ordered_json content_array = nlohmann::ordered_json::array();
            nlohmann::ordered_json text_part;
            text_part["type"] = "text";
            text_part["text"] = content_val.get<std::string>();
            content_array.push_back(text_part);
            append_media_parts(content_array, in.images, in.audios, in.videos);
            msg_json["content"] = content_array;
        } else if (content_val.is_null()) {
            msg_json["content"] = "";
        } else {
            msg_json["content"] = content_val;
        }
    } else if (in.is_last_user_msg && has_media) {
        // No text but media on the last user message: media-only typed array.
        nlohmann::ordered_json content_array = nlohmann::ordered_json::array();
        append_media_parts(content_array, in.images, in.audios, in.videos);
        msg_json["content"] = content_array;
    } else {
        // Empty content (any role, incl. tool/assistant): templates need a string.
        msg_json["content"] = "";
    }
    if (!in.name.empty()) {
        msg_json["name"] = in.name;
    }
    if (!in.tool_call_id.empty()) {
        msg_json["tool_call_id"] = in.tool_call_id;
    }
    if (!in.reasoning_content.empty()) {
        msg_json["reasoning_content"] = in.reasoning_content;
    }
    if (!in.tool_calls.empty()) {
        try {
            nlohmann::ordered_json tool_calls = nlohmann::ordered_json::parse(in.tool_calls);
            msg_json["tool_calls"] = tool_calls;
            // tool_calls + empty/blank content: use " " not "", because llama.cpp's
            // common_chat_msgs_to_json_oaicompat turns "" into null, which breaks
            // templates that slice message.content[:tool_start_length] (#7324).
            if (!msg_json.contains("content") ||
                (msg_json["content"].is_string() && msg_json["content"].get<std::string>().empty())) {
                msg_json["content"] = " ";
            }
        } catch (const nlohmann::ordered_json::parse_error&) {
            // Malformed tool_calls JSON: leave content as-is (prior behavior).
        }
    }
    return msg_json;
 }
 }  // namespace llama_grpc
--- a/backend/cpp/llama-cpp/message_content_test.cpp
+++ b/backend/cpp/llama-cpp/message_content_test.cpp
@@ -1,234 +0,0 @@
 // Unit tests for the shared message-reconstruction helpers (message_content.h).
 //
 // Build & run standalone (nlohmann/json single header on the include path):
 //   g++ -std=c++17 -I<dir-with-nlohmann> message_content_test.cpp -o t && ./t
 // or via CMake: -DLLAMA_GRPC_BUILD_TESTS=ON then ctest.
 //
 // Regression coverage for:
 //   #10524 - a user/system prompt that is itself a JSON-array string must stay
 //            plain text, never be reinterpreted as OpenAI structured parts.
 //   #7324  - assistant/tool null content -> "" (templates slice content[:N]);
 //            assistant+tool_calls+empty content -> " " (not "", which becomes null).
 //   #7528  - tool message array content must reach the template as a string.
 //   multimodal - last user message text + media -> typed-part array, media kept.
 #include <cassert>
 #include <iostream>
 #include <string>
 #include "message_content.h"
 using nlohmann::ordered_json;
 using llama_grpc::normalize_message_content;
 using llama_grpc::normalize_template_message;
 using llama_grpc::build_reconstructed_message;
 using llama_grpc::ReconstructedMessageInput;
 static int failures = 0;
 static void check(bool ok, const std::string& name, const std::string& detail = "") {
    if (!ok) {
        std::cerr << "FAIL " << name << (detail.empty() ? "" : ": " + detail) << "\n";
        failures++;
    }
 }
 // ---- normalize_message_content -------------------------------------------
 static void expect_norm_string(const char* name, const std::string& role,
                               const std::string& content, const std::string& want) {
    auto got = normalize_message_content(role, content);
    if (!got.is_string()) {
        check(false, name, "expected a JSON string, got " +
                               std::string(got.is_array() ? "array" : got.is_object() ? "object" : "other") +
                               " (" + got.dump() + ")");
        return;
    }
    check(got.get<std::string>() == want, name, "expected \"" + want + "\", got \"" + got.get<std::string>() + "\"");
 }
 static void test_normalize() {
    const std::string ingredients = R"(["1/4 cup brown sugar, packed","1 pound ground beef"])";
    // #10524 - JSON-array text must stay a string. Role-INDEPENDENT array defense.
    for (const char* role : {"user", "system", "developer", "function", "assistant", "tool"}) {
        expect_norm_string((std::string("json_array_stays_text:") + role).c_str(), role, ingredients, ingredients);
    }
    // #10524 - user/system/developer JSON-object text stays verbatim (NOT re-dumped).
    expect_norm_string("user_json_object_verbatim", "user", R"({"a":1})", R"({"a":1})");
    expect_norm_string("system_json_object_verbatim", "system", R"({"a":1})", R"({"a":1})");
    expect_norm_string("developer_json_object_verbatim", "developer", R"({"a":1})", R"({"a":1})");
    // Plain text unchanged for all roles.
    expect_norm_string("user_plain_text", "user", "hello world", "hello world");
    expect_norm_string("assistant_non_json_text_kept", "assistant", "hi [unclosed", "hi [unclosed");
    // #7324 boundary - user/system/developer literal "null" preserved (never parsed).
    expect_norm_string("user_literal_null_stays", "user", "null", "null");
    expect_norm_string("system_literal_null_stays", "system", "null", "null");
    expect_norm_string("developer_literal_null_stays", "developer", "null", "null");
    // #7324 - assistant/tool literal null collapses to empty string.
    expect_norm_string("assistant_null_to_empty", "assistant", "null", "");
    expect_norm_string("tool_null_to_empty", "tool", "null", "");
    // #7324/#7528 - assistant/tool object bookkeeping stringified (stays a string).
    check(normalize_message_content("assistant", R"({"tool":"x"})").is_string(), "assistant_object_stringified");
    check(normalize_message_content("tool", R"({"error":"boom"})").is_string(), "tool_object_stringified");
    // #10524-family - a bare scalar that parses as a JSON number stays the string.
    expect_norm_string("assistant_scalar_number_stays_string", "assistant", "42", "42");
    // baseline - empty content stays empty.
    expect_norm_string("user_empty_stays_empty", "user", "", "");
 }
 // ---- normalize_template_message (BEFORE TEMPLATE sanitizer) ---------------
 static void test_template_sanitizer() {
    // #7528 - a tool message with an ACTUAL array becomes a string.
    {
        ordered_json msg = {{"role", "tool"}, {"content", ordered_json::array({{{"type", "text"}, {"text", "r"}}})}};
        normalize_template_message(msg);
        check(msg["content"].is_string(), "before_template_tool_array_to_string", "got " + msg["content"].dump());
    }
    // #7324 - null content -> "" for any role.
    {
        ordered_json msg = {{"role", "assistant"}, {"content", nullptr}};
        normalize_template_message(msg);
        check(msg["content"].is_string() && msg["content"] == "", "before_template_null_to_empty");
    }
    // object content -> dumped string (would otherwise throw at the template).
    {
        ordered_json msg = {{"role", "assistant"}, {"content", {{"x", 1}}}};
        normalize_template_message(msg);
        check(msg["content"].is_string(), "before_template_object_to_string", "got " + msg["content"].dump());
    }
    // missing content field -> "".
    {
        ordered_json msg = {{"role", "user"}};
        normalize_template_message(msg);
        check(msg.contains("content") && msg["content"] == "", "before_template_missing_to_empty");
    }
    // multimodal: a well-typed user array must be left UNTOUCHED (role!=tool).
    {
        ordered_json parts = ordered_json::array();
        parts.push_back({{"type", "text"}, {"text", "x"}});
        ordered_json img; img["type"] = "image_url"; img["image_url"] = {{"url", "data:..."}};
        parts.push_back(img);
        ordered_json msg = {{"role", "user"}, {"content", parts}};
        normalize_template_message(msg);
        check(msg["content"].is_array() && msg["content"].size() == 2, "before_template_user_typed_array_preserved",
              "got " + msg["content"].dump());
    }
    // a plain string is left untouched.
    {
        ordered_json msg = {{"role", "user"}, {"content", "hello"}};
        normalize_template_message(msg);
        check(msg["content"] == "hello", "before_template_string_untouched");
    }
 }
 // ---- build_reconstructed_message ----------------------------------------
 static void test_reconstruction() {
    const std::string ingredients = R"(["1/4 cup brown sugar","1 pound ground beef"])";
    // #10524 end-state - user JSON-array text, no media -> string content.
    {
        ReconstructedMessageInput in;
        in.role = "user"; in.content = ingredients;
        auto m = build_reconstructed_message(in);
        check(m["content"].is_string() && m["content"] == ingredients, "recon_user_json_array_string",
              "got " + m["content"].dump());
    }
    // multimodal - user text + one image on last user msg -> typed array, image kept.
    {
        ReconstructedMessageInput in;
        in.role = "user"; in.content = ingredients; in.is_last_user_msg = true;
        in.images.push_back("BASE64IMG");
        auto m = build_reconstructed_message(in);
        check(m["content"].is_array() && m["content"].size() == 2, "recon_multimodal_text_plus_image",
              "got " + m["content"].dump());
        check(m["content"][0]["type"] == "text" && m["content"][0]["text"] == ingredients, "recon_multimodal_text_first");
        check(m["content"][1]["type"] == "image_url", "recon_multimodal_image_kept");
    }
    // multimodal media-only - empty text + image on last user msg.
    {
        ReconstructedMessageInput in;
        in.role = "user"; in.content = ""; in.is_last_user_msg = true;
        in.images.push_back("BASE64IMG");
        auto m = build_reconstructed_message(in);
        check(m["content"].is_array() && m["content"].size() == 1 && m["content"][0]["type"] == "image_url",
              "recon_media_only", "got " + m["content"].dump());
    }
    // #7528 - tool array-string content stays a string.
    {
        ReconstructedMessageInput in;
        in.role = "tool"; in.content = R"(["a","b"])"; in.tool_call_id = "call_1";
        auto m = build_reconstructed_message(in);
        check(m["content"].is_string() && m["content"] == R"(["a","b"])", "recon_tool_array_string",
              "got " + m["content"].dump());
        check(m["tool_call_id"] == "call_1", "recon_tool_call_id_set");
    }
    // tool empty content -> "".
    {
        ReconstructedMessageInput in;
        in.role = "tool"; in.content = "";
        auto m = build_reconstructed_message(in);
        check(m["content"].is_string() && m["content"] == "", "recon_tool_empty_to_string");
    }
    // #7324 - assistant + tool_calls + empty content -> " " (single space, not "").
    {
        ReconstructedMessageInput in;
        in.role = "assistant"; in.content = "";
        in.tool_calls = R"([{"id":"c1","type":"function","function":{"name":"f","arguments":"{}"}}])";
        auto m = build_reconstructed_message(in);
        check(m["content"].is_string() && m["content"] == " ", "recon_toolcalls_empty_content_space",
              "got " + m["content"].dump());
        check(m["tool_calls"].is_array() && m["tool_calls"].size() == 1, "recon_toolcalls_parsed");
    }
    // assistant + tool_calls + real content keeps the content.
    {
        ReconstructedMessageInput in;
        in.role = "assistant"; in.content = "I'll call f";
        in.tool_calls = R"([{"id":"c1","type":"function","function":{"name":"f","arguments":"{}"}}])";
        auto m = build_reconstructed_message(in);
        check(m["content"] == "I'll call f", "recon_toolcalls_with_content_kept");
    }
    // assistant null content -> "".
    {
        ReconstructedMessageInput in;
        in.role = "assistant"; in.content = "null";
        auto m = build_reconstructed_message(in);
        check(m["content"] == "", "recon_assistant_null_to_empty");
    }
    // malformed tool_calls JSON must not throw; content preserved.
    {
        ReconstructedMessageInput in;
        in.role = "assistant"; in.content = "hi"; in.tool_calls = "{not json";
        auto m = build_reconstructed_message(in);
        check(m["content"] == "hi" && !m.contains("tool_calls"), "recon_malformed_toolcalls_safe");
    }
    // optional fields: name + reasoning carried through.
    {
        ReconstructedMessageInput in;
        in.role = "tool"; in.content = "result"; in.name = "get_weather"; in.reasoning_content = "thinking";
        auto m = build_reconstructed_message(in);
        check(m["name"] == "get_weather" && m["reasoning_content"] == "thinking", "recon_optional_fields");
    }
 }
 int main() {
    test_normalize();
    test_template_sanitizer();
    test_reconstruction();
    if (failures == 0) {
        std::cout << "OK: all message_content tests passed\n";
        return 0;
    }
    std::cerr << failures << " test(s) failed\n";
    return 1;
 }
--- a/backend/cpp/llama-cpp/package.sh
+++ b/backend/cpp/llama-cpp/package.sh
@@ -14,22 +14,6 @@ mkdir -p $CURDIR/package/lib
 cp -avrf $CURDIR/llama-cpp-* $CURDIR/package/
 cp -rfv $CURDIR/run.sh $CURDIR/package/
 # Bundle the ggml shared backends produced by the CPU_ALL_VARIANTS build (libggml-base.so,
 # libggml.so, libllama.so and the per-microarch libggml-cpu-*.so), all into package/lib.
 #
 # Two distinct resolution mechanisms both land here:
 #   - NEEDED deps (libggml-base/libggml/libllama): resolved by the dynamic linker via the
 #     LD_LIBRARY_PATH=$CURDIR/lib that run.sh exports.
 #   - The per-microarch libggml-cpu-*.so are NOT linked; ggml *discovers* them at runtime by
 #     scanning the executable's own directory (readlink /proc/self/exe). run.sh launches via
 #     the bundled $CURDIR/lib/ld.so, so /proc/self/exe -> .../lib/ld.so and ggml scans lib/.
 #     That is why the variants must sit in lib/ (next to ld.so), not just on the link path.
 # No-op on builds (arm64/darwin) that don't produce the all-variants set.
 if [ -d "$CURDIR/ggml-shared-libs" ]; then
    echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
    cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
 fi
 # Detect architecture and copy appropriate libraries
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    # x86_64 architecture
--- a/backend/cpp/llama-cpp/prepare.sh
+++ b/backend/cpp/llama-cpp/prepare.sh
@@ -18,10 +18,6 @@ done
 cp -r CMakeLists.txt llama.cpp/tools/grpc-server/
 cp -r grpc-server.cpp llama.cpp/tools/grpc-server/
 # Shared message-reconstruction helpers (included by grpc-server.cpp) and their
 # unit test (compiled only when -DLLAMA_GRPC_BUILD_TESTS=ON).
 cp -r message_content.h llama.cpp/tools/grpc-server/
 cp -r message_content_test.cpp llama.cpp/tools/grpc-server/
 cp -rfv llama.cpp/vendor/nlohmann/json.hpp llama.cpp/tools/grpc-server/
 cp -rfv llama.cpp/vendor/cpp-httplib/httplib.h llama.cpp/tools/grpc-server/
--- a/backend/cpp/llama-cpp/run.sh
+++ b/backend/cpp/llama-cpp/run.sh
@@ -2,7 +2,7 @@
 set -ex
 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")
 cd /
@@ -12,41 +12,55 @@ grep -e "flags" /proc/cpuinfo | head -1
 BINARY=llama-cpp-fallback
-# CPU images (x86, arm64, darwin) ship a single llama-cpp-cpu-all built with ggml
+if grep -q -e "\savx\s" /proc/cpuinfo ; then
-# CPU_ALL_VARIANTS: ggml's backend registry dlopens the best libggml-cpu-*.so for this
+	echo "CPU:    AVX    found OK"
-# host, so no shell-side AVX probing. GPU images (cublas/sycl/vulkan/hipblas) ship only
+	if [ -e $CURDIR/llama-cpp-avx ]; then
-# llama-cpp-fallback (the accelerator does the compute), so fall back to it when absent.
+		BINARY=llama-cpp-avx
-if [ -e "$CURDIR"/llama-cpp-cpu-all ]; then
+	fi
-	BINARY=llama-cpp-cpu-all
+fi
 if grep -q -e "\savx2\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX2   found OK"
 	if [ -e $CURDIR/llama-cpp-avx2 ]; then
 		BINARY=llama-cpp-avx2
 	fi
 fi
 # Check avx 512
 if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX512F found OK"
 	if [ -e $CURDIR/llama-cpp-avx512 ]; then
 		BINARY=llama-cpp-avx512
 	fi
 fi
 if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
-	if [ -e "$CURDIR"/llama-cpp-grpc ]; then
+	if [ -e $CURDIR/llama-cpp-grpc ]; then
 		BINARY=llama-cpp-grpc
 	fi
 fi
 # Extend ld library path with the dir where this script is located/lib
 if [ "$(uname)" == "Darwin" ]; then
-	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
+	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
-	#export DYLD_FALLBACK_LIBRARY_PATH="$CURDIR"/lib:$DYLD_FALLBACK_LIBRARY_PATH
+	#export DYLD_FALLBACK_LIBRARY_PATH=$CURDIR/lib:$DYLD_FALLBACK_LIBRARY_PATH
 else
-	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 	# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
 	if [ -d "$CURDIR/lib/rocblas/library" ]; then
-		export ROCBLAS_TENSILE_LIBPATH="$CURDIR"/lib/rocblas/library
+		export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
 	fi
 fi
 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using binary: $BINARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/$BINARY "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
 fi
 echo "Using binary: $BINARY"
-exec "$CURDIR"/$BINARY "$@"
+exec $CURDIR/$BINARY "$@"
 # We should never reach this point, however just in case we do, run fallback
-exec "$CURDIR"/llama-cpp-fallback "$@"
+exec $CURDIR/llama-cpp-fallback "$@"
--- a/backend/cpp/privacy-filter/.gitignore
+++ b/backend/cpp/privacy-filter/.gitignore
@@ -1,9 +0,0 @@
 /privacy-filter.cpp
 build/
 package/
 grpc-server
 *.o
 backend.pb.cc
 backend.pb.h
 backend.grpc.pb.cc
 backend.grpc.pb.h
--- a/backend/cpp/privacy-filter/CMakeLists.txt
+++ b/backend/cpp/privacy-filter/CMakeLists.txt
@@ -1,77 +0,0 @@
 cmake_minimum_required(VERSION 3.21)
 project(privacy-filter-grpc-server LANGUAGES CXX C)
 set(CMAKE_CXX_STANDARD 17)
 set(CMAKE_CXX_STANDARD_REQUIRED ON)
 set(TARGET grpc-server)
 # Path to the privacy-filter.cpp engine sources. The Makefile arranges for this
 # to exist (clone of a pinned commit, or a symlink to PRIVACY_FILTER_SRC).
 set(PRIVACY_FILTER_DIR "${CMAKE_CURRENT_SOURCE_DIR}/privacy-filter.cpp"
    CACHE PATH "Path to the privacy-filter.cpp engine source tree")
 find_package(Threads REQUIRED)
 find_package(Protobuf CONFIG QUIET)
 if(NOT Protobuf_FOUND)
    find_package(Protobuf REQUIRED)
 endif()
 find_package(gRPC CONFIG QUIET)
 if(NOT gRPC_FOUND)
    # Ubuntu's apt-installed grpc++ does not ship a CMake config - fall back.
    find_library(GRPCPP_LIB grpc++ REQUIRED)
    find_library(GRPCPP_REFLECTION_LIB grpc++_reflection REQUIRED)
    add_library(gRPC::grpc++ INTERFACE IMPORTED)
    set_target_properties(gRPC::grpc++ PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_LIB}")
    add_library(gRPC::grpc++_reflection INTERFACE IMPORTED)
    set_target_properties(gRPC::grpc++_reflection PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_REFLECTION_LIB}")
 endif()
 find_program(_PROTOC NAMES protoc REQUIRED)
 find_program(_GRPC_CPP_PLUGIN NAMES grpc_cpp_plugin REQUIRED)
 get_filename_component(HW_PROTO "${CMAKE_CURRENT_SOURCE_DIR}/../../backend.proto" ABSOLUTE)
 get_filename_component(HW_PROTO_PATH "${HW_PROTO}" PATH)
 set(HW_PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.cc")
 set(HW_PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.h")
 set(HW_GRPC_SRCS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.cc")
 set(HW_GRPC_HDRS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.h")
 add_custom_command(
    OUTPUT "${HW_PROTO_SRCS}" "${HW_PROTO_HDRS}" "${HW_GRPC_SRCS}" "${HW_GRPC_HDRS}"
    COMMAND ${_PROTOC}
    ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
         --cpp_out  "${CMAKE_CURRENT_BINARY_DIR}"
         -I "${HW_PROTO_PATH}"
         --plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN}"
         "${HW_PROTO}"
    DEPENDS "${HW_PROTO}")
 add_library(hw_grpc_proto STATIC
    ${HW_GRPC_SRCS} ${HW_GRPC_HDRS}
    ${HW_PROTO_SRCS} ${HW_PROTO_HDRS})
 target_include_directories(hw_grpc_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
 # The generated proto/grpc sources include protobuf and grpc++ headers, so this
 # library must see their include dirs. Linking the imported targets propagates
 # them. On Linux the apt headers live in /usr/include (default search path) so
 # this was a no-op; on macOS the Homebrew headers are under /opt/homebrew and
 # would otherwise be missed (runtime_version.h not found).
 target_link_libraries(hw_grpc_proto PUBLIC
    protobuf::libprotobuf
    gRPC::grpc++)
 # Build only the pf static lib (+ ggml) from the engine tree — no CLI/bench/tests.
 # PF_VULKAN is honored when passed on the cmake command line (it lands in the
 # shared cache the engine reads).
 set(PF_BUILD_TOOLS OFF CACHE BOOL "" FORCE)
 set(PF_BUILD_TESTS OFF CACHE BOOL "" FORCE)
 add_subdirectory(${PRIVACY_FILTER_DIR} ${CMAKE_CURRENT_BINARY_DIR}/privacy-filter.cpp)
 add_executable(${TARGET} grpc-server.cpp)
 target_link_libraries(${TARGET} PRIVATE
    pf
    hw_grpc_proto
    gRPC::grpc++
    gRPC::grpc++_reflection
    protobuf::libprotobuf
    Threads::Threads)
--- a/backend/cpp/privacy-filter/Makefile
+++ b/backend/cpp/privacy-filter/Makefile
@@ -1,77 +0,0 @@
 # privacy-filter backend Makefile.
 #
 # Wraps the standalone privacy-filter.cpp GGML engine (the openai-privacy-filter
 # PII/NER token classifier) as a LocalAI gRPC backend. The engine source is
 # fetched at the pin below — .github/workflows/bump_deps.yaml finds and updates
 # PRIVACY_FILTER_VERSION, matching the llama-cpp / ds4 convention.
 #
 # Local development: point at a working checkout instead of cloning, e.g.
 #   make PRIVACY_FILTER_SRC=$HOME/c/privacy-filter.cpp grpc-server
 PRIVACY_FILTER_VERSION?=98f52c5ef2250f207cc6b9a6aef05393a120cb7c
 PRIVACY_FILTER_REPO?=https://github.com/localai-org/privacy-filter.cpp
 PRIVACY_FILTER_SRC?=
 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
 BUILD_DIR := build
 BUILD_TYPE ?=
 NATIVE ?= false
 JOBS ?= $(shell nproc 2>/dev/null || echo 4)
 CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
 # GPU backends; the default (cpu) needs no extra flags. 'cublas' is LocalAI's
 # name for the CUDA build (matches llama-cpp / ds4), mapping to the engine's
 # GGML_CUDA path; 'vulkan' selects the ggml Vulkan backend.
 ifeq ($(BUILD_TYPE),cublas)
    CMAKE_ARGS += -DPF_CUDA=ON
 endif
 ifeq ($(BUILD_TYPE),vulkan)
    CMAKE_ARGS += -DPF_VULKAN=ON
 endif
 # Portable binaries for distribution: disable -march=native unless asked.
 ifneq ($(NATIVE),true)
    CMAKE_ARGS += -DGGML_NATIVE=OFF
 endif
 .PHONY: grpc-server package clean purge test all
 all: grpc-server
 # Provide the engine sources at ./privacy-filter.cpp. With PRIVACY_FILTER_SRC
 # set we symlink a local checkout (instant, no network); otherwise we clone the
 # pinned commit and its ggml submodule. The directory/symlink is the target, so
 # make only does this once — run 'make purge && make' to refetch after a bump.
 privacy-filter.cpp:
 ifneq ($(PRIVACY_FILTER_SRC),)
 	ln -sfn $(abspath $(PRIVACY_FILTER_SRC)) privacy-filter.cpp
 else
 	mkdir -p privacy-filter.cpp
 	cd privacy-filter.cpp && \
 	git init -q && \
 	git remote add origin $(PRIVACY_FILTER_REPO) && \
 	git fetch --depth 1 origin $(PRIVACY_FILTER_VERSION) && \
 	git checkout FETCH_HEAD && \
 	git submodule update --init --recursive --depth 1
 endif
 grpc-server: privacy-filter.cpp
 	@echo "Building privacy-filter grpc-server ($(BUILD_TYPE)) with $(CMAKE_ARGS)"
 	mkdir -p $(BUILD_DIR)
 	cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
 	cp $(BUILD_DIR)/grpc-server grpc-server
 package: grpc-server
 	bash package.sh
 test:
 	@echo "privacy-filter backend: parity/regression coverage lives in the engine repo"
 clean:
 	rm -rf $(BUILD_DIR) grpc-server package
 # 'privacy-filter.cpp' may be a symlink (PRIVACY_FILTER_SRC) — rm without a
 # trailing slash removes the link, never the linked-to checkout.
 purge: clean
 	rm -rf privacy-filter.cpp
--- a/backend/cpp/privacy-filter/grpc-server.cpp
+++ b/backend/cpp/privacy-filter/grpc-server.cpp
@@ -1,210 +0,0 @@
 // privacy-filter LocalAI gRPC backend.
 //
 // Thin shim over privacy-filter.cpp's flat C API (include/pf.h): a standalone
 // GGML engine for the openai-privacy-filter token-classification model family
 // (PII NER). It replaces the llama.cpp-patched TokenClassify path for this one
 // model family — same GGUF files, no llama.cpp carry-patches.
 //
 // Only the RPCs the PII tier needs are implemented: LoadModel, TokenClassify,
 // plus Health / Status / Free. Everything else inherits the generated base
 // class default (UNIMPLEMENTED).
 #include "backend.pb.h"
 #include "backend.grpc.pb.h"
 #include "pf.h"
 #include <grpcpp/grpcpp.h>
 #include <grpcpp/server.h>
 #include <grpcpp/server_builder.h>
 #include <grpcpp/ext/proto_server_reflection_plugin.h>
 #include <atomic>
 #include <chrono>
 #include <csignal>
 #include <iostream>
 #include <memory>
 #include <mutex>
 #include <string>
 using grpc::Server;
 using grpc::ServerBuilder;
 using grpc::ServerContext;
 // NOTE: do NOT alias grpc::Status as Status — the Status RPC method below would
 // shadow the type and break the other method signatures. Use GStatus instead.
 using GStatus = ::grpc::Status;
 using grpc::StatusCode;
 namespace {
 // The engine is single-model-per-process: LocalAI spawns one backend process
 // per loaded model. g_mu guards (re)load against in-flight classification.
 std::mutex          g_mu;
 pf_ctx *            g_ctx = nullptr;
 std::atomic<Server *> g_server{nullptr};
 // Resolve the device string the engine expects ("cpu" / "gpu" / "cuda" /
 // "vulkan", optionally ":N"). Priority: an explicit "device:..." in
 // ModelOptions.Options, then a non-zero NGPULayers as a coarse "use the GPU"
 // signal, else CPU. "gpu" lets the engine pick whichever GPU backend this
 // binary was compiled with (CUDA or Vulkan), so the same config works on
 // either build; pin "device:cuda"/"device:vulkan" to be explicit.
 std::string resolve_device(const backend::ModelOptions * opts) {
    for (const auto & o : opts->options()) {
        const std::string prefix = "device:";
        if (o.rfind(prefix, 0) == 0) {
            return o.substr(prefix.size());
        }
    }
    if (opts->ngpulayers() > 0) {
        return "gpu";
    }
    return "cpu";
 }
 class PrivacyFilterBackend final : public backend::Backend::Service {
 public:
    GStatus Health(ServerContext *, const backend::HealthMessage *,
                   backend::Reply * reply) override {
        reply->set_message("OK");
        return GStatus::OK;
    }
    GStatus Status(ServerContext *, const backend::HealthMessage *,
                   backend::StatusResponse * response) override {
        std::lock_guard<std::mutex> lock(g_mu);
        response->set_state(g_ctx ? backend::StatusResponse::READY
                                  : backend::StatusResponse::UNINITIALIZED);
        return GStatus::OK;
    }
    GStatus LoadModel(ServerContext *, const backend::ModelOptions * request,
                      backend::Result * result) override {
        std::lock_guard<std::mutex> lock(g_mu);
        // ModelFile is the absolute path LocalAI resolves; Model is the bare
        // name. Prefer the former, fall back to the latter.
        const std::string path =
            !request->modelfile().empty() ? request->modelfile() : request->model();
        if (path.empty()) {
            result->set_success(false);
            result->set_message("no model path supplied");
            return GStatus::OK;
        }
        const std::string device = resolve_device(request);
        if (g_ctx) { pf_free(g_ctx); g_ctx = nullptr; }
        pf_ctx * ctx = pf_load(path.c_str(), device.c_str(), request->threads());
        const char * err = pf_last_error(ctx);
        if (err) {
            result->set_success(false);
            result->set_message(std::string("privacy-filter load failed: ") + err);
            pf_free(ctx);
            return GStatus::OK;
        }
        // ContextSize, when set, becomes the per-forward window. The engine
        // ignores values that are too small to window (<= 2*halo) and just
        // runs a single forward, so passing it through is always safe.
        if (request->contextsize() > 0) {
            pf_set_window(ctx, request->contextsize());
        }
        g_ctx = ctx;
        result->set_success(true);
        result->set_message("privacy-filter loaded (" + device + ")");
        return GStatus::OK;
    }
    GStatus TokenClassify(ServerContext *, const backend::TokenClassifyRequest * request,
                          backend::TokenClassifyResponse * response) override {
        std::lock_guard<std::mutex> lock(g_mu);
        if (!g_ctx) {
            return GStatus(StatusCode::FAILED_PRECONDITION, "Model not loaded");
        }
        const std::string & text = request->text();
        if (text.empty()) {
            return GStatus::OK;  // no text -> no entities
        }
        pf_entity * ents = nullptr;
        size_t      n    = 0;
        if (pf_classify(g_ctx, text.data(), text.size(), request->threshold(), &ents, &n) != 0) {
            const char * err = pf_last_error(g_ctx);
            return GStatus(StatusCode::INTERNAL,
                           std::string("TokenClassify failed: ") + (err ? err : "unknown"));
        }
        // Byte offsets are into the original UTF-8 text; the engine already
        // applied the threshold and whitespace-trimmed span edges.
        for (size_t i = 0; i < n; i++) {
            backend::TokenClassifyEntity * ent = response->add_entities();
            ent->set_entity_group(ents[i].label ? ents[i].label : "");
            ent->set_start(ents[i].start);
            ent->set_end(ents[i].end);
            ent->set_score(ents[i].score);
            ent->set_text(text.substr((size_t) ents[i].start,
                                      (size_t) (ents[i].end - ents[i].start)));
        }
        pf_entities_free(ents, n);
        return GStatus::OK;
    }
    GStatus Free(ServerContext *, const backend::HealthMessage *,
                 backend::Result * result) override {
        std::lock_guard<std::mutex> lock(g_mu);
        if (g_ctx) { pf_free(g_ctx); g_ctx = nullptr; }
        result->set_success(true);
        return GStatus::OK;
    }
 };
 void RunServer(const std::string & addr) {
    PrivacyFilterBackend service;
    grpc::EnableDefaultHealthCheckService(true);
    grpc::reflection::InitProtoReflectionServerBuilderPlugin();
    ServerBuilder builder;
    builder.AddListeningPort(addr, grpc::InsecureServerCredentials());
    builder.RegisterService(&service);
    builder.SetMaxReceiveMessageSize(64 * 1024 * 1024);
    builder.SetMaxSendMessageSize(64 * 1024 * 1024);
    std::unique_ptr<Server> server(builder.BuildAndStart());
    if (!server) {
        std::cerr << "privacy-filter grpc-server: failed to bind " << addr << "\n";
        std::exit(1);
    }
    g_server = server.get();
    std::cerr << "privacy-filter grpc-server listening on " << addr << "\n";
    server->Wait();
 }
 void signal_handler(int) {
    if (auto * srv = g_server.load()) {
        srv->Shutdown(std::chrono::system_clock::now() + std::chrono::seconds(3));
    }
 }
 } // namespace
 int main(int argc, char * argv[]) {
    std::string addr = "127.0.0.1:50051";
    for (int i = 1; i < argc; ++i) {
        std::string a = argv[i];
        const std::string addr_flag = "--addr=";
        if (a.rfind(addr_flag, 0) == 0)      addr = a.substr(addr_flag.size());
        else if (a == "--addr" && i + 1 < argc) addr = argv[++i];
        else if (a == "--help" || a == "-h") {
            std::cout << "Usage: grpc-server --addr=HOST:PORT\n";
            return 0;
        }
    }
    std::signal(SIGINT,  signal_handler);
    std::signal(SIGTERM, signal_handler);
    RunServer(addr);
    return 0;
 }
--- a/backend/cpp/privacy-filter/package.sh
+++ b/backend/cpp/privacy-filter/package.sh
@@ -1,39 +0,0 @@
 #!/bin/bash
 # Assemble package/ for the from-scratch backend image: the grpc-server binary,
 # run.sh, the dynamic loader, and every shared library the binary needs.
 set -e
 CURDIR=$(dirname "$(realpath "$0")")
 REPO_ROOT="${CURDIR}/../../.."
 mkdir -p "$CURDIR/package/lib"
 cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
 cp -rfv "$CURDIR/run.sh"      "$CURDIR/package/"
 # The dynamic loader, renamed to lib/ld.so so run.sh can invoke it explicitly
 # (makes the image independent of the host's glibc layout).
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
 elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
 else
    echo "package.sh: unknown architecture" >&2; exit 1
 fi
 # Bundle the binary's transitive shared deps (libstdc++, libgomp, and the apt
 # grpc++/protobuf/absl stack) by walking ldd — robust to whichever of those are
 # linked shared vs static. The loader line (no "=>") is skipped; ld.so above
 # already covers it.
 ldd "$CURDIR/grpc-server" | awk '$2 == "=>" && $3 ~ /^\// { print $3 }' | sort -u | \
 while read -r so; do
    [ -f "$so" ] && cp -arfLv "$so" "$CURDIR/package/lib/"
 done
 # Vulkan loader / GPU libs when building the GPU variant.
 GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
 if [ -f "$GPU_LIB_SCRIPT" ]; then
    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
    package_gpu_libs
 fi
 echo "privacy-filter package contents:"
 ls -lah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/cpp/privacy-filter/run.sh
+++ b/backend/cpp/privacy-filter/run.sh
@@ -1,15 +0,0 @@
 #!/bin/bash
 # Entry point for the privacy-filter backend image / BACKEND_BINARY mode.
 set -e
 CURDIR=$(dirname "$(realpath "$0")")
 # macOS has no bundled ld.so; the darwin package ships only dylibs under lib/,
 # resolved via DYLD_LIBRARY_PATH (the ld.so branch below is skipped there).
 if [ "$(uname)" = "Darwin" ]; then
    export DYLD_LIBRARY_PATH="$CURDIR/lib:$DYLD_LIBRARY_PATH"
 else
    export LD_LIBRARY_PATH="$CURDIR/lib:$LD_LIBRARY_PATH"
 fi
 if [ -f "$CURDIR/lib/ld.so" ]; then
    exec "$CURDIR/lib/ld.so" "$CURDIR/grpc-server" "$@"
 fi
 exec "$CURDIR/grpc-server" "$@"
--- a/backend/cpp/run-unit-tests.sh
+++ b/backend/cpp/run-unit-tests.sh
@@ -1,71 +0,0 @@
 #!/bin/bash
 #
 # Discovers and runs every standalone C++ unit test under backend/cpp/.
 #
 # A "standalone" unit test is a *_test.cpp that depends only on the C++ standard
 # library and nlohmann/json (single header) - i.e. it exercises pure helpers and
 # does not need the full llama.cpp + gRPC backend build. Tests that DO need the
 # backend build use the CMake/ctest path (e.g. -DLLAMA_GRPC_BUILD_TESTS=ON)
 # instead and are skipped here.
 #
 # This keeps CI generic: adding a new pure-C++ unit test file named *_test.cpp in
 # an active backend source dir is picked up automatically, with no CI edits.
 #
 # Env:
 #   NLOHMANN_INCLUDE  include dir that contains nlohmann/json.hpp. If unset, the
 #                     nlohmann/json single header is fetched to a temp dir.
 #   CXX               compiler (default: g++).
 #   JSON_VERSION      nlohmann/json tag to fetch when NLOHMANN_INCLUDE is unset
 #                     (default: v3.11.3).
 set -uo pipefail
 ROOT="$(cd "$(dirname "$0")" && pwd)"
 CXX="${CXX:-g++}"
 JSON_VERSION="${JSON_VERSION:-v3.11.3}"
 JSON_INC="${NLOHMANN_INCLUDE:-}"
 if [ -z "$JSON_INC" ]; then
    JSON_INC="$(mktemp -d)"
    mkdir -p "$JSON_INC/nlohmann"
    echo "Fetching nlohmann/json ${JSON_VERSION} single header..."
    if ! curl -L -sf \
        "https://raw.githubusercontent.com/nlohmann/json/${JSON_VERSION}/single_include/nlohmann/json.hpp" \
        -o "$JSON_INC/nlohmann/json.hpp"; then
        echo "ERROR: failed to fetch nlohmann/json header" >&2
        exit 1
    fi
 fi
 # Active source dirs only - exclude per-variant build copies, dev snapshots and
 # the vendored upstream llama.cpp tree.
 mapfile -t tests < <(find "$ROOT" -name '*_test.cpp' \
    -not -path '*/llama.cpp/*' \
    -not -path '*-build/*' \
    -not -path '*-dev/*' \
    -not -path '*fallback*' | sort)
 if [ "${#tests[@]}" -eq 0 ]; then
    echo "No standalone C++ unit tests found under $ROOT"
    exit 0
 fi
 fail=0
 for test_src in "${tests[@]}"; do
    name="$(basename "$test_src" .cpp)"
    bin="$(mktemp -d)/$name"
    echo "==> $test_src"
    if ! "$CXX" -std=c++17 -Wall -Wextra \
        -I"$JSON_INC" -I"$(dirname "$test_src")" \
        "$test_src" -o "$bin"; then
        echo "COMPILE FAILED: $test_src" >&2
        fail=1
        continue
    fi
    if ! "$bin"; then
        echo "TEST FAILED: $test_src" >&2
        fail=1
    fi
 done
 echo "Ran ${#tests[@]} standalone C++ unit test file(s)"
 exit "$fail"
--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -65,29 +65,6 @@ turboquant-avx:
 turboquant-fallback:
 	$(call turboquant-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
 # Single-build CPU backend via ggml CPU_ALL_VARIANTS (mirrors llama-cpp-cpu-all).
 # turboquant reuses backend/cpp/llama-cpp's CMakeLists.txt (hw_grpc_proto STATIC) and
 # Makefile (SHARED_LIBS make-var + EXTRA_CMAKE_ARGS), so this passes the same overrides
 # through to the copied build: SHARED_LIBS=ON, the DL flags, and --target ggml (which
 # pulls in the per-microarch libggml-cpu-*.so via ggml's add_dependencies). The .so set
 # is collected for package.sh to bundle into package/lib.
 turboquant-cpu-all:
 	rm -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
 	cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build purge
 	bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server.cpp
 	$(info $(GREEN)I turboquant build info:cpu-all-variants$(RESET))
 	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build llama.cpp
 	bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp $(PATCHES_DIR)
 	SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" \
 	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build grpc-server
 	cp -rfv $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server turboquant-cpu-all
 	rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
 	find $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
 	@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
 turboquant-grpc:
 	$(call turboquant-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)
--- a/backend/cpp/turboquant/package.sh
+++ b/backend/cpp/turboquant/package.sh
@@ -14,15 +14,6 @@ mkdir -p $CURDIR/package/lib
 cp -avrf $CURDIR/turboquant-* $CURDIR/package/
 cp -rfv $CURDIR/run.sh $CURDIR/package/
 # Bundle the ggml shared backends from the CPU_ALL_VARIANTS build into package/lib. ggml
 # discovers the per-microarch libggml-cpu-*.so by scanning the executable directory, which
 # (via the bundled lib/ld.so that run.sh launches through) resolves to lib/. See the
 # matching comment in backend/cpp/llama-cpp/package.sh. No-op on the fallback/ROCm builds.
 if [ -d "$CURDIR/ggml-shared-libs" ]; then
    echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
    cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
 fi
 # Detect architecture and copy appropriate libraries
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    # x86_64 architecture
--- a/backend/cpp/turboquant/run.sh
+++ b/backend/cpp/turboquant/run.sh
@@ -2,7 +2,7 @@
 set -ex
 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")
 cd /
@@ -12,39 +12,54 @@ grep -e "flags" /proc/cpuinfo | head -1
 BINARY=turboquant-fallback
-# x86/arm64 ship a single turboquant-cpu-all built with ggml CPU_ALL_VARIANTS: ggml's
+if grep -q -e "\savx\s" /proc/cpuinfo ; then
-# backend registry dlopens the best libggml-cpu-*.so for this host, so no shell-side
+	echo "CPU:    AVX    found OK"
-# probing. ROCm ships only turboquant-fallback, so fall back to it when cpu-all is absent.
+	if [ -e $CURDIR/turboquant-avx ]; then
-if [ -e "$CURDIR"/turboquant-cpu-all ]; then
+		BINARY=turboquant-avx
-	BINARY=turboquant-cpu-all
+	fi
 fi
 if grep -q -e "\savx2\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX2   found OK"
 	if [ -e $CURDIR/turboquant-avx2 ]; then
 		BINARY=turboquant-avx2
 	fi
 fi
 # Check avx 512
 if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX512F found OK"
 	if [ -e $CURDIR/turboquant-avx512 ]; then
 		BINARY=turboquant-avx512
 	fi
 fi
 if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
-	if [ -e "$CURDIR"/turboquant-grpc ]; then
+	if [ -e $CURDIR/turboquant-grpc ]; then
 		BINARY=turboquant-grpc
 	fi
 fi
 # Extend ld library path with the dir where this script is located/lib
 if [ "$(uname)" == "Darwin" ]; then
-	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
+	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
 else
-	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 	# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
 	if [ -d "$CURDIR/lib/rocblas/library" ]; then
-		export ROCBLAS_TENSILE_LIBPATH="$CURDIR"/lib/rocblas/library
+		export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
 	fi
 fi
 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using binary: $BINARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/$BINARY "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
 fi
 echo "Using binary: $BINARY"
-exec "$CURDIR"/$BINARY "$@"
+exec $CURDIR/$BINARY "$@"
 # We should never reach this point, however just in case we do, run fallback
-exec "$CURDIR"/turboquant-fallback "$@"
+exec $CURDIR/turboquant-fallback "$@"
--- a/backend/go/acestep-cpp/Makefile
+++ b/backend/go/acestep-cpp/Makefile
@@ -117,8 +117,7 @@ libgoacestepcpp-custom: CMakeLists.txt cpp/goacestepcpp.cpp cpp/goacestepcpp.h
 	cmake .. $(CMAKE_ARGS) && \
 	cmake --build . --config Release -j$(JOBS) --target goacestepcpp && \
 	cd .. && \
-	(mv build-$(SO_TARGET)/libgoacestepcpp.so ./$(SO_TARGET) 2>/dev/null || \
+	mv build-$(SO_TARGET)/libgoacestepcpp.so ./$(SO_TARGET)
 	 mv build-$(SO_TARGET)/libgoacestepcpp.dylib ./$(SO_TARGET) 2>/dev/null)
 test: acestep-cpp
 	@echo "Running acestep-cpp tests..."
--- a/backend/go/acestep-cpp/main.go
+++ b/backend/go/acestep-cpp/main.go
@@ -4,7 +4,6 @@ package main
 import (
 	"flag"
 	"os"
 	"runtime"
 	"github.com/ebitengine/purego"
 	grpc "github.com/mudler/LocalAI/pkg/grpc"
@@ -23,11 +22,7 @@ func main() {
 	// Get library name from environment variable, default to fallback
 	libName := os.Getenv("ACESTEP_LIBRARY")
 	if libName == "" {
-		if runtime.GOOS == "darwin" {
+		libName = "./libgoacestepcpp-fallback.so"
 			libName = "./libgoacestepcpp-fallback.dylib"
 		} else {
 			libName = "./libgoacestepcpp-fallback.so"
 		}
 	}
 	gosd, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
--- a/backend/go/acestep-cpp/package.sh
+++ b/backend/go/acestep-cpp/package.sh
@@ -13,7 +13,6 @@ mkdir -p $CURDIR/package/lib
 cp -avf $CURDIR/acestep-cpp $CURDIR/package/
 cp -fv $CURDIR/libgoacestepcpp-*.so $CURDIR/package/
 cp -fv $CURDIR/libgoacestepcpp-*.dylib $CURDIR/package/ 2>/dev/null || true
 cp -fv $CURDIR/run.sh $CURDIR/package/
 # Detect architecture and copy appropriate libraries
--- a/backend/go/acestep-cpp/run.sh
+++ b/backend/go/acestep-cpp/run.sh
@@ -2,7 +2,7 @@
 set -ex
 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")
 cd /
@@ -12,29 +12,19 @@ if [ "$(uname)" != "Darwin" ]; then
 	grep -e "flags" /proc/cpuinfo | head -1
 fi
-if [ "$(uname)" = "Darwin" ]; then
+LIBRARY="$CURDIR/libgoacestepcpp-fallback.so"
 	# macOS: single library variant (Metal or Accelerate). The goacestepcpp
 	# target is built as a CMake MODULE, which emits a .dylib for a SHARED
 	# build but a .so for a MODULE build on Apple, so prefer .dylib and fall
 	# back to .so.
 	LIBRARY="$CURDIR/libgoacestepcpp-fallback.dylib"
 	if [ ! -e "$LIBRARY" ]; then
 		LIBRARY="$CURDIR/libgoacestepcpp-fallback.so"
 	fi
 	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
 else
 	LIBRARY="$CURDIR/libgoacestepcpp-fallback.so"
 if [ "$(uname)" != "Darwin" ]; then
 	if grep -q -e "\savx\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX    found OK"
-		if [ -e "$CURDIR"/libgoacestepcpp-avx.so ]; then
+		if [ -e $CURDIR/libgoacestepcpp-avx.so ]; then
 			LIBRARY="$CURDIR/libgoacestepcpp-avx.so"
 		fi
 	fi
 	if grep -q -e "\savx2\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX2   found OK"
-		if [ -e "$CURDIR"/libgoacestepcpp-avx2.so ]; then
+		if [ -e $CURDIR/libgoacestepcpp-avx2.so ]; then
 			LIBRARY="$CURDIR/libgoacestepcpp-avx2.so"
 		fi
 	fi
@@ -42,22 +32,21 @@ else
 	# Check avx 512
 	if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX512F found OK"
-		if [ -e "$CURDIR"/libgoacestepcpp-avx512.so ]; then
+		if [ -e $CURDIR/libgoacestepcpp-avx512.so ]; then
 			LIBRARY="$CURDIR/libgoacestepcpp-avx512.so"
 		fi
 	fi
 	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
 fi
 export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 export ACESTEP_LIBRARY=$LIBRARY
 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using library: $LIBRARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/acestep-cpp "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/acestep-cpp "$@"
 fi
 echo "Using library: $LIBRARY"
-exec "$CURDIR"/acestep-cpp "$@"
+exec $CURDIR/acestep-cpp "$@"
--- a/backend/go/ced/Makefile
+++ b/backend/go/ced/Makefile
@@ -1,78 +0,0 @@
 # ced sound-classification backend Makefile.
 #
 # Upstream pin lives below as CED_VERSION?=<sha> so .github/bump_deps.sh can find
 # and update it (matches the parakeet-cpp / whisper.cpp convention).
 #
 # Local dev shortcut: symlink an out-of-tree ced.cpp shared build + header and
 # skip the clone/cmake steps entirely:
 #   ln -sf /path/to/ced.cpp/build-shared/libced.so .
 #   ln -sf /path/to/ced.cpp/include/ced_capi.h .
 #   go build -o ced-grpc .
 CED_VERSION?=c04ac14b7992d00584d9e812c9bb6268598a6ce7
 CED_REPO?=https://github.com/mudler/ced.cpp
 GOCMD?=go
 GO_TAGS?=
 JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
 BUILD_TYPE?=
 NATIVE?=false
 # Static-link ggml into libced.so (PIC) so the shared lib is self-contained:
 # dlopen needs no libggml*.so alongside it, only system libs the runtime image
 # already provides.
 CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DCED_SHARED=ON -DCED_BUILD_CLI=OFF -DCED_BUILD_TESTS=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
 ifeq ($(NATIVE),false)
 	CMAKE_ARGS+=-DGGML_NATIVE=OFF
 endif
 # ced.cpp gates its ggml backends behind CED_GGML_* options (set(... CACHE BOOL
 # "" FORCE)), so forward those instead of a bare -DGGML_CUDA=ON.
 ifeq ($(BUILD_TYPE),cublas)
 	CMAKE_ARGS+=-DCED_GGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON
 else ifeq ($(BUILD_TYPE),openblas)
 	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
 else ifeq ($(BUILD_TYPE),hipblas)
 	CMAKE_ARGS+=-DCED_GGML_HIP=ON
 else ifeq ($(BUILD_TYPE),vulkan)
 	CMAKE_ARGS+=-DCED_GGML_VULKAN=ON
 endif
 .PHONY: ced-grpc package build clean purge test all
 all: ced-grpc
 sources/ced.cpp:
 	mkdir -p sources/ced.cpp
 	cd sources/ced.cpp && \
 	git init -q && \
 	git remote add origin $(CED_REPO) && \
 	git fetch --depth 1 origin $(CED_VERSION) && \
 	git checkout FETCH_HEAD && \
 	git submodule update --init --recursive --depth 1 --single-branch
 libced.so: sources/ced.cpp
 	cmake -B sources/ced.cpp/build-shared -S sources/ced.cpp $(CMAKE_ARGS)
 	cmake --build sources/ced.cpp/build-shared --config Release -j$(JOBS)
 	cp -fv sources/ced.cpp/build-shared/libced.so* ./ 2>/dev/null || true
 	cp -fv sources/ced.cpp/build-shared/libced.dylib ./ 2>/dev/null || true
 	cp -fv sources/ced.cpp/include/ced_capi.h ./
 ced-grpc: libced.so main.go goced.go
 	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o ced-grpc .
 package: ced-grpc
 	bash package.sh
 build: package
 test:
 	LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1
 clean: purge
 	rm -rf libced.so* ced_capi.h package ced-grpc
 purge:
 	rm -rf sources/ced.cpp
--- a/backend/go/ced/goced.go
+++ b/backend/go/ced/goced.go
@@ -1,130 +0,0 @@
 package main
 // Go side of the ced backend: purego bindings over ced_capi.h plus the gRPC
 // SoundDetection implementation.
 //
 // SKETCH: the pb.SoundDetection* types come from backend.proto (regenerate with
 // `make protogen-go`). The C side is single-threaded per ctx, so we guard the
 // engine with engineMu; LocalAI also serializes via base.SingleThread.
 import (
 	"context"
 	"encoding/json"
 	"errors"
 	"fmt"
 	"sort"
 	"sync"
 	"unsafe"
 	"github.com/mudler/LocalAI/pkg/grpc/base"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 )
 // purego-bound entry points from libced.so. Names match ced_capi.h exactly.
 var (
 	CppAbiVersion       func() int32
 	CppLoad             func(ggufPath string) uintptr
 	CppFree             func(ctx uintptr)
 	CppLastError        func(ctx uintptr) string
 	CppNumClasses       func(ctx uintptr) int32
 	CppSampleRate       func(ctx uintptr) int32
 	CppClassifyPathJSON func(ctx uintptr, wavPath string, topK int32) uintptr
 	CppClassifyPcmJSON  func(ctx uintptr, pcm []float32, nSamples int32, sampleRate int32, topK int32) uintptr
 	CppFreeString       func(s uintptr)
 )
 // cstr copies a malloc'd C string (returned as uintptr) into a Go string and
 // frees the original via ced_capi_free_string. Empty/0 -> "".
 func cstr(p uintptr) string {
 	if p == 0 {
 		return ""
 	}
 	defer CppFreeString(p)
 	var b []byte
 	for i := 0; ; i++ {
 		ch := *(*byte)(unsafe.Pointer(p + uintptr(i))) //nolint:govet // #nosec G103 -- C-owned NUL-terminated string from libced (not Go-GC memory)
 		if ch == 0 {
 			break
 		}
 		b = append(b, ch)
 	}
 	return string(b)
 }
 // Ced is the gRPC backend. One loaded CED model per instance.
 type Ced struct {
 	base.Base
 	ctxPtr   uintptr
 	engineMu sync.Mutex
 }
 // Load resolves the GGUF and opens the C-API context.
 func (c *Ced) Load(opts *pb.ModelOptions) error {
 	if opts.ModelFile == "" {
 		return errors.New("ced: ModelFile is required")
 	}
 	ctx := CppLoad(opts.ModelFile)
 	if ctx == 0 {
 		return fmt.Errorf("ced: ced_capi_load failed for %q: %s", opts.ModelFile, CppLastError(0))
 	}
 	c.ctxPtr = ctx
 	return nil
 }
 // jsonTag mirrors the ced_capi JSON tag objects.
 type jsonTag struct {
 	Index int     `json:"index"`
 	Score float32 `json:"score"`
 	Label string  `json:"label"`
 }
 // SoundDetection classifies the clip at req.Src and returns scored AudioSet tags.
 func (c *Ced) SoundDetection(ctx context.Context, req *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
 	if c.ctxPtr == 0 {
 		return nil, errors.New("ced: model not loaded")
 	}
 	if req.GetSrc() == "" {
 		return nil, errors.New("ced: SoundDetectionRequest.src (audio path) is required")
 	}
 	topK := req.GetTopK()
 	if topK <= 0 {
 		topK = 10 // sensible default for a tagging response
 	}
 	c.engineMu.Lock()
 	out := cstr(CppClassifyPathJSON(c.ctxPtr, req.GetSrc(), topK))
 	lastErr := CppLastError(c.ctxPtr)
 	c.engineMu.Unlock()
 	if out == "" {
 		return nil, fmt.Errorf("ced: classification failed: %s", lastErr)
 	}
 	var tags []jsonTag
 	if err := json.Unmarshal([]byte(out), &tags); err != nil {
 		return nil, fmt.Errorf("ced: bad classifier JSON: %w", err)
 	}
 	thr := req.GetThreshold()
 	resp := &pb.SoundDetectionResponse{}
 	for _, t := range tags {
 		if t.Score < thr {
 			continue
 		}
 		resp.Detections = append(resp.Detections, &pb.SoundClass{
 			Label: t.Label, Score: t.Score, Index: int32(t.Index),
 		})
 	}
 	sort.Slice(resp.Detections, func(i, j int) bool {
 		return resp.Detections[i].Score > resp.Detections[j].Score
 	})
 	return resp, nil
 }
 func (c *Ced) Free() error {
 	c.engineMu.Lock()
 	defer c.engineMu.Unlock()
 	if c.ctxPtr != 0 {
 		CppFree(c.ctxPtr)
 		c.ctxPtr = 0
 	}
 	return nil
 }
--- a/backend/go/ced/main.go
+++ b/backend/go/ced/main.go
@@ -1,64 +0,0 @@
 package main
 // ced sound-classification backend. Started internally by LocalAI: one gRPC
 // server per loaded model. Loads libced.so via purego and registers the flat
 // C-API declared in ced_capi.h. The library name can be overridden with
 // CED_LIBRARY (mirrors PARAKEET_LIBRARY / WHISPER_LIBRARY); the default looks
 // for the .so next to this binary.
 //
 // SKETCH: requires `make protogen-go` after the backend.proto SoundDetection
 // addition, and a built libced.so (see Makefile). See DESIGN.md.
 import (
 	"flag"
 	"fmt"
 	"os"
 	"runtime"
 	"github.com/ebitengine/purego"
 	grpc "github.com/mudler/LocalAI/pkg/grpc"
 )
 var addr = flag.String("addr", "localhost:50051", "the address to connect to")
 type libFunc struct {
 	ptr  any
 	name string
 }
 func main() {
 	libName := os.Getenv("CED_LIBRARY")
 	if libName == "" {
 		if runtime.GOOS == "darwin" {
 			libName = "libced.dylib"
 		} else {
 			libName = "libced.so"
 		}
 	}
 	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
 	if err != nil {
 		panic(fmt.Errorf("ced: dlopen %q: %w", libName, err))
 	}
 	// Bound 1:1 to ced_capi.h. char*-returning functions are declared uintptr
 	// so we can free the same pointer with ced_capi_free_string after copying
 	// (purego's string return would copy and leak the original).
 	for _, lf := range []libFunc{
 		{&CppAbiVersion, "ced_capi_abi_version"},
 		{&CppLoad, "ced_capi_load"},
 		{&CppFree, "ced_capi_free"},
 		{&CppLastError, "ced_capi_last_error"},
 		{&CppNumClasses, "ced_capi_num_classes"},
 		{&CppSampleRate, "ced_capi_sample_rate"},
 		{&CppClassifyPathJSON, "ced_capi_classify_path_json"},
 		{&CppClassifyPcmJSON, "ced_capi_classify_pcm_json"},
 		{&CppFreeString, "ced_capi_free_string"},
 	} {
 		purego.RegisterLibFunc(lf.ptr, lib, lf.name)
 	}
 	fmt.Fprintf(os.Stderr, "[ced] ABI=%d\n", CppAbiVersion())
 	flag.Parse()
 	if err := grpc.StartServer(*addr, &Ced{}); err != nil {
 		panic(err)
 	}
 }
--- a/backend/go/ced/package.sh
+++ b/backend/go/ced/package.sh
@@ -1,62 +0,0 @@
 #!/bin/bash
 #
 # Bundle the ced-grpc binary, libced.so, the core runtime libs (libc/libstdc++/
 # libgomp + ld.so) and the GPU runtime for the active BUILD_TYPE so the package
 # is self-contained. Mirrors backend/go/parakeet-cpp/package.sh; run.sh routes
 # the (CGO_ENABLED=0) binary through lib/ld.so so the packaged libc is used.
 set -e
 CURDIR=$(dirname "$(realpath "$0")")
 REPO_ROOT="${CURDIR}/../../.."
 mkdir -p "$CURDIR/package/lib"
 cp -avf "$CURDIR/ced-grpc" "$CURDIR/package/"
 cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
 cp -avf "$CURDIR"/libced.so* "$CURDIR/package/lib/" 2>/dev/null || true
 cp -avf "$CURDIR"/libced.dylib "$CURDIR/package/lib/" 2>/dev/null || true
 if ! ls "$CURDIR"/package/lib/libced.* >/dev/null 2>&1; then
 	echo "ERROR: libced shared library not found in $CURDIR, run 'make' first" >&2
 	exit 1
 fi
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    echo "Detected x86_64 architecture, copying x86_64 libraries..."
    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
 elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
    echo "Detected ARM64 architecture, copying ARM64 libraries..."
    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
 elif [ "$(uname -s)" = "Darwin" ]; then
    echo "Detected Darwin"
 else
    echo "Error: Could not detect architecture"
    exit 1
 fi
 GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
 if [ -f "$GPU_LIB_SCRIPT" ]; then
    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
    package_gpu_libs
 fi
 echo "Packaging completed successfully"
 ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/go/ced/run.sh
+++ b/backend/go/ced/run.sh
@@ -1,20 +0,0 @@
 #!/bin/bash
 set -e
 CURDIR=$(dirname "$(realpath "$0")")
 if [ "$(uname)" = "Darwin" ]; then
 	export DYLD_LIBRARY_PATH="$CURDIR/lib:"$CURDIR":${DYLD_LIBRARY_PATH:-}"
 	export CED_LIBRARY="$CURDIR/lib/libced.dylib"
 else
 	export LD_LIBRARY_PATH="$CURDIR/lib:"$CURDIR":${LD_LIBRARY_PATH:-}"
 fi
 # If a self-contained ld.so was packaged, route through it so the packaged
 # libc / libstdc++ are used instead of the host's (matches the sibling backends).
 if [ -f "$CURDIR/lib/ld.so" ]; then
 	echo "Using lib/ld.so"
 	exec "$CURDIR/lib/ld.so" "$CURDIR/ced-grpc" "$@"
 fi
 exec "$CURDIR/ced-grpc" "$@"
--- a/backend/go/cloud-proxy/run.sh
+++ b/backend/go/cloud-proxy/run.sh
@@ -1,6 +1,6 @@
 #!/bin/bash
 set -ex
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")
-exec "$CURDIR"/cloud-proxy "$@"
+exec $CURDIR/cloud-proxy "$@"
--- a/backend/go/crispasr/Makefile
+++ b/backend/go/crispasr/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
 # CrispASR version (release tag)
 CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
-CRISPASR_VERSION?=6b50f76e59700665358a1aabf5295597fa318e06
+CRISPASR_VERSION?=c29f6653a516a3001d923944dad8892072cc7334
 SO_TARGET?=libgocrispasr.so
 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
@@ -67,7 +67,7 @@ sources/CrispASR:
 	# it, so ${CMAKE_SOURCE_DIR} is THIS backend dir and the talk-llama sources
 	# aren't found. Rewrite to ${PROJECT_SOURCE_DIR} (the crispasr project root),
 	# which is correct both standalone and as a subproject. Idempotent.
-	sed -i.bak 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt && rm -f sources/CrispASR/src/CMakeLists.txt.bak
+	sed -i 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt
 # Detect OS
 UNAME_S := $(shell uname -s)
@@ -75,8 +75,7 @@ UNAME_S := $(shell uname -s)
 ifeq ($(UNAME_S),Linux)
 	VARIANT_TARGETS = libgocrispasr-avx.so libgocrispasr-avx2.so libgocrispasr-avx512.so libgocrispasr-fallback.so
 else
-	# On non-Linux (e.g., Darwin), build only fallback variant (as a dylib)
+	VARIANT_TARGETS = libgocrispasr-fallback.so
 	VARIANT_TARGETS = libgocrispasr-fallback.dylib
 endif
 crispasr: main.go gocrispasr.go $(VARIANT_TARGETS)
@@ -88,7 +87,7 @@ package: crispasr
 build: package
 clean: purge
-	rm -rf libgocrispasr*.so libgocrispasr*.dylib package sources/CrispASR crispasr
+	rm -rf libgocrispasr*.so package sources/CrispASR crispasr
 purge:
 	rm -rf build*
@@ -119,21 +118,13 @@ libgocrispasr-fallback.so: sources/CrispASR
 	SO_TARGET=libgocrispasr-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
 	rm -rfv build*
 # Build fallback variant as a dylib (Darwin)
 libgocrispasr-fallback.dylib: sources/CrispASR
 	$(MAKE) purge
 	$(info ${GREEN}I crispasr build info:fallback (dylib)${RESET})
 	SO_TARGET=libgocrispasr-fallback.dylib CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
 	rm -rfv build*
 libgocrispasr-custom: CMakeLists.txt cpp/crispasr_shim.cpp cpp/crispasr_shim.h
 	mkdir -p build-$(SO_TARGET) && \
 	cd build-$(SO_TARGET) && \
 	cmake .. $(CMAKE_ARGS) && \
 	cmake --build . --config Release -j$(JOBS) && \
 	cd .. && \
-	(mv build-$(SO_TARGET)/libgocrispasr.so ./$(SO_TARGET) 2>/dev/null || \
+	mv build-$(SO_TARGET)/libgocrispasr.so ./$(SO_TARGET)
 	 mv build-$(SO_TARGET)/libgocrispasr.dylib ./$(SO_TARGET) 2>/dev/null)
 test: crispasr
 	CGO_ENABLED=0 $(GOCMD) test -v ./...
--- a/backend/go/crispasr/cpp/crispasr_shim.cpp
+++ b/backend/go/crispasr/cpp/crispasr_shim.cpp
@@ -47,74 +47,6 @@ extern "C" void set_abort(int v) {
  g_abort.store(v, std::memory_order_relaxed);
 }
 // --- word-level timestamp accessors ---
 extern "C" {
 int crispasr_session_result_n_words(crispasr_session_result *r, int seg_i);
 const char *crispasr_session_result_word_text(crispasr_session_result *r,
                                               int seg_i, int word_i);
 int64_t crispasr_session_result_word_t0(crispasr_session_result *r, int seg_i,
                                         int word_i);
 int64_t crispasr_session_result_word_t1(crispasr_session_result *r, int seg_i,
                                         int word_i);
 // Parakeet-specific word accessors
 int crispasr_parakeet_result_n_words(void *r);
 const char *crispasr_parakeet_result_word_text(void *r, int word_i);
 int64_t crispasr_parakeet_result_word_t0(void *r, int word_i);
 int64_t crispasr_parakeet_result_word_t1(void *r, int word_i);
 }
 void *get_result(void) { return g_result; }
 int get_word_count(int seg_i) {
  if (!g_result)
    return 0;
  return crispasr_session_result_n_words(g_result, seg_i);
 }
 const char *get_word_text(int seg_i, int word_i) {
  if (!g_result)
    return "";
  return crispasr_session_result_word_text(g_result, seg_i, word_i);
 }
 int64_t get_word_t0(int seg_i, int word_i) {
  if (!g_result)
    return 0;
  return crispasr_session_result_word_t0(g_result, seg_i, word_i);
 }
 int64_t get_word_t1(int seg_i, int word_i) {
  if (!g_result)
    return 0;
  return crispasr_session_result_word_t1(g_result, seg_i, word_i);
 }
 // Parakeet-specific word accessors
 int get_parakeet_word_count(void) {
  if (!g_result)
    return 0;
  return crispasr_parakeet_result_n_words(g_result);
 }
 const char *get_parakeet_word_text(int word_i) {
  if (!g_result)
    return "";
  return crispasr_parakeet_result_word_text(g_result, word_i);
 }
 int64_t get_parakeet_word_t0(int word_i) {
  if (!g_result)
    return 0;
  return crispasr_parakeet_result_word_t0(g_result, word_i);
 }
 int64_t get_parakeet_word_t1(int word_i) {
  if (!g_result)
    return 0;
  return crispasr_parakeet_result_word_t1(g_result, word_i);
 }
 static void ggml_log_cb(enum ggml_log_level level, const char *log,
                        void *data) {
  const char *level_str;
--- a/backend/go/crispasr/cpp/crispasr_shim.h
+++ b/backend/go/crispasr/cpp/crispasr_shim.h
@@ -20,18 +20,4 @@ float *tts_synthesize(const char *text, int *out_n_samples); // 24kHz mono float
 void tts_free(float *pcm);
 int tts_set_voice(const char *name); // best-effort speaker selection; 0 ok
 int tts_set_voice_file(const char *path, const char *ref_text); // load voice pack (.gguf) or zero-shot clone (.wav + ref_text)
 // --- word-level timestamp accessors ---
 // Session-based (works for whisper-like backends)
 void *get_result(void);
 int get_word_count(int seg_i);
 const char *get_word_text(int seg_i, int word_i);
 int64_t get_word_t0(int seg_i, int word_i);
 int64_t get_word_t1(int seg_i, int word_i);
 // Parakeet-specific (global word list, no segment index)
 int get_parakeet_word_count(void);
 const char *get_parakeet_word_text(int word_i);
 int64_t get_parakeet_word_t0(int word_i);
 int64_t get_parakeet_word_t1(int word_i);
 }
--- a/backend/go/crispasr/gocrispasr.go
+++ b/backend/go/crispasr/gocrispasr.go
@@ -11,7 +11,6 @@ import (
 	"github.com/go-audio/audio"
 	"github.com/go-audio/wav"
 	gguf "github.com/gpustack/gguf-parser-go"
 	"github.com/mudler/LocalAI/pkg/grpc/base"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 	"github.com/mudler/LocalAI/pkg/utils"
@@ -34,55 +33,10 @@ var (
 	CppTTSFree         func(ptr uintptr)
 	CppTTSSetVoice     func(name string) int
 	CppTTSSetVoiceFile func(path string, refText string) int
 	// Word-level timestamp accessors (session-based, per-segment)
 	CppGetWordCount func(segI int) int
 	CppGetWordText  func(segI int, wordI int) string
 	CppGetWordT0    func(segI int, wordI int) int64
 	CppGetWordT1    func(segI int, wordI int) int64
 	// Parakeet-specific word accessors (global, no segment index)
 	CppGetParakeetWordCount func() int
 	CppGetParakeetWordText  func(wordI int) string
 	CppGetParakeetWordT0    func(wordI int) int64
 	CppGetParakeetWordT1    func(wordI int) int64
 )
 type CrispASR struct {
 	base.SingleThread
 	// sampleRate is the output rate (Hz) of the loaded TTS engine's PCM, used to
 	// write a correct WAV header. Most CrispASR TTS backends emit 24 kHz, but
 	// piper returns its model's native rate (16 kHz for x_low/low voices,
 	// 22.05 kHz for medium/high), so it is read from the GGUF metadata at Load.
 	sampleRate int
 }
 // defaultTTSSampleRate is the output rate assumed for CrispASR TTS engines that
 // don't advertise one in GGUF metadata (vibevoice/orpheus/chatterbox/qwen3-tts
 // all emit 24 kHz). piper is the exception and carries piper.sample_rate.
 const defaultTTSSampleRate = 24000
 // piperSampleRate reads the piper.sample_rate metadata key from a GGUF model.
 // CrispASR's piper backend returns PCM at the model's native rate without
 // resampling, so the WAV header must match it. Returns ok=false for non-piper
 // models (key absent) or an unreadable file, letting the caller fall back to
 // defaultTTSSampleRate.
 func piperSampleRate(modelPath string) (int, bool) {
 	// Only scalar architecture keys are read, so skip the large array metadata
 	// (phoneme map) and mmap the header - same rationale as pkg/vram's reader.
 	f, err := gguf.ParseGGUFFile(modelPath, gguf.UseMMap(), gguf.SkipLargeMetadata())
 	if err != nil {
 		return 0, false
 	}
 	kv, ok := f.Header.MetadataKV.Get("piper.sample_rate")
 	if !ok || kv.ValueType != gguf.GGUFMetadataValueTypeUint32 {
 		return 0, false
 	}
 	rate := int(kv.ValueUint32())
 	if rate <= 0 {
 		return 0, false
 	}
 	return rate, true
 }
 // splitOption splits a "prefix:value" model option into its key and value,
@@ -149,14 +103,6 @@ func (w *CrispASR) Load(opts *pb.ModelOptions) error {
 		return fmt.Errorf("Failed to load CrispASR transcription model")
 	}
 	// Determine the TTS output sample rate for the WAV header. piper voices
 	// carry their native rate in GGUF metadata and CrispASR does not resample;
 	// every other engine emits the 24 kHz default.
 	w.sampleRate = defaultTTSSampleRate
 	if rate, ok := piperSampleRate(opts.ModelFile); ok {
 		w.sampleRate = rate
 	}
 	// Load the companion file (codec/tokenizer/s3gen) after the session is open.
 	// rc==0 means success or "not applicable" for the active backend; only a
 	// negative code is fatal.
@@ -224,28 +170,6 @@ func (w *CrispASR) VAD(req *pb.VADRequest) (pb.VADResponse, error) {
 	}, nil
 }
 // isValidWord reports whether a TranscriptWord contains recognisable speech
 // content. The parakeet-specific word accessors can return stale initialisation
 // data (model name, binary blobs) when a segment has no real speech. A word is
 // considered valid only when:
 //   - the text is non-empty after trimming,
 //   - it contains no U+FFFD replacement characters (from binary data scrubbing),
 //   - both timestamps are non-negative,
 //   - the word has positive duration (end > start).
 func isValidWord(w *pb.TranscriptWord) bool {
 	txt := strings.TrimSpace(w.Text)
 	if txt == "" {
 		return false
 	}
 	if strings.ContainsRune(txt, '\uFFFD') {
 		return false
 	}
 	if w.Start < 0 || w.End < 0 || w.End <= w.Start {
 		return false
 	}
 	return true
 }
 func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
 	if err := ctx.Err(); err != nil {
 		return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
@@ -324,54 +248,15 @@ func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRe
 		// IDs, so Tokens is left empty.
 		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
 		// Populate word-level timestamps. Try session-based functions first
 		// (per-segment); fall back to parakeet-specific functions (global word
 		// list with no segment index — only populated on the first segment to
 		// avoid duplication).
 		words := []*pb.TranscriptWord{}
 		wordCount := CppGetWordCount(i)
 		if wordCount == 0 && i == 0 {
 			wordCount = CppGetParakeetWordCount()
 			for j := 0; j < wordCount; j++ {
 				w := &pb.TranscriptWord{
 					Start: CppGetParakeetWordT0(j) * (10000000),
 					End:   CppGetParakeetWordT1(j) * (10000000),
 					Text:  strings.ToValidUTF8(strings.Clone(CppGetParakeetWordText(j)), "<22>"),
 				}
 				if isValidWord(w) {
 					words = append(words, w)
 				}
 			}
 		} else {
 			for j := 0; j < wordCount; j++ {
 				w := &pb.TranscriptWord{
 					Start: CppGetWordT0(i, j) * (10000000),
 					End:   CppGetWordT1(i, j) * (10000000),
 					Text:  strings.ToValidUTF8(strings.Clone(CppGetWordText(i, j)), "<22>"),
 				}
 				if isValidWord(w) {
 					words = append(words, w)
 				}
 			}
 		}
 		// Skip empty segments with no recognisable content (e.g. trailing
 		// silence segments that parakeet emits with stale init data).
 		trimmed := strings.TrimSpace(txt)
 		if trimmed == "" && len(words) == 0 {
 			continue
 		}
 		segment := &pb.TranscriptSegment{
 			Id:    int32(i),
 			Text:  txt,
 			Start: s, End: t,
 			Words: words,
 		}
 		segments = append(segments, segment)
-		text += " " + trimmed
+		text += " " + strings.TrimSpace(txt)
 	}
 	return pb.TranscriptResult{
@@ -463,20 +348,13 @@ func (w *CrispASR) AudioTranscriptionStream(ctx context.Context, opts *pb.Transc
 		s := CppGetSegmentStart(i) * 10000000
 		t := CppGetSegmentEnd(i) * 10000000
 		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
 		// Skip empty segments (e.g. trailing silence that parakeet emits
 		// with stale init data).
 		trimmed := strings.TrimSpace(txt)
 		if trimmed == "" && s == t {
 			continue
 		}
 		segments = append(segments, &pb.TranscriptSegment{
 			Id:    int32(i),
 			Text:  txt,
 			Start: s, End: t,
 		})
 		trimmed := strings.TrimSpace(txt)
 		if trimmed == "" {
 			continue
 		}
@@ -512,7 +390,7 @@ func (w *CrispASR) synthesize(text string) ([]float32, error) {
 	}
 	defer CppTTSFree(ptr)
 	src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free.
-	out := make([]float32, int(n))                               // copy out of C memory before free
+	out := make([]float32, int(n)) // copy out of C memory before free
 	copy(out, src)
 	return out, nil
 }
@@ -539,7 +417,7 @@ func (w *CrispASR) TTS(req *pb.TTSRequest) error {
 	if err != nil {
 		return err
 	}
-	return writeWAV(req.Dst, pcm, w.sampleRate)
+	return writeWAV24k(req.Dst, pcm)
 }
 // TTSStream is the streaming counterpart to TTS. CrispASR has no progressive
@@ -569,7 +447,7 @@ func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
 	}
 	defer func() { _ = os.Remove(dst) }()
-	if err := writeWAV(dst, pcm, w.sampleRate); err != nil {
+	if err := writeWAV24k(dst, pcm); err != nil {
 		return err
 	}
@@ -581,14 +459,14 @@ func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
 	return nil
 }
-// writeWAV writes pcm as a sampleRate Hz, mono, 16-bit PCM WAV at dst.
+// writeWAV24k writes pcm as a 24000 Hz, mono, 16-bit PCM WAV at dst.
-func writeWAV(dst string, pcm []float32, sampleRate int) error {
+func writeWAV24k(dst string, pcm []float32) error {
 	f, err := os.Create(dst)
 	if err != nil {
 		return fmt.Errorf("crispasr: create %q: %w", dst, err)
 	}
-	enc := wav.NewEncoder(f, sampleRate, 16, 1, 1)
+	enc := wav.NewEncoder(f, 24000, 16, 1, 1)
 	ints := make([]int, len(pcm))
 	for i, s := range pcm {
 		if s > 1 {
@@ -599,7 +477,7 @@ func writeWAV(dst string, pcm []float32, sampleRate int) error {
 		ints[i] = int(s * 32767)
 	}
 	buf := &audio.IntBuffer{
-		Format:         &audio.Format{NumChannels: 1, SampleRate: sampleRate},
+		Format:         &audio.Format{NumChannels: 1, SampleRate: 24000},
 		Data:           ints,
 		SourceBitDepth: 16,
 	}
--- a/backend/go/crispasr/gocrispasr_samplerate_test.go
+++ b/backend/go/crispasr/gocrispasr_samplerate_test.go
@@ -1,164 +0,0 @@
 package main
 import (
 	"bytes"
 	"encoding/binary"
 	"os"
 	"path/filepath"
 	"github.com/go-audio/wav"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 )
 // GGUF metadata value type tags (subset) from the GGUF spec.
 const (
 	ggufTypeUint32 uint32 = 4
 	ggufTypeString uint32 = 8
 )
 type ggufKV struct {
 	key   string
 	vtype uint32
 	val   any
 }
 // writeMinimalGGUF emits a valid, tensor-less GGUF file carrying only the given
 // metadata key-values. Enough for the header-only parse path piperSampleRate
 // uses; avoids pulling a real multi-MB voice into the test.
 func writeMinimalGGUF(path string, kvs []ggufKV) error {
 	var b bytes.Buffer
 	b.WriteString("GGUF")                                // magic
 	_ = binary.Write(&b, binary.LittleEndian, uint32(3)) // version
 	_ = binary.Write(&b, binary.LittleEndian, uint64(0)) // tensor count
 	_ = binary.Write(&b, binary.LittleEndian, uint64(len(kvs)))
 	for _, kv := range kvs {
 		_ = binary.Write(&b, binary.LittleEndian, uint64(len(kv.key)))
 		b.WriteString(kv.key)
 		_ = binary.Write(&b, binary.LittleEndian, kv.vtype)
 		switch v := kv.val.(type) {
 		case uint32:
 			_ = binary.Write(&b, binary.LittleEndian, v)
 		case string:
 			_ = binary.Write(&b, binary.LittleEndian, uint64(len(v)))
 			b.WriteString(v)
 		}
 	}
 	return os.WriteFile(path, b.Bytes(), 0o644)
 }
 // wavSampleRate decodes the WAV header at path and returns its sample rate.
 func wavSampleRate(path string) (int, error) {
 	f, err := os.Open(path)
 	if err != nil {
 		return 0, err
 	}
 	defer func() { _ = f.Close() }()
 	dec := wav.NewDecoder(f)
 	dec.ReadInfo()
 	return int(dec.SampleRate), nil
 }
 var _ = Describe("piper sample rate", func() {
 	Context("piperSampleRate", func() {
 		It("reads piper.sample_rate from a piper GGUF (medium = 22050)", func() {
 			p := filepath.Join(GinkgoT().TempDir(), "voice.gguf")
 			Expect(writeMinimalGGUF(p, []ggufKV{
 				{key: "general.architecture", vtype: ggufTypeString, val: "piper"},
 				{key: "piper.sample_rate", vtype: ggufTypeUint32, val: uint32(22050)},
 			})).To(Succeed())
 			rate, ok := piperSampleRate(p)
 			Expect(ok).To(BeTrue(), "piper.sample_rate should be found")
 			Expect(rate).To(Equal(22050))
 		})
 		It("reads the low-quality rate (16000)", func() {
 			p := filepath.Join(GinkgoT().TempDir(), "voice.gguf")
 			Expect(writeMinimalGGUF(p, []ggufKV{
 				{key: "piper.sample_rate", vtype: ggufTypeUint32, val: uint32(16000)},
 			})).To(Succeed())
 			rate, ok := piperSampleRate(p)
 			Expect(ok).To(BeTrue())
 			Expect(rate).To(Equal(16000))
 		})
 		It("returns ok=false for a non-piper GGUF (no piper.sample_rate key)", func() {
 			p := filepath.Join(GinkgoT().TempDir(), "other.gguf")
 			Expect(writeMinimalGGUF(p, []ggufKV{
 				{key: "general.architecture", vtype: ggufTypeString, val: "vibevoice"},
 			})).To(Succeed())
 			_, ok := piperSampleRate(p)
 			Expect(ok).To(BeFalse())
 		})
 		It("returns ok=false for an unreadable/non-GGUF file", func() {
 			p := filepath.Join(GinkgoT().TempDir(), "garbage.gguf")
 			Expect(os.WriteFile(p, []byte("not a gguf"), 0o644)).To(Succeed())
 			_, ok := piperSampleRate(p)
 			Expect(ok).To(BeFalse())
 		})
 	})
 	// End-to-end through the built .so. Gated on CRISPASR_PIPER_MODEL_PATH (a
 	// real piper voice GGUF) like the other model-backed specs; never runs in
 	// default CI. Proves CrispASR's piper backend output rate flows into the
 	// WAV header instead of the hardcoded 24 kHz default.
 	Context("piper TTS end-to-end", func() {
 		It("writes the WAV at the model's native piper.sample_rate", func() {
 			model := os.Getenv("CRISPASR_PIPER_MODEL_PATH")
 			if model == "" {
 				Skip("set CRISPASR_PIPER_MODEL_PATH to run the piper e2e spec")
 			}
 			ensureLibLoaded()
 			expected, ok := piperSampleRate(model)
 			Expect(ok).To(BeTrue(), "model should carry piper.sample_rate metadata")
 			w := &CrispASR{}
 			Expect(w.Load(&pb.ModelOptions{
 				ModelFile: model,
 				Options:   []string{"backend:piper"},
 				Threads:   4,
 			})).To(Succeed())
 			dst := filepath.Join(GinkgoT().TempDir(), "piper.wav")
 			Expect(w.TTS(&pb.TTSRequest{Text: "Hello from CrispASR piper.", Dst: dst})).To(Succeed())
 			info, err := os.Stat(dst)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(info.Size()).To(BeNumerically(">", 1024), "expected a non-trivial WAV")
 			rate, err := wavSampleRate(dst)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(rate).To(Equal(expected),
 				"WAV header rate must equal the model's native piper.sample_rate, not the 24k default")
 		})
 	})
 	Context("writeWAV", func() {
 		It("writes the WAV header at the given sample rate (22050 for piper, not the 24k default)", func() {
 			dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
 			pcm := make([]float32, 220) // 10 ms of silence is enough for a header
 			Expect(writeWAV(dst, pcm, 22050)).To(Succeed())
 			rate, err := wavSampleRate(dst)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(rate).To(Equal(22050))
 		})
 		It("writes a 16000 Hz header for low-quality piper voices", func() {
 			dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
 			pcm := make([]float32, 160)
 			Expect(writeWAV(dst, pcm, 16000)).To(Succeed())
 			rate, err := wavSampleRate(dst)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(rate).To(Equal(16000))
 		})
 	})
 })
--- a/backend/go/crispasr/main.go
+++ b/backend/go/crispasr/main.go
@@ -4,7 +4,6 @@ package main
 import (
 	"flag"
 	"os"
 	"runtime"
 	"github.com/ebitengine/purego"
 	grpc "github.com/mudler/LocalAI/pkg/grpc"
@@ -22,11 +21,7 @@ type LibFuncs struct {
 func main() {
 	libName := os.Getenv("CRISPASR_LIBRARY")
 	if libName == "" {
-		if runtime.GOOS == "darwin" {
+		libName = "./libgocrispasr-fallback.so"
 			libName = "./libgocrispasr-fallback.dylib"
 		} else {
 			libName = "./libgocrispasr-fallback.so"
 		}
 	}
 	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
@@ -49,14 +44,6 @@ func main() {
 		{&CppTTSFree, "tts_free"},
 		{&CppTTSSetVoice, "tts_set_voice"},
 		{&CppTTSSetVoiceFile, "tts_set_voice_file"},
 		{&CppGetWordCount, "get_word_count"},
 		{&CppGetWordText, "get_word_text"},
 		{&CppGetWordT0, "get_word_t0"},
 		{&CppGetWordT1, "get_word_t1"},
 		{&CppGetParakeetWordCount, "get_parakeet_word_count"},
 		{&CppGetParakeetWordText, "get_parakeet_word_text"},
 		{&CppGetParakeetWordT0, "get_parakeet_word_t0"},
 		{&CppGetParakeetWordT1, "get_parakeet_word_t1"},
 	}
 	for _, lf := range libFuncs {
--- a/backend/go/crispasr/package.sh
+++ b/backend/go/crispasr/package.sh
@@ -12,8 +12,7 @@ REPO_ROOT="${CURDIR}/../../.."
 mkdir -p $CURDIR/package/lib
 cp -avf $CURDIR/crispasr $CURDIR/package/
-cp -fv $CURDIR/libgocrispasr-*.so $CURDIR/package/ 2>/dev/null || true
+cp -fv $CURDIR/libgocrispasr-*.so $CURDIR/package/
 cp -fv $CURDIR/libgocrispasr-*.dylib $CURDIR/package/ 2>/dev/null || true
 cp -fv $CURDIR/run.sh $CURDIR/package/
 # Detect architecture and copy appropriate libraries
@@ -52,32 +51,6 @@ else
    exit 1
 fi
 # Bundle espeak-ng (+ its libpcaudio/libsonic runtime deps) and its voice data so
 # the piper TTS backend can phonemize non-English text. CrispASR dlopens
 # libespeak-ng.so.1 at runtime (the MIT-clean path); the dlopen succeeds loading
 # libespeak-ng but FAILS if libpcaudio/libsonic are absent, so all three .so are
 # required. run.sh points CRISPASR_ESPEAK_DATA_PATH at the bundled data dir.
 # Best-effort: only copied when present, so a local dev build without espeak-ng
 # installed still packages the rest (English voices keep working).
 ESPEAK_LIBDIR=""
 for d in /usr/lib/x86_64-linux-gnu /usr/lib/aarch64-linux-gnu; do
    if [ -f "$d/libespeak-ng.so.1" ]; then
        ESPEAK_LIBDIR="$d"
        break
    fi
 done
 if [ -n "$ESPEAK_LIBDIR" ]; then
    echo "Bundling espeak-ng from $ESPEAK_LIBDIR ..."
    cp -arfLv "$ESPEAK_LIBDIR/libespeak-ng.so.1" $CURDIR/package/lib/
    cp -arfLv "$ESPEAK_LIBDIR/libpcaudio.so.0" $CURDIR/package/lib/
    cp -arfLv "$ESPEAK_LIBDIR/libsonic.so.0" $CURDIR/package/lib/
    if [ -d "$ESPEAK_LIBDIR/espeak-ng-data" ]; then
        cp -arfLv "$ESPEAK_LIBDIR/espeak-ng-data" $CURDIR/package/
    fi
 else
    echo "espeak-ng not found; non-English piper voices will not phonemize"
 fi
 # Package GPU libraries based on BUILD_TYPE
 # The GPU library packaging script will detect BUILD_TYPE and copy appropriate GPU libraries
 GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
--- a/backend/go/crispasr/run.sh
+++ b/backend/go/crispasr/run.sh
@@ -2,7 +2,7 @@
 set -ex
 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")
 cd /
@@ -12,23 +12,19 @@ if [ "$(uname)" != "Darwin" ]; then
 	grep -e "flags" /proc/cpuinfo | head -1
 fi
-if [ "$(uname)" = "Darwin" ]; then
+LIBRARY="$CURDIR/libgocrispasr-fallback.so"
 	# macOS: single dylib variant (Metal or Accelerate)
 	LIBRARY="$CURDIR/libgocrispasr-fallback.dylib"
 	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
 else
 	LIBRARY="$CURDIR/libgocrispasr-fallback.so"
 if [ "$(uname)" != "Darwin" ]; then
 	if grep -q -e "\savx\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX    found OK"
-		if [ -e "$CURDIR"/libgocrispasr-avx.so ]; then
+		if [ -e $CURDIR/libgocrispasr-avx.so ]; then
 			LIBRARY="$CURDIR/libgocrispasr-avx.so"
 		fi
 	fi
 	if grep -q -e "\savx2\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX2   found OK"
-		if [ -e "$CURDIR"/libgocrispasr-avx2.so ]; then
+		if [ -e $CURDIR/libgocrispasr-avx2.so ]; then
 			LIBRARY="$CURDIR/libgocrispasr-avx2.so"
 		fi
 	fi
@@ -36,27 +32,21 @@ else
 	# Check avx 512
 	if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX512F found OK"
-		if [ -e "$CURDIR"/libgocrispasr-avx512.so ]; then
+		if [ -e $CURDIR/libgocrispasr-avx512.so ]; then
 			LIBRARY="$CURDIR/libgocrispasr-avx512.so"
 		fi
 	fi
 	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
 fi
 export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 export CRISPASR_LIBRARY=$LIBRARY
 # Point piper's espeak-ng phonemizer at the bundled voice data. The variable
 # names the directory CONTAINING espeak-ng-data (package.sh drops it next to
 # this script). Harmless when espeak-ng wasn't bundled.
 export CRISPASR_ESPEAK_DATA_PATH="$CURDIR"
 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using library: $LIBRARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/crispasr "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/crispasr "$@"
 fi
 echo "Using library: $LIBRARY"
-exec "$CURDIR"/crispasr "$@"
+exec $CURDIR/crispasr "$@"
--- a/backend/go/depth-anything-cpp/.gitignore
+++ b/backend/go/depth-anything-cpp/.gitignore
@@ -1,7 +0,0 @@
 sources/
 build*/
 package/
 libdepthanythingcpp*.so
 depth-anything-cpp
 test-models/
 test-data/
--- a/backend/go/depth-anything-cpp/CMakeLists.txt
+++ b/backend/go/depth-anything-cpp/CMakeLists.txt
@@ -1,28 +0,0 @@
 cmake_minimum_required(VERSION 3.18)
 project(libdepthanythingcpp LANGUAGES C CXX)
 set(CMAKE_POSITION_INDEPENDENT_CODE ON)
 set(CMAKE_CXX_STANDARD 17)
 set(CMAKE_CXX_STANDARD_REQUIRED ON)
 # Static-link ggml into the depth-anything shared library so the resulting .so
 # has no runtime dependency on an external libggml — only on
 # libc/libstdc++/libgomp, which the LocalAI package step bundles into the
 # docker image.
 set(BUILD_SHARED_LIBS OFF CACHE BOOL "Build static libraries" FORCE)
 # depth-anything.cpp build switches: skip CLI/tests, but build libdepthanything
 # itself as a SHARED library (DA_SHARED) while ggml stays static
 # (BUILD_SHARED_LIBS OFF above). The da_capi_* C ABI is compiled into
 # src/da_capi.cpp and re-exported by that shared library, so no extra MODULE
 # wrapper is needed (unlike locate-anything.cpp).
 set(DA_BUILD_CLI OFF CACHE BOOL "Disable depth-anything CLI" FORCE)
 set(DA_BUILD_TESTS OFF CACHE BOOL "Disable depth-anything tests" FORCE)
 set(DA_SHARED ON CACHE BOOL "Build libdepthanything as a shared lib" FORCE)
 add_subdirectory(./sources/depth-anything.cpp)
 # Emit libdepthanything.so into the top-level build dir so the Makefile can
 # rename it to the per-variant libdepthanythingcpp-<variant>.so.
 set_target_properties(depthanything PROPERTIES
    LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
--- a/backend/go/depth-anything-cpp/Makefile
+++ b/backend/go/depth-anything-cpp/Makefile
@@ -1,152 +0,0 @@
 CMAKE_ARGS?=
 BUILD_TYPE?=
 NATIVE?=false
 GOCMD?=go
 GO_TAGS?=
 JOBS?=$(shell nproc --ignore=1)
 # depth-anything.cpp. Pin to a specific commit for a stable build; a squash
 # merge upstream can orphan a branch, so the native version is pinned by SHA.
 # This SHA adds the Depth Anything V2 engine + C-API routing (depth-only,
 # relative + metric) on top of the nested two-file metric C-API (abi_version 4,
 # da_capi_load_nested) required by the depth-anything-3-nested gallery model.
 # It is kept alive by the upstream tag da2-support (survives a squash-merge);
 # repoint to the master merge commit once mudler/depth-anything.cpp PR #1 lands.
 DEPTHANYTHING_REPO?=https://github.com/mudler/depth-anything.cpp.git
 DEPTHANYTHING_VERSION?=f4e17dea695dd12ae76bea98ba58030996b98118
 ifeq ($(NATIVE),false)
 	CMAKE_ARGS+=-DGGML_NATIVE=OFF
 endif
 # Forward LocalAI's BUILD_TYPE to the matching ggml backend switch. depth-anything.cpp
 # force-sets GGML_CUDA/GGML_VULKAN/GGML_METAL from its own DA_GGML_* options, so
 # those must be toggled via the DA_GGML_* names (a bare -DGGML_CUDA=ON would be
 # overridden); the remaining ggml switches pass straight through.
 ifeq ($(BUILD_TYPE),cublas)
 	CMAKE_ARGS+=-DGGML_CUDA=ON -DDA_GGML_CUDA=ON
 else ifeq ($(BUILD_TYPE),openblas)
 	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
 else ifeq ($(BUILD_TYPE),clblas)
 	CMAKE_ARGS+=-DGGML_CLBLAST=ON
 else ifeq ($(BUILD_TYPE),hipblas)
 	ROCM_HOME ?= /opt/rocm
 	ROCM_PATH ?= /opt/rocm
 	export CXX=$(ROCM_HOME)/llvm/bin/clang++
 	export CC=$(ROCM_HOME)/llvm/bin/clang
 	AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
 	CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
 else ifeq ($(BUILD_TYPE),vulkan)
 	CMAKE_ARGS+=-DGGML_VULKAN=ON -DDA_GGML_VULKAN=ON
 else ifeq ($(OS),Darwin)
 	# macOS/Metal: built + published as an OCI image by CI (includeDarwin in
 	# .github/backend-matrix.yml) so Apple Silicon users can install this backend.
 	ifneq ($(BUILD_TYPE),metal)
 		CMAKE_ARGS+=-DGGML_METAL=OFF
 	else
 		CMAKE_ARGS+=-DGGML_METAL=ON
 		CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
 		CMAKE_ARGS+=-DDA_GGML_METAL=ON
 	endif
 endif
 ifeq ($(BUILD_TYPE),sycl_f16)
 	CMAKE_ARGS+=-DGGML_SYCL=ON \
 		-DCMAKE_C_COMPILER=icx \
 		-DCMAKE_CXX_COMPILER=icpx \
 		-DGGML_SYCL_F16=ON
 endif
 ifeq ($(BUILD_TYPE),sycl_f32)
 	CMAKE_ARGS+=-DGGML_SYCL=ON \
 		-DCMAKE_C_COMPILER=icx \
 		-DCMAKE_CXX_COMPILER=icpx
 endif
 sources/depth-anything.cpp:
 	mkdir -p sources && \
 	git clone --recursive $(DEPTHANYTHING_REPO) sources/depth-anything.cpp && \
 	cd sources/depth-anything.cpp && \
 	git checkout $(DEPTHANYTHING_VERSION) && \
 	git submodule update --init --recursive --depth 1 --single-branch
 # Detect OS
 UNAME_S := $(shell uname -s)
 # Only build CPU variants on Linux
 ifeq ($(UNAME_S),Linux)
 	VARIANT_TARGETS = libdepthanythingcpp-avx.so libdepthanythingcpp-avx2.so libdepthanythingcpp-avx512.so libdepthanythingcpp-fallback.so
 else
 	# On non-Linux (e.g., Darwin), build only fallback variant
 	VARIANT_TARGETS = libdepthanythingcpp-fallback.dylib
 endif
 depth-anything-cpp: main.go godepthanythingcpp.go $(VARIANT_TARGETS)
 	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o depth-anything-cpp ./
 package: depth-anything-cpp
 	bash package.sh
 build: package
 clean: purge
 	rm -rf libdepthanythingcpp*.so libdepthanythingcpp*.dylib depth-anything-cpp package sources
 purge:
 	rm -rf build*
 # Build all variants (Linux only)
 ifeq ($(UNAME_S),Linux)
 libdepthanythingcpp-avx.so: sources/depth-anything.cpp
 	rm -rfv build-$@
 	$(info ${GREEN}I depth-anything-cpp build info:avx${RESET})
 	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libdepthanythingcpp-custom
 	rm -rfv build-$@
 libdepthanythingcpp-avx2.so: sources/depth-anything.cpp
 	rm -rfv build-$@
 	$(info ${GREEN}I depth-anything-cpp build info:avx2${RESET})
 	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libdepthanythingcpp-custom
 	rm -rfv build-$@
 libdepthanythingcpp-avx512.so: sources/depth-anything.cpp
 	rm -rfv build-$@
 	$(info ${GREEN}I depth-anything-cpp build info:avx512${RESET})
 	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libdepthanythingcpp-custom
 	rm -rfv build-$@
 endif
 # Build fallback variant (all platforms)
 ifeq ($(UNAME_S),Darwin)
 libdepthanythingcpp-fallback.dylib: sources/depth-anything.cpp
 	rm -rfv build-$@
 	$(info ${GREEN}I depth-anything-cpp build info:fallback${RESET})
 	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libdepthanythingcpp-custom
 	rm -rfv build-$@
 else
 libdepthanythingcpp-fallback.so: sources/depth-anything.cpp
 	rm -rfv build-$@
 	$(info ${GREEN}I depth-anything-cpp build info:fallback${RESET})
 	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libdepthanythingcpp-custom
 	rm -rfv build-$@
 endif
 libdepthanythingcpp-custom: CMakeLists.txt
 	mkdir -p build-$(SO_TARGET) && \
 	cd build-$(SO_TARGET) && \
 	cmake .. $(CMAKE_ARGS) && \
 	cmake --build . --config Release -j$(JOBS) && \
 	cd .. && \
 	(mv build-$(SO_TARGET)/libdepthanything.so ./$(SO_TARGET) 2>/dev/null || \
 	 mv build-$(SO_TARGET)/libdepthanything.dylib ./$(SO_TARGET) 2>/dev/null)
 all: depth-anything-cpp package
 # `test` is invoked by the top-level Makefile's `test-extra` target. It builds
 # the backend binary + the fallback shared library (needed for dlopen at
 # runtime), then runs test.sh which downloads a small GGUF + a test image and
 # exercises the gRPC Load/Predict wire path via the Go smoke test in
 # main_test.go.
 test: depth-anything-cpp libdepthanythingcpp-fallback.so
 	bash test.sh
--- a/backend/go/depth-anything-cpp/godepthanythingcpp.go
+++ b/backend/go/depth-anything-cpp/godepthanythingcpp.go
@@ -1,556 +0,0 @@
 package main
 // godepthanythingcpp.go - gRPC handlers (Load, Predict, GenerateImage) for the
 // depth-anything-cpp backend, wrapping the Depth Anything 3 ggml C-API
 // (libdepthanythingcpp-<variant>.so) via purego.
 //
 // Embeds base.SingleThread to default the unimplemented RPCs to "not supported"
 // and to serialize calls — the C side shares a ggml graph allocator and is NOT
 // reentrant, so all inference must run one-at-a-time.
 //
 // Depth has no native OpenAI endpoint, so the model is exposed two ways:
 //
 //   - GenerateImage(src, dst): run depth on the src image and write a
 //     min-max-normalised grayscale depth PNG to dst.
 //   - Predict(images[0]): run depth+pose and return a JSON blob with the depth
 //     dimensions, depth stats and the camera extrinsics (3x4) / intrinsics (3x3).
 import (
 	"encoding/base64"
 	"encoding/json"
 	"fmt"
 	"image"
 	"image/png"
 	"math"
 	"os"
 	"path/filepath"
 	"strings"
 	"unsafe"
 	"github.com/mudler/LocalAI/pkg/grpc/base"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 )
 // C-API function pointers, registered in main.go via purego. The da_capi_*
 // symbols live inside libdepthanything (src/da_capi.cpp) and are re-exported by
 // the DA_SHARED build.
 var (
 	// da_capi_load(const char* gguf_path, int n_threads) -> da_ctx* (0 = fail)
 	CapiLoad func(gguf string, nThreads int32) uintptr
 	// da_capi_load_nested(const char* anyview_gguf, const char* metric_gguf,
 	//   int n_threads) -> da_ctx* (0 = fail). The returned ctx serves the nested
 	//   metric model: depth/pose calls produce final metric-scale depth + scaled pose.
 	CapiLoadNested func(anyview string, metric string, nThreads int32) uintptr
 	// da_capi_free(da_ctx* ctx) — safe on a 0 handle.
 	CapiFree func(handle uintptr)
 	// da_capi_last_error(da_ctx* ctx) -> const char* (owned by ctx, "" if none).
 	// purego marshals the returned C string into a Go string (a copy), so we
 	// never free it.
 	CapiLastError func(handle uintptr) string
 	// da_capi_depth_path(ctx, image_path, out_h*, out_w*) -> float* depth map
 	// (row-major H*W); nil on error. Caller frees via da_capi_free_floats.
 	CapiDepthPath func(handle uintptr, imagePath string, outH *int32, outW *int32) *float32
 	// da_capi_free_floats(float* p)
 	CapiFreeFloats func(p *float32)
 	// da_capi_pose_path(ctx, image_path, out_ext[12], out_intr[9]) -> 0 ok, -1 err
 	CapiPosePath func(handle uintptr, imagePath string, outExt *float32, outIntr *float32) int32
 	// da_capi_depth_dense(ctx, image_path, out_h*, out_w*, out_depth**, out_conf**,
 	//   out_sky**, out_ext[12], out_intr[9], out_is_metric*) -> 0 ok, -1 err.
 	// Each non-NULL out_depth/out_conf/out_sky receives a malloc'd float[H*W] (free
 	// via da_capi_free_floats); buffers the model doesn't produce are set NULL.
 	CapiDepthDense func(handle uintptr, imagePath string,
 		outH, outW *int32,
 		outDepth, outConf, outSky **float32,
 		outExt, outIntr *float32,
 		outIsMetric *int32) int32
 	// da_capi_points(ctx, image_path, conf_thresh, out_n*, out_xyz**, out_rgb**) ->
 	//   0 ok, -1 err. *out_xyz = malloc'd float[3*N] (free via da_capi_free_floats),
 	//   *out_rgb = malloc'd uint8[3*N] (free via da_capi_free_bytes).
 	CapiPoints func(handle uintptr, imagePath string, confThresh float32,
 		outN *int32, outXyz **float32, outRgb **byte) int32
 	// da_capi_free_bytes(unsigned char* p)
 	CapiFreeBytes func(p *byte)
 	// da_capi_export_glb(ctx, image_path, out_glb) -> 0 ok, -1 err
 	CapiExportGlb func(handle uintptr, imagePath string, outGlb string) int32
 	// da_capi_export_colmap(ctx, image_path, out_dir, binary) -> 0 ok, -1 err
 	CapiExportColmap func(handle uintptr, imagePath string, outDir string, binary int32) int32
 )
 type DepthAnythingCpp struct {
 	base.SingleThread
 	handle uintptr
 }
 // Load loads the GGUF model at opts.ModelFile (joined with opts.ModelPath if
 // relative) and stores the da_ctx handle for later inference calls.
 func (r *DepthAnythingCpp) Load(opts *pb.ModelOptions) error {
 	modelFile := opts.ModelFile
 	if modelFile == "" {
 		modelFile = opts.Model
 	}
 	if modelFile == "" {
 		return fmt.Errorf("depth-anything-cpp: ModelFile is empty")
 	}
 	resolve := func(name string) string {
 		if filepath.IsAbs(name) {
 			return name
 		}
 		return filepath.Join(opts.ModelPath, name)
 	}
 	modelPath := resolve(modelFile)
 	if _, err := os.Stat(modelPath); err != nil {
 		return fmt.Errorf("depth-anything-cpp: model file not found: %s: %w", modelPath, err)
 	}
 	// Nested metric models are a two-file pair: the main model is the anyview
 	// (GIANT) branch and the metric (ViT-L + DPT/sky) branch is named via a
 	// "metric_model:<filename>" entry in opts.Options. When present we load both
 	// branches so the engine runs the nested metric alignment.
 	metricFile := optionValue(opts.Options, "metric_model")
 	threads := opts.Threads
 	if threads <= 0 {
 		threads = 4
 	}
 	// Release previous model if any (re-Load).
 	if r.handle != 0 {
 		CapiFree(r.handle)
 		r.handle = 0
 	}
 	var h uintptr
 	if metricFile != "" {
 		metricPath := resolve(metricFile)
 		if _, err := os.Stat(metricPath); err != nil {
 			return fmt.Errorf("depth-anything-cpp: metric_model file not found: %s: %w", metricPath, err)
 		}
 		h = CapiLoadNested(modelPath, metricPath, threads)
 		if h == 0 {
 			if msg := CapiLastError(0); msg != "" {
 				return fmt.Errorf("depth-anything-cpp: da_capi_load_nested failed for %s + %s: %s", modelPath, metricPath, msg)
 			}
 			return fmt.Errorf("depth-anything-cpp: da_capi_load_nested failed for %s + %s", modelPath, metricPath)
 		}
 	} else {
 		h = CapiLoad(modelPath, threads)
 		if h == 0 {
 			// da_capi_last_error needs a ctx; on a failed load we have none (it
 			// returns "" for a null ctx), so the text is best-effort.
 			if msg := CapiLastError(0); msg != "" {
 				return fmt.Errorf("depth-anything-cpp: da_capi_load failed for %s: %s", modelPath, msg)
 			}
 			return fmt.Errorf("depth-anything-cpp: da_capi_load failed for %s", modelPath)
 		}
 	}
 	r.handle = h
 	return nil
 }
 // optionValue returns the value of the first "key:value" entry in opts whose key
 // matches (case-sensitive), or "" if absent. Mirrors how other LocalAI backends
 // read ModelOptions.Options.
 func optionValue(opts []string, key string) string {
 	prefix := key + ":"
 	for _, o := range opts {
 		if strings.HasPrefix(o, prefix) {
 			return strings.TrimSpace(o[len(prefix):])
 		}
 	}
 	return ""
 }
 // depthResult is the JSON payload returned by Predict.
 type depthResult struct {
 	DepthW     int         `json:"depth_w"`
 	DepthH     int         `json:"depth_h"`
 	DepthMin   float32     `json:"depth_min"`
 	DepthMax   float32     `json:"depth_max"`
 	Extrinsics [12]float32 `json:"extrinsics"` // 3x4 row-major
 	Intrinsics [9]float32  `json:"intrinsics"` // 3x3 row-major
 }
 // Predict runs depth+pose on the first supplied image and returns depth
 // statistics + camera pose as a JSON string. LocalAI wraps the string into the
 // Reply.Message of the gRPC response. The image in Images[0] may be a
 // filesystem path or a base64-encoded payload.
 func (r *DepthAnythingCpp) Predict(opts *pb.PredictOptions) (string, error) {
 	imgs := opts.GetImages()
 	if len(imgs) == 0 {
 		return "", fmt.Errorf("depth-anything-cpp: Predict requires an image in Images[]")
 	}
 	imgPath, cleanup, err := materializeImage(imgs[0])
 	if err != nil {
 		return "", fmt.Errorf("depth-anything-cpp: %w", err)
 	}
 	defer cleanup()
 	depth, h, w, ext, intr, err := r.runDepthPose(imgPath)
 	if err != nil {
 		return "", err
 	}
 	dmin, dmax := minMax(depth)
 	payload, err := json.Marshal(depthResult{
 		DepthW: w, DepthH: h,
 		DepthMin: dmin, DepthMax: dmax,
 		Extrinsics: ext, Intrinsics: intr,
 	})
 	if err != nil {
 		return "", fmt.Errorf("depth-anything-cpp: marshal: %w", err)
 	}
 	return string(payload), nil
 }
 // GenerateImage runs depth on req.Src and writes a normalised grayscale depth
 // PNG to req.Dst.
 func (r *DepthAnythingCpp) GenerateImage(req *pb.GenerateImageRequest) error {
 	if req.GetSrc() == "" {
 		return fmt.Errorf("depth-anything-cpp: GenerateImage requires src")
 	}
 	if req.GetDst() == "" {
 		return fmt.Errorf("depth-anything-cpp: GenerateImage requires dst")
 	}
 	imgPath, cleanup, err := materializeImage(req.GetSrc())
 	if err != nil {
 		return fmt.Errorf("depth-anything-cpp: %w", err)
 	}
 	defer cleanup()
 	depth, h, w, _, _, err := r.runDepthPose(imgPath)
 	if err != nil {
 		return err
 	}
 	return writeDepthPNG(req.GetDst(), depth, h, w)
 }
 // Depth is the typed Depth RPC. It runs the Depth Anything 3 pipeline on the
 // request's src image and fills a DepthResponse honoring the include_* flags and
 // exports: per-pixel metric depth + confidence (DualDPT) or depth + sky (mono),
 // camera extrinsics/intrinsics, an optional back-projected 3D point cloud and
 // glb/COLMAP exports. The src may be a filesystem path or a base64 payload.
 func (r *DepthAnythingCpp) Depth(in *pb.DepthRequest) (pb.DepthResponse, error) {
 	// Accumulate into locals and return a single composite literal at the end:
 	// returning a named pb.DepthResponse value would copy its embedded mutex
 	// (go vet copylocks).
 	if r.handle == 0 {
 		return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: model not loaded")
 	}
 	if in.GetSrc() == "" {
 		return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: Depth requires src")
 	}
 	imgPath, cleanup, err := materializeImage(in.GetSrc())
 	if err != nil {
 		return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: %w", err)
 	}
 	defer cleanup()
 	// Dense per-pixel output + pose. Pass buffer pointers only for the
 	// requested maps so the native side can skip unrequested work; ext/intr
 	// must always point at 12/9 floats per the C ABI.
 	var (
 		h, w, isMetric      int32
 		depthPtr, confPtr   *float32
 		skyPtr              *float32
 		ext                 [12]float32
 		intr                [9]float32
 		pDepth, pConf, pSky **float32
 	)
 	if in.GetIncludeDepth() {
 		pDepth = &depthPtr
 	}
 	if in.GetIncludeConfidence() {
 		pConf = &confPtr
 	}
 	if in.GetIncludeSky() {
 		pSky = &skyPtr
 	}
 	rc := CapiDepthDense(r.handle, imgPath, &h, &w, pDepth, pConf, pSky, &ext[0], &intr[0], &isMetric)
 	if rc != 0 {
 		return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: da_capi_depth_dense failed (rc=%d): %s", rc, r.lastError())
 	}
 	n := int(h) * int(w)
 	var (
 		depth, conf, sky      []float32
 		extrinsics, intrinsic []float32
 		numPoints             int32
 		points                []float32
 		pointColors           []byte
 		exportPaths           []string
 	)
 	if depthPtr != nil {
 		depth = copyFloats(depthPtr, n)
 		CapiFreeFloats(depthPtr)
 	}
 	if confPtr != nil {
 		conf = copyFloats(confPtr, n)
 		CapiFreeFloats(confPtr)
 	}
 	if skyPtr != nil {
 		sky = copyFloats(skyPtr, n)
 		CapiFreeFloats(skyPtr)
 	}
 	if in.GetIncludePose() {
 		extrinsics = append([]float32(nil), ext[:]...)
 		intrinsic = append([]float32(nil), intr[:]...)
 	}
 	// 3D point cloud (DualDPT / pose-capable models only).
 	if in.GetIncludePoints() {
 		var (
 			np     int32
 			xyzPtr *float32
 			rgbPtr *byte
 		)
 		if rc := CapiPoints(r.handle, imgPath, in.GetPointsConfThresh(), &np, &xyzPtr, &rgbPtr); rc != 0 {
 			return pb.DepthResponse{}, fmt.Errorf("depth-anything-cpp: da_capi_points failed (rc=%d): %s", rc, r.lastError())
 		}
 		numPoints = np
 		if xyzPtr != nil {
 			points = copyFloats(xyzPtr, int(np)*3)
 			CapiFreeFloats(xyzPtr)
 		}
 		if rgbPtr != nil {
 			pointColors = copyBytes(rgbPtr, int(np)*3)
 			CapiFreeBytes(rgbPtr)
 		}
 	}
 	// Exports (glb / colmap). They are written under in.Dst (a directory); a
 	// temp dir is used when Dst is empty.
 	if len(in.GetExports()) > 0 {
 		exportPaths, err = r.runExports(imgPath, in.GetDst(), in.GetExports())
 		if err != nil {
 			return pb.DepthResponse{}, err
 		}
 	}
 	return pb.DepthResponse{
 		Width:       w,
 		Height:      h,
 		Depth:       depth,
 		Confidence:  conf,
 		Sky:         sky,
 		Extrinsics:  extrinsics,
 		Intrinsics:  intrinsic,
 		NumPoints:   numPoints,
 		Points:      points,
 		PointColors: pointColors,
 		ExportPaths: exportPaths,
 		IsMetric:    isMetric != 0,
 	}, nil
 }
 // runExports writes the requested exports for imgPath into dstDir and returns
 // the written paths. Supported exports: "glb", "colmap".
 func (r *DepthAnythingCpp) runExports(imgPath, dstDir string, exports []string) ([]string, error) {
 	if dstDir == "" {
 		tmp, err := os.MkdirTemp("", "depth-anything-export-*")
 		if err != nil {
 			return nil, fmt.Errorf("depth-anything-cpp: mkdir export dir: %w", err)
 		}
 		dstDir = tmp
 	} else if err := os.MkdirAll(dstDir, 0o750); err != nil {
 		return nil, fmt.Errorf("depth-anything-cpp: mkdir %s: %w", dstDir, err)
 	}
 	var paths []string
 	for _, exp := range exports {
 		switch exp {
 		case "glb":
 			out := filepath.Join(dstDir, "pointcloud.glb")
 			if rc := CapiExportGlb(r.handle, imgPath, out); rc != 0 {
 				return nil, fmt.Errorf("depth-anything-cpp: da_capi_export_glb failed (rc=%d): %s", rc, r.lastError())
 			}
 			paths = append(paths, out)
 		case "colmap":
 			out := filepath.Join(dstDir, "colmap")
 			if err := os.MkdirAll(out, 0o750); err != nil {
 				return nil, fmt.Errorf("depth-anything-cpp: mkdir %s: %w", out, err)
 			}
 			if rc := CapiExportColmap(r.handle, imgPath, out, 1); rc != 0 {
 				return nil, fmt.Errorf("depth-anything-cpp: da_capi_export_colmap failed (rc=%d): %s", rc, r.lastError())
 			}
 			paths = append(paths, out)
 		default:
 			return nil, fmt.Errorf("depth-anything-cpp: unknown export %q (want glb|colmap)", exp)
 		}
 	}
 	return paths, nil
 }
 // copyFloats copies n float32 values from a C heap pointer into a fresh Go
 // slice so the C buffer can be freed afterwards.
 func copyFloats(p *float32, n int) []float32 {
 	if p == nil || n <= 0 {
 		return nil
 	}
 	src := unsafe.Slice(p, n)
 	out := make([]float32, n)
 	copy(out, src)
 	return out
 }
 // copyBytes copies n bytes from a C heap pointer into a fresh Go slice.
 func copyBytes(p *byte, n int) []byte {
 	if p == nil || n <= 0 {
 		return nil
 	}
 	src := unsafe.Slice(p, n)
 	out := make([]byte, n)
 	copy(out, src)
 	return out
 }
 // runDepthPose runs depth estimation then pose recovery on an image file. It
 // returns the row-major depth map (length h*w), its dimensions, the 3x4
 // extrinsics (12 floats) and 3x3 intrinsics (9 floats).
 // runDepthPose returns depth + camera pose via two C-API calls (depth then pose).
 // For a nested metric model both calls run the full two-branch pipeline, so this
 // path infers twice; the typed Depth RPC (single da_capi_depth_dense call) is the
 // efficient path for nested models.
 func (r *DepthAnythingCpp) runDepthPose(imagePath string) (depth []float32, h, w int, ext [12]float32, intr [9]float32, err error) {
 	if r.handle == 0 {
 		err = fmt.Errorf("depth-anything-cpp: model not loaded")
 		return
 	}
 	var ch, cw int32
 	ptr := CapiDepthPath(r.handle, imagePath, &ch, &cw)
 	if ptr == nil {
 		err = fmt.Errorf("depth-anything-cpp: da_capi_depth_path failed: %s", r.lastError())
 		return
 	}
 	h, w = int(ch), int(cw)
 	n := h * w
 	if n > 0 {
 		src := unsafe.Slice(ptr, n)
 		depth = make([]float32, n)
 		copy(depth, src)
 	}
 	CapiFreeFloats(ptr)
 	if rc := CapiPosePath(r.handle, imagePath, &ext[0], &intr[0]); rc != 0 {
 		err = fmt.Errorf("depth-anything-cpp: da_capi_pose_path failed (rc=%d): %s", rc, r.lastError())
 		return
 	}
 	return
 }
 // lastError returns the context's last error string, or "" if none.
 func (r *DepthAnythingCpp) lastError() string {
 	if CapiLastError == nil || r.handle == 0 {
 		return ""
 	}
 	return CapiLastError(r.handle)
 }
 // materializeImage returns a filesystem path for an image argument that may be
 // either an existing path or a base64-encoded payload. When the input is
 // base64 it is decoded into a temp file; cleanup removes it (no-op for a path).
 func materializeImage(arg string) (path string, cleanup func(), err error) {
 	cleanup = func() {}
 	if _, statErr := os.Stat(arg); statErr == nil {
 		return arg, cleanup, nil
 	}
 	// Strip an optional data URL prefix (data:image/...;base64,<payload>).
 	b64 := arg
 	if i := indexComma(b64); i >= 0 && hasDataPrefix(b64) {
 		b64 = b64[i+1:]
 	}
 	data, decErr := base64.StdEncoding.DecodeString(b64)
 	if decErr != nil {
 		return "", cleanup, fmt.Errorf("image is neither an existing path nor valid base64: %v", decErr)
 	}
 	f, tErr := os.CreateTemp("", "depth-anything-*.img")
 	if tErr != nil {
 		return "", cleanup, tErr
 	}
 	if _, wErr := f.Write(data); wErr != nil {
 		_ = f.Close()
 		_ = os.Remove(f.Name())
 		return "", cleanup, wErr
 	}
 	_ = f.Close()
 	name := f.Name()
 	return name, func() { _ = os.Remove(name) }, nil
 }
 func hasDataPrefix(s string) bool {
 	return len(s) >= 5 && s[:5] == "data:"
 }
 func indexComma(s string) int {
 	for i := 0; i < len(s); i++ {
 		if s[i] == ',' {
 			return i
 		}
 	}
 	return -1
 }
 // writeDepthPNG min-max normalises a depth map and writes it as an 8-bit
 // grayscale PNG. Near = bright (255), far = dark (0), matching the usual
 // depth-map convention for inverse-depth-like outputs.
 func writeDepthPNG(dst string, depth []float32, h, w int) error {
 	if h <= 0 || w <= 0 || len(depth) < h*w {
 		return fmt.Errorf("depth-anything-cpp: writeDepthPNG: bad dims h=%d w=%d len=%d", h, w, len(depth))
 	}
 	dmin, dmax := minMax(depth)
 	span := dmax - dmin
 	if span <= 0 || math.IsNaN(float64(span)) {
 		span = 1
 	}
 	img := image.NewGray(image.Rect(0, 0, w, h))
 	for y := 0; y < h; y++ {
 		for x := 0; x < w; x++ {
 			v := depth[y*w+x]
 			n := (v - dmin) / span // 0..1
 			if math.IsNaN(float64(n)) {
 				n = 0
 			}
 			if n < 0 {
 				n = 0
 			} else if n > 1 {
 				n = 1
 			}
 			img.Pix[y*img.Stride+x] = uint8(n * 255)
 		}
 	}
 	// dst is the gRPC-provided output path chosen by the LocalAI core (the
 	// intended write destination for the rendered depth map), not
 	// attacker-controlled input, so the variable path is expected here.
 	f, err := os.Create(dst) // #nosec G304
 	if err != nil {
 		return err
 	}
 	defer func() { _ = f.Close() }()
 	return png.Encode(f, img)
 }
 func minMax(v []float32) (mn, mx float32) {
 	if len(v) == 0 {
 		return 0, 0
 	}
 	mn, mx = v[0], v[0]
 	for _, x := range v {
 		if math.IsNaN(float64(x)) || math.IsInf(float64(x), 0) {
 			continue
 		}
 		if x < mn {
 			mn = x
 		}
 		if x > mx {
 			mx = x
 		}
 	}
 	return mn, mx
 }
--- a/backend/go/depth-anything-cpp/main.go
+++ b/backend/go/depth-anything-cpp/main.go
@@ -1,67 +0,0 @@
 package main
 // main.go - entry point for the depth-anything-cpp gRPC backend.
 //
 // Dlopens libdepthanythingcpp-<variant>.so via purego at the path in
 // DEPTHANYTHING_LIBRARY (set by run.sh based on /proc/cpuinfo), registers the
 // da_capi_* C ABI symbols, then starts the gRPC server.
 import (
 	"flag"
 	"os"
 	"runtime"
 	"github.com/ebitengine/purego"
 	grpc "github.com/mudler/LocalAI/pkg/grpc"
 )
 var (
 	addr = flag.String("addr", "localhost:50051", "the address to connect to")
 )
 type LibFuncs struct {
 	FuncPtr any
 	Name    string
 }
 func main() {
 	// Get library name from environment variable, default to fallback
 	libName := os.Getenv("DEPTHANYTHING_LIBRARY")
 	if libName == "" {
 		if runtime.GOOS == "darwin" {
 			libName = "./libdepthanythingcpp-fallback.dylib"
 		} else {
 			libName = "./libdepthanythingcpp-fallback.so"
 		}
 	}
 	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
 	if err != nil {
 		panic(err)
 	}
 	libFuncs := []LibFuncs{
 		{&CapiLoad, "da_capi_load"},
 		{&CapiLoadNested, "da_capi_load_nested"},
 		{&CapiFree, "da_capi_free"},
 		{&CapiLastError, "da_capi_last_error"},
 		{&CapiDepthPath, "da_capi_depth_path"},
 		{&CapiFreeFloats, "da_capi_free_floats"},
 		{&CapiPosePath, "da_capi_pose_path"},
 		{&CapiDepthDense, "da_capi_depth_dense"},
 		{&CapiPoints, "da_capi_points"},
 		{&CapiFreeBytes, "da_capi_free_bytes"},
 		{&CapiExportGlb, "da_capi_export_glb"},
 		{&CapiExportColmap, "da_capi_export_colmap"},
 	}
 	for _, lf := range libFuncs {
 		purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
 	}
 	flag.Parse()
 	if err := grpc.StartServer(*addr, &DepthAnythingCpp{}); err != nil {
 		panic(err)
 	}
 }
--- a/backend/go/depth-anything-cpp/main_test.go
+++ b/backend/go/depth-anything-cpp/main_test.go
@@ -1,167 +0,0 @@
 package main
 // main_test.go - end-to-end smoke test for the depth-anything-cpp gRPC backend.
 //
 // Spawns the compiled depth-anything-cpp binary on a free local port, dials it
 // via gRPC, and exercises LoadModel + Predict against the test fixtures
 // downloaded by test.sh: the small (vits) f32 GGUF of Depth Anything 3 and a
 // real photo. Asserts that Predict returns a JSON payload with a positive
 // depth-map width/height.
 //
 // The spec Skip()s cleanly if its fixtures (the model, the test image, the
 // built binary, or the fallback .so) are missing, so the test target stays
 // usable on a fresh checkout / on CI runners where the model hasn't been
 // downloaded.
 import (
 	"context"
 	"encoding/base64"
 	"encoding/json"
 	"fmt"
 	"net"
 	"os"
 	"os/exec"
 	"path/filepath"
 	"testing"
 	"time"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 	"google.golang.org/grpc"
 	"google.golang.org/grpc/credentials/insecure"
 )
 func TestDepth(t *testing.T) {
 	RegisterFailHandler(Fail)
 	RunSpecs(t, "depth-anything-cpp backend smoke suite")
 }
 // freePort grabs an ephemeral TCP port and immediately releases it so the
 // spawned backend can bind to it. There is a tiny TOCTOU window here but in
 // practice it's adequate for a smoke test on a quiet runner.
 func freePort() int {
 	l, err := net.Listen("tcp", "127.0.0.1:0")
 	Expect(err).ToNot(HaveOccurred(), "freePort listen")
 	port := l.Addr().(*net.TCPAddr).Port
 	Expect(l.Close()).To(Succeed())
 	return port
 }
 // startBackend spawns the depth-anything-cpp binary on the given port and waits
 // until it accepts TCP connections (up to 10s). It mirrors how main.go resolves
 // the purego library: the DEPTHANYTHING_LIBRARY env var points the dlopen at the
 // freshly built fallback .so. The returned cleanup func kills the process.
 func startBackend(port int) func() {
 	binary, err := filepath.Abs("./depth-anything-cpp")
 	Expect(err).ToNot(HaveOccurred())
 	if _, err := os.Stat(binary); err != nil {
 		Skip(fmt.Sprintf("backend binary not built: %s (run `make depth-anything-cpp` first)", binary))
 	}
 	libPath, err := filepath.Abs("./libdepthanythingcpp-fallback.so")
 	Expect(err).ToNot(HaveOccurred())
 	if _, err := os.Stat(libPath); err != nil {
 		Skip(fmt.Sprintf("fallback library not built: %s (run `make libdepthanythingcpp-fallback.so` first)", libPath))
 	}
 	addr := fmt.Sprintf("127.0.0.1:%d", port)
 	cmd := exec.Command(binary, "--addr", addr)
 	cmd.Env = append(os.Environ(), "DEPTHANYTHING_LIBRARY="+libPath)
 	cmd.Stdout = os.Stderr
 	cmd.Stderr = os.Stderr
 	Expect(cmd.Start()).To(Succeed())
 	cleanup := func() {
 		if cmd.Process != nil {
 			_ = cmd.Process.Kill()
 			_, _ = cmd.Process.Wait()
 		}
 	}
 	deadline := time.Now().Add(10 * time.Second)
 	for time.Now().Before(deadline) {
 		c, err := net.DialTimeout("tcp", addr, 200*time.Millisecond)
 		if err == nil {
 			_ = c.Close()
 			return cleanup
 		}
 		time.Sleep(200 * time.Millisecond)
 	}
 	cleanup()
 	Fail(fmt.Sprintf("backend did not become ready on %s within 10s", addr))
 	return func() {}
 }
 // loadTestImage reads the test image downloaded by test.sh and returns its
 // base64-encoded content (one of the wire formats accepted by Predict).
 func loadTestImage() string {
 	imgPath, err := filepath.Abs("test-data/test.jpg")
 	Expect(err).ToNot(HaveOccurred())
 	imgBytes, err := os.ReadFile(imgPath)
 	if err != nil {
 		Skip(fmt.Sprintf("test image not present: %s (run test.sh first)", imgPath))
 	}
 	return base64.StdEncoding.EncodeToString(imgBytes)
 }
 // dialBackend opens a gRPC client connection to the spawned backend.
 func dialBackend(port int) (pb.BackendClient, func()) {
 	addr := fmt.Sprintf("127.0.0.1:%d", port)
 	conn, err := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials()))
 	Expect(err).ToNot(HaveOccurred())
 	return pb.NewBackendClient(conn), func() { _ = conn.Close() }
 }
 // modelPathOrSkip resolves the model file under ./test-models/ and Skip()s the
 // current spec if it's missing (not present on a fresh checkout / on CI runners
 // without the download).
 func modelPathOrSkip(name string) string {
 	modelDir, err := filepath.Abs("test-models")
 	Expect(err).ToNot(HaveOccurred())
 	modelPath := filepath.Join(modelDir, name)
 	if _, err := os.Stat(modelPath); err != nil {
 		Skip(fmt.Sprintf("model not present: %s (run test.sh first)", modelPath))
 	}
 	return modelPath
 }
 var _ = Describe("depth-anything-cpp backend", func() {
 	It("runs depth+pose against a known-good image", func() {
 		modelPath := modelPathOrSkip("depth-anything-small-f32.gguf")
 		imgB64 := loadTestImage()
 		port := freePort()
 		cleanup := startBackend(port)
 		defer cleanup()
 		client, closeConn := dialBackend(port)
 		defer closeConn()
 		ctx, cancel := context.WithTimeout(context.Background(), 20*time.Minute)
 		defer cancel()
 		loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
 			Model:     "depth-anything-small-f32.gguf",
 			ModelFile: modelPath,
 			Threads:   4,
 		})
 		Expect(err).ToNot(HaveOccurred(), "LoadModel")
 		Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
 		// Predict runs depth+pose and returns the JSON depthResult in Reply.Message.
 		reply, err := client.Predict(ctx, &pb.PredictOptions{
 			Images: []string{imgB64},
 		})
 		Expect(err).ToNot(HaveOccurred(), "Predict")
 		var res depthResult
 		Expect(json.Unmarshal(reply.GetMessage(), &res)).To(Succeed(), "Predict returned non-JSON: %q", string(reply.GetMessage()))
 		Expect(res.DepthW).To(BeNumerically(">", 0), "depth width should be positive")
 		Expect(res.DepthH).To(BeNumerically(">", 0), "depth height should be positive")
 		_, _ = fmt.Fprintf(GinkgoWriter, "depth OK: %dx%d min=%.3f max=%.3f\n",
 			res.DepthW, res.DepthH, res.DepthMin, res.DepthMax)
 	})
 })
--- a/backend/go/depth-anything-cpp/nested_e2e_test.go
+++ b/backend/go/depth-anything-cpp/nested_e2e_test.go
@@ -1,64 +0,0 @@
 package main
 // nested_e2e_test.go - e2e smoke for the nested two-file metric model. Loads the
 // anyview branch as the main model and points the metric branch via the
 // "metric_model:<file>" option (exactly as the depth-anything-3-nested gallery
 // entry does), then exercises the typed Depth RPC and asserts a metric depth map.
 //
 // Skips cleanly unless both nested GGUFs are present under ./test-models/ and the
 // backend binary + fallback .so are built.
 import (
 	"context"
 	"fmt"
 	"path/filepath"
 	"time"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 )
 var _ = Describe("depth-anything-cpp nested metric model", func() {
 	It("loads the two-file pair via the metric_model option and returns metric depth", func() {
 		anyviewPath := modelPathOrSkip("depth-anything-nested-anyview.gguf")
 		_ = modelPathOrSkip("depth-anything-nested-metric.gguf")
 		imgB64 := loadTestImage()
 		port := freePort()
 		cleanup := startBackend(port)
 		defer cleanup()
 		client, closeConn := dialBackend(port)
 		defer closeConn()
 		ctx, cancel := context.WithTimeout(context.Background(), 25*time.Minute)
 		defer cancel()
 		loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
 			Model:     "depth-anything-nested-anyview.gguf",
 			ModelFile: anyviewPath,
 			ModelPath: filepath.Dir(anyviewPath),
 			Options:   []string{"metric_model:depth-anything-nested-metric.gguf"},
 			Threads:   8,
 		})
 		Expect(err).ToNot(HaveOccurred(), "LoadModel(nested)")
 		Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
 		resp, err := client.Depth(ctx, &pb.DepthRequest{
 			Src:          imgB64,
 			IncludeDepth: true,
 			IncludePose:  true,
 		})
 		Expect(err).ToNot(HaveOccurred(), "Depth(nested)")
 		Expect(resp.GetWidth()).To(BeNumerically(">", 0), "depth width")
 		Expect(resp.GetHeight()).To(BeNumerically(">", 0), "depth height")
 		Expect(resp.GetIsMetric()).To(BeTrue(), "nested output must be metric")
 		Expect(len(resp.GetDepth())).To(Equal(int(resp.GetWidth())*int(resp.GetHeight())), "dense depth length")
 		Expect(len(resp.GetExtrinsics())).To(Equal(12), "extrinsics 3x4")
 		Expect(resp.GetIntrinsics()[0]).To(BeNumerically(">", 0), "fx > 0")
 		_, _ = fmt.Fprintf(GinkgoWriter, "nested depth OK: %dx%d is_metric=%v fx=%.2f\n",
 			resp.GetWidth(), resp.GetHeight(), resp.GetIsMetric(), resp.GetIntrinsics()[0])
 	})
 })
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Ettore Di Giacinto	b75ab7c3bb	chore(dllm): bump dllm.cpp pin to P5 head Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-13 00:00:42 +00:00
Ettore Di Giacinto	b40843cf62	feat(dllm): image input through the backend (multimodal C-ABI) Routes PredictOptions.Images (raw base64, the core convention) through dllm.cpp's probed multimodal entry points as data: URIs; the gemma4 renderer appends one engine-side <image> marker per image after the last user message (llama.cpp attachment convention; the template's content-parts branch is unreachable through the flattened pb shape). The engine expands markers to boi + soft*n + eoi and splices the vision-tower embeddings. Older libdllm.so without the mm symbols fails with an actionable error (Dlsym probe). DLLM_VERSION pin bumped to the engine's vision-capable commit. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-12 00:41:04 +00:00
Ettore Di Giacinto	c9c6040fe8	feat(dllm): default gallery entry on Q4_K_M; add Q8_0 variant Q4_K_M (~17 GB, GB10-validated: cosine 0.9862, coherent generation) is the friendlier default download than the 50 GB BF16; Q8_0 (~27 GB) is the higher-fidelity middle ground. Both descriptions carry the measured caveat that BF16 is ~5x faster per denoise step on BF16-native hardware, with a pointer to fetch it manually when it fits. sha256 values are the HF LFS oids. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 20:24:26 +00:00
Ettore Di Giacinto	8134d6db37	docs(dllm): record Q4_K_M validation and quantization guidance Q4_K_M validated on GB10: quality holds (cosine 0.9862, coherent generation, 19/48 stopper exit) but a forward step is ~5x slower than BF16 (27.5s vs 5.6s: native BF16 tensor cores vs K-quant MoE dequant). Guidance: prefer BF16 when it fits; Q4_K_M is the memory-bound option. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 19:22:02 +00:00
Ettore Di Giacinto	ad6d1dbc8b	feat(grpc): request cancellation for Go backends via the Cancellable capability The llama.cpp C++ backend aborts generation when its gRPC context is cancelled (grpc-server.cpp polls context->IsCancelled() in the result loops), but Go backends served by pkg/grpc never observed context cancellation: a disconnected client left the generation running to completion. Add an optional Cancellable capability; the server registers context.AfterFunc on the request/stream context (after the Locking block so queued requests cannot abort the current owner) covering both rich and legacy paths. dllm implements it: measured cancel latency ~10ms vs ~10s of orphaned generation, and follow-up requests no longer queue behind cancelled ones (~220ms vs ~9s in the e2e proof). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 17:50:04 +00:00
Ettore Di Giacinto	eb61e1d770	chore(dllm): review fixes - file modes and build-matrix doc accuracy Drop the stray executable bit from the Go sources and Makefile (the sibling Go backends commit them 644; only run.sh/package.sh are executable), and correct two documentation claims found in the final branch review: cuda13-dllm is built for amd64 only (arm64 CUDA ships as the l4t flavor), and package.sh is the parakeet-cpp-style stub layout with no ldd walk. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 17:17:54 +00:00
Ettore Di Giacinto	aba9c4794a	docs(dllm): backend documentation and agents topic guide User docs: dllm section in text-generation (setup, eb_* options table, n_predict canvas rounding, enable_thinking metadata, honest GB10 throughput numbers). Agents guide: .agents/dllm-backend.md covering the purego C-ABI contract, serialization rules, template provenance, test layers, and known limitations. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 17:05:18 +00:00
Ettore Di Giacinto	04d6f66a9a	feat(dllm): diffusiongemma gallery entry and e2e coverage Gallery model diffusiongemma-26b-a4b-it (unsloth BF16 GGUF, sha256 verified against the HF LFS oid) with use_tokenizer_template and an honest experimental/throughput description. e2e: BACKEND_BINARY-mode specs boot the real gRPC backend with the tiny fixture model (templated chat + streaming); real-26B specs are separately env-gated. Adds an opt-in BACKEND_TEST_SEED knob so random-weight fixture models run the generic specs deterministically. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 17:05:18 +00:00
Ettore Di Giacinto	52b3b68cea	feat(dllm): backend packaging, gallery index, CI matrix Registers the dllm backend across every surface: backend gallery index (cpu amd64+arm64 with manifest merge, cuda13, l4t-cuda13 for GB10-class hardware; no darwin per engine scope), top-level Makefile targets, bump_deps pin tracking for DLLM_VERSION, and the curated known-backends list for /backends/known (pref-only: auto-detecting on .gguf would shadow llama-cpp). Note: image builds and the nightly bump leg stay red until github.com/mudler/dllm.cpp is published (planned at merge time). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 17:05:18 +00:00
Ettore Di Giacinto	99184809fa	feat(dllm): rich gRPC backend with ChatDelta streaming Implements PredictRich/PredictStreamRich (legacy methods delegate), TokenizeString, and Load over the purego binding. A single worker goroutine serializes all C calls per the dllm.cpp one-generate-per-ctx contract (cancel is the documented exception); an RWMutex guards Free against in-flight requests. Under use_tokenizer_template the gemma4 renderer and streaming parser own templating and ChatDelta extraction; raw-prompt mode passes through verbatim. enable_thinking is opt-in via request metadata (the gemma4 template treats thinking as opt-in). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 16:14:37 +00:00
Ettore Di Giacinto	294c04ae2f	feat(dllm): gemma4 streaming parser emitting ChatDeltas Fragment-safe state machine (content / channel header / thought / tool-call / done) classifying model output into content, reasoning_content and tool_calls deltas. Tool-call payload decoder is a non-partial port of vLLM's gemma4 parser grammar; ~25 of its test cases are ported with citations, plus a 2-split invariance property over every byte position. Recursion depth-capped against model-generated deep nesting; marker constants shared with the renderer. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 15:55:27 +00:00
Ettore Di Giacinto	778f85c2a0	feat(dllm): purego backend scaffold over the dllm.cpp C-ABI Binds the 9-symbol flat C-ABI of dllm.cpp (DiffusionGemma engine) via purego: typed wrappers with correct string ownership (malloc'd returns freed via dllm_capi_free_string, borrowed last_error never freed), once-allocated stream-callback trampolines, and a gated Ginkgo binding smoke against the tiny fixture model. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 14:50:39 +00:00
Ettore Di Giacinto	af0db1419c	test(http): make the suite listen port configurable The core/http specs hardcoded 127.0.0.1:9090 in ~70 call sites, so the pre-commit coverage gate fails on any machine where an unrelated service holds 9090. Centralize the address in the suite file behind LOCALAI_TEST_HTTP_PORT (default unchanged: 9090). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 14:28:39 +00:00