fix: remove deprecated cosign bundle flag from backend merge workflow

Agent-Logs-Url: https://github.com/mudler/LocalAI/sessions/4207dabc-14ec-4655-9594-487338977fcf Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Initial plan
2026-06-27 09:57:14 -04:00 · 2026-05-22 22:16:44 +00:00 · 2026-05-22 22:13:44 +00:00
1193 changed files with 8194 additions and 111465 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -102,24 +102,6 @@ Multi-arch backends are NOT a single matrix entry with `platforms: 'linux/amd64,

 Entries whose `dockerfile` is `./backend/Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` must also set a `builder-base-image` field pointing at a prebuilt base from `quay.io/go-skynet/ci-cache:base-grpc-*` (CI builds these via `.github/workflows/base-images.yml`). The mapping is by `(build-type, platforms)` — see existing entries for the pattern. CI uses these prebuilt bases to skip the gRPC compile (~25–35 min cold). Local `make backends/<name>` ignores `builder-base-image` and uses the from-source path inside the Dockerfile, so you don't need quay access for local builds.

-### Cover every OS the project supports (Linux **and** Darwin)
-
-`.github/backend-matrix.yml` has two matrices, and they are the source of truth for which OS a backend ships on:
-
- `include:` — the **Linux** matrix (x86_64 + arm64; CPU and CUDA / ROCm / SYCL / Vulkan).
- `includeDarwin:` — the **macOS / Apple Silicon** matrix (arm64; Metal where the engine supports it, otherwise a native arm64 CPU build).
-
-**A new backend must target every OS it can build for — do not ship Linux-only by default.** A backend that appears only under `include:` is silently unavailable on macOS even when its code would run there. Most C/C++/GGML engines build on Darwin out of the box (ggml defaults `GGML_METAL=ON` on Apple, so a plain build is Metal-enabled), and many Python backends do too (CPU / MPS wheels). If a backend genuinely cannot support an OS (e.g. CUDA-only, no CPU variant), state that in the PR description instead of omitting it silently.
-
-Wiring a backend into `includeDarwin:` is more than the matrix entry:
-
-1. **`includeDarwin:` entry** — `tag-suffix: "-metal-darwin-arm64-<backend>"`, `build-type: "metal"`, `lang: "go"` for go+ggml backends; omit `build-type` for the bespoke C++ ones (llama-cpp / ds4 / privacy-filter). Match an existing entry of the same shape.
-2. **`backend/index.yaml`** — add `metal:` to the backend's `capabilities` map (main and `-development`) and concrete `metal-<backend>` / `metal-<backend>-development` image entries pointing at the `-metal-darwin-arm64-<backend>` images.
-3. **C/C++ backends only** — add an `inferBackendPathDarwin` case in `scripts/changed-backends.js` returning `backend/cpp/<backend>/` (the generic fallthrough assumes `backend/<lang>/`, which is wrong for a C++ source tree driven with `lang: go`), and give `run.sh` a Darwin branch that exports `DYLD_LIBRARY_PATH` instead of `LD_LIBRARY_PATH`. If the build is bespoke (single `grpc-server` + dylib bundling), model it on `scripts/build/ds4-darwin.sh` and add a `backends/<backend>-darwin` make target plus a gated step in `.github/workflows/backend_build_darwin.yml`.
-4. **C++ proto gotcha** — if the backend compiles the generated gRPC/protobuf in a separate CMake target (e.g. `hw_grpc_proto`), that target must link `protobuf::libprotobuf` + `gRPC::grpc++` so the Homebrew include dirs propagate; otherwise macOS fails with `google/protobuf/runtime_version.h not found` (Linux hides this because apt headers sit in `/usr/include`).
-
-The CI path filter only builds a backend on a PR when a file under its directory changes, so a darwin-only YAML edit builds nothing — touch a file under `backend/<lang>/<backend>/` (a one-line comment is enough) in the same PR.
-
 ## 3. Add Backend Metadata to `backend/index.yaml`

 **Step 3a: Add Meta Definition**
@@ -216,34 +198,12 @@ docker-build-backends: ... docker-build-<backend-name>
 - If the backend is in `backend/python/<backend-name>/` but uses `.` as context in the workflow file, use `.` context
 - Check similar backends to determine the correct context

-## Documenting the backend (README + docs)
-
-A backend is not "added" until it is discoverable. Update the user-facing docs:
-
- **`docs/content/features/backends.md`** - add the backend to the right
-  category in the "LocalAI supports various types of backends" list (and add a
-  new category if it introduces a new modality, e.g. sound classification).
- If the backend introduces a **new API surface** (a new endpoint or a realtime
-  capability), document it under `docs/content/` where its area lives (audio,
-  vision, etc.) and follow the api-endpoints checklist in
-  [api-endpoints-and-auth.md](api-endpoints-and-auth.md).
-
-**If the backend is a native C/C++/GGML engine created and maintained by the
-LocalAI team** (a from-scratch port like `parakeet.cpp`, `ced.cpp`,
-`vibevoice.cpp`, `rf-detr.cpp`, not a wrapper around a third-party runtime), it
-ALSO belongs in the top-level **`README.md`** table under "native C/C++/GGML
-engines ... developed and maintained by the LocalAI project itself". Add a row
-linking the upstream engine repo with a one-line description. This is the
-project's showcase of its own engines; a new in-house backend that is missing
-from it is a documentation bug.
-
 ## 5. Verification Checklist

 After adding a new backend, verify:

 - [ ] Backend directory structure is complete with all necessary files
 - [ ] Build configurations added to `.github/backend-matrix.yml` for all desired platforms (per-arch entries with `platform-tag` for multi-arch; `builder-base-image` for llama-cpp / ik-llama-cpp / turboquant)
- [ ] **OS coverage considered**: added to `includeDarwin:` (macOS/Apple Silicon) if the backend can build there — with the `backend/index.yaml` `metal:` capability + `metal-<backend>` image entries, a `run.sh` Darwin/DYLD branch and `inferBackendPathDarwin` case for C++ backends — or the PR explains why an OS is unsupported. Do not ship Linux-only by default.
 - [ ] Meta definition added to `backend/index.yaml` in the `## metas` section
 - [ ] Image entries added to `backend/index.yaml` for all build variants (latest + development)
 - [ ] Tag suffixes match between workflow file and index.yaml
@@ -251,8 +211,6 @@ After adding a new backend, verify:
 - [ ] No YAML syntax errors (check with linter)
 - [ ] No Makefile syntax errors (check with linter)
 - [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
- [ ] Documented: added to the category list in `docs/content/features/backends.md` (and any new endpoint/realtime capability documented under `docs/content/`)
- [ ] If it is an in-house native C/C++/GGML engine, added to the maintained-engines table in the top-level `README.md`

 ## Bundling runtime shared libraries (`package.sh`)

--- a/.agents/backend-signing.md
+++ b/.agents/backend-signing.md
@@ -49,12 +49,6 @@ cosign sign --yes --recursive \
 Sign by digest, never by tag — signing by tag binds the signature to
 whatever the tag points at *now*, and a subsequent tag push orphans it.

-`--registry-referrers-mode=oci-1-1` is still gated behind
-`COSIGN_EXPERIMENTAL=1` in cosign v2.4.x (set at the job env level in
-`backend_merge.yml`). Re-evaluate when bumping the pinned cosign release
-— newer versions are expected to graduate this flag and the env var can
-then be dropped.
-
 `backend_build_darwin.yml` builds and pushes single-arch darwin images
 that bypass the manifest-list merge. If/when those entries get a gallery
 `verification:` policy, the equivalent cosign step has to land there
--- a/.agents/building-and-testing.md
+++ b/.agents/building-and-testing.md
@@ -15,35 +15,3 @@ Let's say the user wants to build a particular backend for a given platform. For
 - Unless the user specifies that they want you to run the command, then just print it because not all agent frontends handle long running jobs well and the output may overflow your context
 - The user may say they want to build AMD or ROCM instead of hipblas, or Intel instead of SYCL or NVIDIA insted of l4t or cublas. Ask for confirmation if there is ambiguity.
 - Sometimes the user may need extra parameters to be added to `docker build` (e.g. `--platform` for cross-platform builds or `--progress` to view the full logs), in which case you can generate the `docker build` command directly.
-
-## Test coverage gate
-
-The core Go suites (`./pkg`, `./core`, plus the in-process integration suite `./tests/e2e`) are covered by a **strict, monotonic coverage ratchet**:
-
- `make test-coverage` — runs the suites with `covermode=atomic` instrumentation and writes a merged profile to `coverage/coverage.out`. Uses the same prerequisites as `make test`.
-  - **`--coverpkg` (`COVERAGE_COVERPKG = core/...,pkg/...`):** coverage is attributed to the core+pkg packages, not just the package under test. This is what lets the in-process `tests/e2e` suite (which drives the real HTTP server over loopback via `application.New`) credit the `core/http/endpoints/...` handlers it exercises — folding it in roughly doubled endpoint coverage (e.g. `endpoints/openai` 13.6% → 52%). The denominator is therefore *all* of `core`+`pkg` (minus generated proto, dropped via `COVERAGE_EXCLUDE_RE`), so the number isn't comparable to a plain per-package figure.
-  - **Integration suites (`COVERAGE_E2E_ROOTS = ./tests/e2e`)** run non-recursively (excludes `tests/e2e/distributed`, which needs containers) with `--label-filter=!real-models` (those need a downloaded model) against the mock backend built by `prepare-test`. `tests/integration` is deliberately excluded — it needs `make backends/local-store`, which the coverage CI job doesn't build.
-  - **Flake note:** folding integration tests into a *strict* gate means a hard e2e failure (or a spec that silently stops running) can fail the coverage gate, not just the test. `--flake-attempts` absorbs transient retryable failures; covermode=atomic keeps line coverage deterministic otherwise.
-  - **Why one ginkgo run per root (`scripts/run-coverage.sh`):** passing several recursive roots to a *single* ginkgo invocation (e.g. `ginkgo -r ./pkg ./core`) only merges **one** root's coverprofile into `--output-dir`/`--coverprofile` — the others are silently dropped. Verified with ginkgo 2.29.0: `-r ./pkg ./core` yields only `./pkg` coverage, while `-r ./core` alone yields all 34 core packages. So the script runs each root separately and concatenates the (disjoint) profiles. Don't "simplify" it back to a single multi-root invocation — that's how `core/` (including all of `core/http`, ~7.4k statements) silently vanished from the number before.
-  - **Build tags (`COVERAGE_TAGS`, passed via `GINKGO_TAGS`):** defaults to `debug auth`. The `auth` tag is required to compile the real (sqlite-backed) auth implementation and its ~150 `//go:build auth` tests — without it those files aren't built, the tests don't run, and the gate scores auth against a stub (~3.7% instead of ~38%). If you add new tag-gated tests, extend `COVERAGE_TAGS` or they won't count (and likely won't run in CI at all).
- `make test-coverage-check` — runs `test-coverage`, then `scripts/coverage-check.sh` fails the build if total coverage is **below** the committed baseline in `coverage-baseline.txt`. The Linux job in `.github/workflows/test.yml` runs this instead of `make test`.
- `make test-coverage-baseline` — regenerates and overwrites `coverage-baseline.txt` from the current run.
- `make install-hooks` — sets `core.hooksPath` to the versioned `.githooks/`, whose `pre-commit` runs checks scoped to what's staged: Go changes → `make lint` + `make test-coverage-check`; `core/http/react-ui/` changes → `make test-ui-coverage-check` (Playwright e2e + UI coverage gate). A commit touching neither is skipped; bypass with `git commit --no-verify`. The hook resolves golangci-lint's new-from base to `upstream/master` → `origin/master` → `master`, so it works from a fork clone where `origin/master` is stale (passed to `make lint` via `LINT_NEW_FROM`).
-
-### React UI coverage
-
-The React UI (`core/http/react-ui/`) has **no component/unit tests** — its only tests are the Playwright e2e specs in `e2e/`, which run against the real app served by `tests/e2e-ui/ui-test-server` (the dist is `//go:embed`ed, so the server is rebuilt per coverage run). Those specs do genuinely exercise the UI (clicks, `fill`, `setInputFiles`, `getByRole`/`getByText`, visibility/value assertions).
-
- `make test-ui-coverage` — builds an istanbul-instrumented bundle (`COVERAGE=true`, via `vite-plugin-istanbul` with `forceBuildInstrument: true` — the plugin skips production builds otherwise), re-embeds it into `ui-test-server` (the dist is `//go:embed`ed), runs the Playwright specs, and writes an `nyc` report to `core/http/react-ui/coverage/`. The specs import `{ test, expect }` from `e2e/coverage-fixtures.js` (re-exports Playwright's, plus harvests `window.__coverage__` into `.nyc_output/` after each test). Instrumentation is off unless `COVERAGE=true`, so dev/prod builds and plain `make test-ui-e2e` are unaffected (the fixture no-ops when `window.__coverage__` is absent).
- **Browser:** the flake dev shell ships `chromium` and exports `PLAYWRIGHT_CHROMIUM_PATH`; `playwright.config.js` uses it via `launchOptions.executablePath`, and the Makefile skips `playwright install` when it's set. This avoids Playwright's downloaded browser, which can't resolve system libs (`libglib-2.0`, …) on NixOS. In CI (no `PLAYWRIGHT_CHROMIUM_PATH`) the Makefile falls back to `playwright install --with-deps chromium`.
- The app is a React SPA, so coverage accumulates across in-app navigation within a test; a full `page.goto`/reload resets it.
- `.nycrc.json` uses `all: true`, so **every `src/**` file is in the report**, including 0%-coverage ones — that's how you spot features with no test at all (sort the HTML report or `coverage-summary.json` by line% ascending). 
- **UI coverage gate:** `make test-ui-coverage-check` runs the suite then `scripts/ui-coverage-check.sh`, failing if total line coverage drops more than `UI_COVERAGE_TOLERANCE` below `core/http/react-ui/coverage-baseline.txt`. `make test-ui-coverage-baseline` regenerates the baseline. Runs in CI (`tests-ui-e2e.yml`) and pre-commit on `core/http/react-ui/` changes.
- **Why it has a tolerance (unlike the strict Go gate):** UI e2e coverage is *non-deterministic*. Specs that assert on state and end while async/lazy render work is still in flight collect those lines only when the render beats the coverage teardown — so the total drifts with machine speed/load (a fast local box reads higher than a slow CI runner), diffusely across many specs. The tolerance absorbs that drift, so set the baseline *below* the slow-CI floor, never to a fast-local `make test-ui-coverage-baseline` number, or CI flaps.
- **Raising coverage is cheap:** a *render-smoke* spec (navigate to a route, assert its header renders) mounts a lazy page and runs its full render + initial effects, capturing most of its lines in a few lines of test — see `e2e/page-render-smoke.spec.js`. Auth is disabled in the test server (`isAdmin=true`), so `RequireAdmin`/`RequireFeature` routes render without a mock. The most *deterministic* win is removing a race: make a spec `await` a rendered element before ending (see `e2e/agents.spec.js` → AgentCreate) so its lines count every run.
-
-Rules (both gates):
- **Install the hooks:** `make install-hooks` once per clone so lint + coverage run pre-commit. Don't lean on CI for what the hook catches.
- **Don't work around the gate:** never `git commit --no-verify`, and never hand-lower a baseline or widen a tolerance to turn a red gate green. The ratchet only moves up.
- If a change drops coverage, **add tests** (sort `coverage-summary.json` by line% ascending to find untested code) rather than editing the baseline. When coverage legitimately rises, commit the regenerated baseline (`make test-coverage-baseline` / `test-ui-coverage-baseline`).
- The Go gate is **strict — no tolerance**; `covermode=atomic` keeps it deterministic. The UI gate keeps a small tolerance only because its e2e coverage isn't.
--- a/.agents/coding-style.md
+++ b/.agents/coding-style.md
@@ -50,17 +50,6 @@ Do not mix styles within a package. If you are extending tests in a package that

 This is enforced by `golangci-lint` via the `forbidigo` linter (see `.golangci.yml`); calls like `t.Errorf` / `t.Fatalf` / `t.Run` / `t.Skip` / `t.Logf` are flagged. Run `make lint` locally before submitting; the same check runs in CI (`.github/workflows/lint.yml`).

-## Outbound HTTP
-
-All outbound HTTP must go through `github.com/mudler/LocalAI/pkg/httpclient` rather than the standard library's default client. Use `httpclient.New(...)` (no body deadline — safe for streaming/SSE) or `httpclient.NewWithTimeout(d, ...)` (simple request/response). Both **refuse redirects by default** and set a TLS 1.2 floor.
-
-The reason is GHSA-3mj3-57v2-4636: the std default client follows redirects, and on a *cross-host* redirect Go forwards custom credential headers (e.g. Anthropic's `x-api-key`) to the redirect target, leaking the secret. `httpclient` fails closed instead.
-
- Need to follow redirects (download CDNs, registry blobs, GitHub asset URLs)? Pass `httpclient.WithFollowRedirects()` — it still strips credential headers on any cross-host hop.
- Have a custom transport (IP-pinned dialer, HTTP/2 tuning, a credential-injecting `RoundTripper`)? Pass `httpclient.WithTransport(rt)`, basing the transport on `httpclient.HardenedTransport()` to keep the TLS floor. Handed a `*http.Client` by a library? `httpclient.Harden(c)` applies the policy in place.
-
-This is enforced by `forbidigo` (see `.golangci.yml`): `http.DefaultClient` and `http.Get`/`Post`/`PostForm`/`Head` are flagged. The `&http.Client{}` composite literal can't be matched precisely by forbidigo without also flagging legitimate `*http.Client` type references, so that form is caught by review — don't construct raw clients.
-
 ## Documentation

 The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.
--- a/.agents/ds4-backend.md
+++ b/.agents/ds4-backend.md
@@ -44,39 +44,6 @@ maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_
 via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
 NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).

-## Engine options (LoadModel)
-
-`LoadModel` maps `ModelOptions.Options[]` (`"key:value"`, from model-YAML
-`options:`) onto `ds4_engine_options` through a **declarative table**
-(`kEngineOptSpecs` + `apply_engine_option` in `grpc-server.cpp`). The struct is
-plain C with no reflection, so the field set is enumerated once in the table;
-adding a future engine knob is a one-line table row, not a new branch. Unknown
-keys are ignored (back-compat). A bare flag (`ssd_streaming` with no value)
-means `true`. Path-type values (`mtp_path`, `expert_profile_path`,
-`directional_steering_file`) resolve **relative to the model directory**, so a
-gallery entry can reference a companion file it downloaded by bare filename;
-absolute values pass through. `ds4_role` / `ds4_layers` / `ds4_listen` /
-`ds4_route_timeout` / `kv_cache_dir` keep their dedicated handling (validation
-+ coordinator wiring) and are not in the table.
-
-Wired keys: `mtp_path`, `mtp_draft`, `mtp_margin`, `prefill_chunk`,
-`power_percent`, `warm_weights`, `quality`, `ssd_streaming`,
-`ssd_streaming_cold`, `ssd_streaming_preload_experts`,
-`ssd_streaming_cache_experts` (count or `NGB`, sets both experts+bytes via
-`ds4_parse_streaming_cache_experts_arg`), `simulate_used_memory` (`NGB` via
-`ds4_parse_gib_arg`), `expert_profile_path`, `directional_steering_file`,
-`directional_steering_attn`, `directional_steering_ffn`.
-
-## SSD streaming (running models larger than RAM)
-
-ds4's **SSD streaming** keeps non-routed weights resident and streams routed MoE
-experts from the GGUF on cache misses, turning "does it fit in RAM" into a speed
-spectrum. **Metal (Darwin) only** - it is a no-op on CUDA/CPU. Enable with
-`options: ["ssd_streaming"]`; size the routed-expert cache with
-`ssd_streaming_cache_experts:NGB` (omit for ds4's automatic 80%-of-working-set
-budget). Gallery entries built on this: `deepseek-v4-flash-q4-ssd` (153 GB Flash
-on a 128 GB Mac) and `deepseek-v4-pro-q2-ssd` (433 GB Pro, experimental).
-
 ## Build matrix

 | Build | Where | Notes |
@@ -101,34 +68,6 @@ go test -count=1 -timeout=30m -v ./tests/e2e-backends/...

 CI does not load the model; the suite is opt-in via env vars.

-## Distributed mode
-
-ds4 supports **layer-split** distributed inference (a model too big for one host,
-split by transformer layer; the GGUF must be present on every machine, each loads
-only its slice). Topology is **inverted** vs llama.cpp: the coordinator listens,
-workers dial in.
-
- **`ds4-worker` binary**: built and packaged next to `grpc-server` (`package.sh`
-  copies it into `package/`). Links the same engine objects plus `ds4_distributed.o`;
-  **no gRPC/protobuf dependency** (speaks ds4's own TCP transport), so it builds
-  even where `grpc-server` can't. Runs the worker serving loop (`ds4_dist_run`).
- **Coordinator wiring**: the ds4 `grpc-server` acts as coordinator when `LoadModel`
-  `ModelOptions.Options` (from model-YAML `options:`) carry:
-  - `ds4_role:coordinator` (enables distributed mode; absent → single-node, back-compat)
-  - `ds4_layers:0:19` (coordinator's own slice, inclusive; `N:output` includes the head)
-  - `ds4_listen:0.0.0.0:1234` (address workers dial into)
-  - `ds4_route_timeout:60` (optional; seconds Predict/PredictStream wait for the route
-    to form before returning gRPC `UNAVAILABLE`; default 60)
- **Worker CLI**: `local-ai worker ds4-distributed -- <ds4-worker args>` resolves the
-  ds4 backend and execs the packaged `ds4-worker` (raw passthrough), e.g.
-  `--role worker --model /models/ds4flash.gguf --layers 20:output --coordinator <host> 1234`.
-
-Opt-in e2e in `tests/e2e-backends/backend_test.go`, gated by
-`BACKEND_TEST_DS4_DISTRIBUTED=1` (plus `BACKEND_TEST_DS4_WORKER_BINARY`,
-`BACKEND_TEST_DS4_WORKER_LAYERS`, `BACKEND_TEST_DS4_COORDINATOR_LAYERS`,
-`BACKEND_TEST_DS4_LISTEN`). Design spec:
-`docs/superpowers/specs/2026-05-30-ds4-distributed-inference-design.md`.
-
 ## Importer

 `core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by
--- a/.docker/install-base-deps.sh
+++ b/.docker/install-base-deps.sh
@@ -70,12 +70,6 @@ if [ "${BUILD_TYPE:-}" = "vulkan" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; t
        git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
        ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
        clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
-    # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe + Arm SoC) and their ICD
-    # manifests. The LunarG SDK below only provides the loader and shader
-    # tooling, not hardware drivers — without Mesa the packaged Vulkan backend
-    # would ship a loader that finds no GPU. package-gpu-libs.sh bundles these
-    # .so files plus their deps into the backend so it stays self-contained.
-    apt-get install -y mesa-vulkan-drivers libdrm2
    if [ "amd64" = "${TARGETARCH:-}" ]; then
        wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz"
        tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz
--- a/.docker/llama-cpp-compile.sh
+++ b/.docker/llama-cpp-compile.sh
@@ -17,29 +17,19 @@ if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
  rm -rf /LocalAI/backend/cpp/llama-cpp-*-build
 fi

-cd /LocalAI/backend/cpp/llama-cpp
-if [ -z "${BUILD_TYPE:-}" ]; then
-  # Pure CPU image (BUILD_TYPE empty): one build with ggml CPU_ALL_VARIANTS replaces the
-  # per-microarch binaries (x86: avx/avx2/avx512/fallback; arm64: armv8.x/armv9.x). ggml
-  # dlopens the best libggml-cpu-*.so at runtime by probing host CPU features.
-  #
-  # arm64: the CPU_ALL_VARIANTS table includes armv9.2 SME variants whose -march=...+sme is
-  # rejected by the Ubuntu 24.04 default gcc-13. gcc-14 accepts it, so build the arm64
-  # variants with it (the host never *selects* SME unless it has it, but every variant must
-  # still compile).
-  if [ "${TARGETARCH}" = "arm64" ]; then
-    apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
-    export CC=gcc-14 CXX=g++-14
-  fi
-  make llama-cpp-cpu-all
-else
-  # GPU build (cublas/hipblas/sycl/vulkan/...): the accelerator does the compute, so a
-  # single fallback CPU build is enough - no per-microarch CPU variants needed. (This also
-  # keeps the heavy GPU backend compile from also building the whole CPU variant matrix,
-  # and avoids the gcc-14 apt step on GPU base images such as nvidia l4t.)
+if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
+  cd /LocalAI/backend/cpp/llama-cpp
  make llama-cpp-fallback
+  make llama-cpp-grpc
+  make llama-cpp-rpc-server
+else
+  cd /LocalAI/backend/cpp/llama-cpp
+  make llama-cpp-avx
+  make llama-cpp-avx2
+  make llama-cpp-avx512
+  make llama-cpp-fallback
+  make llama-cpp-grpc
+  make llama-cpp-rpc-server
 fi
-make llama-cpp-grpc
-make llama-cpp-rpc-server

 ccache -s || true
--- a/.docker/turboquant-compile.sh
+++ b/.docker/turboquant-compile.sh
@@ -19,21 +19,17 @@ fi

 cd /LocalAI/backend/cpp/turboquant

-if [ -z "${BUILD_TYPE:-}" ]; then
-  # Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
-  # arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
-  if [ "${TARGETARCH}" = "arm64" ]; then
-    apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
-    export CC=gcc-14 CXX=g++-14
-  fi
-  make turboquant-cpu-all
-else
-  # GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator
-  # does the compute. Keeps the GPU compile from also building the CPU variant matrix and
-  # avoids the gcc-14 apt step on GPU base images such as nvidia l4t.
+if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
  make turboquant-fallback
+  make turboquant-grpc
+  make turboquant-rpc-server
+else
+  make turboquant-avx
+  make turboquant-avx2
+  make turboquant-avx512
+  make turboquant-fallback
+  make turboquant-grpc
+  make turboquant-rpc-server
 fi
-make turboquant-grpc
-make turboquant-rpc-server

 ccache -s || true
--- a/.dockerignore
+++ b/.dockerignore
@@ -4,7 +4,6 @@
 .devcontainer
 models
 backends
-volumes
 examples/chatbot-ui/models
 backend/go/image/stablediffusion-ggml/build/
 backend/go/*/build
@@ -22,36 +21,3 @@ __pycache__
 # backend virtual environments
 **/venv
 backend/python/**/source
-
-# In-place llama.cpp clone + per-variant build copies. The Makefile
-# clones llama.cpp itself at the pinned LLAMA_VERSION; if a stale
-# local checkout is COPY'd into the image, the `llama.cpp:` target
-# sees the directory and skips re-cloning, so grpc-server.cpp ends
-# up compiled against whatever (likely older) commit the host had.
-backend/cpp/llama-cpp/llama.cpp
-backend/cpp/llama-cpp-*-build
-
-# privacy-filter: same in-place pattern. The Makefile fetches privacy-filter.cpp
-# at the pinned commit (or symlinks a PRIVACY_FILTER_SRC checkout for local dev).
-# A stale dir/symlink COPY'd into the image makes the clone step fail (dangling
-# symlink) or compile against the wrong commit, so keep host build state out.
-backend/cpp/privacy-filter/privacy-filter.cpp
-backend/cpp/privacy-filter/build
-backend/cpp/privacy-filter/grpc-server
-backend/cpp/privacy-filter/package
-
-# Rust backend build output (sources are tracked; target/ is generated)
-backend/rust/*/target
-
-# Local-only artifacts that bloat the build context but the image never needs.
-# Saved image tarballs, locally-installed backends, the host-built binary, and
-# assorted tool/scratch dirs. None of these are git-tracked.
-backend-images
-local-backends
-local-ai
-.crush
-protoc
-tests
-
-# Installed via npm inside the build stage; no need to ship the host copy.
-**/node_modules
--- a/.githooks/pre-commit
+++ b/.githooks/pre-commit
@@ -1,60 +0,0 @@
-#!/usr/bin/env sh
-#
-# LocalAI pre-commit hook. Install it (once per clone) with:
-#
-#     make install-hooks
-#
-# Runs only the checks relevant to what's staged:
-#   - Go files          -> make lint + make test-coverage-check
-#   - core/http/react-ui -> make test-ui-coverage-check (Playwright e2e + gate)
-# A commit touching neither is skipped entirely (docs/YAML/etc. can't change
-# lint findings, Go coverage, or the UI).
-#
-# To bypass for a single commit (e.g. a WIP checkpoint): git commit --no-verify
-set -eu
-
-repo_root="$(git rev-parse --show-toplevel)"
-cd "$repo_root"
-
-staged="$(git diff --cached --name-only --diff-filter=ACMRD)"
-
-go_changed=0
-ui_changed=0
-if echo "$staged" | grep -qE '\.go$'; then go_changed=1; fi
-if echo "$staged" | grep -qE '^core/http/react-ui/'; then ui_changed=1; fi
-
-if [ "$go_changed" -eq 0 ] && [ "$ui_changed" -eq 0 ]; then
-	echo "pre-commit: no Go or React UI changes staged — skipping."
-	exit 0
-fi
-
-if [ "$go_changed" -eq 1 ]; then
-	# Resolve the ref golangci-lint's new-from-merge-base should compare
-	# against. .golangci.yml pins origin/master, which is correct in CI
-	# (origin == the canonical repo) but wrong from a fork clone, where
-	# origin/master lags behind and lint would report the whole upstream
-	# backlog. Prefer upstream/master, then origin/master, then master.
-	lint_base=""
-	for ref in upstream/master origin/master master; do
-		if git rev-parse --verify --quiet "${ref}^{commit}" >/dev/null 2>&1; then
-			lint_base="$ref"
-			break
-		fi
-	done
-
-	echo "pre-commit ▶ golangci-lint (make lint${lint_base:+, new-from $lint_base})"
-	make lint LINT_NEW_FROM="$lint_base"
-
-	echo "pre-commit ▶ coverage gate (make test-coverage-check) — builds and runs the"
-	echo "             pkg/core suites plus tests/e2e; can take a few minutes."
-	make test-coverage-check
-fi
-
-if [ "$ui_changed" -eq 1 ]; then
-	echo "pre-commit ▶ React UI e2e + coverage gate (make test-ui-coverage-check) —"
-	echo "             rebuilds the UI + ui-test-server, runs the Playwright specs, and"
-	echo "             fails if line coverage regressed; can take a couple of minutes."
-	make test-ui-coverage-check
-fi
-
-echo "pre-commit ✓ all relevant checks passed"
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
--- a/.github/bump_vllm_metal.sh
+++ b/.github/bump_vllm_metal.sh
@@ -1,55 +0,0 @@
-#!/bin/bash
-# Bump the single vllm-metal pin (VLLM_METAL_VERSION) in the vLLM backend's
-# darwin (Apple Silicon) install path. The macOS/Metal build
-# (backend/python/vllm/install.sh, Darwin branch) installs vllm-metal, which is
-# version-locked to a specific vLLM source release. install.sh derives that vLLM
-# version at build time from vllm-metal's own installer (`vllm_v=`) at the pinned
-# tag, so there is only ONE value to bump here -- mirroring bump_vllm_wheel.sh,
-# which bumps the Linux cu130 wheel pin.
-#
-# This deliberately tracks vllm-project/vllm-metal, NOT vllm-project/vllm: the
-# darwin build can only use the exact vLLM version vllm-metal supports, so it may
-# lag the Linux pin (requirements-cublas13-after.txt) until vllm-metal catches up.
-set -xe
-REPO=$1   # vllm-project/vllm-metal
-FILE=$2   # backend/python/vllm/install.sh
-VAR=$3    # VLLM_METAL_VERSION (used for the workflow's output file names)
-
-if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
-    echo "usage: $0 <repo> <install-file> <var-name>" >&2
-    exit 1
-fi
-
-# vllm-metal ships frequent dev releases, all flagged as non-prerelease, so
-# /releases/latest returns the newest one (with its cp312 wheel asset).
-LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
-    "https://api.github.com/repos/$REPO/releases/latest" \
-    | python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
-
-# The coupled vLLM source version lives in vllm-metal's installer at that tag.
-NEW_VLLM_VERSION=$(curl -fsSL \
-    "https://raw.githubusercontent.com/$REPO/$LATEST_TAG/install.sh" \
-    | grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -1 | cut -d'"' -f2)
-
-if [ -z "$LATEST_TAG" ] || [ -z "$NEW_VLLM_VERSION" ]; then
-    echo "Could not resolve vllm-metal tag ($LATEST_TAG) or its vllm_v ($NEW_VLLM_VERSION)." >&2
-    exit 1
-fi
-
-set +e
-CURRENT_TAG=$(grep -oE 'VLLM_METAL_VERSION="[^"]*"' "$FILE" | head -1 | cut -d'"' -f2)
-set -e
-
-# Rewrite the single pin. install.sh derives VLLM_VERSION from this tag at build
-# time, so there is nothing else to touch. peter-evans/create-pull-request opens
-# no PR on a clean tree, so a no-op rewrite (already current) is safe.
-sed -i "$FILE" \
-    -e "s|VLLM_METAL_VERSION=\"[^\"]*\"|VLLM_METAL_VERSION=\"$LATEST_TAG\"|"
-
-if [ -z "$CURRENT_TAG" ]; then
-    echo "Could not find VLLM_METAL_VERSION=\"...\" in $FILE." >&2
-    exit 0
-fi
-
-echo "vllm-metal ${CURRENT_TAG} -> ${LATEST_TAG} (builds vLLM ${NEW_VLLM_VERSION}): https://github.com/$REPO/releases/tag/${LATEST_TAG}" >> "${VAR}_message.txt"
-echo "${LATEST_TAG}" >> "${VAR}_commit.txt"
--- a/.github/gallery-agent/main.go
+++ b/.github/gallery-agent/main.go
@@ -3,7 +3,6 @@ package main
 import (
 	"context"
 	"encoding/json"
-	"errors"
 	"fmt"
 	"os"
 	"strconv"
@@ -114,17 +113,6 @@ func main() {
 	fmt.Println("Searching for trending models on HuggingFace...")
 	rawModels, err := client.GetTrending(searchTerm, limit)
 	if err != nil {
-		if errors.Is(err, hfapi.ErrRateLimited) {
-			fmt.Printf("HuggingFace API is rate limited after retries, skipping this run: %v\n", err)
-			writeSummary(AddedModelSummary{
-				SearchTerm:     searchTerm,
-				TotalFound:     0,
-				ModelsAdded:    0,
-				Quantization:   quantization,
-				ProcessingTime: time.Since(startTime).String(),
-			})
-			return
-		}
 		fmt.Fprintf(os.Stderr, "Error fetching models: %v\n", err)
 		os.Exit(1)
 	}
@@ -289,3 +277,4 @@ func truncateString(s string, maxLen int) string {
 	}
 	return s[:maxLen] + "..."
 }
+
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -44,7 +44,7 @@ jobs:
      has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6

      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -101,7 +101,7 @@ jobs:
    steps:

      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true

--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -57,7 +57,7 @@ jobs:
      HOMEBREW_NO_ANALYTICS: '1'
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true

@@ -98,8 +98,6 @@ jobs:
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
-            /opt/homebrew/Cellar/nlohmann-json
-            /opt/homebrew/Cellar/opus
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}

      - name: Dependencies
@@ -111,15 +109,7 @@ jobs:
          # Without explicitly installing them, a brew cache-hit run restores
          # ccache's Cellar dir but skips installing those transitive deps,
          # and ccache fails at runtime with `dyld: Library not loaded`.
-          # nlohmann-json is header-only and required by the ds4 backend
-          # (dsml_renderer.cpp includes <nlohmann/json.hpp>); on Linux it comes
-          # from the apt-installed nlohmann-json3-dev in the build image.
-          # opus + pkg-config are required by the opus go backend: its
-          # Makefile/package.sh call `pkg-config --cflags/--libs opus` to build
-          # libopusshim.dylib and to locate libopus.dylib for bundling. brew's
-          # pkg-config defaults its search path to the Homebrew prefix so the
-          # opus.pc is found.
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json opus pkg-config
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd
          # Force-reinstall ccache so brew re-validates its full runtime-dep
          # closure on every run. This is the durable fix: when the upstream
          # ccache formula gains a new transitive dep (as it has multiple times
@@ -138,7 +128,7 @@ jobs:
          # and decides "already installed" without re-linking, so on a cache-
          # hit run the formulas aren't on PATH. Force-link them; --overwrite
          # tolerates pre-existing symlinks from earlier installs.
-          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json opus pkg-config 2>/dev/null || true
+          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd 2>/dev/null || true

      - name: Save Homebrew cache
        if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
@@ -158,8 +148,6 @@ jobs:
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
-            /opt/homebrew/Cellar/nlohmann-json
-            /opt/homebrew/Cellar/opus
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}

      # ---- ccache for llama.cpp CMake builds ----
@@ -235,17 +223,8 @@ jobs:
        run: |
          make backends/ds4-darwin

-      # privacy-filter is a C++/ggml backend like ds4 - a single grpc-server with
-      # otool dylib bundling - so it gets its own bespoke darwin script rather than
-      # the generic build-darwin-go-backend path.
-      - name: Build privacy-filter backend (Darwin Metal)
-        if: inputs.backend == 'privacy-filter'
-        run: |
-          make protogen-go
-          make backends/privacy-filter-darwin
-
      - name: Build ${{ inputs.backend }}-darwin
-        if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4' && inputs.backend != 'privacy-filter'
+        if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4'
        run: |
          make protogen-go
          BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -40,16 +40,11 @@ jobs:
      id-token: write
    env:
      quay_username: ${{ secrets.quayUsername }}
-      # cosign v2.4.x still gates --registry-referrers-mode=oci-1-1 behind
-      # this flag. Without it, signing fails with:
-      #   invalid argument "oci-1-1" for "--registry-referrers-mode" flag:
-      #   in order to use mode "oci-1-1", you must set COSIGN_EXPERIMENTAL=1
-      COSIGN_EXPERIMENTAL: '1'
    steps:
      # Sparse checkout: the merge job needs `.github/scripts/` (for the
      # keepalive cleanup script) but none of the source tree.
      - name: Checkout (.github/scripts only)
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          sparse-checkout: |
            .github/scripts
--- a/.github/workflows/backend_pr.yml
+++ b/.github/workflows/backend_pr.yml
@@ -23,7 +23,7 @@ jobs:
      has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6

      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
--- a/.github/workflows/base-images.yml
+++ b/.github/workflows/base-images.yml
@@ -127,7 +127,7 @@ jobs:
            # the original l4t matrix entry which set skip-drivers: 'true'.
            skip-drivers: 'true'
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
        with:
          submodules: false
      - name: Free disk space
--- a/.github/workflows/build-test.yaml
+++ b/.github/workflows/build-test.yaml
@@ -11,7 +11,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -25,7 +25,7 @@ jobs:
    runs-on: macos-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -47,7 +47,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Configure apt mirror on runner
--- a/.github/workflows/bump-inference-defaults.yml
+++ b/.github/workflows/bump-inference-defaults.yml
@@ -14,7 +14,7 @@ jobs:
  bump:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6

      - uses: actions/setup-go@v5
        with:
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -26,30 +26,10 @@ jobs:
            variable: "DS4_VERSION"
            branch: "main"
            file: "backend/cpp/ds4/Makefile"
-          - repository: "localai-org/privacy-filter.cpp"
-            variable: "PRIVACY_FILTER_VERSION"
-            branch: "master"
-            file: "backend/cpp/privacy-filter/Makefile"
          - repository: "ggml-org/whisper.cpp"
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
            file: "backend/go/whisper/Makefile"
-          - repository: "CrispStrobe/CrispASR"
-            variable: "CRISPASR_VERSION"
-            branch: "main"
-            file: "backend/go/crispasr/Makefile"
-          - repository: "mudler/parakeet.cpp"
-            variable: "PARAKEET_VERSION"
-            branch: "master"
-            file: "backend/go/parakeet-cpp/Makefile"
-          - repository: "mudler/ced.cpp"
-            variable: "CED_VERSION"
-            branch: "master"
-            file: "backend/go/ced/Makefile"
-          - repository: "mudler/depth-anything.cpp"
-            variable: "DEPTHANYTHING_VERSION"
-            branch: "master"
-            file: "backend/go/depth-anything-cpp/Makefile"
          - repository: "leejet/stable-diffusion.cpp"
            variable: "STABLEDIFFUSION_GGML_VERSION"
            branch: "master"
@@ -70,29 +50,17 @@ jobs:
            variable: "SAM3_VERSION"
            branch: "main"
            file: "backend/go/sam3-cpp/Makefile"
-          - repository: "mudler/rf-detr.cpp"
-            variable: "RFDETR_VERSION"
-            branch: "main"
-            file: "backend/go/rfdetr-cpp/Makefile"
-          - repository: "mudler/locate-anything.cpp"
-            variable: "LOCATEANYTHING_VERSION"
-            branch: "master"
-            file: "backend/go/locate-anything-cpp/Makefile"
-          - repository: "ServeurpersoCom/qwentts.cpp"
+          - repository: "predict-woo/qwen3-tts.cpp"
            variable: "QWEN3TTS_CPP_VERSION"
-            branch: "master"
+            branch: "main"
            file: "backend/go/qwen3-tts-cpp/Makefile"
-          - repository: "ServeurpersoCom/omnivoice.cpp"
-            variable: "OMNIVOICE_VERSION"
-            branch: "master"
-            file: "backend/go/omnivoice-cpp/Makefile"
          - repository: "localai-org/vibevoice.cpp"
            variable: "VIBEVOICE_CPP_VERSION"
            branch: "master"
            file: "backend/go/vibevoice-cpp/Makefile"
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Bump dependencies 🔧
        id: bump
        run: |
@@ -128,7 +96,7 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Bump vLLM cu130 wheel pin 🔧
        id: bump
        run: |
@@ -154,39 +122,3 @@ jobs:
          branch: "update/VLLM_VERSION"
          body: ${{ steps.bump.outputs.message }}
          signoff: true
-
-  bump-vllm-metal:
-    # The darwin (Apple Silicon) vLLM build installs vllm-metal, which is locked
-    # to a specific vLLM source release. install.sh pins both VLLM_METAL_VERSION
-    # (the wheel release) and VLLM_VERSION (the vLLM it builds against); this job
-    # tracks vllm-project/vllm-metal and rewrites both atomically. Separate from
-    # bump-vllm-wheel because darwin follows vllm-metal, not vllm/vllm latest.
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v7
-      - name: Bump vllm-metal pin 🔧
-        id: bump
-        run: |
-          bash .github/bump_vllm_metal.sh vllm-project/vllm-metal backend/python/vllm/install.sh VLLM_METAL_VERSION
-          {
-            echo 'message<<EOF'
-            cat "VLLM_METAL_VERSION_message.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          {
-            echo 'commit<<EOF'
-            cat "VLLM_METAL_VERSION_commit.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          rm -rfv VLLM_METAL_VERSION_message.txt VLLM_METAL_VERSION_commit.txt
-      - name: Create Pull Request
-        uses: peter-evans/create-pull-request@v8
-        with:
-          token: ${{ secrets.UPDATE_BOT_TOKEN }}
-          push-to-fork: ci-forks/LocalAI
-          commit-message: ':arrow_up: Update vllm-project/vllm-metal (darwin)'
-          title: 'chore: :arrow_up: Update vllm-metal (darwin) to `${{ steps.bump.outputs.commit }}`'
-          branch: "update/VLLM_METAL_VERSION"
-          body: ${{ steps.bump.outputs.message }}
-          signoff: true
--- a/.github/workflows/bump_docs.yaml
+++ b/.github/workflows/bump_docs.yaml
@@ -13,7 +13,7 @@ jobs:
          - repository: "mudler/LocalAI"
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Bump dependencies 🔧
        run: |
          bash .github/bump_docs.sh ${{ matrix.repository }}
--- a/.github/workflows/checksum_checker.yaml
+++ b/.github/workflows/checksum_checker.yaml
@@ -8,7 +8,7 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - name: Install dependencies
--- a/.github/workflows/deploy-explorer.yaml
+++ b/.github/workflows/deploy-explorer.yaml
@@ -16,7 +16,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - uses: actions/setup-go@v5
--- a/.github/workflows/gallery-agent.yaml
+++ b/.github/workflows/gallery-agent.yaml
@@ -31,7 +31,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          token: ${{ secrets.GITHUB_TOKEN }}

--- a/.github/workflows/generate_intel_image.yaml
+++ b/.github/workflows/generate_intel_image.yaml
@@ -44,7 +44,7 @@ jobs:
        uses: docker/setup-buildx-action@master

      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6

      - name: Cache Intel images
        uses: docker/build-push-action@v7
--- a/.github/workflows/gh-pages.yml
+++ b/.github/workflows/gh-pages.yml
@@ -28,7 +28,7 @@ jobs:
      HUGO_VERSION: "0.146.3"
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0  # needed for enableGitInfo
          submodules: true
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -80,7 +80,7 @@ jobs:
    steps:

      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6

      - name: Configure apt mirror on runner
        id: apt_mirror
--- a/.github/workflows/image_merge.yml
+++ b/.github/workflows/image_merge.yml
@@ -36,7 +36,7 @@ jobs:
      # Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
      # script). Skips the rest of the source tree.
      - name: Checkout (.github/scripts only)
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          sparse-checkout: |
            .github/scripts
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -20,7 +20,7 @@ jobs:
  golangci-lint:
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
        with:
          # Full history so golangci-lint's new-from-merge-base can reach
          # origin/master and compute the diff against it.
--- a/.github/workflows/release.yaml
+++ b/.github/workflows/release.yaml
@@ -10,7 +10,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Set up Go
@@ -24,35 +24,20 @@ jobs:
          args: release --clean
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          MACOS_SIGN_P12: ${{ secrets.MACOS_CERTIFICATE }}
-          MACOS_SIGN_PASSWORD: ${{ secrets.MACOS_CERTIFICATE_PWD }}
-          MACOS_NOTARY_KEY: ${{ secrets.MACOS_NOTARY_KEY }}
-          MACOS_NOTARY_KEY_ID: ${{ secrets.MACOS_NOTARY_KEY_ID }}
-          MACOS_NOTARY_ISSUER_ID: ${{ secrets.MACOS_NOTARY_ISSUER_ID }}
  launcher-build-darwin:
    runs-on: macos-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Set up Go
        uses: actions/setup-go@v5
        with:
          go-version: 1.23
-      - name: Import signing certificate
-        env:
-          MACOS_CERTIFICATE: ${{ secrets.MACOS_CERTIFICATE }}
-          MACOS_CERTIFICATE_PWD: ${{ secrets.MACOS_CERTIFICATE_PWD }}
-          MACOS_CI_KEYCHAIN_PWD: ${{ secrets.MACOS_CI_KEYCHAIN_PWD }}
-        run: bash contrib/macos/sign-and-notarize.sh import-cert
-      - name: Build, sign and notarize the DMG
-        env:
-          MACOS_SIGN_IDENTITY: ${{ secrets.MACOS_SIGN_IDENTITY }}
-          MACOS_NOTARY_KEY: ${{ secrets.MACOS_NOTARY_KEY }}
-          MACOS_NOTARY_KEY_ID: ${{ secrets.MACOS_NOTARY_KEY_ID }}
-          MACOS_NOTARY_ISSUER_ID: ${{ secrets.MACOS_NOTARY_ISSUER_ID }}
-        run: make release-launcher-darwin
+      - name: Build launcher for macOS ARM64
+        run: |
+          make build-launcher-darwin
      - name: Upload DMG to Release
        uses: softprops/action-gh-release@v3
        with:
@@ -61,7 +46,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Configure apt mirror on runner
--- a/.github/workflows/secscan.yaml
+++ b/.github/workflows/secscan.yaml
@@ -14,17 +14,14 @@ jobs:
      GO111MODULE: on
    steps:
      - name: Checkout Source
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        if: ${{ github.actor != 'dependabot[bot]' }}
      - name: Run Gosec Security Scanner
        if: ${{ github.actor != 'dependabot[bot]' }}
-        uses: securego/gosec@v2.27.1
+        uses: securego/gosec@v2.22.9
        with:
          # we let the report trigger content trigger a failure using the GitHub Security features.
-          # backend/go/supertonic is excluded: it vendors upstream supertone-inc/supertonic
-          # (helper.go), whose findings (G304 model-file loads, G404 math/rand for flow-matching
-          # noise, G104 unhandled errors) are inherent to that upstream code, not ours to rewrite.
-          args: '-no-fail -exclude-dir=backend/go/supertonic -fmt sarif -out results.sarif ./...'
+          args: '-no-fail -fmt sarif -out results.sarif ./...'
      - name: Upload SARIF file
        if: ${{ github.actor != 'dependabot[bot]' }}
        uses: github/codeql-action/upload-sarif@v4
--- a/.github/workflows/stalebot.yml
+++ b/.github/workflows/stalebot.yml
@@ -11,7 +11,7 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/stale@eb5cf3af3ac0a1aa4c9c45633dd1ae542a27a899 # v9
+      - uses: actions/stale@b5d41d4e1d5dceea10e7104786b73624c18a190f # v9
        with:
          stale-issue-message: 'This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.'
          stale-pr-message: 'This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 10 days.'
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -37,8 +37,6 @@ jobs:
      sglang: ${{ steps.detect.outputs.sglang }}
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
      qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
-      rfdetr-cpp: ${{ steps.detect.outputs.rfdetr-cpp }}
-      locate-anything-cpp: ${{ steps.detect.outputs.locate-anything-cpp }}
      vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
      localvqe: ${{ steps.detect.outputs.localvqe }}
      voxtral: ${{ steps.detect.outputs.voxtral }}
@@ -47,10 +45,9 @@ jobs:
      speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
      sherpa-onnx: ${{ steps.detect.outputs.sherpa-onnx }}
      whisper: ${{ steps.detect.outputs.whisper }}
-      parakeet-cpp: ${{ steps.detect.outputs.parakeet-cpp }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
      - name: Install dependencies
@@ -67,7 +64,7 @@ jobs:
  #   runs-on: ubuntu-latest
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -90,7 +87,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -113,7 +110,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -137,7 +134,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -158,7 +155,7 @@ jobs:
  #  runs-on: ubuntu-latest
  #  steps:
  #    - name: Clone
-  #      uses: actions/checkout@v7
+  #      uses: actions/checkout@v6
  #      with:
  #        submodules: true
  #    - name: Dependencies
@@ -178,7 +175,7 @@ jobs:
  #   runs-on: ubuntu-latest
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -240,7 +237,7 @@ jobs:
  #           sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
  #           df -h
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -265,7 +262,7 @@ jobs:
  #   runs-on: ubuntu-latest
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -288,7 +285,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -309,7 +306,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -330,7 +327,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -351,7 +348,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -373,7 +370,7 @@ jobs:
  #   timeout-minutes: 45
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -394,7 +391,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -415,7 +412,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -436,7 +433,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -462,7 +459,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -484,7 +481,7 @@ jobs:
    timeout-minutes: 30
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -513,7 +510,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -530,7 +527,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -552,7 +549,7 @@ jobs:
    timeout-minutes: 20
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -564,7 +561,7 @@ jobs:
      - name: Run e2e-backends smoke
        env:
          BACKEND_IMAGE: quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
-          BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias,tokenize
+          BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias
        run: |
          make test-extra-backend
  # Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
@@ -579,7 +576,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -604,7 +601,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -625,7 +622,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -635,26 +632,6 @@ jobs:
      - name: Build whisper backend image and run transcription gRPC e2e tests
        run: |
          make test-extra-backend-whisper-transcription
-  # Parakeet ASR via the parakeet-cpp backend (C++/ggml port of NeMo
-  # Parakeet). Drives AudioTranscription (offline, with word timestamps) on
-  # tdt_ctc-110m + the JFK 11s clip.
-  tests-parakeet-cpp-grpc-transcription:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.parakeet-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 90
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.25.4'
-      - name: Build parakeet-cpp backend image and run transcription gRPC e2e tests
-        run: |
-          make test-extra-backend-parakeet-cpp-transcription
  # VITS TTS via the sherpa-onnx backend. Drives both TTS (file write) and
  # TTSStream (PCM chunks) on the e2e-backends harness.
  tests-sherpa-onnx-grpc-tts:
@@ -664,7 +641,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -681,7 +658,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -698,7 +675,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -741,7 +718,7 @@ jobs:
  #   timeout-minutes: 90
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -783,7 +760,7 @@ jobs:
  #   timeout-minutes: 90
  #   steps:
  #     - name: Clone
-  #       uses: actions/checkout@v7
+  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
@@ -808,7 +785,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -840,7 +817,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -866,81 +843,6 @@ jobs:
      - name: Test qwen3-tts-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp test
-  # Per-backend smoke for rfdetr-cpp: builds the .so + Go binary and runs
-  # `make -C backend/go/rfdetr-cpp test`. test.sh fetches the small (~20 MB)
-  # rfdetr-nano-q8_0 GGUF from the published mudler/rfdetr-cpp-nano HF repo
-  # via curl and synthesises a tiny PNG to exercise the wire protocol.
-  tests-rfdetr-cpp:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.rfdetr-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential cmake curl libopenblas-dev
-      - name: Setup Go
-        uses: actions/setup-go@v5
-      - name: Display Go version
-        run: go version
-      - name: Proto Dependencies
-        run: |
-          # Install protoc
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Build rfdetr-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp
-      - name: Test rfdetr-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp test
-  # Per-backend e2e for locate-anything-cpp: builds the .so + Go binary and
-  # runs `make -C backend/go/locate-anything-cpp test`. test.sh fetches the
-  # locate-anything-q8_0 GGUF (~6.3 GB, NVIDIA LocateAnything-3B) from the
-  # published mudler/locate-anything.cpp-gguf HF repo + a COCO image, then the
-  # Go wire test loads the model and runs an open-vocabulary Detect, asserting
-  # at least one labeled box. Heavier than the other Go backends (it is a 3B),
-  # so it is gated to changes under backend/go/locate-anything-cpp/.
-  tests-locate-anything-cpp:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.locate-anything-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential cmake curl libopenblas-dev
-      - name: Setup Go
-        uses: actions/setup-go@v5
-      - name: Display Go version
-        run: go version
-      - name: Proto Dependencies
-        run: |
-          # Install protoc
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Build locate-anything-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp
-      - name: Test locate-anything-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp test
  # Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
  # runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
  # the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K
@@ -952,7 +854,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -987,7 +889,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -1013,7 +915,7 @@ jobs:
    timeout-minutes: 150
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -1042,7 +944,7 @@ jobs:
    timeout-minutes: 60
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
@@ -1058,7 +960,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -1091,7 +993,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -1114,7 +1016,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
@@ -1140,7 +1042,7 @@ jobs:
    timeout-minutes: 90
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -21,7 +21,7 @@ jobs:
        go-version: ['1.26.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Free disk space
@@ -53,22 +53,9 @@ jobs:
          node-version: '22'
      - name: Build React UI
        run: make react-ui
-      # Runs the core suite with coverage and fails if total coverage dropped
-      # below the committed baseline (coverage-baseline.txt). The gate is
-      # strict — any decrease fails. Raise the baseline with
-      # `make test-coverage-baseline` and commit it when coverage rises.
-      - name: Test (with coverage gate)
+      - name: Test
        run: |
-          PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test-coverage-check
-      - name: Upload coverage report
-        if: ${{ always() }}
-        uses: actions/upload-artifact@v4
-        with:
-          name: coverage-linux
-          path: |
-            coverage/coverage.out
-            coverage/coverage.html
-          if-no-files-found: ignore
+          PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
@@ -84,7 +71,7 @@ jobs:
        go-version: ['1.26.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go ${{ matrix.go-version }}
@@ -121,19 +108,3 @@ jobs:
          detached: true
          connect-timeout-seconds: 180
          limit-access-to-actor: true
-
-  # Fast standalone unit tests for the backends' pure C++ helpers - currently the
-  # llama-cpp message reconstruction (backend/cpp/llama-cpp/message_content.h),
-  # which guards the OpenAI chat content normalization (mudler/LocalAI#10524,
-  # #7324, #7528). The runner discovers every *_test.cpp under backend/cpp/, so
-  # new pure-C++ unit tests are picked up with no CI changes. These need only the
-  # C++ stdlib + nlohmann/json, so they run on every PR without the full
-  # llama.cpp + gRPC backend build. (The same suite is also wired as an opt-in
-  # CMake/ctest target, -DLLAMA_GRPC_BUILD_TESTS=ON, for in-backend-build runs.)
-  tests-backend-cpp:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-      - name: Run backend C++ unit tests
-        run: make test-backend-cpp
--- a/.github/workflows/tests-aio.yml
+++ b/.github/workflows/tests-aio.yml
@@ -62,7 +62,7 @@ jobs:
          sudo rm -rfv build || true
          df -h
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
--- a/.github/workflows/tests-e2e.yml
+++ b/.github/workflows/tests-e2e.yml
@@ -21,7 +21,7 @@ jobs:
        go-version: ['1.25.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Configure apt mirror on runner
--- a/.github/workflows/tests-pii-ner-e2e.yml
+++ b/.github/workflows/tests-pii-ner-e2e.yml
@@ -1,97 +0,0 @@
---
-name: 'PII NER tier E2E (live GGUF, CPU)'
-
-# Runs the real privacy-filter GGUF NER tier end-to-end on CPU — the gap the
-# hermetic tests/e2e suite cannot cover (it only exercises the in-process
-# pattern tier). Heavy (builds the C++ backend image + downloads a ~2.7 GB
-# GGUF), so it is path-filtered on PRs and otherwise runs nightly / on demand.
-#
-# This drives the container-level harness (tests/e2e-backends) via
-# `make test-extra-backend-privacy-filter`: it builds the privacy-filter image,
-# downloads the model, loads it on CPU, and asserts byte-correct, UTF-8-aligned
-# TokenClassify spans. The complementary HTTP-path specs in tests/e2e
-# (e2e_pii_ner_test.go) Skip unless PII_NER_MODEL_GGUF is wired.
-
-on:
-  workflow_dispatch:
-  schedule:
-    - cron: '0 3 * * *'
-  push:
-    branches:
-      - master
-    paths:
-      - 'backend/cpp/privacy-filter/**'
-      - 'backend/Dockerfile.privacy-filter'
-      - 'core/services/routing/pii/**'
-      - 'core/services/routing/piidetector/**'
-      - 'core/backend/token_classify.go'
-      - 'core/http/endpoints/localai/pii.go'
-      - 'core/schema/pii.go'
-      - 'tests/e2e-backends/**'
-      - 'tests/e2e/e2e_pii_ner_test.go'
-      - 'tests/e2e/e2e_suite_test.go'
-      - '.github/workflows/tests-pii-ner-e2e.yml'
-  pull_request:
-    paths:
-      - 'backend/cpp/privacy-filter/**'
-      - 'backend/Dockerfile.privacy-filter'
-      - 'core/services/routing/pii/**'
-      - 'core/services/routing/piidetector/**'
-      - 'core/backend/token_classify.go'
-      - 'core/http/endpoints/localai/pii.go'
-      - 'core/schema/pii.go'
-      - 'tests/e2e-backends/**'
-      - 'tests/e2e/e2e_pii_ner_test.go'
-      - 'tests/e2e/e2e_suite_test.go'
-      - '.github/workflows/tests-pii-ner-e2e.yml'
-
-concurrency:
-  group: ci-tests-pii-ner-e2e-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  tests-pii-ner-e2e:
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        go-version: ['1.25.x']
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-      - name: Free disk space
-        run: |
-          sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc /opt/hostedtoolcache/CodeQL || true
-          sudo docker image prune --all --force || true
-          df -h
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
-      - name: Setup Go ${{ matrix.go-version }}
-        uses: actions/setup-go@v5
-        with:
-          go-version: ${{ matrix.go-version }}
-          cache: false
-      - name: Proto Dependencies
-        run: |
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential
-      # Builds local-ai-backend:privacy-filter, downloads the GGUF, loads it on
-      # CPU and runs the token_classify capability spec (byte-offset contract).
-      - name: Run live PII NER backend E2E
-        run: PATH="$PATH:$HOME/go/bin" make test-extra-backend-privacy-filter
-      - name: Setup tmate session if tests fail
-        if: ${{ failure() }}
-        uses: mxschmitt/action-tmate@v3.23
-        with:
-          detached: true
-          connect-timeout-seconds: 180
-          limit-access-to-actor: true
--- a/.github/workflows/tests-ui-e2e.yml
+++ b/.github/workflows/tests-ui-e2e.yml
@@ -23,7 +23,7 @@ jobs:
        go-version: ['1.26.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
+        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Configure apt mirror on runner
@@ -37,10 +37,6 @@ jobs:
        uses: actions/setup-node@v6
        with:
          node-version: '22'
-      - name: Setup Bun
-        uses: oven-sh/setup-bun@v2
-        with:
-          bun-version: '1.3.11'
      - name: Proto Dependencies
        run: |
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
@@ -52,12 +48,16 @@ jobs:
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential libopus-dev
-      # Builds an instrumented UI bundle, runs the Playwright specs, and fails
-      # if line coverage regressed beyond the jitter tolerance (the gate is
-      # in `make test-ui-coverage-check`). PLAYWRIGHT_CHROMIUM_PATH is unset
-      # here, so scripts/ensure-playwright-browser.sh installs Chromium via apt.
-      - name: Run UI e2e + coverage gate
-        run: PATH="$PATH:$HOME/go/bin" make test-ui-coverage-check
+      - name: Build UI test server
+        run: PATH="$PATH:$HOME/go/bin" make build-ui-test-server
+      - name: Install Playwright
+        working-directory: core/http/react-ui
+        run: |
+          npm install
+          npx playwright install --with-deps chromium
+      - name: Run Playwright tests
+        working-directory: core/http/react-ui
+        run: npx playwright test
      - name: Upload Playwright report
        if: ${{ failure() }}
        uses: actions/upload-artifact@v7
@@ -65,14 +65,6 @@ jobs:
          name: playwright-report
          path: core/http/react-ui/playwright-report/
          retention-days: 7
-      - name: Upload UI coverage report
-        if: ${{ always() }}
-        uses: actions/upload-artifact@v7
-        with:
-          name: ui-coverage
-          path: core/http/react-ui/coverage/
-          if-no-files-found: ignore
-          retention-days: 7
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
--- a/.github/workflows/update_swagger.yaml
+++ b/.github/workflows/update_swagger.yaml
@@ -10,7 +10,7 @@ jobs:
      fail-fast: false
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v6
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - uses: actions/setup-go@v5
--- a/.gitignore
+++ b/.gitignore
@@ -26,10 +26,6 @@ go-bert
 LocalAI
 /local-ai
 /local-ai-launcher
-# Root-level build artifacts when running `go build ./...` against
-# Go backend packages whose main lives under backend/go/.
-/cloud-proxy
-/local-store
 # prevent above rules from omitting the helm chart
 !charts/*
 # prevent above rules from omitting the api/localai folder
@@ -70,17 +66,10 @@ docs/static/gallery.html
 # per-developer customization files for the development container
 .devcontainer/customization/*

-# Coverage profiles (the committed baseline is coverage-baseline.txt)
-/coverage/
-
 # React UI build artifacts (keep placeholder dist/index.html)
 core/http/react-ui/node_modules/
 core/http/react-ui/dist

-# React UI coverage (vite-plugin-istanbul + nyc, via `make test-ui-coverage`)
-core/http/react-ui/.nyc_output/
-core/http/react-ui/coverage/
-
 # Extracted backend binaries for container-based testing
 local-backends/

@@ -91,9 +80,3 @@ core/http/react-ui/test-results/

 # Local worktrees
 .worktrees/
-
-# SDD / brainstorm scratch (agent-driven development)
-.superpowers/
-
-# Local Apple signing material (never commit)
-.certs/
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -56,26 +56,10 @@ linters:
        # are exempt — see linters.exclusions.rules below.
        - pattern: '^os\.(Getenv|LookupEnv|Environ)$'
          msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
-        # Outbound HTTP must go through pkg/httpclient, which refuses redirects
-        # by default and sets a TLS floor. The std-library default client and
-        # the http.Get/Post/... convenience helpers follow redirects (up to 10)
-        # and, on a cross-host redirect, forward custom credential headers such
-        # as Anthropic's x-api-key to the redirect target — leaking the secret
-        # (GHSA-3mj3-57v2-4636). forbidigo can't precisely match the
-        # `&http.Client{}` composite literal without also flagging legitimate
-        # `*http.Client` type references, so that form is enforced by
-        # convention + review; these two patterns catch the implicit-default
-        # client, which is the common footgun.
-        - pattern: '^http\.DefaultClient$'
-          msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.DefaultClient — the std client follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
-        - pattern: '^http\.(Get|Post|PostForm|Head)$'
-          msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.Get/Post/PostForm/Head — these use http.DefaultClient, which follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
  exclusions:
    paths:
      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
      - 'backend/go/whisper/sources'
-      # Vendored upstream supertonic pipeline (supertone-inc/supertonic go/helper.go).
-      - 'backend/go/supertonic/helper.go'
      - 'docs/'
    rules:
      # CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
@@ -111,18 +95,3 @@ linters:
      - path: _test\.go$
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
-      # pkg/httpclient is the sanctioned home for outbound HTTP clients; it
-      # necessarily references net/http directly.
-      - path: ^pkg/httpclient/
-        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
-        linters: [forbidigo]
-      # Tests drive local httptest servers where redirect/TLS hardening is
-      # irrelevant; the std client is fine there.
-      - path: _test\.go$
-        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
-        linters: [forbidigo]
-      # Vendored upstream whisper.cpp Go bindings are a separate module and
-      # cannot import pkg/httpclient.
-      - path: ^backend/go/whisper/sources/
-        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
-        linters: [forbidigo]
--- a/.goreleaser.yaml
+++ b/.goreleaser.yaml
@@ -9,8 +9,7 @@ source:
  enabled: true
  name_template: '{{ .ProjectName }}-{{ .Tag }}-source'
 builds:
-  - id: local-ai
-    main: ./cmd/local-ai
+  - main: ./cmd/local-ai
    env:
      - CGO_ENABLED=0
    ldflags:
@@ -36,19 +35,3 @@ snapshot:
  version_template: "{{ .Tag }}-next"
 changelog:
  use: github-native
-# Sign + notarize the macOS server binary via the quill backend (runs on Linux,
-# no macOS runner needed). Disabled automatically when MACOS_SIGN_P12 is unset
-# (forks / PRs), so those builds stay unsigned and green.
-notarize:
-  macos:
-    - enabled: '{{ isEnvSet "MACOS_SIGN_P12" }}'
-      ids:
-        - local-ai
-      sign:
-        certificate: "{{.Env.MACOS_SIGN_P12}}"
-        password: "{{.Env.MACOS_SIGN_PASSWORD}}"
-      notarize:
-        issuer_id: "{{.Env.MACOS_NOTARY_ISSUER_ID}}"
-        key_id: "{{.Env.MACOS_NOTARY_KEY_ID}}"
-        key: "{{.Env.MACOS_NOTARY_KEY}}"
-        wait: true
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -35,7 +35,6 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]

 ## Quick Reference

- **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md).
 - **Logging**: Use `github.com/mudler/xlog` (same API as slog)
 - **Go style**: Prefer `any` over `interface{}`
 - **Comments**: Explain *why*, not *what*
@@ -43,5 +42,4 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 - **New API endpoints**: LocalAI advertises its capability surface in several independent places — swagger `@Tags`, `/api/instructions` registry, auth `RouteFeatureRegistry`, React UI `capabilities.js`, docs. Read [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) and follow its checklist — missing any surface means clients, admins, and the UI won't know the endpoint exists.
 - **Admin endpoints → MCP tool**: every admin endpoint that an admin would manage conversationally (install/list/edit/toggle/upgrade) MUST also be exposed as an MCP tool in `pkg/mcp/localaitools/`. The LocalAI Assistant chat modality and the standalone `local-ai mcp-server` consume that package; drift between REST and MCP is a real risk. Read [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) — the `TestToolHTTPRouteMappingComplete` test fails until you wire the new tool and update the route map.
 - **Build**: Inspect `Makefile` and `.github/workflows/` — ask the user before running long builds
- **Backend OS coverage**: a new backend must target every OS it can build for, not just Linux. `.github/backend-matrix.yml` has two matrices — `include:` (Linux) and `includeDarwin:` (macOS / Apple Silicon). Most C/C++/GGML and many Python backends build on Darwin too — wire the `includeDarwin` entry + `backend/index.yaml` `metal:` entries, or say in the PR why an OS is unsupported. See the darwin checklist in [.agents/adding-backends.md](.agents/adding-backends.md).
 - **UI**: The active UI is the React app in `core/http/react-ui/`. The older Alpine.js/HTML UI in `core/http/static/` is pending deprecation — all new UI work goes in the React UI
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -198,7 +198,6 @@ For AI-assisted development, see [`AGENTS.md`](AGENTS.md) (or the equivalent [`C

 - Prefer modern Go idioms — for example, use `any` instead of `interface{}`.
 - Use [`golangci-lint`](https://golangci-lint.run) to catch common issues before submitting a PR.
- Run `make install-hooks` once per clone to enable the pre-commit hook: Go changes run `make lint` + the coverage gate (`make test-coverage-check`); `core/http/react-ui/` changes run the Playwright e2e suite (`make test-ui`). Bypass a single commit with `git commit --no-verify`.
 - Use [`github.com/mudler/xlog`](https://github.com/mudler/xlog) for logging (same API as `slog`). Do not use `fmt.Println` or the standard `log` package for operational logging.
 - Use tab indentation for Go files (as defined in `.editorconfig`).

@@ -266,12 +265,6 @@ The e2e tests run LocalAI in a Docker container and exercise the API:
 make test-e2e
 ```

-### React UI tests and coverage
-
-The React UI (`core/http/react-ui/`) is covered by Playwright e2e specs, gated by a **monotonic line-coverage ratchet** (`make test-ui-coverage-check`, run in CI and pre-commit). The metric is non-deterministic — a fast local box reads higher than a slow CI runner for the same code — so a small tolerance is unavoidable.
-
-**If your change lowers UI coverage, raise it back by adding specs — do not widen the tolerance or hand-lower the baseline.** A *render-smoke* spec (navigate to a page, assert its header is visible) cheaply covers an entire lazy page. See `core/http/react-ui/e2e/page-render-smoke.spec.js` and the full policy in [.agents/building-and-testing.md](.agents/building-and-testing.md#react-ui-coverage).
-
 ### Running E2E container tests

 These tests build a standard LocalAI Docker image and run it with pre-configured model configs to verify that most endpoints work correctly:
--- a/1
+++ b/1
@@ -108,7 +108,6 @@ RUN <<EOT bash
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
--- a/245
+++ b/245
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter backends/privacy-filter-darwin
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -69,41 +69,10 @@ else
 	GORELEASER=$(shell which goreleaser)
 endif

-TEST_PATHS?=./api/... ./pkg/... ./core/... ./backend/go/cloud-proxy/... ./backend/go/local-store/...
-
-## Coverage output and the committed baseline that CI compares against.
-## The gate is strict: total coverage must never decrease (no tolerance).
-## covermode=atomic makes line coverage deterministic regardless of test
-## ordering or flake retries, so there is no run-to-run jitter to absorb.
-COVERAGE_DIR?=$(abspath ./coverage)
-COVERAGE_PROFILE?=$(COVERAGE_DIR)/coverage.out
-COVERAGE_BASELINE?=coverage-baseline.txt
-## Coverage is collected one recursive root at a time and merged (see
-## scripts/run-coverage.sh): passing several recursive roots to a single
-## ginkgo invocation only keeps one root's coverprofile. Mirrors TEST_PATHS
-## minus ./api (which doesn't exist).
-COVERAGE_ROOTS?=./pkg ./core
-## Build tags for the coverage build. `auth` is required to compile the real
-## auth implementation and its ~150 `//go:build auth` tests (otherwise they're
-## invisible and the gate scores auth against a stub). `debug` matches `test`.
-COVERAGE_TAGS?=debug auth
-## Coverage is attributed to these packages via --coverpkg, so the in-process
-## integration suites (COVERAGE_E2E_ROOTS) credit the core/http handlers they
-## drive over HTTP — not just their own test package.
-COVERAGE_COVERPKG?=github.com/mudler/LocalAI/core/...,github.com/mudler/LocalAI/pkg/...
-## In-process integration suites folded into coverage. Run non-recursively
-## (excludes tests/e2e/distributed, which needs containers) with the mock
-## backend built by prepare-test. real-models specs need a downloaded model,
-## so they're filtered out. NOTE: tests/integration is intentionally NOT here —
-## it needs the local-store backend built (`make backends/local-store`), which
-## the coverage CI job doesn't do.
-COVERAGE_E2E_ROOTS?=./tests/e2e
-COVERAGE_E2E_LABELS?=!real-models
-## Drop generated protobuf from the denominator (it has no tests by design).
-COVERAGE_EXCLUDE_RE?=grpc/proto/.*[.]pb[.]go
+TEST_PATHS?=./api/... ./pkg/... ./core/...


-.PHONY: all test test-coverage test-coverage-baseline test-coverage-check test-backend-cpp test-ui test-ui-coverage-baseline test-ui-coverage-check install-hooks build vendor lint lint-all
+.PHONY: all test build vendor lint lint-all

 all: help

@@ -180,7 +149,7 @@ osx-signed: build

 ## Run
 run: ## run local-ai
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./cmd/local-ai
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./

 prepare-test: protogen-go build-mock-backend

@@ -201,43 +170,6 @@ test: prepare-test
 	OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)

-## Compiles and runs the standalone C++ unit tests for the backends (pure
-## helpers that depend only on the stdlib + nlohmann/json, no full backend
-## build). Discovers every *_test.cpp under backend/cpp/ - see
-## backend/cpp/run-unit-tests.sh. Set NLOHMANN_INCLUDE to skip the header fetch.
-test-backend-cpp:
-	bash backend/cpp/run-unit-tests.sh
-
-## Runs the core suite ($(TEST_PATHS)) with statement-coverage instrumentation
-## and writes a merged profile to $(COVERAGE_PROFILE). Deliberately omits
-## --fail-fast so a single failure doesn't truncate the coverage number, and
-## uses covermode=atomic so the result is deterministic. Prints the total.
-test-coverage: prepare-test
-	@echo 'Running tests with coverage'
-	GINKGO_TAGS="$(COVERAGE_TAGS)" \
-	COVERAGE_COVERPKG="$(COVERAGE_COVERPKG)" \
-	COVERAGE_E2E_ROOTS="$(COVERAGE_E2E_ROOTS)" \
-	COVERAGE_E2E_LABELS="$(COVERAGE_E2E_LABELS)" \
-	COVERAGE_EXCLUDE_RE='$(COVERAGE_EXCLUDE_RE)' \
-	OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
-	scripts/run-coverage.sh $(COVERAGE_DIR) $(COVERAGE_PROFILE) $(TEST_FLAKES) $(COVERAGE_ROOTS)
-	@$(GOCMD) tool cover -html=$(COVERAGE_PROFILE) -o $(COVERAGE_DIR)/coverage.html
-	@$(GOCMD) tool cover -func=$(COVERAGE_PROFILE) | tail -n1
-
-## Writes the current total coverage to $(COVERAGE_BASELINE). Run this (and
-## commit the result) whenever a change legitimately raises coverage so the
-## ratchet moves up. Never lower it by hand.
-test-coverage-baseline: test-coverage
-	@$(GOCMD) tool cover -func=$(COVERAGE_PROFILE) | awk '/^total:/{gsub(/%/,"",$$NF); print $$NF}' > $(COVERAGE_BASELINE)
-	@echo "Saved coverage baseline: $$(cat $(COVERAGE_BASELINE))%"
-
-## CI gate: fails if total coverage dropped more than COVERAGE_TOLERANCE
-## (default 0.5pp) below the committed baseline. A small tolerance absorbs the
-## run-to-run jitter from the in-process tests/e2e suite folded in via
-## --coverpkg (timing-dependent which handler lines execute).
-test-coverage-check: test-coverage
-	@scripts/coverage-check.sh $(COVERAGE_PROFILE) $(COVERAGE_BASELINE)
-
 ########################################################
 ## Lint
 ########################################################
@@ -253,17 +185,12 @@ test-coverage-check: test-coverage
 ## everything else automatically, so new packages are scanned by default.
 LINT_EXCLUDE_DIRS_RE=/(backend/go/(piper|silero-vad|llm)|cmd/launcher)(/|$$)

-## Set LINT_NEW_FROM to a git ref to override .golangci.yml's
-## new-from-merge-base (origin/master). Useful from a fork clone where
-## origin/master is stale relative to the canonical repo — the pre-commit
-## hook passes the resolved upstream ref here so local lint matches CI.
-LINT_NEW_FROM?=
 lint:
 	@command -v golangci-lint >/dev/null 2>&1 || { \
 		echo 'golangci-lint not installed. Install: go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest'; \
 		exit 1; \
 	}
-	golangci-lint run $(if $(LINT_NEW_FROM),--new-from-merge-base=$(LINT_NEW_FROM),) $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')
+	golangci-lint run $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')

 ## Like `lint` but reports every issue, including the pre-existing baseline
 ## that `lint` ignores via .golangci.yml's new-from-merge-base. Use this to
@@ -275,17 +202,6 @@ lint-all:
 	}
 	golangci-lint run --new=false --new-from-merge-base= --new-from-rev= $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')

-########################################################
-## Git hooks
-########################################################
-## Points git at the versioned .githooks/ directory so the pre-commit hook
-## (lint + coverage gate) runs locally. Run once per clone. Undo with:
-## `git config --unset core.hooksPath`. Skip a single commit with
-## `git commit --no-verify`.
-install-hooks:
-	git config core.hooksPath .githooks
-	@echo 'Installed git hooks: core.hooksPath -> .githooks (pre-commit runs lint + test-coverage-check on Go changes)'
-
 ########################################################
 ## E2E AIO tests (uses standard image with pre-configured models)
 ########################################################
@@ -316,20 +232,13 @@ run-e2e-aio: protogen-go
 	@echo 'Running e2e AIO tests'
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e-aio

-# Distributed architecture e2e (PostgreSQL + NATS via testcontainers).
-# Includes NatsJWT specs (JWT-enabled NATS). Requires Docker.
-# VLLMMultinode is excluded here; use test-e2e-vllm-multinode for that.
-test-e2e-distributed: protogen-go
-	@echo 'Running distributed e2e tests (label Distributed, incl. NatsJWT)'
-	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter='Distributed && !VLLMMultinode' --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e/distributed
-
 # vLLM multi-node DP smoke (CPU). Builds local-ai:tests and the
 # cpu-vllm backend from the current working tree, then drives a
 # head + headless follower via testcontainers-go and asserts a chat
 # completion. BuildKit caches both images, so re-runs only rebuild
 # what changed. The test lives under tests/e2e/distributed and is
 # selected by the VLLMMultinode label so it doesn't run alongside
-# test-e2e-distributed.
+# the other distributed-suite tests by default.
 test-e2e-vllm-multinode: docker-build-e2e extract-backend-vllm protogen-go
 	@echo 'Running e2e vLLM multi-node DP test'
 	LOCALAI_IMAGE=local-ai \
@@ -359,13 +268,12 @@ prepare-e2e:
 run-e2e-image:
 	docker run -p 5390:8080 -e MODELS_PATH=/models -e THREADS=1 -e DEBUG=true -d --rm -v $(TEST_DIR):/models --name e2e-tests-$(RANDOM) localai-tests

-test-e2e: build-mock-backend build-cloud-proxy-backend prepare-e2e run-e2e-image
+test-e2e: build-mock-backend prepare-e2e run-e2e-image
 	@echo 'Running e2e tests'
 	BUILD_TYPE=$(BUILD_TYPE) \
 	LOCALAI_API=http://$(E2E_BRIDGE_IP):5390 \
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e
 	$(MAKE) clean-mock-backend
-	$(MAKE) clean-cloud-proxy-backend
 	$(MAKE) teardown-e2e
 	docker rmi localai-tests

@@ -572,8 +480,6 @@ prepare-test-extra: protogen-python
 	$(MAKE) -C backend/python/insightface
 	$(MAKE) -C backend/python/speaker-recognition
 	$(MAKE) -C backend/rust/kokoros kokoros-grpc
-	$(MAKE) -C backend/go/rfdetr-cpp
-	$(MAKE) -C backend/go/locate-anything-cpp

 test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/transformers test
@@ -600,10 +506,6 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/insightface test
 	$(MAKE) -C backend/python/speaker-recognition test
 	$(MAKE) -C backend/rust/kokoros test
-	$(MAKE) -C backend/go/rfdetr-cpp test
-	$(MAKE) -C backend/go/locate-anything-cpp test
-	$(MAKE) -C backend/go/depth-anything-cpp test
-	$(MAKE) -C backend/go/supertonic test

 ##
 ## End-to-end gRPC tests that exercise a built backend container image.
@@ -697,16 +599,6 @@ test-extra-backend-llama-cpp-transcription: docker-build-llama-cpp
 	BACKEND_TEST_CTX_SIZE=2048 \
 	$(MAKE) test-extra-backend

-## privacy-filter: the PII/NER token-classification backend. Exercises the
-## TokenClassify RPC and asserts byte-correct, UTF-8-aligned span offsets
-## against the openai-privacy-filter multilingual GGUF (CPU-runnable, ~50M
-## active params). This is the live-backend coverage for the PII NER tier.
-test-extra-backend-privacy-filter: docker-build-privacy-filter
-	BACKEND_IMAGE=local-ai-backend:privacy-filter \
-	BACKEND_TEST_MODEL_URL=https://huggingface.co/LocalAI-io/privacy-filter-multilingual-GGUF/resolve/main/privacy-filter-multilingual-f16.gguf \
-	BACKEND_TEST_CAPS=health,load,token_classify \
-	$(MAKE) test-extra-backend
-
 ## vllm is resolved from a HuggingFace model id (no file download) and
 ## exercises Predict + streaming + tool-call extraction via the hermes parser.
 ## Requires a host CPU with the SIMD instructions the prebuilt vllm CPU
@@ -1019,19 +911,6 @@ test-extra-backend-whisper-transcription: docker-build-whisper
 	BACKEND_TEST_CAPS=health,load,transcription \
 	$(MAKE) test-extra-backend

-## Audio transcription wrapper for the parakeet-cpp (parakeet.cpp ggml port)
-## backend. Mirrors test-extra-backend-whisper-transcription: drives the
-## AudioTranscription / AudioTranscriptionStream RPCs against a published
-## Parakeet GGUF using the JFK 11s clip from whisper.cpp's CI samples. Not
-## part of the default test suite - run explicitly once the pinned model URL
-## is reachable.
-test-extra-backend-parakeet-cpp-transcription: docker-build-parakeet-cpp
-	BACKEND_IMAGE=local-ai-backend:parakeet-cpp \
-	BACKEND_TEST_MODEL_URL=https://huggingface.co/mudler/parakeet-cpp-gguf/resolve/main/tdt_ctc-110m-f16.gguf \
-	BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
-	BACKEND_TEST_CAPS=health,load,transcription \
-	$(MAKE) test-extra-backend
-
 ## LocalVQE audio transform (joint AEC + noise suppression + dereverb).
 ## Exercises the audio_transform capability end-to-end: batch transform
 ## of a real WAV fixture and bidi streaming of synthetic silent frames.
@@ -1136,10 +1015,6 @@ backends/ds4-darwin: build
 	bash ./scripts/build/ds4-darwin.sh
 	./local-ai backends install "ocifile://$(abspath ./backend-images/ds4.tar)"

-backends/privacy-filter-darwin: build
-	bash ./scripts/build/privacy-filter-darwin.sh
-	./local-ai backends install "ocifile://$(abspath ./backend-images/privacy-filter.tar)"
-
 build-darwin-python-backend: build
 	bash ./scripts/build/python-darwin.sh

@@ -1185,31 +1060,21 @@ BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
 # Single-model; hardware-only validation lives at tests/e2e-backends/
 # (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
 BACKEND_DS4 = ds4|ds4|.|false|false
-# privacy-filter wraps the standalone privacy-filter.cpp GGML engine (the
-# openai-privacy-filter PII/NER token classifier) — the TokenClassify RPC for
-# the PII redactor tier, on stock ggml with no llama.cpp carry-patches.
-BACKEND_PRIVACY_FILTER = privacy-filter|privacy-filter|.|false|false

 # Golang backends
 BACKEND_PIPER = piper|golang|.|false|true
 BACKEND_LOCAL_STORE = local-store|golang|.|false|true
-BACKEND_CLOUD_PROXY = cloud-proxy|golang|.|false|true
 BACKEND_HUGGINGFACE = huggingface|golang|.|false|true
 BACKEND_SILERO_VAD = silero-vad|golang|.|false|true
 BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|true
 BACKEND_WHISPER = whisper|golang|.|false|true
-BACKEND_CRISPASR = crispasr|golang|.|false|true
-BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
-BACKEND_DEPTH_ANYTHING_CPP = depth-anything-cpp|golang|.|false|true
 BACKEND_VOXTRAL = voxtral|golang|.|false|true
 BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
 BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
-BACKEND_OMNIVOICE_CPP = omnivoice-cpp|golang|.|false|true
 BACKEND_VIBEVOICE_CPP = vibevoice-cpp|golang|.|false|true
 BACKEND_LOCALVQE = localvqe|golang|.|false|true
 BACKEND_OPUS = opus|golang|.|false|true
 BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true
-BACKEND_SUPERTONIC = supertonic|golang|.|false|true

 # Python backends with root context
 BACKEND_RERANKERS = rerankers|python|.|false|true
@@ -1252,7 +1117,6 @@ BACKEND_KOKOROS = kokoros|rust|.|false|true

 # C++ backends (Go wrapper with purego)
 BACKEND_SAM3_CPP = sam3-cpp|golang|.|false|true
-BACKEND_RFDETR_CPP = rfdetr-cpp|golang|.|false|true

 # Helper function to build docker image for a backend
 # Usage: $(call docker-build-backend,BACKEND_NAME,DOCKERFILE_TYPE,BUILD_CONTEXT,PROGRESS_FLAG,NEEDS_BACKEND_ARG)
@@ -1283,17 +1147,12 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
 $(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
-$(eval $(call generate-docker-build-target,$(BACKEND_PRIVACY_FILTER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
-$(eval $(call generate-docker-build-target,$(BACKEND_CLOUD_PROXY)))
 $(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SILERO_VAD)))
 $(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
 $(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
-$(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
-$(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_DEPTH_ANYTHING_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
 $(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
@@ -1326,7 +1185,6 @@ $(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX)))
 $(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_OMNIVOICE_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCALVQE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_MLX)))
@@ -1337,15 +1195,13 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TINYGRAD)))
 $(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_RFDETR_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
-$(eval $(call generate-docker-build-target,$(BACKEND_SUPERTONIC)))

 # Pattern rule for docker-save targets
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx

 ########################################################
 ### Mock Backend for E2E Tests
@@ -1357,12 +1213,6 @@ build-mock-backend: protogen-go
 clean-mock-backend:
 	rm -f tests/e2e/mock-backend/mock-backend

-build-cloud-proxy-backend: protogen-go
-	$(GOCMD) build -o tests/e2e/mock-backend/cloud-proxy ./backend/go/cloud-proxy
-
-clean-cloud-proxy-backend:
-	rm -f tests/e2e/mock-backend/cloud-proxy
-
 ########################################################
 ### UI E2E Test Server
 ########################################################
@@ -1373,50 +1223,6 @@ build-ui-test-server: build-mock-backend react-ui protogen-go
 test-ui-e2e: build-ui-test-server
 	cd core/http/react-ui && npm install && npx playwright install --with-deps chromium && npx playwright test

-## Optional Playwright worker count for the UI e2e targets below. Pass
-## UI_TEST_WORKERS=N (e.g. `make test-ui-coverage UI_TEST_WORKERS=20`) to
-## override Playwright's default (cores/2). Empty by default so Playwright
-## picks its own worker count.
-UI_TEST_WORKERS ?=
-PLAYWRIGHT_WORKERS_FLAG = $(if $(UI_TEST_WORKERS),--workers=$(UI_TEST_WORKERS),)
-
-## Fast Playwright e2e run used by the pre-commit hook on React UI changes.
-## Force-rebuilds the (non-instrumented) dist so the suite tests the working
-## tree — not a stale dist the `react-ui` skip-guard would leave — re-embeds
-## it into ui-test-server, and runs the specs. Uses the nix-provided browser
-## when PLAYWRIGHT_CHROMIUM_PATH is set (flake dev shell), else falls back to
-## downloading it as `test-ui-e2e` does.
-test-ui: build-mock-backend protogen-go
-	cd core/http/react-ui && bun install && bun run build
-	$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui
-	cd core/http/react-ui && sh $(CURDIR)/scripts/ensure-playwright-browser.sh && bunx playwright test $(PLAYWRIGHT_WORKERS_FLAG)
-
-## React UI code coverage from the Playwright e2e suite. Builds a
-## NON-instrumented bundle with source maps (COVERAGE_V8=true), re-embeds it
-## into the ui-test-server (the dist is //go:embed'ed at compile time), runs the
-## Playwright specs which collect native Chromium V8 coverage (PW_V8_COVERAGE=1)
-## — far cheaper than istanbul's build-time counters (~40% faster end-to-end) —
-## convert it to istanbul via v8-to-istanbul in the coverage fixture, and write
-## an nyc report to core/http/react-ui/coverage/. Removes the dist afterwards so
-## normal builds aren't served source-mapped assets. (The legacy istanbul path
-## still exists: `bun run build:coverage` + unset PW_V8_COVERAGE.)
-test-ui-coverage: build-mock-backend protogen-go
-	trap 'rm -rf "$(CURDIR)/core/http/react-ui/dist"' EXIT; \
-	( cd core/http/react-ui && bun install && bun run build:coverage-v8 ) && \
-	$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui && \
-	( cd core/http/react-ui && rm -rf .nyc_output coverage && \
-	    sh $(CURDIR)/scripts/ensure-playwright-browser.sh && \
-	    PW_V8_COVERAGE=1 bunx playwright test $(PLAYWRIGHT_WORKERS_FLAG) && bun run coverage:report )
-
-## UI coverage baseline (committed) and the strict gate that compares against
-## it — the React mirror of test-coverage-baseline / test-coverage-check.
-test-ui-coverage-baseline: test-ui-coverage
-	@node -e 'const fs=require("fs");process.stdout.write(String(JSON.parse(fs.readFileSync("core/http/react-ui/coverage/coverage-summary.json")).total.lines.pct))' > core/http/react-ui/coverage-baseline.txt
-	@echo "Saved UI coverage baseline: $$(cat core/http/react-ui/coverage-baseline.txt)% lines"
-
-test-ui-coverage-check: test-ui-coverage
-	sh $(CURDIR)/scripts/ui-coverage-check.sh core/http/react-ui/coverage/coverage-summary.json core/http/react-ui/coverage-baseline.txt
-
 test-ui-e2e-docker:
 	docker build -t localai-ui-e2e -f tests/e2e-ui/Dockerfile .
 	docker run --rm localai-ui-e2e
@@ -1460,32 +1266,13 @@ docs: docs/static/gallery.html
 ########################################################

 ## fyne cross-platform build
-# Build LocalAI.app from the launcher via fyne (metadata read from cmd/launcher/FyneApp.toml).
-# Signing happens via contrib/macos/sign-and-notarize.sh, which is a no-op when the signing
-# secrets are unset, so unsigned local/fork builds keep working.
-build-launcher-darwin:
-	rm -rf dist/LocalAI.app cmd/launcher/LocalAI.app
-	mkdir -p dist
-	cd cmd/launcher && go run fyne.io/tools/cmd/fyne@latest package -os darwin -icon ../../core/http/static/logo.png --executable $(LAUNCHER_BINARY_NAME)
-	mv cmd/launcher/LocalAI.app dist/LocalAI.app
-	bash contrib/macos/sign-and-notarize.sh sign dist/LocalAI.app
-
-# Wrap the (signed) app into a drag-to-Applications DMG via hdiutil, then sign the DMG.
-dmg-launcher-darwin: build-launcher-darwin
-	rm -rf dist/dmg dist/LocalAI.dmg
-	mkdir -p dist/dmg
-	cp -R dist/LocalAI.app dist/dmg/LocalAI.app
-	ln -s /Applications dist/dmg/Applications
-	hdiutil create -volname "LocalAI" -srcfolder dist/dmg -ov -format UDZO dist/LocalAI.dmg
-	bash contrib/macos/sign-and-notarize.sh sign dist/LocalAI.dmg
-
-# Submit the DMG to Apple notarization and staple the ticket (no-op without notary secrets).
-notarize-launcher-darwin: dmg-launcher-darwin
-	bash contrib/macos/sign-and-notarize.sh notarize dist/LocalAI.dmg
-
-# Single entrypoint for CI: build -> sign app -> dmg -> sign dmg -> notarize -> staple.
-release-launcher-darwin: notarize-launcher-darwin
-	@echo "dist/LocalAI.dmg is ready"
+build-launcher-darwin: build-launcher
+	go run github.com/tiagomelo/macos-dmg-creator/cmd/createdmg@latest \
+	--appName "LocalAI" \
+	--appBinaryPath "$(LAUNCHER_BINARY_NAME)" \
+	--bundleIdentifier "com.localai.launcher" \
+	--iconPath "core/http/static/logo.png" \
+	--outputDir "dist/"

 build-launcher-linux:
-	cd cmd/launcher && go run fyne.io/tools/cmd/fyne@latest package -os linux -icon ../../core/http/static/logo.png --executable $(LAUNCHER_BINARY_NAME)-linux && mv LocalAI.tar.xz ../../$(LAUNCHER_BINARY_NAME)-linux.tar.xz
+	cd cmd/launcher && go run fyne.io/tools/cmd/fyne@latest package -os linux -icon ../../core/http/static/logo.png --executable $(LAUNCHER_BINARY_NAME)-linux && mv launcher.tar.xz ../../$(LAUNCHER_BINARY_NAME)-linux.tar.xz
--- a/README.md
+++ b/README.md
@@ -29,32 +29,14 @@
 <a href="https://trendshift.io/repositories/5539" target="_blank"><img src="https://trendshift.io/api/badge/repositories/5539" alt="mudler%2FLocalAI | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 </p>

-<!-- Keep these links, translations synced daily. -->
-<p align="center">
-<a href="https://zdoc.app/de/mudler/LocalAI">Deutsch</a> |
-<a href="https://zdoc.app/es/mudler/LocalAI">Español</a> |
-<a href="https://zdoc.app/fr/mudler/LocalAI">français</a> |
-<a href="https://zdoc.app/ja/mudler/LocalAI">日本語</a> |
-<a href="https://zdoc.app/ko/mudler/LocalAI">한국어</a> |
-<a href="https://zdoc.app/pt/mudler/LocalAI">Português</a> |
-<a href="https://zdoc.app/ru/mudler/LocalAI">Русский</a> |
-<a href="https://zdoc.app/zh/mudler/LocalAI">中文</a>
-</p>
-
 **LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

-**A small core, not a bundle.** Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.
-
- **Composable by design**: backends are separate and pulled on demand, so you install only what your model needs
- **Open and extensible**: load any model, or build your own backend in any language against an open interface
- **Drop-in API compatibility**: OpenAI, Anthropic, and ElevenLabs APIs across every backend
- **Any model, any modality**: LLMs, vision, voice, image, and video behind one API
- **Any hardware**: NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
- **Multi-user ready**: API key auth, user quotas, role-based access
- **Built-in AI agents**: autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first**: your data never leaves your infrastructure
-
-![A small LocalAI core with backends (llama.cpp, vLLM, MLX, whisper.cpp, stable-diffusion, kokoro, parakeet.cpp...) plugged in as separate on-demand images](docs/static/images/diagrams/composable-core.png)
+- **Drop-in API compatibility** — OpenAI, Anthropic, ElevenLabs APIs
+- **36+ backends** — llama.cpp, vLLM, transformers, whisper, diffusers, MLX...
+- **Any hardware** — NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
+- **Multi-user ready** — API key auth, user quotas, role-based access
+- **Built-in AI agents** — autonomous agents with tool use, RAG, MCP, and skills
+- **Privacy-first** — your data never leaves your infrastructure

 Created by [Ettore Di Giacinto](https://github.com/mudler) and maintained by the [LocalAI team](#team).

@@ -161,30 +143,14 @@ local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
 local-ai run oci://localai/phi-2:latest
 ```

-To test a running LocalAI server from the terminal, open an interactive chat session from another shell. Inside the prompt, `/models` lists installed models and `/model <name>` switches between them.
-
-```bash
-# Terminal 1
-local-ai run llama-3.2-1b-instruct:q4_k_m
-
-# Terminal 2
-local-ai chat --model llama-3.2-1b-instruct:q4_k_m
-```
-
 > **Automatic Backend Detection**: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/).

 For more details, see the [Getting Started guide](https://localai.io/basics/getting_started/).

 ## Latest News

- **June 2026**: New [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (a tiny Go client for the Realtime API with a full talk-back voice loop and tool calling), plus [streaming of the realtime LLM / TTS / transcription pipeline stages](https://github.com/mudler/LocalAI/pull/10176) and [configurable WebRTC ICE candidates](https://github.com/mudler/LocalAI/pull/10231).
- **June 2026**: Big speech push: the [parakeet.cpp](https://github.com/mudler/parakeet.cpp) ASR engine gains [NeMo-faithful segment timestamps](https://github.com/mudler/LocalAI/pull/10207), a [multilingual streaming Nemotron-3.5 model](https://github.com/mudler/LocalAI/pull/10199), [dynamic batching for concurrent transcription](https://github.com/mudler/LocalAI/pull/10112) and [CUDA graphs](https://github.com/mudler/LocalAI/pull/10273); the new [CrispASR backend](https://github.com/mudler/LocalAI/pull/10099) adds multi-architecture ASR + TTS, and [60 Piper TTS voices across 42 languages](https://github.com/mudler/LocalAI/pull/10296) land in the gallery (plus [per-request TTS instructions and params](https://github.com/mudler/LocalAI/pull/10172)).
- **June 2026**: New backends and models: [locate-anything.cpp](https://github.com/mudler/LocalAI/pull/10264) for open-vocabulary object detection via ggml, [Ideogram4 image generation](https://github.com/mudler/LocalAI/pull/10201) in stablediffusion-ggml, [llama.cpp video input](https://github.com/mudler/LocalAI/pull/10216), and the [Gemma 4 QAT family with MTP speculative-decoding pairs](https://github.com/mudler/LocalAI/pull/10215). Plus an [interactive CLI chat mode](https://github.com/mudler/LocalAI/pull/10226) and [RAG source citations in agent responses](https://github.com/mudler/LocalAI/pull/10228).
- **June 2026**: Distributed mode hardening: [prefix-cache-aware routing](https://github.com/mudler/LocalAI/pull/10071), a [production-ready request router with auto-sized embedding/rerank batches](https://github.com/mudler/LocalAI/pull/10104), [ds4 layer-split distributed inference](https://github.com/mudler/LocalAI/pull/10098), [NATS JWT auth + TLS/mTLS](https://github.com/mudler/LocalAI/pull/10159), and [resumable file uploads](https://github.com/mudler/LocalAI/pull/10109).
- **May 2026**: **LocalAI 4.3.0** - `llama.cpp` [prompt cache on by default](https://github.com/mudler/LocalAI/pull/9925) (repeated system prompts collapse from minutes to seconds), [keyless cosign signing of backend OCI images](https://github.com/mudler/LocalAI/pull/9823), [per-API-key + per-user usage attribution](https://github.com/mudler/LocalAI/pull/9920), Distributed v3 with [per-request replica routing](https://github.com/mudler/LocalAI/pull/9968). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.3.0)
- **May 2026**: **LocalAI 4.2.0** - LocalAI sees and hears: [voice recognition](https://github.com/mudler/LocalAI/pull/9500), [face recognition + antispoofing liveness](https://github.com/mudler/LocalAI/pull/9480), speaker diarization. Plus [drop-in Ollama API](https://github.com/mudler/LocalAI/pull/9284), [video generation](https://github.com/mudler/LocalAI/pull/9420), redesigned UI with i18n + admin-configurable branding, vLLM at feature parity with llama.cpp, and 11 new backends. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.2.0)
- **April 2026**: **LocalAI 4.1.0** - LocalAI becomes a control tower: distributed cluster mode with VRAM-aware smart routing + autoscaling, multi-user platform with OIDC and API keys, per-user quotas with predictive analytics, in-UI fine-tuning with TRL (auto-export to GGUF), on-the-fly quantization backend, visual pipeline editor. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.1.0)
- **March 2026**: **LocalAI 4.0.0** - native agentic orchestration with the new [Agenthub](https://agenthub.localai.io) community hub, full React UI rewrite with Canvas mode, [MCP Apps + client-side](https://github.com/mudler/LocalAI/pull/8947) with tool streaming, [WebRTC realtime audio](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed](https://github.com/mudler/LocalAI/pull/8801). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.0.0)
+- **April 2026**: [Voice recognition](https://github.com/mudler/LocalAI/pull/9500), [Face recognition, identification & liveness detection](https://github.com/mudler/LocalAI/pull/9480), [Ollama API compatibility](https://github.com/mudler/LocalAI/pull/9284), [Video generation in stable-diffusion.ggml](https://github.com/mudler/LocalAI/pull/9420), [Backend versioning with auto-upgrade](https://github.com/mudler/LocalAI/pull/9315), [Pin models & load-on-demand toggle](https://github.com/mudler/LocalAI/pull/9309), [Universal model importer](https://github.com/mudler/LocalAI/pull/9466), new backends: [sglang](https://github.com/mudler/LocalAI/pull/9359), [ik-llama-cpp](https://github.com/mudler/LocalAI/pull/9326), [TurboQuant](https://github.com/mudler/LocalAI/pull/9355), [sam.cpp](https://github.com/mudler/LocalAI/pull/9288), [Kokoros](https://github.com/mudler/LocalAI/pull/9212), [qwen3tts.cpp](https://github.com/mudler/LocalAI/pull/9316), [tinygrad multimodal](https://github.com/mudler/LocalAI/pull/9364)
+- **March 2026**: [Agent management](https://github.com/mudler/LocalAI/pull/8820), [New React UI](https://github.com/mudler/LocalAI/pull/8772), [WebRTC](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed via P2P and RDMA](https://github.com/mudler/LocalAI/pull/8801), [MCP Apps, MCP Client-side](https://github.com/mudler/LocalAI/pull/8947)
 - **February 2026**: [Realtime API for audio-to-audio with tool calling](https://github.com/mudler/LocalAI/pull/6245), [ACE-Step 1.5 support](https://github.com/mudler/LocalAI/pull/8396)
 - **January 2026**: **LocalAI 3.10.0** — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v3.10.0)
 - **December 2025**: [Dynamic Memory Resource reclaimer](https://github.com/mudler/LocalAI/pull/7583), [Automatic multi-GPU model fitting (llama.cpp)](https://github.com/mudler/LocalAI/pull/7584), [Vibevoice backend](https://github.com/mudler/LocalAI/pull/7494)
@@ -220,29 +186,10 @@ For older news and full release notes, see [GitHub Releases](https://github.com/

 ## Supported Backends & Acceleration

-LocalAI supports **60+ backends** including llama.cpp, vLLM, SGLang, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
+LocalAI supports **36+ backends** including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).

 See the full [Backend & Model Compatibility Table](https://localai.io/model-compatibility/) and [GPU Acceleration guide](https://localai.io/features/gpu-acceleration/).

-### Backends built by us
-
-Most backends wrap a best-in-class upstream engine. A handful of them are native C/C++/GGML engines (no Python at inference) developed and maintained by the LocalAI project itself:
-
-| Backend | What it does |
-|---------|-------------|
-| [parakeet.cpp](https://github.com/mudler/parakeet.cpp) | C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription |
-| [ced.cpp](https://github.com/mudler/ced.cpp) | C++/GGML port of the CED audio-tagging models: sound-event classification (527-class AudioSet) over REST and the realtime API for live recognition |
-| [voxtral.c](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in pure C |
-| [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp) | Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization |
-| [rf-detr.cpp](https://github.com/mudler/rf-detr.cpp) | Native RF-DETR object detection and instance segmentation |
-| [locate-anything.cpp](https://github.com/mudler/locate-anything.cpp) | Open-vocabulary object detection and visual grounding (LocateAnything-3B) |
-| [depth-anything.cpp](https://github.com/mudler/depth-anything.cpp) | Depth Anything 3 monocular metric depth + camera pose estimation |
-| [privacy-filter.cpp](https://github.com/localai-org/privacy-filter.cpp) | Standalone GGML PII/NER token-classification engine powering LocalAI's PII redaction tier |
-| [LocalVQE](https://github.com/localai-org/LocalVQE) | Joint acoustic echo cancellation, noise suppression, and dereverberation |
-| [local-store](https://github.com/mudler/LocalAI) | Local-first vector database for embeddings (shipped in-tree) |
-
-We also maintain [apex-quant](https://github.com/localai-org/apex-quant), a per-tensor, per-layer quantization recipe for Mixture-of-Experts models that exploits their structural sparsity to produce GGUFs matching or beating Q8_0 quality - and they run out of the box on stock llama.cpp.
-
 ## Resources

 - [Documentation](https://localai.io/)
@@ -252,7 +199,7 @@ We also maintain [apex-quant](https://github.com/localai-org/apex-quant), a per-
 - [Integrations & community projects](https://localai.io/docs/integrations/)
 - [Installation video walkthrough](https://www.youtube.com/watch?v=cMVNnlqwfw4)
 - [Media & blog posts](https://localai.io/basics/news/#media-blogs-social)
- [Examples](https://github.com/mudler/LocalAI-examples) — including the [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (Go client for the Realtime API with tool calling)
+- [Examples](https://github.com/mudler/LocalAI-examples)

 ## Team

@@ -289,22 +236,11 @@ A huge thank you to our generous sponsors who support this project covering CI e
  <a href="https://www.spectrocloud.com/" target="blank">
    <img height="200" src="https://github.com/user-attachments/assets/72eab1dd-8b93-4fc0-9ade-84db49f24962">
  </a>
-</p>
-
-<details>
-
-<summary>
-Past sponsors
-</summary>
-
-<p align="center">
  <a href="https://www.premai.io/" target="blank">
    <img height="200" src="https://github.com/mudler/LocalAI/assets/2420543/42e4ca83-661e-4f79-8e46-ae43689683d6"> <br>
  </a>
 </p>

-</details>
-
 ### Individual sponsors

 A special thanks to individual sponsors, a full list is on [GitHub](https://github.com/sponsors/mudler) and [buymeacoffee](https://buymeacoffee.com/mudler). Special shout out to [drikster80](https://github.com/drikster80) for being generous. Thank you everyone!
--- a/backend/Dockerfile.golang
+++ b/backend/Dockerfile.golang
@@ -65,12 +65,7 @@ RUN <<EOT bash
            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils && \
-        apt-get install -y mesa-vulkan-drivers libdrm2
-        # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe) + their manifests. The
-        # LunarG SDK below only provides the loader and shader tooling, not
-        # hardware drivers — without Mesa, package-gpu-libs.sh has no ICD to
-        # bundle and the packaged backend finds no GPU at runtime.
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
        if [ "amd64" = "$TARGETARCH" ]; then
            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
@@ -211,16 +206,6 @@ RUN if [ "${BACKEND}" = "opus" ]; then \
    apt-get clean && rm -rf /var/lib/apt/lists/*; \
 fi

-# CrispASR's piper TTS backend dlopens libespeak-ng at runtime to phonemize
-# non-English text (the MIT-clean path; English uses a built-in G2P). Install
-# the espeak-ng runtime + its libpcaudio/libsonic deps + voice data so
-# package.sh can bundle them into the FROM scratch image.
-RUN if [ "${BACKEND}" = "crispasr" ]; then \
-    apt-get update && apt-get install -y --no-install-recommends \
-        espeak-ng-data libespeak-ng1 libpcaudio0 libsonic0 && \
-    apt-get clean && rm -rf /var/lib/apt/lists/*; \
-fi
-
 COPY . /LocalAI

 RUN git config --global --add safe.directory /LocalAI
--- a/backend/Dockerfile.privacy-filter
+++ b/backend/Dockerfile.privacy-filter
@@ -1,109 +0,0 @@
-ARG BASE_IMAGE=ubuntu:24.04
-# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses when no
-# prebuilt base is supplied; the builder-prebuilt stage is only entered when
-# BUILDER_TARGET=builder-prebuilt, so the fallback content is harmless
-# (BuildKit prunes the unreferenced builder).
-ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
-# BUILDER_TARGET selects which builder stage the scratch image copies from.
-# Declared before any FROM so it is usable in `FROM ${BUILDER_TARGET}`. The
-# backend_build workflow sets it to builder-prebuilt when the matrix entry
-# provides builder-base-image, else builder-fromsource (the local default).
-ARG BUILDER_TARGET=builder-fromsource
-ARG APT_MIRROR=""
-ARG APT_PORTS_MIRROR=""
-
-# privacy-filter: standalone GGML engine for the openai-privacy-filter PII/NER
-# token classifier, wrapped as a LocalAI gRPC backend.
-#
-# Mirrors backend/Dockerfile.llama-cpp: the build toolchain (gRPC + cmake +
-# protoc + conditional CUDA/Vulkan) comes from the shared
-# .docker/install-base-deps.sh (from-source path) or a prebuilt
-# quay.io/go-skynet/ci-cache:base-grpc-* image (CI path) — nothing GPU-specific
-# is hand-rolled here. BUILD_TYPE selects the engine backend in the Makefile:
-# "" = cpu, "cublas" -> -DPF_CUDA=ON, "vulkan" -> -DPF_VULKAN=ON.
-
-# ============================================================================
-# Stage: builder-fromsource — self-contained build. Runs the same install
-# script backend/Dockerfile.base-grpc-builder runs, so this path is
-# bit-equivalent to the prebuilt base. Used when BUILDER_TARGET=builder-fromsource
-# (the default; local `make backends/privacy-filter`).
-# ============================================================================
-FROM ${BASE_IMAGE} AS builder-fromsource
-ARG BUILD_TYPE
-ARG CUDA_MAJOR_VERSION
-ARG CUDA_MINOR_VERSION
-ARG CMAKE_FROM_SOURCE=false
-# CUDA Toolkit 13.x needs CMake 3.31.9+ for correct toolchain/arch detection.
-ARG CMAKE_VERSION=3.31.10
-ARG GRPC_VERSION=v1.65.0
-ARG GRPC_MAKEFLAGS="-j4 -Otarget"
-ARG SKIP_DRIVERS=false
-ARG TARGETARCH
-ARG UBUNTU_VERSION=2404
-ARG APT_MIRROR
-ARG APT_PORTS_MIRROR
-
-ENV BUILD_TYPE=${BUILD_TYPE} \
-    CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
-    CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
-    CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
-    CMAKE_VERSION=${CMAKE_VERSION} \
-    GRPC_VERSION=${GRPC_VERSION} \
-    GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
-    SKIP_DRIVERS=${SKIP_DRIVERS} \
-    TARGETARCH=${TARGETARCH} \
-    UBUNTU_VERSION=${UBUNTU_VERSION} \
-    APT_MIRROR=${APT_MIRROR} \
-    APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
-    DEBIAN_FRONTEND=noninteractive
-# CUDA on PATH (a no-op when CUDA is not installed, e.g. cpu/vulkan builds).
-ENV PATH=/usr/local/cuda/bin:${PATH}
-
-WORKDIR /build
-
-# apt deps + cmake + protoc + gRPC + conditional CUDA/Vulkan, all from the
-# shared script (the source of truth that base-grpc-builder also runs).
-RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
-    --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-    bash /usr/local/sbin/install-base-deps
-
-# install-base-deps installs gRPC under /opt/grpc; copy it to /usr/local so the
-# backend's find_package(gRPC CONFIG) resolves it at the canonical prefix.
-RUN cp -a /opt/grpc/. /usr/local/
-
-COPY . /LocalAI
-
-RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
-
-# ============================================================================
-# Stage: builder-prebuilt — FROM a prebuilt
-# quay.io/go-skynet/ci-cache:base-grpc-* image (gRPC at /opt/grpc + apt deps +
-# CUDA/Vulkan already installed). Used in CI when the matrix entry sets
-# builder-base-image.
-# ============================================================================
-FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
-ARG BUILD_TYPE
-ARG TARGETARCH
-ENV BUILD_TYPE=${BUILD_TYPE}
-# CUDA on PATH (a no-op for the cpu/vulkan base images).
-ENV PATH=/usr/local/cuda/bin:${PATH}
-
-# Mirror builder-fromsource: the base-grpc image installs gRPC to /opt/grpc but
-# does not copy it to /usr/local.
-RUN cp -a /opt/grpc/. /usr/local/
-
-COPY . /LocalAI
-
-RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
-
-# ============================================================================
-# Final stage — copy the package output from the selected builder. BuildKit
-# does not expand variables in `COPY --from=`, so alias the chosen builder to a
-# fixed stage name first.
-# ============================================================================
-FROM ${BUILDER_TARGET} AS builder
-
-FROM scratch
-COPY --from=builder /LocalAI/backend/cpp/privacy-filter/package/. ./
--- a/backend/Dockerfile.python
+++ b/backend/Dockerfile.python
@@ -66,12 +66,7 @@ RUN <<EOT bash
            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils && \
-        apt-get install -y mesa-vulkan-drivers libdrm2
-        # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe) + their manifests. The
-        # LunarG SDK below only provides the loader and shader tooling, not
-        # hardware drivers — without Mesa, package-gpu-libs.sh has no ICD to
-        # bundle and the packaged backend finds no GPU at runtime.
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
        if [ "amd64" = "$TARGETARCH" ]; then
            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
@@ -131,7 +126,6 @@ RUN <<EOT bash
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -24,10 +24,6 @@ service Backend {
  rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {}
  rpc Status(HealthMessage) returns (StatusResponse) {}
  rpc Detect(DetectOptions) returns (DetectResponse) {}
-  // SoundDetection runs an audio-tagging / sound-event-classification model
-  // (e.g. CED over the AudioSet ontology) on a clip and returns scored labels.
-  rpc SoundDetection(SoundDetectionRequest) returns (SoundDetectionResponse) {}
-  rpc Depth(DepthRequest) returns (DepthResponse) {}
  rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
  rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
  rpc VoiceVerify(VoiceVerifyRequest) returns (VoiceVerifyResponse) {}
@@ -41,22 +37,6 @@ service Backend {

  rpc Rerank(RerankRequest) returns (RerankResult) {}

-  // TokenClassify runs a token-classification (NER) model on the
-  // supplied text and returns each detected entity span. Used by the
-  // PII redactor's optional NER tier — the regex tier still handles
-  // formatted hits cheaply, while this catches names, locations, and
-  // other unformatted PII that regex misses.
-  rpc TokenClassify(TokenClassifyRequest) returns (TokenClassifyResponse) {}
-
-  // Score evaluates the model's joint log-probability of each
-  // supplied candidate continuation given a shared prompt. The
-  // prompt's KV cache is computed once and reused across candidates.
-  // Used for routing-policy multi-label classification, reranking,
-  // calibrated confidence, and reward-model scoring — any task where
-  // the consumer wants the model's confidence in a pre-specified
-  // continuation rather than a generated one.
-  rpc Score(ScoreRequest) returns (ScoreResponse) {}
-
  rpc GetMetrics(MetricsRequest) returns (MetricsResponse);

  rpc VAD(VADRequest) returns (VADResponse) {}
@@ -88,23 +68,6 @@ service Backend {
  rpc QuantizationProgress(QuantizationProgressRequest) returns (stream QuantizationProgressUpdate) {}
  rpc StopQuantization(QuantizationStopRequest) returns (Result) {}

-  // Forward proxies a raw HTTP request to an upstream provider. The
-  // cloud-proxy backend implements this for passthrough-mode model
-  // configs: the client wire format is preserved end-to-end (no
-  // translation through internal proto), which means new provider
-  // fields work the day they ship. Translation-mode proxies use the
-  // standard Predict/PredictStream RPCs instead. Backends that don't
-  // support this return UNIMPLEMENTED.
-  //
-  // The request is bidirectionally streamed so large bodies can flow
-  // without buffering. In practice the first ForwardRequest carries
-  // path, method, headers, and the initial body chunk; subsequent
-  // messages append body chunks. The first ForwardReply carries the
-  // upstream status and response headers; subsequent messages stream
-  // body chunks (SSE frames or chunked transfer). Cancellation of the
-  // gRPC context closes the upstream connection.
-  rpc Forward(stream ForwardRequest) returns (stream ForwardReply) {}
-
 }

 // Define the empty request
@@ -118,76 +81,6 @@ message MetricsResponse {
  int32 prompt_tokens_processed = 5;
 }

-// TokenClassifyRequest carries the text to classify plus an optional
-// score threshold. The transformers backend interprets threshold as
-// the minimum confidence to include in the response; 0 = include all.
-message TokenClassifyRequest {
-  string text = 1;
-  float threshold = 2;
-}
-
-// TokenClassifyEntity is one detected entity span. Byte offsets are
-// into the original UTF-8 text — start..end is a half-open range that
-// addresses the substring corresponding to entity_group.
-//
-// entity_group follows HuggingFace's aggregated-tag convention (e.g.
-// "PER", "LOC", "ORG", or a PII-specific label like "EMAIL" /
-// "SSN" depending on the model). The redactor's per-pattern action
-// map keys off this string.
-message TokenClassifyEntity {
-  string entity_group = 1;
-  int32 start = 2;
-  int32 end = 3;
-  float score = 4;
-  string text = 5;
-}
-
-message TokenClassifyResponse {
-  repeated TokenClassifyEntity entities = 1;
-}
-
-// ScoreRequest carries one shared prompt and one or more continuations
-// to score against it. The backend tokenises the prompt once and reuses
-// the resulting KV cache across all candidates in this request.
-message ScoreRequest {
-  string prompt = 1;
-  repeated string candidates = 2;
-  // Return per-token logprobs for each candidate when true. Default
-  // false to keep the wire response small; the joint log_prob field
-  // covers the common ranking case.
-  bool include_token_logprobs = 3;
-  // When true, the response also populates length_normalized_log_prob
-  // (joint log-prob divided by candidate token count). Useful when
-  // candidates differ in length and the consumer wants a per-token
-  // measure comparable across them (PMI-style scoring).
-  bool length_normalize = 4;
-}
-
-// CandidateScore is one row in the ScoreResponse, matching by index
-// the candidate in ScoreRequest.candidates.
-message CandidateScore {
-  // Sum of log P(token_i | prompt, candidate_token_<i) across the
-  // candidate's tokens. The primary ranking signal.
-  double log_prob = 1;
-  // log_prob / num_tokens — populated when length_normalize=true on
-  // the request.
-  double length_normalized_log_prob = 2;
-  // Per-token detail — populated when include_token_logprobs=true.
-  repeated TokenLogProb tokens = 3;
-  // Number of tokens the backend tokenised this candidate into, after
-  // any backend-specific normalisation (e.g. leading-space handling).
-  int32 num_tokens = 4;
-}
-
-message TokenLogProb {
-  string token = 1;
-  double log_prob = 2;
-}
-
-message ScoreResponse {
-  repeated CandidateScore candidates = 1;
-}
-
 message RerankRequest {
  string query = 1;
  repeated string documents = 2;
@@ -432,25 +325,6 @@ message ModelOptions {
  // applied verbatim to the backend's engine constructor (e.g. vLLM AsyncEngineArgs).
  // Unknown keys produce an error at LoadModel time.
  string EngineArgs = 73;
-
-  // Proxy carries the cloud-proxy backend's per-model configuration.
-  // Empty for non-proxy backends.
-  ProxyOptions Proxy = 74;
-}
-
-// ProxyOptions configures the cloud-proxy backend. UpstreamURL and
-// Mode are always meaningful; Provider only matters in translate mode.
-// The two api_key_* fields are mutually exclusive and resolved by the
-// backend at LoadModel — core forwards the references rather than the
-// plaintext key.
-message ProxyOptions {
-  string upstream_url = 1;
-  string mode = 2;
-  string provider = 3;
-  string api_key_env = 4;
-  string api_key_file = 5;
-  string upstream_model = 6;
-  int32 request_timeout_seconds = 7;
 }

 message Result {
@@ -541,15 +415,6 @@ message TTSRequest {
  string dst = 3;
  string voice = 4;
  optional string language = 5;
-  // instructions is a free-form, per-request style/voice description (maps to
-  // the OpenAI `instructions` field). Backends that support expressive synthesis
-  // (e.g. Qwen3-TTS CustomVoice/VoiceDesign) prefer this over the static YAML
-  // option when set; backends that don't simply ignore it.
-  optional string instructions = 6;
-  // params carries optional, backend-specific per-request generation parameters
-  // (e.g. Chatterbox exaggeration/cfg_weight/temperature). Values are strings and
-  // coerced by the backend; unset leaves the backend's configured defaults.
-  map<string, string> params = 7;
 }

 message VADRequest {
@@ -674,53 +539,6 @@ message DetectResponse {
  repeated Detection Detections = 1;
 }

-// --- Sound-event classification / audio tagging messages (CED) ---
-
-message SoundDetectionRequest {
-  string src = 1;       // audio file path (LocalAI writes the upload to disk)
-  int32 top_k = 2;      // number of top tags to return (0 = all classes)
-  float threshold = 3;  // optional: drop tags scoring below this
-}
-
-message SoundClass {
-  string label = 1;     // AudioSet class name, e.g. "Baby cry, infant cry"
-  float score = 2;      // per-class probability (multi-label, independent)
-  int32 index = 3;      // class index in the model ontology
-}
-
-message SoundDetectionResponse {
-  repeated SoundClass detections = 1;  // score-descending
-}
-
-// --- Depth estimation messages (Depth Anything 3) ---
-
-message DepthRequest {
-  string src = 1;                  // input image (filesystem path or base64-encoded payload)
-  string dst = 2;                  // optional output directory for exports (glb/colmap)
-  bool include_depth = 3;          // return the per-pixel metric depth map
-  bool include_confidence = 4;     // return the per-pixel confidence map (DualDPT)
-  bool include_pose = 5;           // return camera extrinsics/intrinsics (DualDPT)
-  bool include_sky = 6;            // return the per-pixel sky map (mono models)
-  bool include_points = 7;         // back-project to a 3D point cloud (DualDPT)
-  float points_conf_thresh = 8;    // keep points with confidence >= this threshold
-  repeated string exports = 9;     // requested exports: "glb", "colmap"
-}
-
-message DepthResponse {
-  int32 width = 1;                 // processed depth-map width
-  int32 height = 2;                // processed depth-map height
-  repeated float depth = 3;        // width*height row-major metric depth
-  repeated float confidence = 4;   // width*height row-major confidence (DualDPT)
-  repeated float sky = 5;          // width*height row-major sky map (mono)
-  repeated float extrinsics = 6;   // 12 floats, 3x4 row-major (world-to-camera)
-  repeated float intrinsics = 7;   // 9 floats, 3x3 row-major
-  int32 num_points = 8;            // number of 3D points
-  repeated float points = 9;       // num_points*3 xyz, world space
-  bytes point_colors = 10;         // num_points*3 uint8 rgb
-  repeated string export_paths = 11; // paths written for the requested exports
-  bool is_metric = 12;             // depth is in metric units
-}
-
 // --- Face recognition messages ---

 message FacialArea {
@@ -1184,32 +1002,3 @@ message QuantizationStopRequest {
  string job_id = 1;
 }

-// ForwardHeader is one HTTP header on the request or response. Headers
-// like Authorization are typically injected by the backend (from the
-// resolved API key) rather than passed through from the client.
-message ForwardHeader {
-  string name = 1;
-  string value = 2;
-}
-
-// ForwardRequest is a streamed HTTP request to the upstream. First
-// message carries path/method/headers; subsequent messages carry
-// body_chunk only. All fields except body_chunk are honoured on the
-// first message and ignored thereafter.
-message ForwardRequest {
-  string path = 1;                          // e.g. "/v1/chat/completions" — appended to the model's upstream_url
-  string method = 2;                        // usually "POST"
-  repeated ForwardHeader headers = 3;
-  bytes body_chunk = 4;
-}
-
-// ForwardReply is a streamed HTTP response from the upstream. First
-// message carries status/headers; subsequent messages carry body_chunk
-// only. SSE responses arrive as a sequence of body_chunk frames; the
-// caller is responsible for any parsing.
-message ForwardReply {
-  int32 status = 1;
-  repeated ForwardHeader headers = 2;
-  bytes body_chunk = 3;
-}
-
--- a/backend/cpp/ds4/.gitignore
+++ b/backend/cpp/ds4/.gitignore
@@ -2,7 +2,6 @@ ds4/
 build/
 package/
 grpc-server
-ds4-worker
 *.o
 backend.pb.cc
 backend.pb.h
--- a/backend/cpp/ds4/CMakeLists.txt
+++ b/backend/cpp/ds4/CMakeLists.txt
@@ -9,22 +9,6 @@ option(DS4_NATIVE "Compile with -march=native / -mcpu=native" ON)
 set(DS4_GPU "cpu" CACHE STRING "GPU backend: cpu, cuda, or metal")
 set(DS4_DIR "${CMAKE_CURRENT_SOURCE_DIR}/ds4" CACHE PATH "Path to cloned ds4 source")

-if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
-    # Homebrew installs protobuf/grpc under a non-default prefix. The generated
-    # backend.pb.cc / backend.grpc.pb.cc pull in google/protobuf and grpcpp
-    # headers, but the hw_grpc_proto library links neither target, so on macOS
-    # the headers (e.g. google/protobuf/runtime_version.h) are never on the
-    # compiler's include path. Add the Homebrew prefix globally, matching the
-    # llama-cpp backend which builds on Darwin CI.
-    if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "arm64")
-        set(HOMEBREW_DEFAULT_PREFIX "/opt/homebrew")
-    else()
-        set(HOMEBREW_DEFAULT_PREFIX "/usr/local")
-    endif()
-    link_directories("${HOMEBREW_DEFAULT_PREFIX}/lib")
-    include_directories("${HOMEBREW_DEFAULT_PREFIX}/include")
-endif()
-
 find_package(Threads REQUIRED)
 find_package(Protobuf CONFIG QUIET)
 if(NOT Protobuf_FOUND)
@@ -76,13 +60,6 @@ elseif(DS4_GPU STREQUAL "cpu")
    set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
 endif()

-# ds4.c now references ds4_distributed.c (distributed inference) and ds4_ssd.c
-# (SSD expert-cache), each split into its own translation unit upstream. Both
-# are GPU-agnostic objects shared by every GPU mode, so link them in regardless
-# of DS4_GPU.
-list(APPEND DS4_OBJS "${DS4_DIR}/ds4_distributed.o")
-list(APPEND DS4_OBJS "${DS4_DIR}/ds4_ssd.o")
-
 add_executable(${TARGET}
    grpc-server.cpp
    dsml_parser.cpp
@@ -122,36 +99,3 @@ if(DS4_NATIVE)
        target_compile_options(${TARGET} PRIVATE -march=native)
    endif()
 endif()
-
-# ds4-worker: standalone distributed worker. Links the same ds4 engine objects
-# (including ds4_distributed.o) but has NO gRPC/protobuf dependency - it speaks
-# ds4's own TCP transport via ds4_dist_run(). Buildable wherever the engine
-# objects build, even on hosts without protobuf/grpc dev headers.
-add_executable(ds4-worker worker_main.c)
-target_include_directories(ds4-worker PRIVATE ${DS4_DIR})
-foreach(obj ${DS4_OBJS})
-    target_sources(ds4-worker PRIVATE ${obj})
-    set_source_files_properties(${obj} PROPERTIES EXTERNAL_OBJECT TRUE GENERATED TRUE)
-endforeach()
-# worker_main.c is C, but the engine objects built by nvcc (ds4_cuda.o) and the
-# Metal path (ds4_metal.o, Obj-C++) reference the C++ runtime (libstdc++). Force
-# the C++ linker driver so those symbols resolve; the C driver would not link
-# libstdc++ and the CUDA/Metal builds fail with undefined std:: references.
-set_target_properties(ds4-worker PROPERTIES LINKER_LANGUAGE CXX)
-target_link_libraries(ds4-worker PRIVATE Threads::Threads m)
-
-if(DS4_GPU STREQUAL "cuda")
-    target_link_libraries(ds4-worker PRIVATE CUDA::cudart CUDA::cublas)
-elseif(DS4_GPU STREQUAL "metal")
-    target_link_libraries(ds4-worker PRIVATE ${FOUNDATION_LIB} ${METAL_LIB})
-elseif(DS4_GPU STREQUAL "cpu")
-    target_compile_definitions(ds4-worker PRIVATE DS4_NO_GPU)
-endif()
-
-if(DS4_NATIVE)
-    if(APPLE)
-        target_compile_options(ds4-worker PRIVATE -mcpu=native)
-    else()
-        target_compile_options(ds4-worker PRIVATE -march=native)
-    endif()
-endif()
--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -1,10 +1,10 @@
 # ds4 backend Makefile.
 #
-# Upstream pin lives below as DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
+# Upstream pin lives below as DS4_VERSION?=8d576642c39b9a2d782a80159ba84ef5a81c0b81
 # (.github/bump_deps.sh) can find and update it - matches the
 # llama-cpp / ik-llama-cpp / turboquant convention.

-DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
+DS4_VERSION?=8d576642c39b9a2d782a80159ba84ef5a81c0b81
 DS4_REPO?=https://github.com/antirez/ds4

 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
@@ -18,20 +18,16 @@ UNAME_S := $(shell uname -s)

 CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release

-# ds4_distributed.o and ds4_ssd.o are GPU-agnostic translation units that
-# ds4.c/ds4_cpu.o now reference (upstream split distributed inference and the
-# SSD expert-cache into their own .c files). Both objects are shared by every
-# GPU mode, so they are appended unconditionally below.
 ifeq ($(BUILD_TYPE),cublas)
    CMAKE_ARGS += -DDS4_GPU=cuda
-    DS4_OBJ_TARGET := ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
+    DS4_OBJ_TARGET := ds4.o ds4_cuda.o
 else ifeq ($(UNAME_S),Darwin)
    CMAKE_ARGS += -DDS4_GPU=metal
-    DS4_OBJ_TARGET := ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
+    DS4_OBJ_TARGET := ds4.o ds4_metal.o
 else
    # CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
    CMAKE_ARGS += -DDS4_GPU=cpu
-    DS4_OBJ_TARGET := ds4_cpu.o ds4_distributed.o ds4_ssd.o
+    DS4_OBJ_TARGET := ds4_cpu.o
 endif

 ifneq ($(NATIVE),true)
@@ -56,18 +52,17 @@ ds4:
 # the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
 ds4/ds4.o: ds4
 ifeq ($(BUILD_TYPE),cublas)
-	+$(MAKE) -C ds4 ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
+	+$(MAKE) -C ds4 ds4.o ds4_cuda.o
 else ifeq ($(UNAME_S),Darwin)
-	+$(MAKE) -C ds4 ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
+	+$(MAKE) -C ds4 ds4.o ds4_metal.o
 else
-	+$(MAKE) -C ds4 ds4_cpu.o ds4_distributed.o ds4_ssd.o
+	+$(MAKE) -C ds4 ds4_cpu.o
 endif

 grpc-server: ds4/ds4.o
 	mkdir -p $(BUILD_DIR)
 	cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
 	cp $(BUILD_DIR)/grpc-server grpc-server
-	cp $(BUILD_DIR)/ds4-worker ds4-worker

 package: grpc-server
 	bash package.sh
@@ -76,7 +71,7 @@ test:
 	@echo "ds4 backend: e2e coverage at tests/e2e-backends/ (BACKEND_BINARY mode)"

 clean:
-	rm -rf $(BUILD_DIR) grpc-server ds4-worker package
+	rm -rf $(BUILD_DIR) grpc-server package
 	if [ -d ds4 ]; then $(MAKE) -C ds4 clean; fi

 purge: clean
--- a/backend/cpp/ds4/grpc-server.cpp
+++ b/backend/cpp/ds4/grpc-server.cpp
@@ -23,13 +23,8 @@ extern "C" {

 #include <atomic>
 #include <chrono>
-#include <climits>
 #include <csignal>
-#include <cstddef>
-#include <cstdint>
-#include <cstdlib>
 #include <cstring>
-#include <ctime>
 #include <iostream>
 #include <memory>
 #include <mutex>
@@ -56,12 +51,6 @@ ds4_session *g_session = nullptr;
 int g_ctx_size = 32768;
 std::string g_kv_cache_dir; // empty disables disk cache

-// Distributed coordinator state. g_distributed is set true when LoadModel is
-// given 'ds4_role:coordinator'; generation then waits for the worker route to
-// form before running. Single-node behavior is unchanged when unset.
-bool g_distributed = false;
-int g_route_timeout_sec = 60;
-
 std::atomic<Server *> g_server{nullptr};

 // Parse a "key:value" option string. Returns empty when no colon.
@@ -71,201 +60,6 @@ static std::pair<std::string, std::string> split_option(const std::string &opt)
    return {opt.substr(0, colon), opt.substr(colon + 1)};
 }

-// Parse a positive base-10 integer. Returns false (without throwing) on empty,
-// trailing garbage, non-positive, or overflow - unlike std::stoi.
-static bool parse_positive_int(const std::string &s, int *out) {
-    if (s.empty()) return false;
-    char *end = nullptr;
-    long v = std::strtol(s.c_str(), &end, 10);
-    if (!end || *end != '\0' || v <= 0 || v > INT_MAX) return false;
-    *out = static_cast<int>(v);
-    return true;
-}
-
-// Parse a ds4 layer spec "START:END" or "START:output" into the engine's
-// distributed layer fields. Returns false on malformed input.
-static bool parse_layers_spec(const std::string &spec, ds4_distributed_layers *out) {
-    auto colon = spec.find(':');
-    if (colon == std::string::npos) return false;
-    std::string lhs = spec.substr(0, colon);
-    std::string rhs = spec.substr(colon + 1);
-    if (lhs.empty() || rhs.empty()) return false;
-    char *end = nullptr;
-    long start = std::strtol(lhs.c_str(), &end, 10);
-    if (!end || *end != '\0' || start < 0) return false;
-    out->start = static_cast<uint32_t>(start);
-    out->has_output = false;
-    if (rhs == "output") {
-        out->has_output = true;
-        out->end = out->start; // engine treats has_output as "through final layer"
-    } else {
-        long e = std::strtol(rhs.c_str(), &end, 10);
-        if (!end || *end != '\0' || e < start) return false;
-        out->end = static_cast<uint32_t>(e);
-    }
-    out->set = true;
-    return true;
-}
-
-// Parse a boolean LoadModel option. An empty value (a bare flag-style option
-// like "ssd_streaming" with no colon) means true so model YAMLs can write
-// options: ["ssd_streaming"] to enable a switch.
-static bool parse_bool_option(const std::string &s, bool *out) {
-    if (s.empty() || s == "true" || s == "1" || s == "yes" || s == "on") { *out = true; return true; }
-    if (s == "false" || s == "0" || s == "no" || s == "off") { *out = false; return true; }
-    return false;
-}
-
-// Table-driven mapping from LoadModel option keys to ds4_engine_options fields.
-// ds4_engine_options is a fixed C struct with no reflection, so the field set
-// is enumerated once here; adding a future engine knob is a one-line table
-// entry rather than a new branch in LoadModel. Two fields need ds4's own typed
-// parsers (Gib, CacheExperts) so a plain string passthrough can't cover them.
-enum class DsOptType { Bool, Int, Uint, Float, Str, Gib, CacheExperts };
-
-struct DsOptSpec {
-    const char *key;
-    DsOptType   type;
-    size_t      off;      // byte offset into ds4_engine_options
-    size_t      off2;     // second offset (CacheExperts writes experts + bytes)
-    bool        is_path;  // Str values: resolve a relative value against the model dir
-};
-
-static const DsOptSpec kEngineOptSpecs[] = {
-    {"mtp_path",                      DsOptType::Str,          offsetof(ds4_engine_options, mtp_path),                      0, true},
-    {"mtp_draft",                     DsOptType::Int,          offsetof(ds4_engine_options, mtp_draft_tokens),              0},
-    {"mtp_margin",                    DsOptType::Float,        offsetof(ds4_engine_options, mtp_margin),                    0},
-    {"prefill_chunk",                 DsOptType::Uint,         offsetof(ds4_engine_options, prefill_chunk),                 0},
-    {"power_percent",                 DsOptType::Int,          offsetof(ds4_engine_options, power_percent),                 0},
-    {"warm_weights",                  DsOptType::Bool,         offsetof(ds4_engine_options, warm_weights),                  0},
-    {"quality",                       DsOptType::Bool,         offsetof(ds4_engine_options, quality),                       0},
-    {"ssd_streaming",                 DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming),                 0},
-    {"ssd_streaming_cold",            DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming_cold),            0},
-    {"ssd_streaming_preload_experts", DsOptType::Uint,         offsetof(ds4_engine_options, ssd_streaming_preload_experts), 0},
-    {"ssd_streaming_cache_experts",   DsOptType::CacheExperts, offsetof(ds4_engine_options, ssd_streaming_cache_experts),
-                                                               offsetof(ds4_engine_options, ssd_streaming_cache_bytes)},
-    {"simulate_used_memory",          DsOptType::Gib,          offsetof(ds4_engine_options, simulate_used_memory_bytes),    0},
-    {"expert_profile_path",           DsOptType::Str,          offsetof(ds4_engine_options, expert_profile_path),           0, true},
-    {"directional_steering_file",     DsOptType::Str,          offsetof(ds4_engine_options, directional_steering_file),     0, true},
-    {"directional_steering_attn",     DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_attn),     0},
-    {"directional_steering_ffn",      DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_ffn),      0},
-};
-
-// Apply a single key:value LoadModel option to the engine options struct.
-// Unknown keys are ignored (back-compat: callers pass mixed option sets).
-// String values are copied into `storage`, whose elements the engine reads by
-// pointer during ds4_engine_open; `storage` MUST have reserved capacity so
-// push_back never reallocates and dangles an earlier c_str(). Returns false
-// with `err` set when a recognized key has an invalid value.
-static bool apply_engine_option(ds4_engine_options *opt, const std::string &key,
-                                const std::string &val, const std::string &model_dir,
-                                std::vector<std::string> &storage, std::string &err) {
-    const DsOptSpec *spec = nullptr;
-    for (const auto &s : kEngineOptSpecs) {
-        if (key == s.key) { spec = &s; break; }
-    }
-    if (!spec) return true; // unknown key: ignore
-
-    char *base = reinterpret_cast<char *>(opt);
-    switch (spec->type) {
-    case DsOptType::Bool: {
-        bool b = false;
-        if (!parse_bool_option(val, &b)) { err = key + " must be true/false"; return false; }
-        *reinterpret_cast<bool *>(base + spec->off) = b;
-        return true;
-    }
-    case DsOptType::Int: {
-        char *end = nullptr;
-        long v = std::strtol(val.c_str(), &end, 10);
-        if (val.empty() || !end || *end != '\0') { err = key + " must be an integer"; return false; }
-        *reinterpret_cast<int *>(base + spec->off) = static_cast<int>(v);
-        return true;
-    }
-    case DsOptType::Uint: {
-        char *end = nullptr;
-        long v = std::strtol(val.c_str(), &end, 10);
-        if (val.empty() || !end || *end != '\0' || v < 0 || v > static_cast<long>(UINT32_MAX)) {
-            err = key + " must be a non-negative integer"; return false;
-        }
-        *reinterpret_cast<uint32_t *>(base + spec->off) = static_cast<uint32_t>(v);
-        return true;
-    }
-    case DsOptType::Float: {
-        char *end = nullptr;
-        float f = std::strtof(val.c_str(), &end);
-        if (val.empty() || !end || *end != '\0') { err = key + " must be a number"; return false; }
-        *reinterpret_cast<float *>(base + spec->off) = f;
-        return true;
-    }
-    case DsOptType::Str: {
-        // Resolve a relative path option (e.g. mtp_path: a sibling GGUF the
-        // gallery downloaded next to the model) against the model directory, so
-        // YAMLs reference companion files by name. Absolute values pass through.
-        if (spec->is_path && !model_dir.empty() && !val.empty() && val.front() != '/') {
-            storage.push_back(model_dir + "/" + val);
-        } else {
-            storage.push_back(val);
-        }
-        *reinterpret_cast<const char **>(base + spec->off) = storage.back().c_str();
-        return true;
-    }
-    case DsOptType::Gib: {
-        uint64_t bytes = 0;
-        if (!ds4_parse_gib_arg(val.c_str(), &bytes)) {
-            err = key + " must be a GiB value, e.g. 64GB"; return false;
-        }
-        *reinterpret_cast<uint64_t *>(base + spec->off) = bytes;
-        return true;
-    }
-    case DsOptType::CacheExperts: {
-        uint32_t experts = 0;
-        uint64_t bytes = 0;
-        if (!ds4_parse_streaming_cache_experts_arg(val.c_str(), &experts, &bytes)) {
-            err = key + " must be a positive expert count or a <number>GB budget"; return false;
-        }
-        *reinterpret_cast<uint32_t *>(base + spec->off)  = experts;
-        *reinterpret_cast<uint64_t *>(base + spec->off2) = bytes;
-        return true;
-    }
-    }
-    return true;
-}
-
-// When acting as a distributed coordinator, block until the worker route
-// covers all layers (ds4_session_distributed_route_ready == 1) or the timeout
-// elapses. Returns an empty string on success, or an error message to return
-// to the client. No-op when not distributed.
-//
-// Takes the g_engine_mu lock by reference and RELEASES it during each poll
-// sleep. The wait can span up to g_route_timeout_sec seconds while workers
-// connect; holding g_engine_mu the whole time would block the Status/Health
-// readiness probes (they also lock g_engine_mu), making LocalAI's loader treat
-// a still-starting worker as hung.
-static std::string wait_route_ready(std::unique_lock<std::mutex> &lock) {
-    if (!g_distributed) return "";
-    char err[256] = {0};
-    const int deadline_polls = g_route_timeout_sec * 10; // 100ms per poll
-    for (int i = 0; i <= deadline_polls; ++i) {
-        int ready = ds4_session_distributed_route_ready(g_session, err, sizeof(err));
-        if (ready == 1) return "";
-        if (ready < 0) {
-            return std::string("ds4 distributed route error: ") +
-                   (err[0] ? err : "unknown");
-        }
-        // Release the lock while sleeping so Status/Health and other RPCs can
-        // interleave during worker startup.
-        lock.unlock();
-        struct timespec ts = {0, 100L * 1000L * 1000L}; // 100ms
-        nanosleep(&ts, nullptr);
-        lock.lock();
-        // A concurrent Free() may have torn down the engine while we slept.
-        if (!g_engine || !g_session) {
-            return "ds4: model unloaded while waiting for distributed route";
-        }
-    }
-    return "ds4 distributed route incomplete: workers not connected (layers uncovered)";
-}
-
 static void append_token_text(ds4_engine *engine, int token, std::string &out) {
    size_t len = 0;
    const char *text = ds4_token_text(engine, token, &len);
@@ -583,11 +377,6 @@ public:
                     backend::Result *result) override {
        std::lock_guard<std::mutex> lock(g_engine_mu);

-        // Reset distributed state so a model swap (a second LoadModel without
-        // ds4_role) doesn't inherit a stale coordinator configuration.
-        g_distributed = false;
-        g_route_timeout_sec = 60;
-
        if (g_engine) {
            if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
            ds4_engine_close(g_engine);
@@ -602,10 +391,28 @@ public:
            return GStatus::OK;
        }

+        std::string mtp_path;
+        int mtp_draft = 0;
+        float mtp_margin = 3.0f;
+        for (const auto &opt : request->options()) {
+            auto [k, v] = split_option(opt);
+            if (k == "mtp_path") mtp_path = v;
+            else if (k == "mtp_draft") mtp_draft = std::stoi(v);
+            else if (k == "mtp_margin") mtp_margin = std::stof(v);
+            else if (k == "kv_cache_dir") g_kv_cache_dir = v;
+        }
+
+        g_kv_cache.SetDir(g_kv_cache_dir);
+
        ds4_engine_options opt = {};
        opt.model_path = model_path.c_str();
+        opt.mtp_path = mtp_path.empty() ? nullptr : mtp_path.c_str();
        opt.n_threads = request->threads() > 0 ? request->threads() : 0;
-        opt.mtp_margin = 3.0f; // ds4 default; overridable via the mtp_margin option
+        opt.mtp_draft_tokens = mtp_draft;
+        opt.mtp_margin = mtp_margin;
+        opt.directional_steering_file = nullptr;
+        opt.warm_weights = false;
+        opt.quality = false;

 #if defined(DS4_NO_GPU)
        opt.backend = DS4_BACKEND_CPU;
@@ -615,89 +422,6 @@ public:
        opt.backend = DS4_BACKEND_CUDA;
 #endif

-        // Stable storage for string-valued engine options. The engine reads
-        // these by pointer during ds4_engine_open, so the std::string backing
-        // store must outlive the call and not reallocate; reserve up front so
-        // push_back keeps every prior c_str() valid. Static + clear() reuses
-        // the buffer across LoadModel calls (the old engine is closed above).
-        static std::vector<std::string> s_opt_strings;
-        s_opt_strings.clear();
-        s_opt_strings.reserve(sizeof(kEngineOptSpecs) / sizeof(kEngineOptSpecs[0]));
-
-        // Directory of the main model, used to resolve relative path options.
-        std::string model_dir;
-        if (auto slash = model_path.find_last_of('/'); slash != std::string::npos) {
-            model_dir = model_path.substr(0, slash);
-        }
-
-        std::string ds4_role, ds4_layers, ds4_listen;
-        for (const auto &o : request->options()) {
-            auto [k, v] = split_option(o);
-            if (k == "kv_cache_dir") { g_kv_cache_dir = v; continue; }
-            else if (k == "ds4_role") { ds4_role = v; continue; }
-            else if (k == "ds4_layers") { ds4_layers = v; continue; }
-            else if (k == "ds4_listen") { ds4_listen = v; continue; }
-            else if (k == "ds4_route_timeout") {
-                if (!parse_positive_int(v, &g_route_timeout_sec)) {
-                    result->set_success(false);
-                    result->set_message("ds4: ds4_route_timeout must be a positive integer");
-                    return GStatus::OK;
-                }
-                continue;
-            }
-            std::string err;
-            if (!apply_engine_option(&opt, k, v, model_dir, s_opt_strings, err)) {
-                result->set_success(false);
-                result->set_message("ds4: " + err);
-                return GStatus::OK;
-            }
-        }
-
-        g_kv_cache.SetDir(g_kv_cache_dir);
-
-        // Coordinator wiring. 'ds4_role:coordinator' enables layer-split
-        // distributed inference: this process listens on ds4_listen and owns
-        // the ds4_layers slice; workers dial in (see `local-ai worker
-        // ds4-distributed`). Absent ds4_role => unchanged single-node path.
-        // Must be static: opt.distributed.listen_host is a const char* the
-        // engine retains past this call, so it cannot point at a local that
-        // goes out of scope (otherwise a future "simplify to local" refactor
-        // reintroduces a dangling pointer).
-        static std::string s_listen_host;
-        if (ds4_role == "coordinator") {
-            if (ds4_layers.empty() || ds4_listen.empty()) {
-                result->set_success(false);
-                result->set_message("ds4: ds4_role:coordinator requires ds4_layers and ds4_listen");
-                return GStatus::OK;
-            }
-            // host:port for IPv4/hostname; IPv6 literals are unsupported (the
-            // first colon would split inside the address).
-            auto host_port = split_option(ds4_listen); // "host:port" -> {host, port}
-            if (host_port.second.empty()) {
-                result->set_success(false);
-                result->set_message("ds4: ds4_listen must be host:port");
-                return GStatus::OK;
-            }
-            int listen_port = 0;
-            if (!parse_positive_int(host_port.second, &listen_port)) {
-                result->set_success(false);
-                result->set_message("ds4: ds4_listen port must be a positive integer");
-                return GStatus::OK;
-            }
-            ds4_distributed_layers layers = {};
-            if (!parse_layers_spec(ds4_layers, &layers)) {
-                result->set_success(false);
-                result->set_message("ds4: invalid ds4_layers (want START:END or START:output)");
-                return GStatus::OK;
-            }
-            s_listen_host = host_port.first;
-            opt.distributed.role = DS4_DISTRIBUTED_COORDINATOR;
-            opt.distributed.layers = layers;
-            opt.distributed.listen_host = s_listen_host.c_str();
-            opt.distributed.listen_port = listen_port;
-            g_distributed = true;
-        }
-
        int rc = ds4_engine_open(&g_engine, &opt);
        if (rc != 0 || !g_engine) {
            result->set_success(false);
@@ -734,13 +458,10 @@ public:

    GStatus Predict(ServerContext *, const backend::PredictOptions *request,
                   backend::Reply *reply) override {
-        std::unique_lock<std::mutex> lock(g_engine_mu);
+        std::lock_guard<std::mutex> lock(g_engine_mu);
        if (!g_engine || !g_session) {
            return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
        }
-        if (std::string route_err = wait_route_ready(lock); !route_err.empty()) {
-            return GStatus(StatusCode::UNAVAILABLE, route_err);
-        }
        ds4_tokens prompt = {};
        build_prompt(g_engine, request, &prompt);
        int n_predict = request->tokens() > 0 ? request->tokens() : 256;
@@ -833,13 +554,10 @@ public:

    GStatus PredictStream(ServerContext *, const backend::PredictOptions *request,
                         ServerWriter<backend::Reply> *writer) override {
-        std::unique_lock<std::mutex> lock(g_engine_mu);
+        std::lock_guard<std::mutex> lock(g_engine_mu);
        if (!g_engine || !g_session) {
            return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
        }
-        if (std::string route_err = wait_route_ready(lock); !route_err.empty()) {
-            return GStatus(StatusCode::UNAVAILABLE, route_err);
-        }
        ds4_tokens prompt = {};
        build_prompt(g_engine, request, &prompt);
        int n_predict = request->tokens() > 0 ? request->tokens() : 256;
--- a/backend/cpp/ds4/package.sh
+++ b/backend/cpp/ds4/package.sh
@@ -5,8 +5,7 @@ REPO_ROOT="${CURDIR}/../../.."

 mkdir -p "$CURDIR/package/lib"
 cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
-cp -avf "$CURDIR/ds4-worker"  "$CURDIR/package/"
-cp -rfv "$CURDIR/run.sh"      "$CURDIR/package/"
+cp -rfv "$CURDIR/run.sh"     "$CURDIR/package/"

 UNAME_S=$(uname -s)
 if [ "$UNAME_S" = "Darwin" ]; then
--- a/backend/cpp/ds4/worker_main.c
+++ b/backend/cpp/ds4/worker_main.c
@@ -1,126 +0,0 @@
-// ds4-worker: standalone distributed worker for the LocalAI ds4 backend.
-//
-// A ds4 distributed worker owns a slice of the model's transformer layers,
-// dials the coordinator, and serves activations for its slice. It does NOT
-// speak backend.proto - it speaks ds4's own TCP transport via ds4_dist_run().
-// This binary is intentionally minimal (no HTTP/web/kvstore/linenoise): it
-// only needs the engine objects + ds4_distributed.o, which the backend already
-// builds. It is launched by `local-ai worker ds4-distributed`.
-//
-// Usage:
-//   ds4-worker --role worker --model <gguf> --layers 20:output \
-//              --coordinator <host> <port> [--cpu|--cuda|--metal] [-c CTX] [-t N]
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include <signal.h>
-#include <limits.h>
-
-#include "ds4.h"
-#include "ds4_distributed.h"
-
-static const char *need_arg(int *i, int argc, char **argv, const char *flag) {
-    if (*i + 1 >= argc) {
-        fprintf(stderr, "ds4-worker: missing value for %s\n", flag);
-        exit(2);
-    }
-    return argv[++(*i)];
-}
-
-static int parse_int_arg(const char *s, const char *flag) {
-    char *end = NULL;
-    long v = strtol(s, &end, 10);
-    if (!s[0] || *end || v <= 0 || v > INT_MAX) {
-        fprintf(stderr, "ds4-worker: invalid value for %s: %s\n", flag, s);
-        exit(2);
-    }
-    return (int)v;
-}
-
-static ds4_backend default_backend(void) {
-#if defined(DS4_NO_GPU)
-    return DS4_BACKEND_CPU;
-#elif defined(__APPLE__)
-    return DS4_BACKEND_METAL;
-#else
-    return DS4_BACKEND_CUDA;
-#endif
-}
-
-int main(int argc, char **argv) {
-    signal(SIGPIPE, SIG_IGN);
-
-    ds4_engine_options opt = {0};
-    opt.backend = default_backend();
-    int ctx_size = 32768;
-
-    for (int i = 1; i < argc; i++) {
-        const char *arg = argv[i];
-        if (!strcmp(arg, "-h") || !strcmp(arg, "--help")) {
-            fprintf(stdout, "ds4-worker: standalone ds4 distributed worker\n");
-            ds4_dist_usage(stdout);
-            fprintf(stdout, "  -m, --model PATH   model GGUF (the worker loads only its --layers slice)\n");
-            fprintf(stdout, "  -c, --ctx N        context size (default 32768)\n");
-            fprintf(stdout, "  -t, --threads N    CPU threads\n");
-            fprintf(stdout, "  --cpu|--cuda|--metal  backend override\n");
-            return 0;
-        }
-
-        char dist_err[256] = {0};
-        ds4_dist_cli_parse_result dist_parse =
-            ds4_dist_parse_cli_arg(arg, &i, argc, argv, &opt.distributed,
-                                   dist_err, sizeof(dist_err));
-        if (dist_parse == DS4_DIST_CLI_ERROR) {
-            fprintf(stderr, "ds4-worker: %s\n",
-                    dist_err[0] ? dist_err : "invalid distributed option");
-            return 2;
-        }
-        if (dist_parse == DS4_DIST_CLI_MATCHED) continue;
-
-        if (!strcmp(arg, "-m") || !strcmp(arg, "--model")) {
-            opt.model_path = need_arg(&i, argc, argv, arg);
-        } else if (!strcmp(arg, "-c") || !strcmp(arg, "--ctx")) {
-            ctx_size = parse_int_arg(need_arg(&i, argc, argv, arg), arg);
-        } else if (!strcmp(arg, "-t") || !strcmp(arg, "--threads")) {
-            opt.n_threads = parse_int_arg(need_arg(&i, argc, argv, arg), arg);
-        } else if (!strcmp(arg, "--cpu")) {
-            opt.backend = DS4_BACKEND_CPU;
-        } else if (!strcmp(arg, "--cuda")) {
-            opt.backend = DS4_BACKEND_CUDA;
-        } else if (!strcmp(arg, "--metal")) {
-            opt.backend = DS4_BACKEND_METAL;
-        } else {
-            fprintf(stderr, "ds4-worker: unknown option: %s\n", arg);
-            return 2;
-        }
-    }
-
-    if (opt.distributed.role != DS4_DISTRIBUTED_WORKER) {
-        fprintf(stderr, "ds4-worker: --role worker is required\n");
-        return 2;
-    }
-    if (!opt.model_path) {
-        fprintf(stderr, "ds4-worker: --model is required\n");
-        return 2;
-    }
-
-    char prep_err[256] = {0};
-    if (ds4_dist_prepare_engine_options(&opt.distributed, &opt,
-                                        prep_err, sizeof(prep_err)) != 0) {
-        fprintf(stderr, "ds4-worker: %s\n", prep_err);
-        return 2;
-    }
-
-    ds4_engine *engine = NULL;
-    if (ds4_engine_open(&engine, &opt) != 0 || !engine) {
-        fprintf(stderr, "ds4-worker: failed to open engine\n");
-        return 1;
-    }
-
-    ds4_dist_generation_options gen = {0};
-    gen.ctx_size = ctx_size;
-    int rc = ds4_dist_run(engine, &opt.distributed, &gen);
-    ds4_engine_close(engine);
-    return rc;
-}
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=b84902d2ad27c34f989f23947200c4b91b1568fd
+IK_LLAMA_VERSION?=b3d39cff8bffbd67296d6badd4076a1486a0715c
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/ik-llama-cpp/run.sh
+++ b/backend/cpp/ik-llama-cpp/run.sh
@@ -2,7 +2,7 @@
 set -ex

 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")

 cd /

@@ -13,28 +13,28 @@ grep -e "flags" /proc/cpuinfo | head -1
 # ik_llama.cpp requires AVX2 — default to avx2 binary
 BINARY=ik-llama-cpp-avx2

-if [ -e "$CURDIR"/ik-llama-cpp-fallback ] && ! grep -q -e "\savx2\s" /proc/cpuinfo ; then
+if [ -e $CURDIR/ik-llama-cpp-fallback ] && ! grep -q -e "\savx2\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX2   NOT found, using fallback"
 	BINARY=ik-llama-cpp-fallback
 fi

 # Extend ld library path with the dir where this script is located/lib
 if [ "$(uname)" == "Darwin" ]; then
-	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
-	#export DYLD_FALLBACK_LIBRARY_PATH="$CURDIR"/lib:$DYLD_FALLBACK_LIBRARY_PATH
+	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
+	#export DYLD_FALLBACK_LIBRARY_PATH=$CURDIR/lib:$DYLD_FALLBACK_LIBRARY_PATH
 else
-	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 fi

 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using binary: $BINARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/$BINARY "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
 fi

 echo "Using binary: $BINARY"
-exec "$CURDIR"/$BINARY "$@"
+exec $CURDIR/$BINARY "$@"

 # We should never reach this point, however just in case we do, run fallback
-exec "$CURDIR"/ik-llama-cpp-fallback "$@"
+exec $CURDIR/ik-llama-cpp-fallback "$@"
--- a/backend/cpp/llama-cpp/CMakeLists.txt
+++ b/backend/cpp/llama-cpp/CMakeLists.txt
@@ -50,13 +50,8 @@ add_custom_command(
        "${hw_proto}"
      DEPENDS "${hw_proto}")

-# hw_grpc_proto: force STATIC. Under the CPU_ALL_VARIANTS build BUILD_SHARED_LIBS=ON
-# (ggml/llama become shared), which would otherwise make this glue library a DSO. As a
-# DSO it references the hidden-visibility symbols in the static libprotobuf.a, which the
-# linker cannot satisfy ("hidden symbol ... in libprotobuf.a is referenced by DSO").
-# Keeping it STATIC links protobuf/gRPC directly into the grpc-server executable while
-# only ggml/llama stay shared. No effect on the static variants (already BUILD_SHARED_LIBS=OFF).
-add_library(hw_grpc_proto STATIC
+# hw_grpc_proto
+add_library(hw_grpc_proto
  ${hw_grpc_srcs}
  ${hw_grpc_hdrs}
  ${hw_proto_srcs}
@@ -87,18 +82,3 @@ target_compile_features(${TARGET} PRIVATE cxx_std_11)
 if(TARGET BUILD_INFO)
  add_dependencies(${TARGET} BUILD_INFO)
 endif()
-
-# Unit test for the message-content normalization helper (message_content.h).
-# Off by default so the normal backend build is untouched; enable with
-# -DLLAMA_GRPC_BUILD_TESTS=ON and run via ctest. It reuses llama.cpp's vendored
-# <nlohmann/json.hpp> (propagated by the common helpers library) so it has no
-# extra dependency beyond what the backend already builds against.
-option(LLAMA_GRPC_BUILD_TESTS "Build grpc-server unit tests" OFF)
-if(LLAMA_GRPC_BUILD_TESTS)
-    enable_testing()
-    add_executable(message_content_test message_content_test.cpp message_content.h)
-    target_include_directories(message_content_test PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
-    target_link_libraries(message_content_test PRIVATE ${_LLAMA_COMMON_TARGET})
-    target_compile_features(message_content_test PRIVATE cxx_std_17)
-    add_test(NAME message_content_test COMMAND message_content_test)
-endif()
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1
+LLAMA_VERSION?=bb28c1fe246b72276ee1d00ce89306be7b865766
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
@@ -10,16 +10,8 @@ TARGET?=--target grpc-server
 JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
 ARCH?=$(shell uname -m)

-# Shared libs default to OFF: we link static gRPC and the avx/avx2/avx512/fallback
-# variants are fully static. The CPU_ALL_VARIANTS build flips SHARED_LIBS=ON (ggml/llama
-# become shared so the dynamic CPU backends work; gRPC stays static via its imported
-# targets). SHARED_LIBS is a make variable, not an appended -D, so it survives the
-# recursive sub-make into the VARIANT build dir (which re-parses this Makefile) instead
-# of being re-clobbered by a second -DBUILD_SHARED_LIBS=OFF. EXTRA_CMAKE_ARGS is the hook
-# the CPU_ALL_VARIANTS target uses to inject -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS.
-SHARED_LIBS?=OFF
-EXTRA_CMAKE_ARGS?=
-CMAKE_ARGS+=-DBUILD_SHARED_LIBS=$(SHARED_LIBS) -DLLAMA_CURL=OFF $(EXTRA_CMAKE_ARGS)
+# Disable Shared libs as we are linking on static gRPC and we can't mix shared and static
+CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=OFF

 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
 ifeq ($(NATIVE),false)
@@ -128,30 +120,6 @@ llama-cpp-fallback: llama.cpp
 	CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) VARIANT="llama-cpp-fallback-build" build-llama-cpp-grpc-server
 	cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-fallback-build/grpc-server llama-cpp-fallback

-# Single-build CPU backend using ggml's CPU_ALL_VARIANTS. Produces ONE grpc-server
-# plus a set of dlopen-able libggml-cpu-*.so (sandybridge/haswell/skylakex/...) that
-# ggml's backend registry selects from at runtime by probing host CPU features.
-# Replaces the avx/avx2/avx512/fallback multi-binary build on x86.
-#
-# CPU_ALL_VARIANTS requires GGML_BACKEND_DL, which requires BUILD_SHARED_LIBS=ON, so we
-# pass SHARED_LIBS=ON and the DL flags as make variables (NOT pre-expanded into the
-# CMAKE_ARGS env string): command-line make variables propagate through every recursive
-# sub-make, so the deepest VARIANT-dir build computes BUILD_SHARED_LIBS=ON consistently.
-# Only ggml/llama go shared - gRPC is found via its static imported targets, so the
-# grpc-server binary keeps static gRPC and only dynamically links ggml.
-#
-# TARGET adds "ggml": the per-microarch backends are runtime-dlopened, not link deps of
-# grpc-server, so they only build because each is an add_dependencies() of the ggml target.
-llama-cpp-cpu-all: llama.cpp
-	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build
-	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build purge
-	$(info ${GREEN}I llama-cpp build info:cpu-all-variants${RESET})
-	$(MAKE) SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" VARIANT="llama-cpp-cpu-all-build" build-llama-cpp-grpc-server
-	cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/grpc-server llama-cpp-cpu-all
-	rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
-	find $(CURRENT_MAKEFILE_DIR)/../llama-cpp-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
-	@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
-
 llama-cpp-grpc: llama.cpp
 	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-grpc-build purge
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
--- a/backend/cpp/llama-cpp/message_content.h
+++ b/backend/cpp/llama-cpp/message_content.h
@@ -1,192 +0,0 @@
-#pragma once
-
-#include <string>
-#include <vector>
-
-#include <nlohmann/json.hpp>
-
-namespace llama_grpc {
-
-// Normalizes a proto message's content string into the JSON value used when
-// reconstructing OpenAI-format messages for the tokenizer (jinja) template.
-//
-// Shared by the streaming (PredictStream) and non-streaming (Predict) message
-// reconstruction paths so the two cannot drift.
-//
-// LocalAI's Go layer (schema.Messages.ToProto) always sends content as a plain
-// text string; multimodal media travels in separate proto fields, never inside
-// content. So user/system/developer content is *only ever* opaque text and must
-// NOT be JSON-sniffed: a prompt that merely looks like JSON (e.g. an ingredient
-// list ["1/4 cup sugar", ...]) would otherwise be reinterpreted as structured
-// content parts and rejected by oaicompat_chat_params_parse with
-// "unsupported content[].type" (https://github.com/mudler/LocalAI/issues/10524).
-// (developer is OpenAI's modern system alias - same "human-authored text" nature.)
-//
-// For assistant/tool messages we still collapse a literal JSON null/object
-// (tool-call bookkeeping) to a string, but we never turn a plain string into an
-// array/scalar. The array defense is therefore role-independent (arrays/scalars
-// fall through for every role); the role gate only governs the null/object case.
-inline nlohmann::ordered_json normalize_message_content(const std::string& role,
-                                                        const std::string& content) {
-    nlohmann::ordered_json content_val = content;
-    if (role != "user" && role != "system" && role != "developer") {
-        try {
-            nlohmann::ordered_json parsed = nlohmann::ordered_json::parse(content);
-            if (parsed.is_null()) {
-                content_val = "";
-            } else if (parsed.is_object()) {
-                content_val = parsed.dump();
-            }
-            // arrays / scalars: keep the original plain-text string as-is
-        } catch (const nlohmann::ordered_json::parse_error&) {
-            // Not JSON, already the plain string
-        }
-    }
-    return content_val;
-}
-
-// Final safety pass applied to each reconstructed OpenAI message right before it
-// is handed to oaicompat_chat_params_parse (jinja templating). Jinja templates
-// assume content is a string: a literal null breaks slicing such as
-// message.content[:N] (#7324), and a tool message with array content is rejected
-// (#7528). A multimodal user message legitimately carries a typed-part array
-// ({type:text}, {type:image_url}, ...), which must be left intact. Shared by the
-// streaming and non-streaming paths so this invariant cannot drift between them.
-inline void normalize_template_message(nlohmann::ordered_json& msg) {
-    if (!msg.contains("content")) {
-        msg["content"] = ""; // templates expect the field to exist
-        return;
-    }
-    nlohmann::ordered_json& content = msg["content"];
-    const std::string role = (msg.contains("role") && msg["role"].is_string())
-                                 ? msg["role"].get<std::string>()
-                                 : std::string();
-    if (content.is_null()) {
-        content = ""; // #7324: null would crash content[:N] slicing
-    } else if (role == "tool" && content.is_array()) {
-        content = content.dump(); // #7528: tool messages must have string content
-    } else if (!content.is_string() && !content.is_array()) {
-        if (content.is_object()) {
-            content = content.dump(); // tool-call bookkeeping object -> string
-        } else {
-            content = ""; // other scalar (number/bool) -> empty
-        }
-    }
-    // string, or a non-tool (multimodal) typed-part array: leave untouched
-}
-
-// One proto message's data, flattened to plain types so the reconstruction logic
-// can be shared and unit-tested without protobuf. The streaming and non-streaming
-// predict paths both populate this from proto::Message + the request's media.
-struct ReconstructedMessageInput {
-    std::string role;
-    std::string content;            // proto.Message.content (always a plain string)
-    std::string name;
-    std::string tool_call_id;
-    std::string reasoning_content;
-    std::string tool_calls;         // tool_calls as a JSON string, or empty
-    bool is_last_user_msg = false;  // attach request media to this message
-    std::vector<std::string> images; // base64 (jpeg)
-    std::vector<std::string> audios; // base64 (wav)
-    std::vector<std::string> videos; // base64
-};
-
-// Appends the request's media as OpenAI typed content parts. Imperative (not
-// brace-init) to avoid nlohmann's object-vs-array initializer-list ambiguity.
-inline void append_media_parts(nlohmann::ordered_json& content_array,
-                               const std::vector<std::string>& images,
-                               const std::vector<std::string>& audios,
-                               const std::vector<std::string>& videos) {
-    for (const auto& img : images) {
-        nlohmann::ordered_json image_chunk;
-        image_chunk["type"] = "image_url";
-        nlohmann::ordered_json image_url;
-        image_url["url"] = "data:image/jpeg;base64," + img;
-        image_chunk["image_url"] = image_url;
-        content_array.push_back(image_chunk);
-    }
-    for (const auto& aud : audios) {
-        nlohmann::ordered_json audio_chunk;
-        audio_chunk["type"] = "input_audio";
-        nlohmann::ordered_json input_audio;
-        input_audio["data"] = aud;
-        input_audio["format"] = "wav"; // default; could be made configurable
-        audio_chunk["input_audio"] = input_audio;
-        content_array.push_back(audio_chunk);
-    }
-    for (const auto& vid : videos) {
-        nlohmann::ordered_json video_chunk;
-        video_chunk["type"] = "input_video";
-        nlohmann::ordered_json input_video;
-        input_video["data"] = vid;
-        video_chunk["input_video"] = input_video;
-        content_array.push_back(video_chunk);
-    }
-}
-
-// Reconstructs a single OpenAI-format message (the object fed to
-// oaicompat_chat_params_parse) from a proto message. Shared by PredictStream and
-// Predict so the content/multimodal/tool_calls handling cannot drift between the
-// two stream modes (it previously lived as two ~150-line copies with a redundant
-// Predict-only tool_calls->" " branch). Guarantees content is always a string or
-// a typed-part array, never null/missing.
-inline nlohmann::ordered_json build_reconstructed_message(const ReconstructedMessageInput& in) {
-    nlohmann::ordered_json msg_json;
-    msg_json["role"] = in.role;
-    const bool has_media = !in.images.empty() || !in.audios.empty() || !in.videos.empty();
-
-    if (!in.content.empty()) {
-        nlohmann::ordered_json content_val = normalize_message_content(in.role, in.content);
-        if (content_val.is_string() && in.is_last_user_msg && has_media) {
-            // Last user message + media: build a typed-part array (text first).
-            nlohmann::ordered_json content_array = nlohmann::ordered_json::array();
-            nlohmann::ordered_json text_part;
-            text_part["type"] = "text";
-            text_part["text"] = content_val.get<std::string>();
-            content_array.push_back(text_part);
-            append_media_parts(content_array, in.images, in.audios, in.videos);
-            msg_json["content"] = content_array;
-        } else if (content_val.is_null()) {
-            msg_json["content"] = "";
-        } else {
-            msg_json["content"] = content_val;
-        }
-    } else if (in.is_last_user_msg && has_media) {
-        // No text but media on the last user message: media-only typed array.
-        nlohmann::ordered_json content_array = nlohmann::ordered_json::array();
-        append_media_parts(content_array, in.images, in.audios, in.videos);
-        msg_json["content"] = content_array;
-    } else {
-        // Empty content (any role, incl. tool/assistant): templates need a string.
-        msg_json["content"] = "";
-    }
-
-    if (!in.name.empty()) {
-        msg_json["name"] = in.name;
-    }
-    if (!in.tool_call_id.empty()) {
-        msg_json["tool_call_id"] = in.tool_call_id;
-    }
-    if (!in.reasoning_content.empty()) {
-        msg_json["reasoning_content"] = in.reasoning_content;
-    }
-    if (!in.tool_calls.empty()) {
-        try {
-            nlohmann::ordered_json tool_calls = nlohmann::ordered_json::parse(in.tool_calls);
-            msg_json["tool_calls"] = tool_calls;
-            // tool_calls + empty/blank content: use " " not "", because llama.cpp's
-            // common_chat_msgs_to_json_oaicompat turns "" into null, which breaks
-            // templates that slice message.content[:tool_start_length] (#7324).
-            if (!msg_json.contains("content") ||
-                (msg_json["content"].is_string() && msg_json["content"].get<std::string>().empty())) {
-                msg_json["content"] = " ";
-            }
-        } catch (const nlohmann::ordered_json::parse_error&) {
-            // Malformed tool_calls JSON: leave content as-is (prior behavior).
-        }
-    }
-
-    return msg_json;
-}
-
-}  // namespace llama_grpc
--- a/backend/cpp/llama-cpp/message_content_test.cpp
+++ b/backend/cpp/llama-cpp/message_content_test.cpp
@@ -1,234 +0,0 @@
-// Unit tests for the shared message-reconstruction helpers (message_content.h).
-//
-// Build & run standalone (nlohmann/json single header on the include path):
-//   g++ -std=c++17 -I<dir-with-nlohmann> message_content_test.cpp -o t && ./t
-// or via CMake: -DLLAMA_GRPC_BUILD_TESTS=ON then ctest.
-//
-// Regression coverage for:
-//   #10524 - a user/system prompt that is itself a JSON-array string must stay
-//            plain text, never be reinterpreted as OpenAI structured parts.
-//   #7324  - assistant/tool null content -> "" (templates slice content[:N]);
-//            assistant+tool_calls+empty content -> " " (not "", which becomes null).
-//   #7528  - tool message array content must reach the template as a string.
-//   multimodal - last user message text + media -> typed-part array, media kept.
-
-#include <cassert>
-#include <iostream>
-#include <string>
-
-#include "message_content.h"
-
-using nlohmann::ordered_json;
-using llama_grpc::normalize_message_content;
-using llama_grpc::normalize_template_message;
-using llama_grpc::build_reconstructed_message;
-using llama_grpc::ReconstructedMessageInput;
-
-static int failures = 0;
-
-static void check(bool ok, const std::string& name, const std::string& detail = "") {
-    if (!ok) {
-        std::cerr << "FAIL " << name << (detail.empty() ? "" : ": " + detail) << "\n";
-        failures++;
-    }
-}
-
-// ---- normalize_message_content -------------------------------------------
-
-static void expect_norm_string(const char* name, const std::string& role,
-                               const std::string& content, const std::string& want) {
-    auto got = normalize_message_content(role, content);
-    if (!got.is_string()) {
-        check(false, name, "expected a JSON string, got " +
-                               std::string(got.is_array() ? "array" : got.is_object() ? "object" : "other") +
-                               " (" + got.dump() + ")");
-        return;
-    }
-    check(got.get<std::string>() == want, name, "expected \"" + want + "\", got \"" + got.get<std::string>() + "\"");
-}
-
-static void test_normalize() {
-    const std::string ingredients = R"(["1/4 cup brown sugar, packed","1 pound ground beef"])";
-
-    // #10524 - JSON-array text must stay a string. Role-INDEPENDENT array defense.
-    for (const char* role : {"user", "system", "developer", "function", "assistant", "tool"}) {
-        expect_norm_string((std::string("json_array_stays_text:") + role).c_str(), role, ingredients, ingredients);
-    }
-
-    // #10524 - user/system/developer JSON-object text stays verbatim (NOT re-dumped).
-    expect_norm_string("user_json_object_verbatim", "user", R"({"a":1})", R"({"a":1})");
-    expect_norm_string("system_json_object_verbatim", "system", R"({"a":1})", R"({"a":1})");
-    expect_norm_string("developer_json_object_verbatim", "developer", R"({"a":1})", R"({"a":1})");
-
-    // Plain text unchanged for all roles.
-    expect_norm_string("user_plain_text", "user", "hello world", "hello world");
-    expect_norm_string("assistant_non_json_text_kept", "assistant", "hi [unclosed", "hi [unclosed");
-
-    // #7324 boundary - user/system/developer literal "null" preserved (never parsed).
-    expect_norm_string("user_literal_null_stays", "user", "null", "null");
-    expect_norm_string("system_literal_null_stays", "system", "null", "null");
-    expect_norm_string("developer_literal_null_stays", "developer", "null", "null");
-
-    // #7324 - assistant/tool literal null collapses to empty string.
-    expect_norm_string("assistant_null_to_empty", "assistant", "null", "");
-    expect_norm_string("tool_null_to_empty", "tool", "null", "");
-
-    // #7324/#7528 - assistant/tool object bookkeeping stringified (stays a string).
-    check(normalize_message_content("assistant", R"({"tool":"x"})").is_string(), "assistant_object_stringified");
-    check(normalize_message_content("tool", R"({"error":"boom"})").is_string(), "tool_object_stringified");
-
-    // #10524-family - a bare scalar that parses as a JSON number stays the string.
-    expect_norm_string("assistant_scalar_number_stays_string", "assistant", "42", "42");
-
-    // baseline - empty content stays empty.
-    expect_norm_string("user_empty_stays_empty", "user", "", "");
-}
-
-// ---- normalize_template_message (BEFORE TEMPLATE sanitizer) ---------------
-
-static void test_template_sanitizer() {
-    // #7528 - a tool message with an ACTUAL array becomes a string.
-    {
-        ordered_json msg = {{"role", "tool"}, {"content", ordered_json::array({{{"type", "text"}, {"text", "r"}}})}};
-        normalize_template_message(msg);
-        check(msg["content"].is_string(), "before_template_tool_array_to_string", "got " + msg["content"].dump());
-    }
-    // #7324 - null content -> "" for any role.
-    {
-        ordered_json msg = {{"role", "assistant"}, {"content", nullptr}};
-        normalize_template_message(msg);
-        check(msg["content"].is_string() && msg["content"] == "", "before_template_null_to_empty");
-    }
-    // object content -> dumped string (would otherwise throw at the template).
-    {
-        ordered_json msg = {{"role", "assistant"}, {"content", {{"x", 1}}}};
-        normalize_template_message(msg);
-        check(msg["content"].is_string(), "before_template_object_to_string", "got " + msg["content"].dump());
-    }
-    // missing content field -> "".
-    {
-        ordered_json msg = {{"role", "user"}};
-        normalize_template_message(msg);
-        check(msg.contains("content") && msg["content"] == "", "before_template_missing_to_empty");
-    }
-    // multimodal: a well-typed user array must be left UNTOUCHED (role!=tool).
-    {
-        ordered_json parts = ordered_json::array();
-        parts.push_back({{"type", "text"}, {"text", "x"}});
-        ordered_json img; img["type"] = "image_url"; img["image_url"] = {{"url", "data:..."}};
-        parts.push_back(img);
-        ordered_json msg = {{"role", "user"}, {"content", parts}};
-        normalize_template_message(msg);
-        check(msg["content"].is_array() && msg["content"].size() == 2, "before_template_user_typed_array_preserved",
-              "got " + msg["content"].dump());
-    }
-    // a plain string is left untouched.
-    {
-        ordered_json msg = {{"role", "user"}, {"content", "hello"}};
-        normalize_template_message(msg);
-        check(msg["content"] == "hello", "before_template_string_untouched");
-    }
-}
-
-// ---- build_reconstructed_message ----------------------------------------
-
-static void test_reconstruction() {
-    const std::string ingredients = R"(["1/4 cup brown sugar","1 pound ground beef"])";
-
-    // #10524 end-state - user JSON-array text, no media -> string content.
-    {
-        ReconstructedMessageInput in;
-        in.role = "user"; in.content = ingredients;
-        auto m = build_reconstructed_message(in);
-        check(m["content"].is_string() && m["content"] == ingredients, "recon_user_json_array_string",
-              "got " + m["content"].dump());
-    }
-    // multimodal - user text + one image on last user msg -> typed array, image kept.
-    {
-        ReconstructedMessageInput in;
-        in.role = "user"; in.content = ingredients; in.is_last_user_msg = true;
-        in.images.push_back("BASE64IMG");
-        auto m = build_reconstructed_message(in);
-        check(m["content"].is_array() && m["content"].size() == 2, "recon_multimodal_text_plus_image",
-              "got " + m["content"].dump());
-        check(m["content"][0]["type"] == "text" && m["content"][0]["text"] == ingredients, "recon_multimodal_text_first");
-        check(m["content"][1]["type"] == "image_url", "recon_multimodal_image_kept");
-    }
-    // multimodal media-only - empty text + image on last user msg.
-    {
-        ReconstructedMessageInput in;
-        in.role = "user"; in.content = ""; in.is_last_user_msg = true;
-        in.images.push_back("BASE64IMG");
-        auto m = build_reconstructed_message(in);
-        check(m["content"].is_array() && m["content"].size() == 1 && m["content"][0]["type"] == "image_url",
-              "recon_media_only", "got " + m["content"].dump());
-    }
-    // #7528 - tool array-string content stays a string.
-    {
-        ReconstructedMessageInput in;
-        in.role = "tool"; in.content = R"(["a","b"])"; in.tool_call_id = "call_1";
-        auto m = build_reconstructed_message(in);
-        check(m["content"].is_string() && m["content"] == R"(["a","b"])", "recon_tool_array_string",
-              "got " + m["content"].dump());
-        check(m["tool_call_id"] == "call_1", "recon_tool_call_id_set");
-    }
-    // tool empty content -> "".
-    {
-        ReconstructedMessageInput in;
-        in.role = "tool"; in.content = "";
-        auto m = build_reconstructed_message(in);
-        check(m["content"].is_string() && m["content"] == "", "recon_tool_empty_to_string");
-    }
-    // #7324 - assistant + tool_calls + empty content -> " " (single space, not "").
-    {
-        ReconstructedMessageInput in;
-        in.role = "assistant"; in.content = "";
-        in.tool_calls = R"([{"id":"c1","type":"function","function":{"name":"f","arguments":"{}"}}])";
-        auto m = build_reconstructed_message(in);
-        check(m["content"].is_string() && m["content"] == " ", "recon_toolcalls_empty_content_space",
-              "got " + m["content"].dump());
-        check(m["tool_calls"].is_array() && m["tool_calls"].size() == 1, "recon_toolcalls_parsed");
-    }
-    // assistant + tool_calls + real content keeps the content.
-    {
-        ReconstructedMessageInput in;
-        in.role = "assistant"; in.content = "I'll call f";
-        in.tool_calls = R"([{"id":"c1","type":"function","function":{"name":"f","arguments":"{}"}}])";
-        auto m = build_reconstructed_message(in);
-        check(m["content"] == "I'll call f", "recon_toolcalls_with_content_kept");
-    }
-    // assistant null content -> "".
-    {
-        ReconstructedMessageInput in;
-        in.role = "assistant"; in.content = "null";
-        auto m = build_reconstructed_message(in);
-        check(m["content"] == "", "recon_assistant_null_to_empty");
-    }
-    // malformed tool_calls JSON must not throw; content preserved.
-    {
-        ReconstructedMessageInput in;
-        in.role = "assistant"; in.content = "hi"; in.tool_calls = "{not json";
-        auto m = build_reconstructed_message(in);
-        check(m["content"] == "hi" && !m.contains("tool_calls"), "recon_malformed_toolcalls_safe");
-    }
-    // optional fields: name + reasoning carried through.
-    {
-        ReconstructedMessageInput in;
-        in.role = "tool"; in.content = "result"; in.name = "get_weather"; in.reasoning_content = "thinking";
-        auto m = build_reconstructed_message(in);
-        check(m["name"] == "get_weather" && m["reasoning_content"] == "thinking", "recon_optional_fields");
-    }
-}
-
-int main() {
-    test_normalize();
-    test_template_sanitizer();
-    test_reconstruction();
-
-    if (failures == 0) {
-        std::cout << "OK: all message_content tests passed\n";
-        return 0;
-    }
-    std::cerr << failures << " test(s) failed\n";
-    return 1;
-}
--- a/backend/cpp/llama-cpp/package.sh
+++ b/backend/cpp/llama-cpp/package.sh
@@ -14,22 +14,6 @@ mkdir -p $CURDIR/package/lib
 cp -avrf $CURDIR/llama-cpp-* $CURDIR/package/
 cp -rfv $CURDIR/run.sh $CURDIR/package/

-# Bundle the ggml shared backends produced by the CPU_ALL_VARIANTS build (libggml-base.so,
-# libggml.so, libllama.so and the per-microarch libggml-cpu-*.so), all into package/lib.
-#
-# Two distinct resolution mechanisms both land here:
-#   - NEEDED deps (libggml-base/libggml/libllama): resolved by the dynamic linker via the
-#     LD_LIBRARY_PATH=$CURDIR/lib that run.sh exports.
-#   - The per-microarch libggml-cpu-*.so are NOT linked; ggml *discovers* them at runtime by
-#     scanning the executable's own directory (readlink /proc/self/exe). run.sh launches via
-#     the bundled $CURDIR/lib/ld.so, so /proc/self/exe -> .../lib/ld.so and ggml scans lib/.
-#     That is why the variants must sit in lib/ (next to ld.so), not just on the link path.
-# No-op on builds (arm64/darwin) that don't produce the all-variants set.
-if [ -d "$CURDIR/ggml-shared-libs" ]; then
-    echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
-    cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
-fi
-
 # Detect architecture and copy appropriate libraries
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    # x86_64 architecture
--- a/backend/cpp/llama-cpp/prepare.sh
+++ b/backend/cpp/llama-cpp/prepare.sh
@@ -18,10 +18,6 @@ done

 cp -r CMakeLists.txt llama.cpp/tools/grpc-server/
 cp -r grpc-server.cpp llama.cpp/tools/grpc-server/
-# Shared message-reconstruction helpers (included by grpc-server.cpp) and their
-# unit test (compiled only when -DLLAMA_GRPC_BUILD_TESTS=ON).
-cp -r message_content.h llama.cpp/tools/grpc-server/
-cp -r message_content_test.cpp llama.cpp/tools/grpc-server/
 cp -rfv llama.cpp/vendor/nlohmann/json.hpp llama.cpp/tools/grpc-server/
 cp -rfv llama.cpp/vendor/cpp-httplib/httplib.h llama.cpp/tools/grpc-server/

--- a/backend/cpp/llama-cpp/run.sh
+++ b/backend/cpp/llama-cpp/run.sh
@@ -2,7 +2,7 @@
 set -ex

 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")

 cd /

@@ -12,41 +12,55 @@ grep -e "flags" /proc/cpuinfo | head -1

 BINARY=llama-cpp-fallback

-# CPU images (x86, arm64, darwin) ship a single llama-cpp-cpu-all built with ggml
-# CPU_ALL_VARIANTS: ggml's backend registry dlopens the best libggml-cpu-*.so for this
-# host, so no shell-side AVX probing. GPU images (cublas/sycl/vulkan/hipblas) ship only
-# llama-cpp-fallback (the accelerator does the compute), so fall back to it when absent.
-if [ -e "$CURDIR"/llama-cpp-cpu-all ]; then
-	BINARY=llama-cpp-cpu-all
+if grep -q -e "\savx\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX    found OK"
+	if [ -e $CURDIR/llama-cpp-avx ]; then
+		BINARY=llama-cpp-avx
+	fi
+fi
+
+if grep -q -e "\savx2\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX2   found OK"
+	if [ -e $CURDIR/llama-cpp-avx2 ]; then
+		BINARY=llama-cpp-avx2
+	fi
+fi
+
+# Check avx 512
+if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX512F found OK"
+	if [ -e $CURDIR/llama-cpp-avx512 ]; then
+		BINARY=llama-cpp-avx512
+	fi
 fi

 if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
-	if [ -e "$CURDIR"/llama-cpp-grpc ]; then
+	if [ -e $CURDIR/llama-cpp-grpc ]; then
 		BINARY=llama-cpp-grpc
 	fi
 fi
 
 # Extend ld library path with the dir where this script is located/lib
 if [ "$(uname)" == "Darwin" ]; then
-	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
-	#export DYLD_FALLBACK_LIBRARY_PATH="$CURDIR"/lib:$DYLD_FALLBACK_LIBRARY_PATH
+	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
+	#export DYLD_FALLBACK_LIBRARY_PATH=$CURDIR/lib:$DYLD_FALLBACK_LIBRARY_PATH
 else
-	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 	# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
 	if [ -d "$CURDIR/lib/rocblas/library" ]; then
-		export ROCBLAS_TENSILE_LIBPATH="$CURDIR"/lib/rocblas/library
+		export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
 	fi
 fi

 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using binary: $BINARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/$BINARY "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
 fi

 echo "Using binary: $BINARY"
-exec "$CURDIR"/$BINARY "$@"
+exec $CURDIR/$BINARY "$@"

 # We should never reach this point, however just in case we do, run fallback
-exec "$CURDIR"/llama-cpp-fallback "$@"
+exec $CURDIR/llama-cpp-fallback "$@"
--- a/backend/cpp/privacy-filter/.gitignore
+++ b/backend/cpp/privacy-filter/.gitignore
@@ -1,9 +0,0 @@
-/privacy-filter.cpp
-build/
-package/
-grpc-server
-*.o
-backend.pb.cc
-backend.pb.h
-backend.grpc.pb.cc
-backend.grpc.pb.h
--- a/backend/cpp/privacy-filter/CMakeLists.txt
+++ b/backend/cpp/privacy-filter/CMakeLists.txt
@@ -1,77 +0,0 @@
-cmake_minimum_required(VERSION 3.21)
-project(privacy-filter-grpc-server LANGUAGES CXX C)
-
-set(CMAKE_CXX_STANDARD 17)
-set(CMAKE_CXX_STANDARD_REQUIRED ON)
-set(TARGET grpc-server)
-
-# Path to the privacy-filter.cpp engine sources. The Makefile arranges for this
-# to exist (clone of a pinned commit, or a symlink to PRIVACY_FILTER_SRC).
-set(PRIVACY_FILTER_DIR "${CMAKE_CURRENT_SOURCE_DIR}/privacy-filter.cpp"
-    CACHE PATH "Path to the privacy-filter.cpp engine source tree")
-
-find_package(Threads REQUIRED)
-find_package(Protobuf CONFIG QUIET)
-if(NOT Protobuf_FOUND)
-    find_package(Protobuf REQUIRED)
-endif()
-find_package(gRPC CONFIG QUIET)
-if(NOT gRPC_FOUND)
-    # Ubuntu's apt-installed grpc++ does not ship a CMake config - fall back.
-    find_library(GRPCPP_LIB grpc++ REQUIRED)
-    find_library(GRPCPP_REFLECTION_LIB grpc++_reflection REQUIRED)
-    add_library(gRPC::grpc++ INTERFACE IMPORTED)
-    set_target_properties(gRPC::grpc++ PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_LIB}")
-    add_library(gRPC::grpc++_reflection INTERFACE IMPORTED)
-    set_target_properties(gRPC::grpc++_reflection PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_REFLECTION_LIB}")
-endif()
-
-find_program(_PROTOC NAMES protoc REQUIRED)
-find_program(_GRPC_CPP_PLUGIN NAMES grpc_cpp_plugin REQUIRED)
-
-get_filename_component(HW_PROTO "${CMAKE_CURRENT_SOURCE_DIR}/../../backend.proto" ABSOLUTE)
-get_filename_component(HW_PROTO_PATH "${HW_PROTO}" PATH)
-
-set(HW_PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.cc")
-set(HW_PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.h")
-set(HW_GRPC_SRCS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.cc")
-set(HW_GRPC_HDRS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.h")
-
-add_custom_command(
-    OUTPUT "${HW_PROTO_SRCS}" "${HW_PROTO_HDRS}" "${HW_GRPC_SRCS}" "${HW_GRPC_HDRS}"
-    COMMAND ${_PROTOC}
-    ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
-         --cpp_out  "${CMAKE_CURRENT_BINARY_DIR}"
-         -I "${HW_PROTO_PATH}"
-         --plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN}"
-         "${HW_PROTO}"
-    DEPENDS "${HW_PROTO}")
-
-add_library(hw_grpc_proto STATIC
-    ${HW_GRPC_SRCS} ${HW_GRPC_HDRS}
-    ${HW_PROTO_SRCS} ${HW_PROTO_HDRS})
-target_include_directories(hw_grpc_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
-# The generated proto/grpc sources include protobuf and grpc++ headers, so this
-# library must see their include dirs. Linking the imported targets propagates
-# them. On Linux the apt headers live in /usr/include (default search path) so
-# this was a no-op; on macOS the Homebrew headers are under /opt/homebrew and
-# would otherwise be missed (runtime_version.h not found).
-target_link_libraries(hw_grpc_proto PUBLIC
-    protobuf::libprotobuf
-    gRPC::grpc++)
-
-# Build only the pf static lib (+ ggml) from the engine tree — no CLI/bench/tests.
-# PF_VULKAN is honored when passed on the cmake command line (it lands in the
-# shared cache the engine reads).
-set(PF_BUILD_TOOLS OFF CACHE BOOL "" FORCE)
-set(PF_BUILD_TESTS OFF CACHE BOOL "" FORCE)
-add_subdirectory(${PRIVACY_FILTER_DIR} ${CMAKE_CURRENT_BINARY_DIR}/privacy-filter.cpp)
-
-add_executable(${TARGET} grpc-server.cpp)
-target_link_libraries(${TARGET} PRIVATE
-    pf
-    hw_grpc_proto
-    gRPC::grpc++
-    gRPC::grpc++_reflection
-    protobuf::libprotobuf
-    Threads::Threads)
--- a/backend/cpp/privacy-filter/Makefile
+++ b/backend/cpp/privacy-filter/Makefile
@@ -1,77 +0,0 @@
-# privacy-filter backend Makefile.
-#
-# Wraps the standalone privacy-filter.cpp GGML engine (the openai-privacy-filter
-# PII/NER token classifier) as a LocalAI gRPC backend. The engine source is
-# fetched at the pin below — .github/workflows/bump_deps.yaml finds and updates
-# PRIVACY_FILTER_VERSION, matching the llama-cpp / ds4 convention.
-#
-# Local development: point at a working checkout instead of cloning, e.g.
-#   make PRIVACY_FILTER_SRC=$HOME/c/privacy-filter.cpp grpc-server
-
-PRIVACY_FILTER_VERSION?=98f52c5ef2250f207cc6b9a6aef05393a120cb7c
-PRIVACY_FILTER_REPO?=https://github.com/localai-org/privacy-filter.cpp
-PRIVACY_FILTER_SRC?=
-
-CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
-BUILD_DIR := build
-
-BUILD_TYPE ?=
-NATIVE ?= false
-JOBS ?= $(shell nproc 2>/dev/null || echo 4)
-
-CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
-
-# GPU backends; the default (cpu) needs no extra flags. 'cublas' is LocalAI's
-# name for the CUDA build (matches llama-cpp / ds4), mapping to the engine's
-# GGML_CUDA path; 'vulkan' selects the ggml Vulkan backend.
-ifeq ($(BUILD_TYPE),cublas)
-    CMAKE_ARGS += -DPF_CUDA=ON
-endif
-ifeq ($(BUILD_TYPE),vulkan)
-    CMAKE_ARGS += -DPF_VULKAN=ON
-endif
-
-# Portable binaries for distribution: disable -march=native unless asked.
-ifneq ($(NATIVE),true)
-    CMAKE_ARGS += -DGGML_NATIVE=OFF
-endif
-
-.PHONY: grpc-server package clean purge test all
-all: grpc-server
-
-# Provide the engine sources at ./privacy-filter.cpp. With PRIVACY_FILTER_SRC
-# set we symlink a local checkout (instant, no network); otherwise we clone the
-# pinned commit and its ggml submodule. The directory/symlink is the target, so
-# make only does this once — run 'make purge && make' to refetch after a bump.
-privacy-filter.cpp:
-ifneq ($(PRIVACY_FILTER_SRC),)
-	ln -sfn $(abspath $(PRIVACY_FILTER_SRC)) privacy-filter.cpp
-else
-	mkdir -p privacy-filter.cpp
-	cd privacy-filter.cpp && \
-	git init -q && \
-	git remote add origin $(PRIVACY_FILTER_REPO) && \
-	git fetch --depth 1 origin $(PRIVACY_FILTER_VERSION) && \
-	git checkout FETCH_HEAD && \
-	git submodule update --init --recursive --depth 1
-endif
-
-grpc-server: privacy-filter.cpp
-	@echo "Building privacy-filter grpc-server ($(BUILD_TYPE)) with $(CMAKE_ARGS)"
-	mkdir -p $(BUILD_DIR)
-	cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
-	cp $(BUILD_DIR)/grpc-server grpc-server
-
-package: grpc-server
-	bash package.sh
-
-test:
-	@echo "privacy-filter backend: parity/regression coverage lives in the engine repo"
-
-clean:
-	rm -rf $(BUILD_DIR) grpc-server package
-
-# 'privacy-filter.cpp' may be a symlink (PRIVACY_FILTER_SRC) — rm without a
-# trailing slash removes the link, never the linked-to checkout.
-purge: clean
-	rm -rf privacy-filter.cpp
--- a/backend/cpp/privacy-filter/grpc-server.cpp
+++ b/backend/cpp/privacy-filter/grpc-server.cpp
@@ -1,210 +0,0 @@
-// privacy-filter LocalAI gRPC backend.
-//
-// Thin shim over privacy-filter.cpp's flat C API (include/pf.h): a standalone
-// GGML engine for the openai-privacy-filter token-classification model family
-// (PII NER). It replaces the llama.cpp-patched TokenClassify path for this one
-// model family — same GGUF files, no llama.cpp carry-patches.
-//
-// Only the RPCs the PII tier needs are implemented: LoadModel, TokenClassify,
-// plus Health / Status / Free. Everything else inherits the generated base
-// class default (UNIMPLEMENTED).
-
-#include "backend.pb.h"
-#include "backend.grpc.pb.h"
-
-#include "pf.h"
-
-#include <grpcpp/grpcpp.h>
-#include <grpcpp/server.h>
-#include <grpcpp/server_builder.h>
-#include <grpcpp/ext/proto_server_reflection_plugin.h>
-
-#include <atomic>
-#include <chrono>
-#include <csignal>
-#include <iostream>
-#include <memory>
-#include <mutex>
-#include <string>
-
-using grpc::Server;
-using grpc::ServerBuilder;
-using grpc::ServerContext;
-// NOTE: do NOT alias grpc::Status as Status — the Status RPC method below would
-// shadow the type and break the other method signatures. Use GStatus instead.
-using GStatus = ::grpc::Status;
-using grpc::StatusCode;
-
-namespace {
-
-// The engine is single-model-per-process: LocalAI spawns one backend process
-// per loaded model. g_mu guards (re)load against in-flight classification.
-std::mutex          g_mu;
-pf_ctx *            g_ctx = nullptr;
-std::atomic<Server *> g_server{nullptr};
-
-// Resolve the device string the engine expects ("cpu" / "gpu" / "cuda" /
-// "vulkan", optionally ":N"). Priority: an explicit "device:..." in
-// ModelOptions.Options, then a non-zero NGPULayers as a coarse "use the GPU"
-// signal, else CPU. "gpu" lets the engine pick whichever GPU backend this
-// binary was compiled with (CUDA or Vulkan), so the same config works on
-// either build; pin "device:cuda"/"device:vulkan" to be explicit.
-std::string resolve_device(const backend::ModelOptions * opts) {
-    for (const auto & o : opts->options()) {
-        const std::string prefix = "device:";
-        if (o.rfind(prefix, 0) == 0) {
-            return o.substr(prefix.size());
-        }
-    }
-    if (opts->ngpulayers() > 0) {
-        return "gpu";
-    }
-    return "cpu";
-}
-
-class PrivacyFilterBackend final : public backend::Backend::Service {
-public:
-    GStatus Health(ServerContext *, const backend::HealthMessage *,
-                   backend::Reply * reply) override {
-        reply->set_message("OK");
-        return GStatus::OK;
-    }
-
-    GStatus Status(ServerContext *, const backend::HealthMessage *,
-                   backend::StatusResponse * response) override {
-        std::lock_guard<std::mutex> lock(g_mu);
-        response->set_state(g_ctx ? backend::StatusResponse::READY
-                                  : backend::StatusResponse::UNINITIALIZED);
-        return GStatus::OK;
-    }
-
-    GStatus LoadModel(ServerContext *, const backend::ModelOptions * request,
-                      backend::Result * result) override {
-        std::lock_guard<std::mutex> lock(g_mu);
-
-        // ModelFile is the absolute path LocalAI resolves; Model is the bare
-        // name. Prefer the former, fall back to the latter.
-        const std::string path =
-            !request->modelfile().empty() ? request->modelfile() : request->model();
-        if (path.empty()) {
-            result->set_success(false);
-            result->set_message("no model path supplied");
-            return GStatus::OK;
-        }
-
-        const std::string device = resolve_device(request);
-
-        if (g_ctx) { pf_free(g_ctx); g_ctx = nullptr; }
-
-        pf_ctx * ctx = pf_load(path.c_str(), device.c_str(), request->threads());
-        const char * err = pf_last_error(ctx);
-        if (err) {
-            result->set_success(false);
-            result->set_message(std::string("privacy-filter load failed: ") + err);
-            pf_free(ctx);
-            return GStatus::OK;
-        }
-
-        // ContextSize, when set, becomes the per-forward window. The engine
-        // ignores values that are too small to window (<= 2*halo) and just
-        // runs a single forward, so passing it through is always safe.
-        if (request->contextsize() > 0) {
-            pf_set_window(ctx, request->contextsize());
-        }
-
-        g_ctx = ctx;
-        result->set_success(true);
-        result->set_message("privacy-filter loaded (" + device + ")");
-        return GStatus::OK;
-    }
-
-    GStatus TokenClassify(ServerContext *, const backend::TokenClassifyRequest * request,
-                          backend::TokenClassifyResponse * response) override {
-        std::lock_guard<std::mutex> lock(g_mu);
-        if (!g_ctx) {
-            return GStatus(StatusCode::FAILED_PRECONDITION, "Model not loaded");
-        }
-
-        const std::string & text = request->text();
-        if (text.empty()) {
-            return GStatus::OK;  // no text -> no entities
-        }
-
-        pf_entity * ents = nullptr;
-        size_t      n    = 0;
-        if (pf_classify(g_ctx, text.data(), text.size(), request->threshold(), &ents, &n) != 0) {
-            const char * err = pf_last_error(g_ctx);
-            return GStatus(StatusCode::INTERNAL,
-                           std::string("TokenClassify failed: ") + (err ? err : "unknown"));
-        }
-
-        // Byte offsets are into the original UTF-8 text; the engine already
-        // applied the threshold and whitespace-trimmed span edges.
-        for (size_t i = 0; i < n; i++) {
-            backend::TokenClassifyEntity * ent = response->add_entities();
-            ent->set_entity_group(ents[i].label ? ents[i].label : "");
-            ent->set_start(ents[i].start);
-            ent->set_end(ents[i].end);
-            ent->set_score(ents[i].score);
-            ent->set_text(text.substr((size_t) ents[i].start,
-                                      (size_t) (ents[i].end - ents[i].start)));
-        }
-        pf_entities_free(ents, n);
-        return GStatus::OK;
-    }
-
-    GStatus Free(ServerContext *, const backend::HealthMessage *,
-                 backend::Result * result) override {
-        std::lock_guard<std::mutex> lock(g_mu);
-        if (g_ctx) { pf_free(g_ctx); g_ctx = nullptr; }
-        result->set_success(true);
-        return GStatus::OK;
-    }
-};
-
-void RunServer(const std::string & addr) {
-    PrivacyFilterBackend service;
-    grpc::EnableDefaultHealthCheckService(true);
-    grpc::reflection::InitProtoReflectionServerBuilderPlugin();
-
-    ServerBuilder builder;
-    builder.AddListeningPort(addr, grpc::InsecureServerCredentials());
-    builder.RegisterService(&service);
-    builder.SetMaxReceiveMessageSize(64 * 1024 * 1024);
-    builder.SetMaxSendMessageSize(64 * 1024 * 1024);
-
-    std::unique_ptr<Server> server(builder.BuildAndStart());
-    if (!server) {
-        std::cerr << "privacy-filter grpc-server: failed to bind " << addr << "\n";
-        std::exit(1);
-    }
-    g_server = server.get();
-    std::cerr << "privacy-filter grpc-server listening on " << addr << "\n";
-    server->Wait();
-}
-
-void signal_handler(int) {
-    if (auto * srv = g_server.load()) {
-        srv->Shutdown(std::chrono::system_clock::now() + std::chrono::seconds(3));
-    }
-}
-
-} // namespace
-
-int main(int argc, char * argv[]) {
-    std::string addr = "127.0.0.1:50051";
-    for (int i = 1; i < argc; ++i) {
-        std::string a = argv[i];
-        const std::string addr_flag = "--addr=";
-        if (a.rfind(addr_flag, 0) == 0)      addr = a.substr(addr_flag.size());
-        else if (a == "--addr" && i + 1 < argc) addr = argv[++i];
-        else if (a == "--help" || a == "-h") {
-            std::cout << "Usage: grpc-server --addr=HOST:PORT\n";
-            return 0;
-        }
-    }
-    std::signal(SIGINT,  signal_handler);
-    std::signal(SIGTERM, signal_handler);
-    RunServer(addr);
-    return 0;
-}
--- a/backend/cpp/privacy-filter/package.sh
+++ b/backend/cpp/privacy-filter/package.sh
@@ -1,39 +0,0 @@
-#!/bin/bash
-# Assemble package/ for the from-scratch backend image: the grpc-server binary,
-# run.sh, the dynamic loader, and every shared library the binary needs.
-set -e
-CURDIR=$(dirname "$(realpath "$0")")
-REPO_ROOT="${CURDIR}/../../.."
-
-mkdir -p "$CURDIR/package/lib"
-cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
-cp -rfv "$CURDIR/run.sh"      "$CURDIR/package/"
-
-# The dynamic loader, renamed to lib/ld.so so run.sh can invoke it explicitly
-# (makes the image independent of the host's glibc layout).
-if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
-    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
-elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
-    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
-else
-    echo "package.sh: unknown architecture" >&2; exit 1
-fi
-
-# Bundle the binary's transitive shared deps (libstdc++, libgomp, and the apt
-# grpc++/protobuf/absl stack) by walking ldd — robust to whichever of those are
-# linked shared vs static. The loader line (no "=>") is skipped; ld.so above
-# already covers it.
-ldd "$CURDIR/grpc-server" | awk '$2 == "=>" && $3 ~ /^\// { print $3 }' | sort -u | \
-while read -r so; do
-    [ -f "$so" ] && cp -arfLv "$so" "$CURDIR/package/lib/"
-done
-
-# Vulkan loader / GPU libs when building the GPU variant.
-GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
-if [ -f "$GPU_LIB_SCRIPT" ]; then
-    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
-    package_gpu_libs
-fi
-
-echo "privacy-filter package contents:"
-ls -lah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/cpp/privacy-filter/run.sh
+++ b/backend/cpp/privacy-filter/run.sh
@@ -1,15 +0,0 @@
-#!/bin/bash
-# Entry point for the privacy-filter backend image / BACKEND_BINARY mode.
-set -e
-CURDIR=$(dirname "$(realpath "$0")")
-# macOS has no bundled ld.so; the darwin package ships only dylibs under lib/,
-# resolved via DYLD_LIBRARY_PATH (the ld.so branch below is skipped there).
-if [ "$(uname)" = "Darwin" ]; then
-    export DYLD_LIBRARY_PATH="$CURDIR/lib:$DYLD_LIBRARY_PATH"
-else
-    export LD_LIBRARY_PATH="$CURDIR/lib:$LD_LIBRARY_PATH"
-fi
-if [ -f "$CURDIR/lib/ld.so" ]; then
-    exec "$CURDIR/lib/ld.so" "$CURDIR/grpc-server" "$@"
-fi
-exec "$CURDIR/grpc-server" "$@"
--- a/backend/cpp/run-unit-tests.sh
+++ b/backend/cpp/run-unit-tests.sh
@@ -1,71 +0,0 @@
-#!/bin/bash
-#
-# Discovers and runs every standalone C++ unit test under backend/cpp/.
-#
-# A "standalone" unit test is a *_test.cpp that depends only on the C++ standard
-# library and nlohmann/json (single header) - i.e. it exercises pure helpers and
-# does not need the full llama.cpp + gRPC backend build. Tests that DO need the
-# backend build use the CMake/ctest path (e.g. -DLLAMA_GRPC_BUILD_TESTS=ON)
-# instead and are skipped here.
-#
-# This keeps CI generic: adding a new pure-C++ unit test file named *_test.cpp in
-# an active backend source dir is picked up automatically, with no CI edits.
-#
-# Env:
-#   NLOHMANN_INCLUDE  include dir that contains nlohmann/json.hpp. If unset, the
-#                     nlohmann/json single header is fetched to a temp dir.
-#   CXX               compiler (default: g++).
-#   JSON_VERSION      nlohmann/json tag to fetch when NLOHMANN_INCLUDE is unset
-#                     (default: v3.11.3).
-set -uo pipefail
-
-ROOT="$(cd "$(dirname "$0")" && pwd)"
-CXX="${CXX:-g++}"
-JSON_VERSION="${JSON_VERSION:-v3.11.3}"
-
-JSON_INC="${NLOHMANN_INCLUDE:-}"
-if [ -z "$JSON_INC" ]; then
-    JSON_INC="$(mktemp -d)"
-    mkdir -p "$JSON_INC/nlohmann"
-    echo "Fetching nlohmann/json ${JSON_VERSION} single header..."
-    if ! curl -L -sf \
-        "https://raw.githubusercontent.com/nlohmann/json/${JSON_VERSION}/single_include/nlohmann/json.hpp" \
-        -o "$JSON_INC/nlohmann/json.hpp"; then
-        echo "ERROR: failed to fetch nlohmann/json header" >&2
-        exit 1
-    fi
-fi
-
-# Active source dirs only - exclude per-variant build copies, dev snapshots and
-# the vendored upstream llama.cpp tree.
-mapfile -t tests < <(find "$ROOT" -name '*_test.cpp' \
-    -not -path '*/llama.cpp/*' \
-    -not -path '*-build/*' \
-    -not -path '*-dev/*' \
-    -not -path '*fallback*' | sort)
-
-if [ "${#tests[@]}" -eq 0 ]; then
-    echo "No standalone C++ unit tests found under $ROOT"
-    exit 0
-fi
-
-fail=0
-for test_src in "${tests[@]}"; do
-    name="$(basename "$test_src" .cpp)"
-    bin="$(mktemp -d)/$name"
-    echo "==> $test_src"
-    if ! "$CXX" -std=c++17 -Wall -Wextra \
-        -I"$JSON_INC" -I"$(dirname "$test_src")" \
-        "$test_src" -o "$bin"; then
-        echo "COMPILE FAILED: $test_src" >&2
-        fail=1
-        continue
-    fi
-    if ! "$bin"; then
-        echo "TEST FAILED: $test_src" >&2
-        fail=1
-    fi
-done
-
-echo "Ran ${#tests[@]} standalone C++ unit test file(s)"
-exit "$fail"
--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -1,7 +1,7 @@

 # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
 # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
-TURBOQUANT_VERSION?=7d9715f1f071fa07c7b2ad3dbfd320b314139e65
+TURBOQUANT_VERSION?=5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403
 LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant

 CMAKE_ARGS?=
@@ -65,29 +65,6 @@ turboquant-avx:
 turboquant-fallback:
 	$(call turboquant-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)

-# Single-build CPU backend via ggml CPU_ALL_VARIANTS (mirrors llama-cpp-cpu-all).
-# turboquant reuses backend/cpp/llama-cpp's CMakeLists.txt (hw_grpc_proto STATIC) and
-# Makefile (SHARED_LIBS make-var + EXTRA_CMAKE_ARGS), so this passes the same overrides
-# through to the copied build: SHARED_LIBS=ON, the DL flags, and --target ggml (which
-# pulls in the per-microarch libggml-cpu-*.so via ggml's add_dependencies). The .so set
-# is collected for package.sh to bundle into package/lib.
-turboquant-cpu-all:
-	rm -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
-	cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build
-	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build purge
-	bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server.cpp
-	$(info $(GREEN)I turboquant build info:cpu-all-variants$(RESET))
-	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
-	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build llama.cpp
-	bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp $(PATCHES_DIR)
-	SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" \
-	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
-	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build grpc-server
-	cp -rfv $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/grpc-server turboquant-cpu-all
-	rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
-	find $(CURRENT_MAKEFILE_DIR)/../turboquant-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
-	@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
-
 turboquant-grpc:
 	$(call turboquant-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)

--- a/backend/cpp/turboquant/package.sh
+++ b/backend/cpp/turboquant/package.sh
@@ -14,15 +14,6 @@ mkdir -p $CURDIR/package/lib
 cp -avrf $CURDIR/turboquant-* $CURDIR/package/
 cp -rfv $CURDIR/run.sh $CURDIR/package/

-# Bundle the ggml shared backends from the CPU_ALL_VARIANTS build into package/lib. ggml
-# discovers the per-microarch libggml-cpu-*.so by scanning the executable directory, which
-# (via the bundled lib/ld.so that run.sh launches through) resolves to lib/. See the
-# matching comment in backend/cpp/llama-cpp/package.sh. No-op on the fallback/ROCm builds.
-if [ -d "$CURDIR/ggml-shared-libs" ]; then
-    echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
-    cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
-fi
-
 # Detect architecture and copy appropriate libraries
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    # x86_64 architecture
--- a/backend/cpp/turboquant/patch-grpc-server.sh
+++ b/backend/cpp/turboquant/patch-grpc-server.sh
@@ -4,19 +4,21 @@
 #
 #   1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
 #      fork-specific `turbo2` / `turbo3` / `turbo4` cache types.
-#   2. Define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top of the file
-#      so the grpc-server option parser skips the two references to
-#      common_params::checkpoint_min_step (the default and the option handler).
-#      That field does not exist in the fork yet; drop this once it does.
-#
-# The fork used to lag upstream on the whole common_params_speculative refactor
-# (ggml-org/llama.cpp#22397/#22838/#22964), the model_tgt rename (#22838) and
-# get_media_marker (#21962), which required a much larger compat shim here
-# (flat-field sed renames + a coarse LOCALAI_LEGACY_LLAMA_CPP_SPEC define). The
-# fork has since rebased past all of those, so the only remaining gap is
-# checkpoint_min_step. If a future bump reintroduces a divergence, add a narrow
-# guard in grpc-server.cpp keyed on a fork-specific macro and inject it here
-# rather than resurrecting the coarse one.
+#   2. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
+#      server-side random per-instance marker) with the legacy "<__media__>"
+#      literal. The fork branched before that PR, so server-common.cpp has no
+#      get_media_marker symbol. The fork's mtmd_default_marker() still returns
+#      "<__media__>", and Go-side tooling falls back to that sentinel when the
+#      backend does not expose media_marker, so substituting the literal keeps
+#      behavior identical on the turboquant path.
+#   3. Revert the `common_params_speculative` field references to the
+#      pre-refactor flat layout. Upstream ggml-org/llama.cpp#22397 split the
+#      struct into nested `draft` / `ngram_simple` / `ngram_mod` / etc. members;
+#      the turboquant fork branched before that PR and still exposes the flat
+#      `n_max`, `mparams_dft`, `ngram_size_n`, ... fields. The substitutions
+#      below map the new nested paths back to the legacy flat names so the
+#      shared grpc-server.cpp keeps compiling against the fork's common.h.
+#      Drop this block once the fork rebases past #22397.
 #
 # We patch the *copy* sitting in turboquant-<flavor>-build/, never the original
 # under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps compiling
@@ -70,20 +72,69 @@ else
    echo "==> KV allow-list patch OK"
 fi

-# 2. Define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top of the file so
-#    the grpc-server option parser skips the two references to
-#    common_params::checkpoint_min_step (the default assignment and the option
-#    handler). That field does not exist in the fork yet. Drop this block once
-#    the fork rebases past the bump that added checkpoint_min_step.
-if grep -q '^#define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP' "$SRC"; then
-    echo "==> $SRC already defines LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP, skipping"
+if grep -q 'get_media_marker()' "$SRC"; then
+    echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
+    # Only one call site today (ModelMetadata), but replace all occurrences to
+    # stay robust if upstream adds more. Use a temp file to avoid relying on
+    # sed -i portability (the builder image uses GNU sed, but keeping this
+    # consistent with the awk block above).
+    sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> get_media_marker() substitution OK"
 else
-    echo "==> patching $SRC to define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top"
-    # Insert the define before the very first `#include` so it precedes the
-    # checkpoint_min_step references.
+    echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
+fi
+
+if grep -q 'params\.speculative\.draft\.\|params\.speculative\.ngram_simple\.' "$SRC"; then
+    echo "==> patching $SRC to revert common_params_speculative refs to pre-#22397 flat layout"
+    # Each substitution is the exact post-refactor path → legacy flat field.
+    # Order doesn't matter because the source paths are disjoint, but we keep
+    # the most-specific (mparams.path) first for readability.
+    sed -E \
+        -e 's/params\.speculative\.draft\.mparams\.path/params.speculative.mparams_dft.path/g' \
+        -e 's/params\.speculative\.draft\.n_max/params.speculative.n_max/g' \
+        -e 's/params\.speculative\.draft\.n_min/params.speculative.n_min/g' \
+        -e 's/params\.speculative\.draft\.p_min/params.speculative.p_min/g' \
+        -e 's/params\.speculative\.draft\.p_split/params.speculative.p_split/g' \
+        -e 's/params\.speculative\.draft\.n_gpu_layers/params.speculative.n_gpu_layers/g' \
+        -e 's/params\.speculative\.draft\.n_ctx/params.speculative.n_ctx/g' \
+        -e 's/params\.speculative\.ngram_simple\.size_n/params.speculative.ngram_size_n/g' \
+        -e 's/params\.speculative\.ngram_simple\.size_m/params.speculative.ngram_size_m/g' \
+        -e 's/params\.speculative\.ngram_simple\.min_hits/params.speculative.ngram_min_hits/g' \
+        "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> speculative field rename OK"
+else
+    echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
+fi
+
+# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream
+#    ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still
+#    exposes the field as `model` on `server_context_impl`. The two call sites
+#    are in the Rerank and ModelMetadata RPC handlers.
+if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then
+    echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model"
+    sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> model_tgt rename OK"
+else
+    echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch"
+fi
+
+# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
+#    grpc-server option parser skips the new option-handler blocks (ngram_mod,
+#    ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
+#    draft.tensor_buft_overrides) introduced for the post-#22838 layout. Those
+#    blocks reference struct fields that simply do not exist in the fork.
+if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
+    echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
+else
+    echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top"
+    # Insert the define before the very first `#include` so it precedes all the
+    # speculative-decoding code paths.
    awk '
        !done && /^#include/ {
-            print "#define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP 1"
+            print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1"
            print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
            print ""
            done = 1
@@ -91,13 +142,13 @@ else
        { print }
        END {
            if (!done) {
-                print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP" > "/dev/stderr"
+                print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr"
                exit 1
            }
        }
    ' "$SRC" > "$SRC.tmp"
    mv "$SRC.tmp" "$SRC"
-    echo "==> LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP define OK"
+    echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK"
 fi

 echo "==> all patches applied"
--- a/backend/cpp/turboquant/patches/0001-hip-guard-copy2d-peer-fastpath.patch
+++ b/backend/cpp/turboquant/patches/0001-hip-guard-copy2d-peer-fastpath.patch
@@ -1,55 +0,0 @@
-hip: port the turboquant CUDA additions that ggml's HIP shim doesn't cover
-
-The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs
-that ggml's HIP (and MUSA) compatibility layer does not provide, breaking
-the -gpu-rocm-hipblas-turboquant build:
-
-  1. ggml_cuda_copy2d_across_devices() (host-staged cross-device copy for
-     split mul_mat output) uses the CUDA 3D-peer copy APIs
-     cudaMemcpy3DPeerParms / make_cudaPitchedPtr / make_cudaExtent /
-     cudaMemcpy3DPeerAsync. HIP genuinely does not support these (see the
-     fork's own comment "HIP does not support cudaMemcpy3DPeerAsync"), so
-     guard the peer fast path with #if !defined(GGML_USE_HIP) &&
-     !defined(GGML_USE_MUSA) -- matching how the fork already guards the
-     same API for the sibling 2D copy -- and fall through to the existing
-     cudaMemcpyAsync staging fallback below (functionally identical,
-     slightly slower on multi-GPU ROCm).
-
-  2. ggml_backend_cuda_device_event_new() creates its event with plain
-     cudaEventCreate, which ggml's HIP shim does not alias (it only aliases
-     cudaEventCreateWithFlags). Use cudaEventCreateWithFlags(..., 
-     cudaEventDisableTiming) -- exactly what the rest of this file already
-     does (cf. lines ~1034, ~3461) and HIP-safe.
-
-CUDA builds are unaffected. Drop the relevant hunk once the fork HIP-ports
-these; apply-patches.sh fails fast if an anchor goes stale.
-
-diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index 0427e6b..6352e6a 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
-+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -1933,6 +1933,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
-     size_t width, size_t height, cudaStream_t dst_stream, cudaStream_t src_stream) {
- 
-     const auto & info = ggml_cuda_info();
-+#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)  // 3D-peer copy types unmapped by ggml's HIP/MUSA shim; use staging fallback below
-     if (info.peer_access[src_device][dst_device]) {
-         cudaMemcpy3DPeerParms p = {};
-         p.dstDevice = dst_device;
-@@ -1942,6 +1943,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
-         p.extent = make_cudaExtent(width, height, 1);
-         return cudaMemcpy3DPeerAsync(&p, dst_stream);
-     }
-+#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
- 
-     // Fallback: stage all rows through a single contiguous pinned buffer
-     int prev_device = ggml_cuda_get_device();
-@@ -5714,7 +5716,7 @@ static ggml_backend_event_t ggml_backend_cuda_device_event_new(ggml_backend_dev_
-     ggml_cuda_set_device(dev_ctx->device);
- 
-     cudaEvent_t event;
-    CUDA_CHECK(cudaEventCreate(&event));
-+    CUDA_CHECK(cudaEventCreateWithFlags(&event, cudaEventDisableTiming));
- 
-     return new ggml_backend_event {
-         /* .device  = */ dev,
--- a/backend/cpp/turboquant/run.sh
+++ b/backend/cpp/turboquant/run.sh
@@ -2,7 +2,7 @@
 set -ex

 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")

 cd /

@@ -12,39 +12,54 @@ grep -e "flags" /proc/cpuinfo | head -1

 BINARY=turboquant-fallback

-# x86/arm64 ship a single turboquant-cpu-all built with ggml CPU_ALL_VARIANTS: ggml's
-# backend registry dlopens the best libggml-cpu-*.so for this host, so no shell-side
-# probing. ROCm ships only turboquant-fallback, so fall back to it when cpu-all is absent.
-if [ -e "$CURDIR"/turboquant-cpu-all ]; then
-	BINARY=turboquant-cpu-all
+if grep -q -e "\savx\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX    found OK"
+	if [ -e $CURDIR/turboquant-avx ]; then
+		BINARY=turboquant-avx
+	fi
+fi
+
+if grep -q -e "\savx2\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX2   found OK"
+	if [ -e $CURDIR/turboquant-avx2 ]; then
+		BINARY=turboquant-avx2
+	fi
+fi
+
+# Check avx 512
+if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
+	echo "CPU:    AVX512F found OK"
+	if [ -e $CURDIR/turboquant-avx512 ]; then
+		BINARY=turboquant-avx512
+	fi
 fi

 if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
-	if [ -e "$CURDIR"/turboquant-grpc ]; then
+	if [ -e $CURDIR/turboquant-grpc ]; then
 		BINARY=turboquant-grpc
 	fi
 fi

 # Extend ld library path with the dir where this script is located/lib
 if [ "$(uname)" == "Darwin" ]; then
-	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
+	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
 else
-	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
+	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 	# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
 	if [ -d "$CURDIR/lib/rocblas/library" ]; then
-		export ROCBLAS_TENSILE_LIBPATH="$CURDIR"/lib/rocblas/library
+		export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
 	fi
 fi

 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using binary: $BINARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/$BINARY "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
 fi

 echo "Using binary: $BINARY"
-exec "$CURDIR"/$BINARY "$@"
+exec $CURDIR/$BINARY "$@"

 # We should never reach this point, however just in case we do, run fallback
-exec "$CURDIR"/turboquant-fallback "$@"
+exec $CURDIR/turboquant-fallback "$@"
--- a/backend/go/acestep-cpp/Makefile
+++ b/backend/go/acestep-cpp/Makefile
@@ -117,8 +117,7 @@ libgoacestepcpp-custom: CMakeLists.txt cpp/goacestepcpp.cpp cpp/goacestepcpp.h
 	cmake .. $(CMAKE_ARGS) && \
 	cmake --build . --config Release -j$(JOBS) --target goacestepcpp && \
 	cd .. && \
-	(mv build-$(SO_TARGET)/libgoacestepcpp.so ./$(SO_TARGET) 2>/dev/null || \
-	 mv build-$(SO_TARGET)/libgoacestepcpp.dylib ./$(SO_TARGET) 2>/dev/null)
+	mv build-$(SO_TARGET)/libgoacestepcpp.so ./$(SO_TARGET)

 test: acestep-cpp
 	@echo "Running acestep-cpp tests..."
--- a/backend/go/acestep-cpp/main.go
+++ b/backend/go/acestep-cpp/main.go
@@ -4,7 +4,6 @@ package main
 import (
 	"flag"
 	"os"
-	"runtime"

 	"github.com/ebitengine/purego"
 	grpc "github.com/mudler/LocalAI/pkg/grpc"
@@ -23,11 +22,7 @@ func main() {
 	// Get library name from environment variable, default to fallback
 	libName := os.Getenv("ACESTEP_LIBRARY")
 	if libName == "" {
-		if runtime.GOOS == "darwin" {
-			libName = "./libgoacestepcpp-fallback.dylib"
-		} else {
-			libName = "./libgoacestepcpp-fallback.so"
-		}
+		libName = "./libgoacestepcpp-fallback.so"
 	}

 	gosd, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
--- a/backend/go/acestep-cpp/package.sh
+++ b/backend/go/acestep-cpp/package.sh
@@ -13,7 +13,6 @@ mkdir -p $CURDIR/package/lib

 cp -avf $CURDIR/acestep-cpp $CURDIR/package/
 cp -fv $CURDIR/libgoacestepcpp-*.so $CURDIR/package/
-cp -fv $CURDIR/libgoacestepcpp-*.dylib $CURDIR/package/ 2>/dev/null || true
 cp -fv $CURDIR/run.sh $CURDIR/package/

 # Detect architecture and copy appropriate libraries
--- a/backend/go/acestep-cpp/run.sh
+++ b/backend/go/acestep-cpp/run.sh
@@ -2,7 +2,7 @@
 set -ex

 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath "$0")")
+CURDIR=$(dirname "$(realpath $0)")

 cd /

@@ -12,29 +12,19 @@ if [ "$(uname)" != "Darwin" ]; then
 	grep -e "flags" /proc/cpuinfo | head -1
 fi

-if [ "$(uname)" = "Darwin" ]; then
-	# macOS: single library variant (Metal or Accelerate). The goacestepcpp
-	# target is built as a CMake MODULE, which emits a .dylib for a SHARED
-	# build but a .so for a MODULE build on Apple, so prefer .dylib and fall
-	# back to .so.
-	LIBRARY="$CURDIR/libgoacestepcpp-fallback.dylib"
-	if [ ! -e "$LIBRARY" ]; then
-		LIBRARY="$CURDIR/libgoacestepcpp-fallback.so"
-	fi
-	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
-else
-	LIBRARY="$CURDIR/libgoacestepcpp-fallback.so"
+LIBRARY="$CURDIR/libgoacestepcpp-fallback.so"

+if [ "$(uname)" != "Darwin" ]; then
 	if grep -q -e "\savx\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX    found OK"
-		if [ -e "$CURDIR"/libgoacestepcpp-avx.so ]; then
+		if [ -e $CURDIR/libgoacestepcpp-avx.so ]; then
 			LIBRARY="$CURDIR/libgoacestepcpp-avx.so"
 		fi
 	fi

 	if grep -q -e "\savx2\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX2   found OK"
-		if [ -e "$CURDIR"/libgoacestepcpp-avx2.so ]; then
+		if [ -e $CURDIR/libgoacestepcpp-avx2.so ]; then
 			LIBRARY="$CURDIR/libgoacestepcpp-avx2.so"
 		fi
 	fi
@@ -42,22 +32,21 @@ else
 	# Check avx 512
 	if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
 		echo "CPU:    AVX512F found OK"
-		if [ -e "$CURDIR"/libgoacestepcpp-avx512.so ]; then
+		if [ -e $CURDIR/libgoacestepcpp-avx512.so ]; then
 			LIBRARY="$CURDIR/libgoacestepcpp-avx512.so"
 		fi
 	fi
-
-	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
 fi

+export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 export ACESTEP_LIBRARY=$LIBRARY

 # If there is a lib/ld.so, use it
-if [ -f "$CURDIR"/lib/ld.so ]; then
+if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using library: $LIBRARY"
-	exec "$CURDIR"/lib/ld.so "$CURDIR"/acestep-cpp "$@"
+	exec $CURDIR/lib/ld.so $CURDIR/acestep-cpp "$@"
 fi

 echo "Using library: $LIBRARY"
-exec "$CURDIR"/acestep-cpp "$@"
+exec $CURDIR/acestep-cpp "$@"
--- a/backend/go/ced/.gitignore
+++ b/backend/go/ced/.gitignore
@@ -1,11 +0,0 @@
-.cache/
-sources/
-build/
-package/
-ced-grpc
-# build artifacts staged in-tree by the Makefile (cp from sources/) or
-# symlinked for local dev; the real sources live in ced.cpp upstream.
-*.so
-*.so.*
-ced_capi.h
-compile_commands.json
--- a/backend/go/ced/Makefile
+++ b/backend/go/ced/Makefile
@@ -1,78 +0,0 @@
-# ced sound-classification backend Makefile.
-#
-# Upstream pin lives below as CED_VERSION?=<sha> so .github/bump_deps.sh can find
-# and update it (matches the parakeet-cpp / whisper.cpp convention).
-#
-# Local dev shortcut: symlink an out-of-tree ced.cpp shared build + header and
-# skip the clone/cmake steps entirely:
-#   ln -sf /path/to/ced.cpp/build-shared/libced.so .
-#   ln -sf /path/to/ced.cpp/include/ced_capi.h .
-#   go build -o ced-grpc .
-
-CED_VERSION?=c04ac14b7992d00584d9e812c9bb6268598a6ce7
-CED_REPO?=https://github.com/mudler/ced.cpp
-
-GOCMD?=go
-GO_TAGS?=
-JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
-
-BUILD_TYPE?=
-NATIVE?=false
-
-# Static-link ggml into libced.so (PIC) so the shared lib is self-contained:
-# dlopen needs no libggml*.so alongside it, only system libs the runtime image
-# already provides.
-CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DCED_SHARED=ON -DCED_BUILD_CLI=OFF -DCED_BUILD_TESTS=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
-
-ifeq ($(NATIVE),false)
-	CMAKE_ARGS+=-DGGML_NATIVE=OFF
-endif
-
-# ced.cpp gates its ggml backends behind CED_GGML_* options (set(... CACHE BOOL
-# "" FORCE)), so forward those instead of a bare -DGGML_CUDA=ON.
-ifeq ($(BUILD_TYPE),cublas)
-	CMAKE_ARGS+=-DCED_GGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON
-else ifeq ($(BUILD_TYPE),openblas)
-	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
-else ifeq ($(BUILD_TYPE),hipblas)
-	CMAKE_ARGS+=-DCED_GGML_HIP=ON
-else ifeq ($(BUILD_TYPE),vulkan)
-	CMAKE_ARGS+=-DCED_GGML_VULKAN=ON
-endif
-
-.PHONY: ced-grpc package build clean purge test all
-
-all: ced-grpc
-
-sources/ced.cpp:
-	mkdir -p sources/ced.cpp
-	cd sources/ced.cpp && \
-	git init -q && \
-	git remote add origin $(CED_REPO) && \
-	git fetch --depth 1 origin $(CED_VERSION) && \
-	git checkout FETCH_HEAD && \
-	git submodule update --init --recursive --depth 1 --single-branch
-
-libced.so: sources/ced.cpp
-	cmake -B sources/ced.cpp/build-shared -S sources/ced.cpp $(CMAKE_ARGS)
-	cmake --build sources/ced.cpp/build-shared --config Release -j$(JOBS)
-	cp -fv sources/ced.cpp/build-shared/libced.so* ./ 2>/dev/null || true
-	cp -fv sources/ced.cpp/build-shared/libced.dylib ./ 2>/dev/null || true
-	cp -fv sources/ced.cpp/include/ced_capi.h ./
-
-ced-grpc: libced.so main.go goced.go
-	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o ced-grpc .
-
-package: ced-grpc
-	bash package.sh
-
-build: package
-
-test:
-	LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1
-
-clean: purge
-	rm -rf libced.so* ced_capi.h package ced-grpc
-
-purge:
-	rm -rf sources/ced.cpp
--- a/backend/go/ced/goced.go
+++ b/backend/go/ced/goced.go
@@ -1,130 +0,0 @@
-package main
-
-// Go side of the ced backend: purego bindings over ced_capi.h plus the gRPC
-// SoundDetection implementation.
-//
-// SKETCH: the pb.SoundDetection* types come from backend.proto (regenerate with
-// `make protogen-go`). The C side is single-threaded per ctx, so we guard the
-// engine with engineMu; LocalAI also serializes via base.SingleThread.
-import (
-	"context"
-	"encoding/json"
-	"errors"
-	"fmt"
-	"sort"
-	"sync"
-	"unsafe"
-
-	"github.com/mudler/LocalAI/pkg/grpc/base"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-)
-
-// purego-bound entry points from libced.so. Names match ced_capi.h exactly.
-var (
-	CppAbiVersion       func() int32
-	CppLoad             func(ggufPath string) uintptr
-	CppFree             func(ctx uintptr)
-	CppLastError        func(ctx uintptr) string
-	CppNumClasses       func(ctx uintptr) int32
-	CppSampleRate       func(ctx uintptr) int32
-	CppClassifyPathJSON func(ctx uintptr, wavPath string, topK int32) uintptr
-	CppClassifyPcmJSON  func(ctx uintptr, pcm []float32, nSamples int32, sampleRate int32, topK int32) uintptr
-	CppFreeString       func(s uintptr)
-)
-
-// cstr copies a malloc'd C string (returned as uintptr) into a Go string and
-// frees the original via ced_capi_free_string. Empty/0 -> "".
-func cstr(p uintptr) string {
-	if p == 0 {
-		return ""
-	}
-	defer CppFreeString(p)
-	var b []byte
-	for i := 0; ; i++ {
-		ch := *(*byte)(unsafe.Pointer(p + uintptr(i))) //nolint:govet // #nosec G103 -- C-owned NUL-terminated string from libced (not Go-GC memory)
-		if ch == 0 {
-			break
-		}
-		b = append(b, ch)
-	}
-	return string(b)
-}
-
-// Ced is the gRPC backend. One loaded CED model per instance.
-type Ced struct {
-	base.Base
-	ctxPtr   uintptr
-	engineMu sync.Mutex
-}
-
-// Load resolves the GGUF and opens the C-API context.
-func (c *Ced) Load(opts *pb.ModelOptions) error {
-	if opts.ModelFile == "" {
-		return errors.New("ced: ModelFile is required")
-	}
-	ctx := CppLoad(opts.ModelFile)
-	if ctx == 0 {
-		return fmt.Errorf("ced: ced_capi_load failed for %q: %s", opts.ModelFile, CppLastError(0))
-	}
-	c.ctxPtr = ctx
-	return nil
-}
-
-// jsonTag mirrors the ced_capi JSON tag objects.
-type jsonTag struct {
-	Index int     `json:"index"`
-	Score float32 `json:"score"`
-	Label string  `json:"label"`
-}
-
-// SoundDetection classifies the clip at req.Src and returns scored AudioSet tags.
-func (c *Ced) SoundDetection(ctx context.Context, req *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
-	if c.ctxPtr == 0 {
-		return nil, errors.New("ced: model not loaded")
-	}
-	if req.GetSrc() == "" {
-		return nil, errors.New("ced: SoundDetectionRequest.src (audio path) is required")
-	}
-	topK := req.GetTopK()
-	if topK <= 0 {
-		topK = 10 // sensible default for a tagging response
-	}
-
-	c.engineMu.Lock()
-	out := cstr(CppClassifyPathJSON(c.ctxPtr, req.GetSrc(), topK))
-	lastErr := CppLastError(c.ctxPtr)
-	c.engineMu.Unlock()
-
-	if out == "" {
-		return nil, fmt.Errorf("ced: classification failed: %s", lastErr)
-	}
-	var tags []jsonTag
-	if err := json.Unmarshal([]byte(out), &tags); err != nil {
-		return nil, fmt.Errorf("ced: bad classifier JSON: %w", err)
-	}
-
-	thr := req.GetThreshold()
-	resp := &pb.SoundDetectionResponse{}
-	for _, t := range tags {
-		if t.Score < thr {
-			continue
-		}
-		resp.Detections = append(resp.Detections, &pb.SoundClass{
-			Label: t.Label, Score: t.Score, Index: int32(t.Index),
-		})
-	}
-	sort.Slice(resp.Detections, func(i, j int) bool {
-		return resp.Detections[i].Score > resp.Detections[j].Score
-	})
-	return resp, nil
-}
-
-func (c *Ced) Free() error {
-	c.engineMu.Lock()
-	defer c.engineMu.Unlock()
-	if c.ctxPtr != 0 {
-		CppFree(c.ctxPtr)
-		c.ctxPtr = 0
-	}
-	return nil
-}
--- a/backend/go/ced/main.go
+++ b/backend/go/ced/main.go
@@ -1,64 +0,0 @@
-package main
-
-// ced sound-classification backend. Started internally by LocalAI: one gRPC
-// server per loaded model. Loads libced.so via purego and registers the flat
-// C-API declared in ced_capi.h. The library name can be overridden with
-// CED_LIBRARY (mirrors PARAKEET_LIBRARY / WHISPER_LIBRARY); the default looks
-// for the .so next to this binary.
-//
-// SKETCH: requires `make protogen-go` after the backend.proto SoundDetection
-// addition, and a built libced.so (see Makefile). See DESIGN.md.
-import (
-	"flag"
-	"fmt"
-	"os"
-	"runtime"
-
-	"github.com/ebitengine/purego"
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-)
-
-var addr = flag.String("addr", "localhost:50051", "the address to connect to")
-
-type libFunc struct {
-	ptr  any
-	name string
-}
-
-func main() {
-	libName := os.Getenv("CED_LIBRARY")
-	if libName == "" {
-		if runtime.GOOS == "darwin" {
-			libName = "libced.dylib"
-		} else {
-			libName = "libced.so"
-		}
-	}
-	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
-	if err != nil {
-		panic(fmt.Errorf("ced: dlopen %q: %w", libName, err))
-	}
-
-	// Bound 1:1 to ced_capi.h. char*-returning functions are declared uintptr
-	// so we can free the same pointer with ced_capi_free_string after copying
-	// (purego's string return would copy and leak the original).
-	for _, lf := range []libFunc{
-		{&CppAbiVersion, "ced_capi_abi_version"},
-		{&CppLoad, "ced_capi_load"},
-		{&CppFree, "ced_capi_free"},
-		{&CppLastError, "ced_capi_last_error"},
-		{&CppNumClasses, "ced_capi_num_classes"},
-		{&CppSampleRate, "ced_capi_sample_rate"},
-		{&CppClassifyPathJSON, "ced_capi_classify_path_json"},
-		{&CppClassifyPcmJSON, "ced_capi_classify_pcm_json"},
-		{&CppFreeString, "ced_capi_free_string"},
-	} {
-		purego.RegisterLibFunc(lf.ptr, lib, lf.name)
-	}
-
-	fmt.Fprintf(os.Stderr, "[ced] ABI=%d\n", CppAbiVersion())
-	flag.Parse()
-	if err := grpc.StartServer(*addr, &Ced{}); err != nil {
-		panic(err)
-	}
-}
--- a/backend/go/ced/package.sh
+++ b/backend/go/ced/package.sh
@@ -1,62 +0,0 @@
-#!/bin/bash
-#
-# Bundle the ced-grpc binary, libced.so, the core runtime libs (libc/libstdc++/
-# libgomp + ld.so) and the GPU runtime for the active BUILD_TYPE so the package
-# is self-contained. Mirrors backend/go/parakeet-cpp/package.sh; run.sh routes
-# the (CGO_ENABLED=0) binary through lib/ld.so so the packaged libc is used.
-
-set -e
-
-CURDIR=$(dirname "$(realpath "$0")")
-REPO_ROOT="${CURDIR}/../../.."
-
-mkdir -p "$CURDIR/package/lib"
-
-cp -avf "$CURDIR/ced-grpc" "$CURDIR/package/"
-cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
-
-cp -avf "$CURDIR"/libced.so* "$CURDIR/package/lib/" 2>/dev/null || true
-cp -avf "$CURDIR"/libced.dylib "$CURDIR/package/lib/" 2>/dev/null || true
-if ! ls "$CURDIR"/package/lib/libced.* >/dev/null 2>&1; then
-	echo "ERROR: libced shared library not found in $CURDIR, run 'make' first" >&2
-	exit 1
-fi
-
-if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
-    echo "Detected x86_64 architecture, copying x86_64 libraries..."
-    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
-    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
-    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
-    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
-    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
-    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
-    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
-    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
-    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
-elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
-    echo "Detected ARM64 architecture, copying ARM64 libraries..."
-    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
-    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
-    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
-    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
-    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
-    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
-    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
-    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
-    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
-elif [ "$(uname -s)" = "Darwin" ]; then
-    echo "Detected Darwin"
-else
-    echo "Error: Could not detect architecture"
-    exit 1
-fi
-
-GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
-if [ -f "$GPU_LIB_SCRIPT" ]; then
-    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
-    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
-    package_gpu_libs
-fi
-
-echo "Packaging completed successfully"
-ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/go/ced/run.sh
+++ b/backend/go/ced/run.sh
@@ -1,20 +0,0 @@
-#!/bin/bash
-set -e
-
-CURDIR=$(dirname "$(realpath "$0")")
-
-if [ "$(uname)" = "Darwin" ]; then
-	export DYLD_LIBRARY_PATH="$CURDIR/lib:"$CURDIR":${DYLD_LIBRARY_PATH:-}"
-	export CED_LIBRARY="$CURDIR/lib/libced.dylib"
-else
-	export LD_LIBRARY_PATH="$CURDIR/lib:"$CURDIR":${LD_LIBRARY_PATH:-}"
-fi
-
-# If a self-contained ld.so was packaged, route through it so the packaged
-# libc / libstdc++ are used instead of the host's (matches the sibling backends).
-if [ -f "$CURDIR/lib/ld.so" ]; then
-	echo "Using lib/ld.so"
-	exec "$CURDIR/lib/ld.so" "$CURDIR/ced-grpc" "$@"
-fi
-
-exec "$CURDIR/ced-grpc" "$@"
--- a/backend/go/cloud-proxy/Makefile
+++ b/backend/go/cloud-proxy/Makefile
@@ -1,12 +0,0 @@
-GOCMD=go
-
-cloud-proxy:
-	CGO_ENABLED=0 $(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o cloud-proxy ./
-
-package:
-	bash package.sh
-
-build: cloud-proxy package
-
-clean:
-	rm -f cloud-proxy
--- a/backend/go/cloud-proxy/cloud_proxy_suite_test.go
+++ b/backend/go/cloud-proxy/cloud_proxy_suite_test.go
@@ -1,16 +0,0 @@
-package main
-
-import (
-	"testing"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-// Ginkgo bootstrap. The other Test* functions in this package use
-// raw testing.T and run independently; they coexist with Ginkgo
-// specs registered via Describe / Context.
-func TestCloudProxySpecs(t *testing.T) {
-	RegisterFailHandler(Fail)
-	RunSpecs(t, "cloud-proxy specs")
-}
--- a/backend/go/cloud-proxy/main.go
+++ b/backend/go/cloud-proxy/main.go
@@ -1,39 +0,0 @@
-package main
-
-// cloud-proxy is a LocalAI backend that forwards request traffic to an
-// external HTTP provider (OpenAI, Anthropic, etc.). Two modes:
-//
-//   - passthrough: serves the Forward RPC; the client wire format is
-//     preserved end-to-end, no translation.
-//   - translate: serves Predict/PredictStream; the backend converts
-//     internal proto to the provider's wire format. (Phases 5–6.)
-//
-// LoadModel reads UpstreamURL/Mode/Provider/key references from
-// ProxyOptions and resolves the API key once at load time.
-
-import (
-	"flag"
-	"os"
-
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-	"github.com/mudler/xlog"
-	"golang.org/x/term"
-)
-
-var addr = flag.String("addr", "localhost:50051", "the address to listen on")
-
-func main() {
-	// xlog's default handler emits ANSI color codes; that's fine for an
-	// interactive shell but unreadable when the backend's stdout is
-	// captured by LocalAI and tee'd to a log file. Force plain text when
-	// LOCALAI_LOG_FORMAT is unset and stdout isn't a terminal.
-	format := os.Getenv("LOCALAI_LOG_FORMAT")
-	if format == "" && !term.IsTerminal(int(os.Stdout.Fd())) {
-		format = xlog.TextFormat
-	}
-	xlog.SetLogger(xlog.NewLogger(xlog.LogLevel(os.Getenv("LOCALAI_LOG_LEVEL")), format))
-	flag.Parse()
-	if err := grpc.StartServer(*addr, NewCloudProxy()); err != nil {
-		panic(err)
-	}
-}
--- a/backend/go/cloud-proxy/package.sh
+++ b/backend/go/cloud-proxy/package.sh
@@ -1,13 +0,0 @@
-#!/bin/bash
-
-# Script to copy the cloud-proxy binary into the package dir for the
-# final Dockerfile stage. Mirrors backend/go/local-store/package.sh —
-# no extra runtime libs needed since the backend is pure Go.
-
-set -e
-
-CURDIR=$(dirname "$(realpath $0)")
-
-mkdir -p $CURDIR/package
-cp -avf $CURDIR/cloud-proxy $CURDIR/package/
-cp -rfv $CURDIR/run.sh $CURDIR/package/
--- a/backend/go/cloud-proxy/passthrough_edge_test.go
+++ b/backend/go/cloud-proxy/passthrough_edge_test.go
@@ -1,325 +0,0 @@
-package main
-
-import (
-	"context"
-	"errors"
-	"io"
-	"net/http"
-	"net/http/httptest"
-	"strconv"
-	"sync"
-
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-var _ = Describe("composeURL", func() {
-	// Upstream URL convention: gallery configs put the canonical path
-	// in upstream_url, so per-request Path is ignored. A bare-host
-	// upstream_url accepts the per-request path.
-	DescribeTable("path resolution",
-		func(upstream, reqPath, want string) {
-			got, err := composeURL(upstream, reqPath)
-			Expect(err).NotTo(HaveOccurred())
-			Expect(got).To(Equal(want))
-		},
-		Entry("full path wins", "https://api.openai.com/v1/chat/completions", "/v1/something-else", "https://api.openai.com/v1/chat/completions"),
-		Entry("bare host accepts path", "https://api.openai.com", "/v1/chat/completions", "https://api.openai.com/v1/chat/completions"),
-		Entry("root slash treated as bare", "https://api.openai.com/", "/v1/chat/completions", "https://api.openai.com/v1/chat/completions"),
-		Entry("bare host + empty path", "https://api.openai.com", "", "https://api.openai.com"),
-	)
-
-	It("returns an error on invalid upstream URL", func() {
-		_, err := composeURL("://garbage", "")
-		Expect(err).To(HaveOccurred())
-	})
-})
-
-var _ = Describe("applyAuthHeader", func() {
-	It("sets x-api-key and anthropic-version for Anthropic, no Authorization", func() {
-		req, _ := http.NewRequest("POST", "https://example.com", nil)
-		applyAuthHeader(req, providerAnthropic, "ant-key")
-		Expect(req.Header.Get("x-api-key")).To(Equal("ant-key"))
-		Expect(req.Header.Get("anthropic-version")).NotTo(BeEmpty())
-		Expect(req.Header.Get("Authorization")).To(BeEmpty(), "Authorization must not leak on Anthropic backend")
-	})
-
-	It("sets Bearer Authorization for OpenAI, no x-api-key", func() {
-		req, _ := http.NewRequest("POST", "https://example.com", nil)
-		applyAuthHeader(req, providerOpenAI, "sk-key")
-		Expect(req.Header.Get("Authorization")).To(Equal("Bearer sk-key"))
-		Expect(req.Header.Get("x-api-key")).To(BeEmpty(), "x-api-key must not leak on OpenAI backend")
-	})
-
-	It("defaults to Bearer when provider is empty", func() {
-		// Passthrough mode often has provider == "" because the operator
-		// doesn't claim a specific upstream wire format. Most providers
-		// (including OpenAI-compatible ones) accept Bearer, so default to it.
-		req, _ := http.NewRequest("POST", "https://example.com", nil)
-		applyAuthHeader(req, "", "some-key")
-		Expect(req.Header.Get("Authorization")).To(Equal("Bearer some-key"))
-	})
-
-	It("preserves an existing anthropic-version header", func() {
-		// If the client supplied anthropic-version (rare but legitimate
-		// for an upstream pinned to a specific date), the proxy must not
-		// clobber it.
-		req, _ := http.NewRequest("POST", "https://example.com", nil)
-		req.Header.Set("anthropic-version", "2024-10-01")
-		applyAuthHeader(req, providerAnthropic, "k")
-		Expect(req.Header.Get("anthropic-version")).To(Equal("2024-10-01"))
-	})
-})
-
-var _ = Describe("isHopByHopHeader", func() {
-	DescribeTable("hop-by-hop classification",
-		func(header string, want bool) {
-			Expect(isHopByHopHeader(header)).To(Equal(want))
-		},
-		Entry("Connection is hop-by-hop", "Connection", true),
-		Entry("Keep-Alive is hop-by-hop", "Keep-Alive", true),
-		Entry("Proxy-Connection is hop-by-hop", "Proxy-Connection", true),
-		Entry("Transfer-Encoding is hop-by-hop", "Transfer-Encoding", true),
-		Entry("TE is hop-by-hop", "TE", true),
-		Entry("Trailer is hop-by-hop", "Trailer", true),
-		Entry("Upgrade is hop-by-hop", "Upgrade", true),
-		Entry("Host is hop-by-hop", "Host", true),
-		Entry("Content-Length is hop-by-hop", "Content-Length", true),
-		// Case-insensitive — RFC 7230 doesn't constrain header case.
-		Entry("lowercase connection is hop-by-hop", "connection", true),
-		Entry("uppercase HOST is hop-by-hop", "HOST", true),
-		// Non hop-by-hop — must NOT be stripped.
-		Entry("Authorization is end-to-end", "Authorization", false),
-		Entry("Content-Type is end-to-end", "Content-Type", false),
-		Entry("Accept is end-to-end", "Accept", false),
-		Entry("X-Custom is end-to-end", "X-Custom", false),
-	)
-})
-
-var _ = Describe("Forward", func() {
-	It("strips hop-by-hop and Connection headers before upstream, preserves custom headers", func() {
-		gotConnection := make(chan string, 1)
-		gotXCustom := make(chan string, 1)
-		gotHost := make(chan string, 1)
-		upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-			gotConnection <- r.Header.Get("Connection")
-			gotXCustom <- r.Header.Get("X-Custom")
-			gotHost <- r.Header.Get("Host")
-			w.WriteHeader(http.StatusOK)
-		}))
-		defer upstream.Close()
-
-		cp := NewCloudProxy()
-		Expect(cp.Load(&pb.ModelOptions{
-			Proxy: &pb.ProxyOptions{
-				UpstreamUrl: upstream.URL,
-				Mode:        modePassthrough,
-			},
-		})).To(Succeed())
-
-		addr := "test://forward-hopbyhop"
-		grpc.Provide(addr, cp)
-		c := grpc.NewClient(addr, true, nil, false)
-		stream, err := c.Forward(context.Background())
-		Expect(err).NotTo(HaveOccurred())
-		Expect(stream.Send(&pb.ForwardRequest{
-			Path:   "/v1/chat/completions",
-			Method: "POST",
-			Headers: []*pb.ForwardHeader{
-				{Name: "Connection", Value: "keep-alive"},
-				{Name: "Host", Value: "spoofed.example.com"},
-				{Name: "X-Custom", Value: "preserved"},
-			},
-		})).To(Succeed())
-		Expect(stream.CloseSend()).To(Succeed())
-		_, _ = stream.Recv()
-		for {
-			if _, err := stream.Recv(); errors.Is(err, io.EOF) || err != nil {
-				break
-			}
-		}
-
-		Expect(<-gotConnection).To(BeEmpty(), "Connection must not leak to upstream")
-		Expect(<-gotHost).NotTo(Equal("spoofed.example.com"), "Host header must not be spoofed through")
-		Expect(<-gotXCustom).To(Equal("preserved"), "X-Custom header must survive")
-	})
-
-	It("replaces caller-supplied Authorization with the configured key", func() {
-		// The proxy must overwrite a client-supplied Authorization header
-		// so a downstream caller can't smuggle stale or wrong credentials.
-		gotAuth := make(chan string, 1)
-		upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-			gotAuth <- r.Header.Get("Authorization")
-			w.WriteHeader(http.StatusOK)
-		}))
-		defer upstream.Close()
-
-		GinkgoT().Setenv("CLOUD_PROXY_AUTH_REPLACE_KEY", "sk-real")
-
-		cp := NewCloudProxy()
-		Expect(cp.Load(&pb.ModelOptions{
-			Proxy: &pb.ProxyOptions{
-				UpstreamUrl: upstream.URL,
-				Mode:        modePassthrough,
-				ApiKeyEnv:   "CLOUD_PROXY_AUTH_REPLACE_KEY",
-			},
-		})).To(Succeed())
-
-		addr := "test://forward-replaces-auth"
-		grpc.Provide(addr, cp)
-		c := grpc.NewClient(addr, true, nil, false)
-		stream, err := c.Forward(context.Background())
-		Expect(err).NotTo(HaveOccurred())
-		Expect(stream.Send(&pb.ForwardRequest{
-			Path:   "/v1/chat/completions",
-			Method: "POST",
-			Headers: []*pb.ForwardHeader{
-				// Client-supplied Authorization with the wrong scheme / key.
-				{Name: "Authorization", Value: "Basic Zm9vOmJhcg=="},
-			},
-		})).To(Succeed())
-		Expect(stream.CloseSend()).To(Succeed())
-		_, _ = stream.Recv()
-		for {
-			if _, err := stream.Recv(); errors.Is(err, io.EOF) || err != nil {
-				break
-			}
-		}
-
-		Expect(<-gotAuth).To(Equal("Bearer sk-real"), "caller-supplied Basic header must be replaced")
-	})
-
-	It("refuses to follow upstream redirects and never leaks the key to the redirect target", func() {
-		// A 3xx from the configured upstream means misconfiguration or a
-		// hijacked/spoofed host. Following it would replay the request —
-		// and the injected API key — to the Location host. Anthropic's
-		// x-api-key is NOT stripped by Go on cross-host redirects, so this
-		// would be a credential leak. The proxy must refuse the redirect.
-		sinkHit := make(chan string, 1)
-		sink := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-			sinkHit <- r.Header.Get("x-api-key")
-			w.WriteHeader(http.StatusOK)
-		}))
-		defer sink.Close()
-
-		redirector := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-			http.Redirect(w, r, sink.URL, http.StatusFound)
-		}))
-		defer redirector.Close()
-
-		GinkgoT().Setenv("CLOUD_PROXY_REDIRECT_KEY", "ant-secret")
-
-		cp := NewCloudProxy()
-		Expect(cp.Load(&pb.ModelOptions{
-			Proxy: &pb.ProxyOptions{
-				UpstreamUrl: redirector.URL,
-				Mode:        modePassthrough,
-				Provider:    providerAnthropic,
-				ApiKeyEnv:   "CLOUD_PROXY_REDIRECT_KEY",
-			},
-		})).To(Succeed())
-
-		addr := "test://forward-no-redirect"
-		grpc.Provide(addr, cp)
-		c := grpc.NewClient(addr, true, nil, false)
-		stream, err := c.Forward(context.Background())
-		Expect(err).NotTo(HaveOccurred())
-		Expect(stream.Send(&pb.ForwardRequest{
-			Path:   "/v1/messages",
-			Method: "POST",
-		})).To(Succeed())
-		Expect(stream.CloseSend()).To(Succeed())
-
-		// Drain the stream; a refused redirect surfaces as a non-EOF error.
-		var streamErr error
-		for {
-			if _, err := stream.Recv(); err != nil {
-				if !errors.Is(err, io.EOF) {
-					streamErr = err
-				}
-				break
-			}
-		}
-		Expect(streamErr).To(HaveOccurred(), "refused redirect must surface as an error")
-		Expect(sinkHit).NotTo(Receive(), "the redirect target must never be contacted")
-	})
-
-	It("handles concurrent calls without interference", func() {
-		// CloudProxy explicitly omits base.SingleThread — independent
-		// Forward streams must not block each other or leak state.
-		upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-			body, _ := io.ReadAll(r.Body)
-			w.WriteHeader(http.StatusOK)
-			_, _ = w.Write(body)
-		}))
-		defer upstream.Close()
-
-		cp := NewCloudProxy()
-		Expect(cp.Load(&pb.ModelOptions{
-			Proxy: &pb.ProxyOptions{
-				UpstreamUrl: upstream.URL,
-				Mode:        modePassthrough,
-			},
-		})).To(Succeed())
-		addr := "test://forward-concurrent"
-		grpc.Provide(addr, cp)
-		c := grpc.NewClient(addr, true, nil, false)
-
-		const N = 8
-		var wg sync.WaitGroup
-		errs := make(chan error, N)
-		for i := 0; i < N; i++ {
-			wg.Add(1)
-			go func(idx int) {
-				defer wg.Done()
-				stream, err := c.Forward(context.Background())
-				if err != nil {
-					errs <- err
-					return
-				}
-				payload := "request-" + string(rune('A'+idx))
-				if err := stream.Send(&pb.ForwardRequest{
-					Path:      "/v1/chat/completions",
-					Method:    "POST",
-					BodyChunk: []byte(payload),
-				}); err != nil {
-					errs <- err
-					return
-				}
-				_ = stream.CloseSend()
-				_, _ = stream.Recv()
-				var body []byte
-				for {
-					r, err := stream.Recv()
-					if errors.Is(err, io.EOF) {
-						break
-					}
-					if err != nil {
-						errs <- err
-						return
-					}
-					body = append(body, r.GetBodyChunk()...)
-				}
-				if string(body) != payload {
-					errs <- &echoMismatch{want: payload, got: string(body)}
-				}
-			}(i)
-		}
-		wg.Wait()
-		close(errs)
-		var collected []error
-		for err := range errs {
-			collected = append(collected, err)
-		}
-		Expect(collected).To(BeEmpty(), "no concurrent Forward call should fail")
-	})
-})
-
-type echoMismatch struct{ want, got string }
-
-func (e *echoMismatch) Error() string {
-	return "echo mismatch: want " + strconv.Quote(e.want) + " got " + strconv.Quote(e.got)
-}
--- a/backend/go/cloud-proxy/provider_anthropic.go
+++ b/backend/go/cloud-proxy/provider_anthropic.go
@@ -1,508 +0,0 @@
-package main
-
-import (
-	"bufio"
-	"bytes"
-	"context"
-	"encoding/json"
-	"fmt"
-	"io"
-	"net/http"
-	"strings"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	"github.com/mudler/xlog"
-)
-
-// Anthropic Messages API wire-format types. Narrowed to what translate
-// mode preserves through the Reply proto: text + tool_use blocks +
-// usage tokens. Image blocks, prompt caching, metadata, and stop
-// sequence metadata are not modelled — passthrough mode covers those.
-//
-// Notable differences from OpenAI:
-//   - max_tokens is REQUIRED. Anthropic 400s without it.
-//   - Roles are user/assistant only — system messages move to a
-//     top-level `system` string field.
-//   - Streaming SSE uses event: lines alongside data: lines. The
-//     events we care about: content_block_start (carries tool_use
-//     init: id + name), content_block_delta (text_delta with text;
-//     input_json_delta with partial_json for tool arguments), and
-//     message_stop (terminates the stream). Others are ignored.
-
-type anthropicRequest struct {
-	Model         string               `json:"model"`
-	MaxTokens     int32                `json:"max_tokens"`
-	System        string               `json:"system,omitempty"`
-	Messages      []anthropicMessage   `json:"messages"`
-	Stream        bool                 `json:"stream,omitempty"`
-	Temperature   *float64             `json:"temperature,omitempty"`
-	TopP          *float64             `json:"top_p,omitempty"`
-	StopSequences []string             `json:"stop_sequences,omitempty"`
-	Tools         []anthropicTool      `json:"tools,omitempty"`
-	ToolChoice    *anthropicToolChoice `json:"tool_choice,omitempty"`
-}
-
-// Content is `any` because Anthropic accepts a bare string OR a
-// list of content blocks. Use the string form for plain user/
-// assistant turns; switch to []anthropicContentBlock when the
-// turn needs tool_use (assistant) or tool_result (user) blocks.
-type anthropicMessage struct {
-	Role    string `json:"role"`
-	Content any    `json:"content"`
-}
-
-type anthropicTool struct {
-	Name        string          `json:"name"`
-	Description string          `json:"description,omitempty"`
-	InputSchema json.RawMessage `json:"input_schema"`
-}
-
-// anthropicToolChoice mirrors the four shapes Anthropic accepts:
-// {"type":"auto"} | {"type":"any"} | {"type":"tool","name":"X"} |
-// {"type":"none"} (newer models). OpenAI's "auto"/"none"/
-// "required"/{"function":{"name":"X"}} all map here.
-type anthropicToolChoice struct {
-	Type string `json:"type"`
-	Name string `json:"name,omitempty"`
-}
-
-// anthropicContentBlock is the union shape used both for response
-// blocks (text/tool_use we read off the wire) and outbound request
-// blocks (tool_use/tool_result we emit in the conversation history).
-// Anthropic encodes tool calls inline rather than as a separate field,
-// so we walk Content[] looking for type=="tool_use" on responses and
-// produce equivalent blocks when serialising prior-turn tool calls.
-type anthropicContentBlock struct {
-	Type  string          `json:"type"`
-	Text  string          `json:"text,omitempty"`
-	ID    string          `json:"id,omitempty"`
-	Name  string          `json:"name,omitempty"`
-	Input json.RawMessage `json:"input,omitempty"`
-	// Tool-result block fields. tool_result uses `content` (not
-	// `text`) and pairs with `tool_use_id`; modelling them as
-	// distinct fields avoids ambiguity at marshal time.
-	ToolUseID     string `json:"tool_use_id,omitempty"`
-	ResultContent string `json:"content,omitempty"`
-}
-
-type anthropicResponse struct {
-	ID      string                  `json:"id"`
-	Type    string                  `json:"type"`
-	Role    string                  `json:"role"`
-	Content []anthropicContentBlock `json:"content"`
-	Model   string                  `json:"model"`
-	Usage   *anthropicUsage         `json:"usage,omitempty"`
-}
-
-type anthropicUsage struct {
-	InputTokens  int `json:"input_tokens"`
-	OutputTokens int `json:"output_tokens"`
-}
-
-// anthropicStreamEvent is the union shape used for every event type we
-// process. Type discriminates; only the matching fields are populated.
-// content_block_start carries ContentBlock (with id/name for tool_use);
-// content_block_delta carries Delta (text or partial_json).
-type anthropicStreamEvent struct {
-	Type         string                 `json:"type"`
-	Index        int                    `json:"index,omitempty"`
-	ContentBlock *anthropicContentBlock `json:"content_block,omitempty"`
-	Delta        *anthropicStreamDelta  `json:"delta,omitempty"`
-	Message      *anthropicResponse     `json:"message,omitempty"`
-	Usage        *anthropicUsage        `json:"usage,omitempty"`
-}
-
-type anthropicStreamDelta struct {
-	Type        string `json:"type,omitempty"`
-	Text        string `json:"text,omitempty"`
-	PartialJSON string `json:"partial_json,omitempty"`
-}
-
-// Anthropic requires max_tokens. If the caller didn't set it, use a
-// generous-but-bounded default so the request doesn't 400.
-const anthropicDefaultMaxTokens int32 = 4096
-
-const anthropicToolChoiceNone = "none"
-
-// Reused JSON-Schema defaults for malformed inputs. Anthropic requires
-// input_schema to be a JSON object and tool_use.input to be a JSON
-// object; clients that omit them must not 400 the entire request.
-var (
-	emptyJSONObject   = json.RawMessage(`{}`)
-	emptyObjectSchema = json.RawMessage(`{"type":"object","properties":{}}`)
-)
-
-func buildAnthropicRequest(opts *pb.PredictOptions, cfg *proxyConfig, stream bool) ([]byte, error) {
-	req := anthropicRequest{
-		Model:         modelName(cfg, opts),
-		MaxTokens:     opts.GetTokens(),
-		Stream:        stream,
-		StopSequences: opts.GetStopPrompts(),
-	}
-	if req.MaxTokens <= 0 {
-		req.MaxTokens = anthropicDefaultMaxTokens
-	}
-	// Newer Anthropic models 400 when both temperature and top_p are
-	// set ("`temperature` and `top_p` cannot both be specified for
-	// this model. Please use only one.") even though their docs only
-	// "recommend" picking one. The OpenAI-compatible chat UI almost
-	// always sends both with default values, so prefer temperature
-	// and drop top_p when both are present.
-	if t := opts.GetTemperature(); t != 0 {
-		v := float64(t)
-		req.Temperature = &v
-	} else if t := opts.GetTopP(); t != 0 {
-		v := float64(t)
-		req.TopP = &v
-	}
-
-	req.Tools = convertOpenAITools(opts.GetTools())
-	req.ToolChoice = convertOpenAIToolChoice(opts.GetToolChoice())
-	// Anthropic rejects tool_choice without tools and older models
-	// don't accept {"type":"none"} — collapse to a no-tools request.
-	if req.ToolChoice != nil && req.ToolChoice.Type == anthropicToolChoiceNone {
-		req.Tools, req.ToolChoice = nil, nil
-	}
-
-	var systemParts []string
-	for _, m := range opts.GetMessages() {
-		role := m.GetRole()
-		if role == "system" {
-			if c := m.GetContent(); c != "" {
-				systemParts = append(systemParts, c)
-			}
-			continue
-		}
-		switch role {
-		case "user":
-			req.Messages = append(req.Messages, anthropicMessage{
-				Role:    "user",
-				Content: m.GetContent(),
-			})
-		case "assistant":
-			if blocks := assistantBlocks(m); blocks != nil {
-				req.Messages = append(req.Messages, anthropicMessage{Role: "assistant", Content: blocks})
-				continue
-			}
-			req.Messages = append(req.Messages, anthropicMessage{
-				Role:    "assistant",
-				Content: m.GetContent(),
-			})
-		case "tool", "function":
-			req.Messages = appendToolResult(req.Messages, anthropicContentBlock{
-				Type:          "tool_result",
-				ToolUseID:     m.GetToolCallId(),
-				ResultContent: m.GetContent(),
-			})
-		}
-	}
-	req.System = strings.Join(systemParts, "\n\n")
-
-	if len(req.Messages) == 0 && opts.GetPrompt() != "" {
-		req.Messages = []anthropicMessage{{Role: "user", Content: opts.GetPrompt()}}
-	}
-
-	return json.Marshal(req)
-}
-
-// appendToolResult appends a tool_result block as a user message,
-// merging into a preceding user message that already carries blocks.
-// Anthropic concatenates consecutive same-role messages on its end,
-// but explicit merging keeps the body smaller and the conversation
-// strictly alternating — which some upstream filters require.
-func appendToolResult(msgs []anthropicMessage, block anthropicContentBlock) []anthropicMessage {
-	if n := len(msgs); n > 0 && msgs[n-1].Role == "user" {
-		if existing, ok := msgs[n-1].Content.([]anthropicContentBlock); ok {
-			msgs[n-1].Content = append(existing, block)
-			return msgs
-		}
-	}
-	return append(msgs, anthropicMessage{
-		Role:    "user",
-		Content: []anthropicContentBlock{block},
-	})
-}
-
-func convertOpenAITools(toolsJSON string) []anthropicTool {
-	if toolsJSON == "" {
-		return nil
-	}
-	var raw []openAITool
-	if err := json.Unmarshal([]byte(toolsJSON), &raw); err != nil {
-		xlog.Warn("cloud-proxy: anthropic translate: unparseable tools JSON, dropping", "error", err)
-		return nil
-	}
-	tools := make([]anthropicTool, 0, len(raw))
-	for _, t := range raw {
-		if t.Function.Name == "" {
-			continue
-		}
-		schema := t.Function.Parameters
-		if len(schema) == 0 {
-			schema = emptyObjectSchema
-		}
-		tools = append(tools, anthropicTool{
-			Name:        t.Function.Name,
-			Description: t.Function.Description,
-			InputSchema: schema,
-		})
-	}
-	return tools
-}
-
-// convertOpenAIToolChoice accepts the spec form
-// ({type:function, function:{name:X}}) and the flat legacy form
-// ({type:function, name:X}) some clients send. Unknown object shapes
-// are warned and dropped rather than silently treated as auto.
-func convertOpenAIToolChoice(toolChoiceJSON string) *anthropicToolChoice {
-	if toolChoiceJSON == "" {
-		return nil
-	}
-	var asString string
-	if err := json.Unmarshal([]byte(toolChoiceJSON), &asString); err == nil {
-		switch asString {
-		case "auto":
-			return &anthropicToolChoice{Type: "auto"}
-		case "none":
-			return &anthropicToolChoice{Type: anthropicToolChoiceNone}
-		case "required":
-			return &anthropicToolChoice{Type: "any"}
-		}
-		return nil
-	}
-	var asObj struct {
-		Type     string `json:"type"`
-		Name     string `json:"name"`
-		Function struct {
-			Name string `json:"name"`
-		} `json:"function"`
-	}
-	if err := json.Unmarshal([]byte(toolChoiceJSON), &asObj); err != nil {
-		xlog.Warn("cloud-proxy: anthropic translate: unparseable tool_choice, dropping", "error", err)
-		return nil
-	}
-	if name := asObj.Function.Name; name != "" {
-		return &anthropicToolChoice{Type: "tool", Name: name}
-	}
-	if asObj.Name != "" {
-		return &anthropicToolChoice{Type: "tool", Name: asObj.Name}
-	}
-	xlog.Warn("cloud-proxy: anthropic translate: unrecognised tool_choice shape, dropping", "shape", toolChoiceJSON)
-	return nil
-}
-
-// openAITool mirrors pkg/functions.Tool but keeps Parameters as
-// json.RawMessage so the input_schema passes through verbatim — no
-// re-marshal cost, no fidelity loss on exotic schemas.
-type openAITool struct {
-	Type     string `json:"type"`
-	Function struct {
-		Name        string          `json:"name"`
-		Description string          `json:"description"`
-		Parameters  json.RawMessage `json:"parameters"`
-	} `json:"function"`
-}
-
-func assistantBlocks(m *pb.Message) []anthropicContentBlock {
-	toolCallsJSON := m.GetToolCalls()
-	if toolCallsJSON == "" {
-		return nil
-	}
-	var toolCalls []openAIToolCall
-	if err := json.Unmarshal([]byte(toolCallsJSON), &toolCalls); err != nil || len(toolCalls) == 0 {
-		return nil
-	}
-	blocks := make([]anthropicContentBlock, 0, len(toolCalls)+1)
-	if text := m.GetContent(); text != "" {
-		blocks = append(blocks, anthropicContentBlock{Type: "text", Text: text})
-	}
-	for _, tc := range toolCalls {
-		// OpenAI's arguments are a JSON-encoded string; pass through
-		// as RawMessage so a non-JSON string from a poorly-formed
-		// local model doesn't crash the marshaller downstream.
-		args := json.RawMessage(tc.Function.Arguments)
-		if len(args) == 0 {
-			args = emptyJSONObject
-		}
-		blocks = append(blocks, anthropicContentBlock{
-			Type:  "tool_use",
-			ID:    tc.ID,
-			Name:  tc.Function.Name,
-			Input: args,
-		})
-	}
-	return blocks
-}
-
-// doAnthropicRequest is the Anthropic counterpart of doOpenAIRequest.
-// applyAuthHeader sets x-api-key and anthropic-version when provider
-// is anthropic, so this method doesn't need to duplicate that.
-func (c *CloudProxy) doAnthropicRequest(ctx context.Context, cfg *proxyConfig, body []byte) (*http.Response, error) {
-	req, err := http.NewRequestWithContext(ctx, http.MethodPost, cfg.upstreamURL, bytes.NewReader(body))
-	if err != nil {
-		return nil, fmt.Errorf("cloud-proxy: build request: %w", err)
-	}
-	req.Header.Set("Content-Type", "application/json")
-	req.Header.Set("Accept", "*/*")
-	if cfg.apiKey != "" {
-		applyAuthHeader(req, cfg.provider, cfg.apiKey)
-	}
-	resp, err := c.client.Do(req)
-	if err != nil {
-		return nil, fmt.Errorf("cloud-proxy: upstream request: %w", err)
-	}
-	return resp, nil
-}
-
-// predictAnthropicRich returns the full Reply: joined text from all
-// text blocks, tool_use blocks mapped to ToolCallDelta, and usage
-// tokens.
-func (c *CloudProxy) predictAnthropicRich(ctx context.Context, cfg *proxyConfig, opts *pb.PredictOptions) (*pb.Reply, error) {
-	body, err := buildAnthropicRequest(opts, cfg, false)
-	if err != nil {
-		return nil, fmt.Errorf("cloud-proxy: marshal request: %w", err)
-	}
-	resp, err := c.doAnthropicRequest(ctx, cfg, body)
-	if err != nil {
-		return nil, err
-	}
-	defer func() { _ = resp.Body.Close() }()
-
-	if resp.StatusCode >= 400 {
-		errBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
-		return nil, fmt.Errorf("cloud-proxy: upstream %d: %s", resp.StatusCode, string(errBody))
-	}
-
-	var parsed anthropicResponse
-	if err := json.NewDecoder(resp.Body).Decode(&parsed); err != nil {
-		return nil, fmt.Errorf("cloud-proxy: decode response: %w", err)
-	}
-
-	reply := &pb.Reply{}
-	if parsed.Usage != nil {
-		reply.PromptTokens = int32(parsed.Usage.InputTokens)
-		reply.Tokens = int32(parsed.Usage.OutputTokens)
-	}
-
-	var content strings.Builder
-	var toolCalls []*pb.ToolCallDelta
-	toolIdx := 0
-	for _, b := range parsed.Content {
-		switch b.Type {
-		case "text":
-			content.WriteString(b.Text)
-		case "tool_use":
-			// Input is a structured JSON object; we serialise to a
-			// string so it fits the OpenAI-shaped arguments field
-			// downstream consumers expect.
-			args := ""
-			if len(b.Input) > 0 {
-				args = string(b.Input)
-			}
-			toolCalls = append(toolCalls, newToolCallDelta(toolIdx, b.ID, b.Name, args))
-			toolIdx++
-		}
-	}
-	reply.Message = []byte(content.String())
-	if len(toolCalls) > 0 {
-		reply.ChatDeltas = []*pb.ChatDelta{{ToolCalls: toolCalls}}
-	}
-	return reply, nil
-}
-
-// predictAnthropicStreamRich streams Reply chunks from Anthropic's SSE.
-// Three event types matter: content_block_start (initialises tool_use
-// id+name), content_block_delta (carries text or input_json_delta),
-// message_stop (terminates). The block index from the wire feeds
-// straight into ToolCallDelta.Index so downstream consumers can
-// reassemble multiple parallel tool calls.
-func (c *CloudProxy) predictAnthropicStreamRich(ctx context.Context, cfg *proxyConfig, opts *pb.PredictOptions, results chan<- *pb.Reply) error {
-	body, err := buildAnthropicRequest(opts, cfg, true)
-	if err != nil {
-		return fmt.Errorf("cloud-proxy: marshal request: %w", err)
-	}
-	resp, err := c.doAnthropicRequest(ctx, cfg, body)
-	if err != nil {
-		return err
-	}
-	defer func() { _ = resp.Body.Close() }()
-
-	if resp.StatusCode >= 400 {
-		errBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
-		return fmt.Errorf("cloud-proxy: upstream %d: %s", resp.StatusCode, string(errBody))
-	}
-
-	scanner := bufio.NewScanner(resp.Body)
-	scanner.Buffer(make([]byte, 0, 64*1024), 1<<20)
-	for scanner.Scan() {
-		line := scanner.Text()
-		if !strings.HasPrefix(line, "data:") {
-			continue
-		}
-		payload := strings.TrimSpace(strings.TrimPrefix(line, "data:"))
-		if payload == "" {
-			continue
-		}
-		var ev anthropicStreamEvent
-		if err := json.Unmarshal([]byte(payload), &ev); err != nil {
-			xlog.Debug("cloud-proxy: skip malformed SSE chunk", "error", err)
-			continue
-		}
-		switch ev.Type {
-		case "content_block_start":
-			// tool_use blocks announce id + name here; arguments arrive
-			// in subsequent input_json_delta events. Emit a Reply with
-			// just the tool_call init fields so consumers can allocate
-			// a slot at this index.
-			if ev.ContentBlock != nil && ev.ContentBlock.Type == "tool_use" {
-				if !sendReply(ctx, results, &pb.Reply{
-					ChatDeltas: []*pb.ChatDelta{{ToolCalls: []*pb.ToolCallDelta{
-						newToolCallDelta(ev.Index, ev.ContentBlock.ID, ev.ContentBlock.Name, ""),
-					}}},
-				}) {
-					return ctx.Err()
-				}
-			}
-		case "content_block_delta":
-			if ev.Delta == nil {
-				continue
-			}
-			switch ev.Delta.Type {
-			case "text_delta":
-				if ev.Delta.Text == "" {
-					continue
-				}
-				if !sendReply(ctx, results, &pb.Reply{
-					Message:    []byte(ev.Delta.Text),
-					ChatDeltas: []*pb.ChatDelta{{Content: ev.Delta.Text}},
-				}) {
-					return ctx.Err()
-				}
-			case "input_json_delta":
-				if ev.Delta.PartialJSON == "" {
-					continue
-				}
-				if !sendReply(ctx, results, &pb.Reply{
-					ChatDeltas: []*pb.ChatDelta{{ToolCalls: []*pb.ToolCallDelta{
-						newToolCallDelta(ev.Index, "", "", ev.Delta.PartialJSON),
-					}}},
-				}) {
-					return ctx.Err()
-				}
-			}
-		case "message_delta":
-			// Anthropic sends final usage in message_delta.usage. Emit
-			// a usage-only Reply so the consumer can record totals.
-			if ev.Usage != nil {
-				if !sendReply(ctx, results, &pb.Reply{
-					Tokens: int32(ev.Usage.OutputTokens),
-				}) {
-					return ctx.Err()
-				}
-			}
-		case "message_stop":
-			return nil
-		}
-	}
-	return scanner.Err()
-}
--- a/backend/go/cloud-proxy/provider_anthropic_test.go
+++ b/backend/go/cloud-proxy/provider_anthropic_test.go
@@ -1,334 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"io"
-	"math"
-	"net/http"
-	"net/http/httptest"
-	"strings"
-	"testing"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	. "github.com/onsi/gomega"
-)
-
-// fakeAnthropicUpstream mirrors fakeOpenAIUpstream but decodes the
-// request body as an anthropicRequest so tests can assert on the
-// translated wire shape (system field, max_tokens, etc.).
-func fakeAnthropicUpstream(t *testing.T, handler func(req anthropicRequest) (status int, body string, contentType string)) (*httptest.Server, *anthropicRequest) {
-	t.Helper()
-	var captured anthropicRequest
-	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-		raw, _ := io.ReadAll(r.Body)
-		_ = json.Unmarshal(raw, &captured)
-		status, body, ct := handler(captured)
-		w.Header().Set("Content-Type", ct)
-		w.WriteHeader(status)
-		_, _ = io.WriteString(w, body)
-	}))
-	return srv, &captured
-}
-
-func newAnthropicTranslateCloudProxy(t *testing.T, upstreamURL string) *CloudProxy {
-	t.Helper()
-	g := NewWithT(t)
-	t.Setenv("CLOUD_PROXY_ANTHROPIC_FAKE", "sk-ant-fake")
-	cp := NewCloudProxy()
-	err := cp.Load(&pb.ModelOptions{
-		Model: "claude-local",
-		Proxy: &pb.ProxyOptions{
-			UpstreamUrl:   upstreamURL,
-			Mode:          modeTranslate,
-			Provider:      providerAnthropic,
-			ApiKeyEnv:     "CLOUD_PROXY_ANTHROPIC_FAKE",
-			UpstreamModel: "claude-3-5-sonnet-20241022",
-		},
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	return cp
-}
-
-func TestPredict_Anthropic_BasicMessages(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"id":"msg_1","type":"message","role":"assistant","content":[{"type":"text","text":"hi there"}],"model":"claude-3-5-sonnet-20241022","usage":{"input_tokens":5,"output_tokens":2}}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	got, err := cp.Predict(&pb.PredictOptions{
-		Messages: []*pb.Message{
-			{Role: "system", Content: "be brief"},
-			{Role: "user", Content: "hello"},
-		},
-		Temperature: 0.5,
-		TopP:        0.9,
-		Tokens:      32,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(got).To(Equal("hi there"))
-
-	g.Expect(captured.Model).To(Equal("claude-3-5-sonnet-20241022"))
-	// System message must be hoisted out of Messages into top-level field.
-	g.Expect(captured.System).To(Equal("be brief"))
-	g.Expect(captured.Messages).To(HaveLen(1))
-	g.Expect(captured.Messages[0].Role).To(Equal("user"))
-	g.Expect(captured.MaxTokens).To(Equal(int32(32)))
-	g.Expect(captured.Temperature).NotTo(BeNil())
-	g.Expect(*captured.Temperature).To(Equal(0.5))
-	// Anthropic 400s when both temperature and top_p are set; the
-	// translator must prefer temperature and drop top_p.
-	g.Expect(captured.TopP).To(BeNil())
-	g.Expect(captured.Stream).To(BeFalse())
-}
-
-// When only top_p is set, it should be forwarded.
-func TestPredict_Anthropic_TopPOnly(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	_, err := cp.Predict(&pb.PredictOptions{
-		Messages: []*pb.Message{{Role: "user", Content: "hello"}},
-		TopP:     0.9,
-		Tokens:   16,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.Temperature).To(BeNil())
-	// PredictOptions.TopP is float32 on the wire; the translator widens
-	// to float64 so 0.9 round-trips as 0.8999999761581421… — compare
-	// with a small tolerance rather than exact equality.
-	g.Expect(captured.TopP).NotTo(BeNil())
-	g.Expect(math.Abs(*captured.TopP - 0.9)).To(BeNumerically("<=", 1e-6))
-}
-
-func TestPredict_Anthropic_DefaultsMaxTokens(t *testing.T) {
-	g := NewWithT(t)
-	// Anthropic 400s without max_tokens. The translator must default
-	// it when the caller doesn't supply Tokens.
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	_, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "x"}}})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.MaxTokens).To(Equal(anthropicDefaultMaxTokens))
-}
-
-func TestPredict_Anthropic_PromptFallback(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	_, err := cp.Predict(&pb.PredictOptions{Prompt: "what time is it?", Tokens: 16})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.Messages).To(HaveLen(1))
-	g.Expect(captured.Messages[0].Role).To(Equal("user"))
-	g.Expect(captured.Messages[0].Content).To(Equal("what time is it?"))
-}
-
-func TestPredict_Anthropic_ConcatenatesContentBlocks(t *testing.T) {
-	g := NewWithT(t)
-	// Anthropic may return multiple text blocks; the translator joins
-	// them so the Predict() string return is the full assistant message.
-	srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[{"type":"text","text":"hello "},{"type":"text","text":"world"}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	got, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "x"}}, Tokens: 16})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(got).To(Equal("hello world"))
-}
-
-func TestPredict_Anthropic_UpstreamError(t *testing.T) {
-	g := NewWithT(t)
-	srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 401, `{"error":{"type":"authentication_error","message":"bad key"}}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	_, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "x"}}, Tokens: 16})
-	g.Expect(err).To(HaveOccurred())
-	g.Expect(err.Error()).To(ContainSubstring("401"))
-}
-
-func TestPredictStream_Anthropic_StreamsTextDeltas(t *testing.T) {
-	g := NewWithT(t)
-	// Real Anthropic SSE has event: lines + data: lines. The translator
-	// only needs the data: payload; only content_block_delta with
-	// delta.type=text_delta carries content. message_stop ends.
-	frames := []string{
-		"event: message_start\ndata: {\"type\":\"message_start\"}\n\n",
-		"event: content_block_start\ndata: {\"type\":\"content_block_start\",\"index\":0}\n\n",
-		"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"text_delta\",\"text\":\"hello\"}}\n\n",
-		"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"text_delta\",\"text\":\" \"}}\n\n",
-		"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"text_delta\",\"text\":\"world\"}}\n\n",
-		"event: content_block_stop\ndata: {\"type\":\"content_block_stop\",\"index\":0}\n\n",
-		"event: message_stop\ndata: {\"type\":\"message_stop\"}\n\n",
-	}
-	body := strings.Join(frames, "")
-
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, body, "text/event-stream"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	results := make(chan string, 8)
-	done := make(chan error, 1)
-	go func() {
-		done <- cp.PredictStream(&pb.PredictOptions{
-			Messages: []*pb.Message{{Role: "user", Content: "hi"}},
-			Tokens:   16,
-		}, results)
-	}()
-
-	var got []string
-	for s := range results {
-		got = append(got, s)
-	}
-	err := <-done
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(strings.Join(got, "")).To(Equal("hello world"))
-	g.Expect(captured.Stream).To(BeTrue())
-}
-
-func TestBuildAnthropic_TranslatesOpenAITools(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	tools := `[{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}]`
-	_, err := cp.Predict(&pb.PredictOptions{
-		Messages:   []*pb.Message{{Role: "user", Content: "weather in Paris?"}},
-		Tools:      tools,
-		ToolChoice: `"auto"`,
-		Tokens:     32,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.Tools).To(HaveLen(1))
-	g.Expect(captured.Tools[0].Name).To(Equal("get_weather"))
-	g.Expect(captured.Tools[0].Description).To(Equal("Get weather"))
-	// input_schema must be the parameters object verbatim.
-	g.Expect(string(captured.Tools[0].InputSchema)).To(ContainSubstring(`"city"`))
-	g.Expect(captured.ToolChoice).NotTo(BeNil())
-	g.Expect(captured.ToolChoice.Type).To(Equal("auto"))
-}
-
-func TestBuildAnthropic_ToolChoice_RequiredMapsToAny(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-	_, err := cp.Predict(&pb.PredictOptions{
-		Messages:   []*pb.Message{{Role: "user", Content: "x"}},
-		Tools:      `[{"type":"function","function":{"name":"t","parameters":{"type":"object"}}}]`,
-		ToolChoice: `"required"`,
-		Tokens:     16,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.ToolChoice).NotTo(BeNil())
-	g.Expect(captured.ToolChoice.Type).To(Equal("any"))
-}
-
-func TestBuildAnthropic_ToolChoice_NoneDropsTools(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-	_, err := cp.Predict(&pb.PredictOptions{
-		Messages:   []*pb.Message{{Role: "user", Content: "x"}},
-		Tools:      `[{"type":"function","function":{"name":"t","parameters":{"type":"object"}}}]`,
-		ToolChoice: `"none"`,
-		Tokens:     16,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.Tools).To(BeNil())
-	g.Expect(captured.ToolChoice).To(BeNil())
-}
-
-func TestBuildAnthropic_ToolChoice_NamedFunction(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-	_, err := cp.Predict(&pb.PredictOptions{
-		Messages:   []*pb.Message{{Role: "user", Content: "x"}},
-		Tools:      `[{"type":"function","function":{"name":"weather","parameters":{"type":"object"}}}]`,
-		ToolChoice: `{"type":"function","function":{"name":"weather"}}`,
-		Tokens:     16,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.ToolChoice).NotTo(BeNil())
-	g.Expect(captured.ToolChoice.Type).To(Equal("tool"))
-	g.Expect(captured.ToolChoice.Name).To(Equal("weather"))
-}
-
-func TestBuildAnthropic_RoundTripsAssistantToolCalls(t *testing.T) {
-	g := NewWithT(t)
-	// LocalAI Assistant's second turn: the LLM previously emitted a
-	// tool_use, the server executed it, and the conversation now
-	// includes the assistant turn (with tool_calls) plus a tool-role
-	// result message. Both must convert to Anthropic block form.
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	tools := `[{"type":"function","function":{"name":"list_models","parameters":{"type":"object"}}}]`
-	toolCallsJSON := `[{"id":"call_abc","type":"function","function":{"name":"list_models","arguments":"{}"}}]`
-	_, err := cp.Predict(&pb.PredictOptions{
-		Tools: tools,
-		Messages: []*pb.Message{
-			{Role: "user", Content: "what models are installed?"},
-			{Role: "assistant", Content: "", ToolCalls: toolCallsJSON},
-			{Role: "tool", Content: `{"models":["a","b"]}`, ToolCallId: "call_abc"},
-		},
-		Tokens: 64,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-
-	g.Expect(captured.Messages).To(HaveLen(3))
-	// 1. user text — bare string
-	s, ok := captured.Messages[0].Content.(string)
-	g.Expect(ok).To(BeTrue())
-	g.Expect(s).To(Equal("what models are installed?"))
-	// 2. assistant — must be a content-block list with one tool_use
-	// json.Unmarshal of `any` produces []any not []anthropicContentBlock.
-	blocks, ok := captured.Messages[1].Content.([]any)
-	g.Expect(ok).To(BeTrue())
-	g.Expect(blocks).To(HaveLen(1))
-	b0, _ := blocks[0].(map[string]any)
-	g.Expect(b0["type"]).To(Equal("tool_use"))
-	g.Expect(b0["id"]).To(Equal("call_abc"))
-	g.Expect(b0["name"]).To(Equal("list_models"))
-	// 3. tool → user with tool_result block
-	g.Expect(captured.Messages[2].Role).To(Equal("user"))
-	resBlocks, _ := captured.Messages[2].Content.([]any)
-	r0, _ := resBlocks[0].(map[string]any)
-	g.Expect(r0["type"]).To(Equal("tool_result"))
-	g.Expect(r0["tool_use_id"]).To(Equal("call_abc"))
-	g.Expect(r0["content"]).To(Equal(`{"models":["a","b"]}`))
-}
--- a/backend/go/cloud-proxy/provider_edge_test.go
+++ b/backend/go/cloud-proxy/provider_edge_test.go
@@ -1,119 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"strings"
-	"testing"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	. "github.com/onsi/gomega"
-)
-
-// Verify buildOpenAIRequest preserves caller-supplied tools and
-// tool_choice as opaque JSON. PredictOptions carries them as strings;
-// they must land in the outbound request body unchanged so the
-// upstream sees the caller's intent verbatim. A regression here would
-// silently disable function calling for translate-mode clients.
-func TestBuildOpenAIRequest_ToolsAndToolChoicePassthrough(t *testing.T) {
-	g := NewWithT(t)
-	cfg := &proxyConfig{upstreamModel: "gpt-4o"}
-	toolsJSON := `[{"type":"function","function":{"name":"search","parameters":{"type":"object"}}}]`
-	choiceJSON := `{"type":"function","function":{"name":"search"}}`
-
-	body, err := buildOpenAIRequest(&pb.PredictOptions{
-		Messages:   []*pb.Message{{Role: "user", Content: "find x"}},
-		Tools:      toolsJSON,
-		ToolChoice: choiceJSON,
-	}, cfg, false)
-	g.Expect(err).NotTo(HaveOccurred())
-
-	var decoded openAIRequest
-	err = json.Unmarshal(body, &decoded)
-	g.Expect(err).NotTo(HaveOccurred())
-	// Compare the JSON-canonical form so whitespace differences are ignored.
-	gotTools, _ := json.Marshal(json.RawMessage(decoded.Tools))
-	wantTools, _ := json.Marshal(json.RawMessage(toolsJSON))
-	g.Expect(string(gotTools)).To(Equal(string(wantTools)))
-	gotChoice, _ := json.Marshal(json.RawMessage(decoded.ToolChoice))
-	wantChoice, _ := json.Marshal(json.RawMessage(choiceJSON))
-	g.Expect(string(gotChoice)).To(Equal(string(wantChoice)))
-}
-
-// Garbage JSON in tools / tool_choice is silently dropped (omitted)
-// rather than blowing up the request. Documents the parseRawJSON
-// behaviour — operators shouldn't see hard failures from an upstream
-// caller's mis-formatted tools field.
-func TestBuildOpenAIRequest_InvalidToolsJSONDropped(t *testing.T) {
-	g := NewWithT(t)
-	cfg := &proxyConfig{upstreamModel: "gpt-4o"}
-	body, err := buildOpenAIRequest(&pb.PredictOptions{
-		Messages:   []*pb.Message{{Role: "user", Content: "x"}},
-		Tools:      "this is not json",
-		ToolChoice: "{also bad",
-	}, cfg, false)
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(string(body)).NotTo(ContainSubstring("this is not json"))
-	g.Expect(string(body)).NotTo(ContainSubstring("{also bad"))
-}
-
-// Anthropic empty content array yields an empty Reply (not an error).
-// Mirrors how an upstream tool_use-only response might arrive — the
-// content array can legitimately be empty in some edge cases.
-func TestPredictRich_Anthropic_EmptyContent(t *testing.T) {
-	g := NewWithT(t)
-	srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"id":"m1","type":"message","role":"assistant","content":[],"usage":{"input_tokens":3,"output_tokens":0}}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	reply, err := cp.PredictRich(&pb.PredictOptions{
-		Messages: []*pb.Message{{Role: "user", Content: "x"}},
-		Tokens:   16,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(string(reply.GetMessage())).To(Equal(""))
-	g.Expect(reply.GetChatDeltas()).To(HaveLen(0))
-	g.Expect(reply.GetPromptTokens()).To(Equal(int32(3)))
-}
-
-// A truncated / malformed SSE payload mid-stream should be tolerated:
-// the malformed chunk gets skipped (xlog.Debug logged), valid chunks
-// before AND after it still reach the channel.
-func TestPredictStreamRich_OpenAI_TolerantOfBadChunks(t *testing.T) {
-	g := NewWithT(t)
-	body := strings.Join([]string{
-		`data: {"choices":[{"index":0,"delta":{"content":"hello"}}]}`,
-		``,
-		`data: this-is-not-json{{`,
-		``,
-		`data: {"choices":[{"index":0,"delta":{"content":" world"}}]}`,
-		``,
-		`data: [DONE]`,
-		``,
-	}, "\n")
-
-	srv, _ := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
-		return 200, body, "text/event-stream"
-	})
-	defer srv.Close()
-	cp := newTranslateCloudProxy(t, srv.URL)
-
-	results := make(chan *pb.Reply, 8)
-	done := make(chan error, 1)
-	go func() {
-		done <- cp.PredictStreamRich(&pb.PredictOptions{
-			Messages: []*pb.Message{{Role: "user", Content: "hi"}},
-		}, results)
-		close(results)
-	}()
-
-	var assembled strings.Builder
-	for reply := range results {
-		assembled.Write(reply.GetMessage())
-	}
-	err := <-done
-	g.Expect(err).NotTo(HaveOccurred())
-	// The good chunks before and after the malformed one both made it through.
-	g.Expect(assembled.String()).To(Equal("hello world"))
-}
--- a/backend/go/cloud-proxy/provider_openai.go
+++ b/backend/go/cloud-proxy/provider_openai.go
@@ -1,320 +0,0 @@
-package main
-
-import (
-	"bufio"
-	"bytes"
-	"context"
-	"encoding/json"
-	"errors"
-	"fmt"
-	"io"
-	"net/http"
-	"strings"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	"github.com/mudler/xlog"
-)
-
-// OpenAI Chat Completions wire-format types. Narrowed to the fields
-// translate mode needs to preserve through the Reply proto: content,
-// role, tool_calls (typed so we can map them to pb.ToolCallDelta),
-// and sampling params copied verbatim from PredictOptions.
-//
-// Provider-specific extensions (logit_bias, function calling beyond
-// tool_calls, etc.) are not modelled — passthrough mode covers callers
-// that need full upstream fidelity.
-
-type openAIRequest struct {
-	Model            string          `json:"model"`
-	Messages         []openAIMessage `json:"messages"`
-	Stream           bool            `json:"stream,omitempty"`
-	Temperature      *float64        `json:"temperature,omitempty"`
-	TopP             *float64        `json:"top_p,omitempty"`
-	MaxTokens        *int32          `json:"max_tokens,omitempty"`
-	Stop             []string        `json:"stop,omitempty"`
-	FrequencyPenalty *float64        `json:"frequency_penalty,omitempty"`
-	PresencePenalty  *float64        `json:"presence_penalty,omitempty"`
-	Tools            json.RawMessage `json:"tools,omitempty"`
-	ToolChoice       json.RawMessage `json:"tool_choice,omitempty"`
-}
-
-type openAIMessage struct {
-	Role       string           `json:"role"`
-	Content    string           `json:"content,omitempty"`
-	Name       string           `json:"name,omitempty"`
-	ToolCallID string           `json:"tool_call_id,omitempty"`
-	ToolCalls  []openAIToolCall `json:"tool_calls,omitempty"`
-}
-
-// openAIToolCall covers both the non-streaming response shape (full
-// id+function+arguments) and the streaming-delta shape (sparse fields,
-// index assignment). The proto's ToolCallDelta absorbs both — name is
-// set on first appearance, arguments arrive incrementally in streaming.
-type openAIToolCall struct {
-	Index    int                `json:"index,omitempty"`
-	ID       string             `json:"id,omitempty"`
-	Type     string             `json:"type,omitempty"`
-	Function openAIFunctionCall `json:"function,omitempty"`
-}
-
-type openAIFunctionCall struct {
-	Name      string `json:"name,omitempty"`
-	Arguments string `json:"arguments,omitempty"`
-}
-
-type openAIChoice struct {
-	Index        int           `json:"index"`
-	Message      openAIMessage `json:"message"`
-	FinishReason string        `json:"finish_reason"`
-}
-
-type openAIResponse struct {
-	ID      string         `json:"id"`
-	Choices []openAIChoice `json:"choices"`
-	Usage   *openAIUsage   `json:"usage,omitempty"`
-}
-
-type openAIStreamChoice struct {
-	Index int `json:"index"`
-	Delta struct {
-		Content   string           `json:"content,omitempty"`
-		Role      string           `json:"role,omitempty"`
-		ToolCalls []openAIToolCall `json:"tool_calls,omitempty"`
-	} `json:"delta"`
-	FinishReason string `json:"finish_reason,omitempty"`
-}
-
-type openAIStreamChunk struct {
-	Choices []openAIStreamChoice `json:"choices"`
-	Usage   *openAIUsage         `json:"usage,omitempty"`
-}
-
-type openAIUsage struct {
-	PromptTokens     int `json:"prompt_tokens"`
-	CompletionTokens int `json:"completion_tokens"`
-	TotalTokens      int `json:"total_tokens"`
-}
-
-// buildOpenAIRequest converts pb.PredictOptions into the OpenAI Chat
-// Completions request body. Prefers Messages when non-empty; falls
-// back to wrapping Prompt as a single user message so plain
-// /completions-style calls still work in translate mode.
-func buildOpenAIRequest(opts *pb.PredictOptions, cfg *proxyConfig, stream bool) ([]byte, error) {
-	req := openAIRequest{
-		Model:      modelName(cfg, opts),
-		Stream:     stream,
-		Stop:       opts.GetStopPrompts(),
-		Tools:      parseRawJSON(opts.GetTools()),
-		ToolChoice: parseRawJSON(opts.GetToolChoice()),
-	}
-	if t := opts.GetTemperature(); t != 0 {
-		v := float64(t)
-		req.Temperature = &v
-	}
-	if t := opts.GetTopP(); t != 0 {
-		v := float64(t)
-		req.TopP = &v
-	}
-	if n := opts.GetTokens(); n > 0 {
-		req.MaxTokens = &n
-	}
-	if p := opts.GetFrequencyPenalty(); p != 0 {
-		v := float64(p)
-		req.FrequencyPenalty = &v
-	}
-	if p := opts.GetPresencePenalty(); p != 0 {
-		v := float64(p)
-		req.PresencePenalty = &v
-	}
-
-	for _, m := range opts.GetMessages() {
-		msg := openAIMessage{
-			Role:       m.GetRole(),
-			Content:    m.GetContent(),
-			Name:       m.GetName(),
-			ToolCallID: m.GetToolCallId(),
-		}
-		// Pre-existing tool_calls arrive as a JSON string from the
-		// upstream caller's previous assistant turn; pass-through as-is.
-		if tc := m.GetToolCalls(); tc != "" {
-			_ = json.Unmarshal([]byte(tc), &msg.ToolCalls)
-		}
-		req.Messages = append(req.Messages, msg)
-	}
-	// Fallback for plain Prompt requests (no Messages array). LocalAI
-	// templating may have produced a flat prompt; rewrap as a single
-	// user message so the upstream chat endpoint accepts it.
-	if len(req.Messages) == 0 && opts.GetPrompt() != "" {
-		req.Messages = []openAIMessage{{Role: "user", Content: opts.GetPrompt()}}
-	}
-
-	return json.Marshal(req)
-}
-
-// modelName picks the upstream model: upstream_model from the proxy
-// config wins (operator override), else the local model name captured
-// at LoadModel time. Operator sets upstream_model to map LocalAI's
-// alias (e.g. "claude-strict") to the upstream's canonical name
-// (e.g. "claude-3-5-sonnet-20241022").
-func modelName(cfg *proxyConfig, _ *pb.PredictOptions) string {
-	if cfg.upstreamModel != "" {
-		return cfg.upstreamModel
-	}
-	return cfg.localModel
-}
-
-// parseRawJSON parses a JSON string into a RawMessage so it round-trips
-// into the upstream body. Returns nil for empty/invalid input so the
-// field is omitted (omitempty).
-func parseRawJSON(s string) json.RawMessage {
-	if s == "" {
-		return nil
-	}
-	var probe json.RawMessage
-	if err := json.Unmarshal([]byte(s), &probe); err != nil {
-		return nil
-	}
-	return probe
-}
-
-// doOpenAIRequest builds + sends the upstream request. Returns the
-// raw response on success; caller handles status / body.
-func (c *CloudProxy) doOpenAIRequest(ctx context.Context, cfg *proxyConfig, body []byte) (*http.Response, error) {
-	req, err := http.NewRequestWithContext(ctx, http.MethodPost, cfg.upstreamURL, bytes.NewReader(body))
-	if err != nil {
-		return nil, fmt.Errorf("cloud-proxy: build request: %w", err)
-	}
-	req.Header.Set("Content-Type", "application/json")
-	req.Header.Set("Accept", "*/*")
-	if cfg.apiKey != "" {
-		applyAuthHeader(req, cfg.provider, cfg.apiKey)
-	}
-	resp, err := c.client.Do(req)
-	if err != nil {
-		return nil, fmt.Errorf("cloud-proxy: upstream request: %w", err)
-	}
-	return resp, nil
-}
-
-// predictOpenAIRich is the non-streaming translate path. Returns a
-// fully-populated *pb.Reply with assistant content, tool calls, and
-// token usage. The gRPC server forwards the Reply verbatim.
-func (c *CloudProxy) predictOpenAIRich(ctx context.Context, cfg *proxyConfig, opts *pb.PredictOptions) (*pb.Reply, error) {
-	body, err := buildOpenAIRequest(opts, cfg, false)
-	if err != nil {
-		return nil, fmt.Errorf("cloud-proxy: marshal request: %w", err)
-	}
-	resp, err := c.doOpenAIRequest(ctx, cfg, body)
-	if err != nil {
-		return nil, err
-	}
-	defer func() { _ = resp.Body.Close() }()
-
-	if resp.StatusCode >= 400 {
-		errBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
-		return nil, fmt.Errorf("cloud-proxy: upstream %d: %s", resp.StatusCode, string(errBody))
-	}
-
-	var parsed openAIResponse
-	if err := json.NewDecoder(resp.Body).Decode(&parsed); err != nil {
-		return nil, fmt.Errorf("cloud-proxy: decode response: %w", err)
-	}
-	if len(parsed.Choices) == 0 {
-		return nil, errors.New("cloud-proxy: upstream returned no choices")
-	}
-
-	choice := parsed.Choices[0]
-	reply := &pb.Reply{
-		Message: []byte(choice.Message.Content),
-	}
-	if parsed.Usage != nil {
-		reply.PromptTokens = int32(parsed.Usage.PromptTokens)
-		reply.Tokens = int32(parsed.Usage.CompletionTokens)
-	}
-	if len(choice.Message.ToolCalls) > 0 {
-		// Non-streaming: a single ChatDelta carries the full tool-call
-		// set. Index/Name/Arguments are populated together; downstream
-		// consumers don't need to assemble streaming deltas.
-		delta := &pb.ChatDelta{}
-		for _, tc := range choice.Message.ToolCalls {
-			delta.ToolCalls = append(delta.ToolCalls,
-				newToolCallDelta(tc.Index, tc.ID, tc.Function.Name, tc.Function.Arguments))
-		}
-		reply.ChatDeltas = []*pb.ChatDelta{delta}
-	}
-	return reply, nil
-}
-
-// predictOpenAIStreamRich streams *pb.Reply chunks. Each chunk carries
-// either a content delta (Message + ChatDeltas[].Content) or tool-call
-// deltas (ChatDeltas[].ToolCalls). The final Reply carries usage tokens
-// when the upstream sends them (stream_options.include_usage).
-func (c *CloudProxy) predictOpenAIStreamRich(ctx context.Context, cfg *proxyConfig, opts *pb.PredictOptions, results chan<- *pb.Reply) error {
-	body, err := buildOpenAIRequest(opts, cfg, true)
-	if err != nil {
-		return fmt.Errorf("cloud-proxy: marshal request: %w", err)
-	}
-	resp, err := c.doOpenAIRequest(ctx, cfg, body)
-	if err != nil {
-		return err
-	}
-	defer func() { _ = resp.Body.Close() }()
-
-	if resp.StatusCode >= 400 {
-		errBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
-		return fmt.Errorf("cloud-proxy: upstream %d: %s", resp.StatusCode, string(errBody))
-	}
-
-	scanner := bufio.NewScanner(resp.Body)
-	scanner.Buffer(make([]byte, 0, 64*1024), 1<<20)
-	for scanner.Scan() {
-		line := scanner.Text()
-		if !strings.HasPrefix(line, "data:") {
-			continue
-		}
-		payload := strings.TrimSpace(strings.TrimPrefix(line, "data:"))
-		if payload == "" || payload == "[DONE]" {
-			return nil
-		}
-		var chunk openAIStreamChunk
-		if err := json.Unmarshal([]byte(payload), &chunk); err != nil {
-			xlog.Debug("cloud-proxy: skip malformed SSE chunk", "error", err)
-			continue
-		}
-		// Usage frames may arrive separately from content frames when
-		// stream_options.include_usage is set; emit a usage-only Reply
-		// in that case so the consumer sees the totals.
-		if chunk.Usage != nil && len(chunk.Choices) == 0 {
-			if !sendReply(ctx, results, &pb.Reply{
-				PromptTokens: int32(chunk.Usage.PromptTokens),
-				Tokens:       int32(chunk.Usage.CompletionTokens),
-			}) {
-				return ctx.Err()
-			}
-			continue
-		}
-		for _, ch := range chunk.Choices {
-			reply := &pb.Reply{}
-			if ch.Delta.Content != "" {
-				reply.Message = []byte(ch.Delta.Content)
-				reply.ChatDeltas = []*pb.ChatDelta{{Content: ch.Delta.Content}}
-			}
-			if len(ch.Delta.ToolCalls) > 0 {
-				if len(reply.ChatDeltas) == 0 {
-					reply.ChatDeltas = []*pb.ChatDelta{{}}
-				}
-				for _, tc := range ch.Delta.ToolCalls {
-					reply.ChatDeltas[0].ToolCalls = append(reply.ChatDeltas[0].ToolCalls,
-						newToolCallDelta(tc.Index, tc.ID, tc.Function.Name, tc.Function.Arguments))
-				}
-			}
-			if reply.Message == nil && len(reply.ChatDeltas) == 0 {
-				continue
-			}
-			if !sendReply(ctx, results, reply) {
-				return ctx.Err()
-			}
-		}
-	}
-	return scanner.Err()
-}
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
copilot-swe-agent[bot]	8fbf18490e	fix: remove deprecated cosign bundle flag from backend merge workflow Agent-Logs-Url: https://github.com/mudler/LocalAI/sessions/4207dabc-14ec-4655-9594-487338977fcf Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-22 22:16:44 +00:00
copilot-swe-agent[bot]	b334a77405	Initial plan	2026-05-22 22:13:44 +00:00