chore(turboquant): retreat pin to 4c1c3ac0 to skip fork GPU regression

CI on the prior 2cbfdc62 pin confirmed our grpc-server.cpp/patch fix works (tests-turboquant-grpc + all multiarch turboquant builds passed), but every GPU singlearch turboquant build now hits a static-assertion error in the fork's own ggml/src/ggml-cuda/fattn-mma-f16.cuh — a regression introduced by the May 14 #22880 `HIP: RDNA3 mma FA` refactor (file went from 1855 to 2049 lines). 4c1c3ac0 (2026-05-13 22:12 UTC) is the last commit before that refactor and still has every API piece grpc-server.cpp depends on (DRAFT_SIMPLE enum, nested common_params_speculative, model_tgt, get_media_marker(), common_speculative_types_from_names). MTP support landed later (May 16) and is not exercised by grpc-server.cpp. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
chore(turboquant): bump to 2cbfdc62 and retire obsolete grpc-server patches
2026-06-11 18:27:32 -04:00 · 2026-05-21 15:54:38 +00:00 · 2026-05-21 11:03:46 +00:00
775 changed files with 3513 additions and 74815 deletions
--- a/.agents/backend-signing.md
+++ b/.agents/backend-signing.md
@@ -16,8 +16,7 @@ side (`pkg/oci/cosignverify` plus the gallery YAML).
  per-arch manifest before checking signatures.
 - **Storage:** Signatures are written as OCI 1.1 referrers
  (`--registry-referrers-mode=oci-1-1`) in the new Sigstore bundle format
-  (current cosign releases do this by default; no `--new-bundle-format`
-  flag). No `:sha256-<hex>.sig` tag clutter.
+  (`--new-bundle-format`). No `:sha256-<hex>.sig` tag clutter.
 - **Consumer:** `pkg/oci/cosignverify` discovers the bundle via the
  referrers API, hands it to `sigstore-go`, and verifies it against the
  policy declared in the gallery YAML (`Gallery.Verification`).
@@ -34,14 +33,15 @@ to sign. The job needs:

 - `permissions: { id-token: write, contents: read }` at the job level so
  the runner can exchange its GitHub OIDC token for a Fulcio cert.
- `sigstore/cosign-installer@v3` step (current cosign releases already
-  default to the new bundle format).
+- `sigstore/cosign-installer@v3` step (cosign ≥ 2.2 for
+  `--new-bundle-format`).
 - After each `docker buildx imagetools create`, resolve the resulting
  list digest with `docker buildx imagetools inspect <tag> --format
  '{{.Manifest.Digest}}'` and sign:

 ```sh
 cosign sign --yes --recursive \
+  --new-bundle-format \
  --registry-referrers-mode=oci-1-1 \
  "${REGISTRY_REPO}@${DIGEST}"
 ```
@@ -49,12 +49,6 @@ cosign sign --yes --recursive \
 Sign by digest, never by tag — signing by tag binds the signature to
 whatever the tag points at *now*, and a subsequent tag push orphans it.

-`--registry-referrers-mode=oci-1-1` is still gated behind
-`COSIGN_EXPERIMENTAL=1` in cosign v2.4.x (set at the job env level in
-`backend_merge.yml`). Re-evaluate when bumping the pinned cosign release
-— newer versions are expected to graduate this flag and the env var can
-then be dropped.
-
 `backend_build_darwin.yml` builds and pushes single-arch darwin images
 that bypass the manifest-list merge. If/when those entries get a gallery
 `verification:` policy, the equivalent cosign step has to land there
--- a/.agents/building-and-testing.md
+++ b/.agents/building-and-testing.md
@@ -15,35 +15,3 @@ Let's say the user wants to build a particular backend for a given platform. For
 - Unless the user specifies that they want you to run the command, then just print it because not all agent frontends handle long running jobs well and the output may overflow your context
 - The user may say they want to build AMD or ROCM instead of hipblas, or Intel instead of SYCL or NVIDIA insted of l4t or cublas. Ask for confirmation if there is ambiguity.
 - Sometimes the user may need extra parameters to be added to `docker build` (e.g. `--platform` for cross-platform builds or `--progress` to view the full logs), in which case you can generate the `docker build` command directly.
-
-## Test coverage gate
-
-The core Go suites (`./pkg`, `./core`, plus the in-process integration suite `./tests/e2e`) are covered by a **strict, monotonic coverage ratchet**:
-
- `make test-coverage` — runs the suites with `covermode=atomic` instrumentation and writes a merged profile to `coverage/coverage.out`. Uses the same prerequisites as `make test`.
-  - **`--coverpkg` (`COVERAGE_COVERPKG = core/...,pkg/...`):** coverage is attributed to the core+pkg packages, not just the package under test. This is what lets the in-process `tests/e2e` suite (which drives the real HTTP server over loopback via `application.New`) credit the `core/http/endpoints/...` handlers it exercises — folding it in roughly doubled endpoint coverage (e.g. `endpoints/openai` 13.6% → 52%). The denominator is therefore *all* of `core`+`pkg` (minus generated proto, dropped via `COVERAGE_EXCLUDE_RE`), so the number isn't comparable to a plain per-package figure.
-  - **Integration suites (`COVERAGE_E2E_ROOTS = ./tests/e2e`)** run non-recursively (excludes `tests/e2e/distributed`, which needs containers) with `--label-filter=!real-models` (those need a downloaded model) against the mock backend built by `prepare-test`. `tests/integration` is deliberately excluded — it needs `make backends/local-store`, which the coverage CI job doesn't build.
-  - **Flake note:** folding integration tests into a *strict* gate means a hard e2e failure (or a spec that silently stops running) can fail the coverage gate, not just the test. `--flake-attempts` absorbs transient retryable failures; covermode=atomic keeps line coverage deterministic otherwise.
-  - **Why one ginkgo run per root (`scripts/run-coverage.sh`):** passing several recursive roots to a *single* ginkgo invocation (e.g. `ginkgo -r ./pkg ./core`) only merges **one** root's coverprofile into `--output-dir`/`--coverprofile` — the others are silently dropped. Verified with ginkgo 2.29.0: `-r ./pkg ./core` yields only `./pkg` coverage, while `-r ./core` alone yields all 34 core packages. So the script runs each root separately and concatenates the (disjoint) profiles. Don't "simplify" it back to a single multi-root invocation — that's how `core/` (including all of `core/http`, ~7.4k statements) silently vanished from the number before.
-  - **Build tags (`COVERAGE_TAGS`, passed via `GINKGO_TAGS`):** defaults to `debug auth`. The `auth` tag is required to compile the real (sqlite-backed) auth implementation and its ~150 `//go:build auth` tests — without it those files aren't built, the tests don't run, and the gate scores auth against a stub (~3.7% instead of ~38%). If you add new tag-gated tests, extend `COVERAGE_TAGS` or they won't count (and likely won't run in CI at all).
- `make test-coverage-check` — runs `test-coverage`, then `scripts/coverage-check.sh` fails the build if total coverage is **below** the committed baseline in `coverage-baseline.txt`. The Linux job in `.github/workflows/test.yml` runs this instead of `make test`.
- `make test-coverage-baseline` — regenerates and overwrites `coverage-baseline.txt` from the current run.
- `make install-hooks` — sets `core.hooksPath` to the versioned `.githooks/`, whose `pre-commit` runs checks scoped to what's staged: Go changes → `make lint` + `make test-coverage-check`; `core/http/react-ui/` changes → `make test-ui-coverage-check` (Playwright e2e + UI coverage gate). A commit touching neither is skipped; bypass with `git commit --no-verify`. The hook resolves golangci-lint's new-from base to `upstream/master` → `origin/master` → `master`, so it works from a fork clone where `origin/master` is stale (passed to `make lint` via `LINT_NEW_FROM`).
-
-### React UI coverage
-
-The React UI (`core/http/react-ui/`) has **no component/unit tests** — its only tests are the Playwright e2e specs in `e2e/`, which run against the real app served by `tests/e2e-ui/ui-test-server` (the dist is `//go:embed`ed, so the server is rebuilt per coverage run). Those specs do genuinely exercise the UI (clicks, `fill`, `setInputFiles`, `getByRole`/`getByText`, visibility/value assertions).
-
- `make test-ui-coverage` — builds an istanbul-instrumented bundle (`COVERAGE=true`, via `vite-plugin-istanbul` with `forceBuildInstrument: true` — the plugin skips production builds otherwise), re-embeds it into `ui-test-server` (the dist is `//go:embed`ed), runs the Playwright specs, and writes an `nyc` report to `core/http/react-ui/coverage/`. The specs import `{ test, expect }` from `e2e/coverage-fixtures.js` (re-exports Playwright's, plus harvests `window.__coverage__` into `.nyc_output/` after each test). Instrumentation is off unless `COVERAGE=true`, so dev/prod builds and plain `make test-ui-e2e` are unaffected (the fixture no-ops when `window.__coverage__` is absent).
- **Browser:** the flake dev shell ships `chromium` and exports `PLAYWRIGHT_CHROMIUM_PATH`; `playwright.config.js` uses it via `launchOptions.executablePath`, and the Makefile skips `playwright install` when it's set. This avoids Playwright's downloaded browser, which can't resolve system libs (`libglib-2.0`, …) on NixOS. In CI (no `PLAYWRIGHT_CHROMIUM_PATH`) the Makefile falls back to `playwright install --with-deps chromium`.
- The app is a React SPA, so coverage accumulates across in-app navigation within a test; a full `page.goto`/reload resets it.
- `.nycrc.json` uses `all: true`, so **every `src/**` file is in the report**, including 0%-coverage ones — that's how you spot features with no test at all (sort the HTML report or `coverage-summary.json` by line% ascending). 
- **UI coverage gate:** `make test-ui-coverage-check` runs the suite then `scripts/ui-coverage-check.sh`, failing if total line coverage drops more than `UI_COVERAGE_TOLERANCE` below `core/http/react-ui/coverage-baseline.txt`. `make test-ui-coverage-baseline` regenerates the baseline. Runs in CI (`tests-ui-e2e.yml`) and pre-commit on `core/http/react-ui/` changes.
- **Why it has a tolerance (unlike the strict Go gate):** UI e2e coverage is *non-deterministic*. Specs that assert on state and end while async/lazy render work is still in flight collect those lines only when the render beats the coverage teardown — so the total drifts with machine speed/load (a fast local box reads higher than a slow CI runner), diffusely across many specs. The tolerance absorbs that drift, so set the baseline *below* the slow-CI floor, never to a fast-local `make test-ui-coverage-baseline` number, or CI flaps.
- **Raising coverage is cheap:** a *render-smoke* spec (navigate to a route, assert its header renders) mounts a lazy page and runs its full render + initial effects, capturing most of its lines in a few lines of test — see `e2e/page-render-smoke.spec.js`. Auth is disabled in the test server (`isAdmin=true`), so `RequireAdmin`/`RequireFeature` routes render without a mock. The most *deterministic* win is removing a race: make a spec `await` a rendered element before ending (see `e2e/agents.spec.js` → AgentCreate) so its lines count every run.
-
-Rules (both gates):
- **Install the hooks:** `make install-hooks` once per clone so lint + coverage run pre-commit. Don't lean on CI for what the hook catches.
- **Don't work around the gate:** never `git commit --no-verify`, and never hand-lower a baseline or widen a tolerance to turn a red gate green. The ratchet only moves up.
- If a change drops coverage, **add tests** (sort `coverage-summary.json` by line% ascending to find untested code) rather than editing the baseline. When coverage legitimately rises, commit the regenerated baseline (`make test-coverage-baseline` / `test-ui-coverage-baseline`).
- The Go gate is **strict — no tolerance**; `covermode=atomic` keeps it deterministic. The UI gate keeps a small tolerance only because its e2e coverage isn't.
--- a/.agents/coding-style.md
+++ b/.agents/coding-style.md
@@ -50,17 +50,6 @@ Do not mix styles within a package. If you are extending tests in a package that

 This is enforced by `golangci-lint` via the `forbidigo` linter (see `.golangci.yml`); calls like `t.Errorf` / `t.Fatalf` / `t.Run` / `t.Skip` / `t.Logf` are flagged. Run `make lint` locally before submitting; the same check runs in CI (`.github/workflows/lint.yml`).

-## Outbound HTTP
-
-All outbound HTTP must go through `github.com/mudler/LocalAI/pkg/httpclient` rather than the standard library's default client. Use `httpclient.New(...)` (no body deadline — safe for streaming/SSE) or `httpclient.NewWithTimeout(d, ...)` (simple request/response). Both **refuse redirects by default** and set a TLS 1.2 floor.
-
-The reason is GHSA-3mj3-57v2-4636: the std default client follows redirects, and on a *cross-host* redirect Go forwards custom credential headers (e.g. Anthropic's `x-api-key`) to the redirect target, leaking the secret. `httpclient` fails closed instead.
-
- Need to follow redirects (download CDNs, registry blobs, GitHub asset URLs)? Pass `httpclient.WithFollowRedirects()` — it still strips credential headers on any cross-host hop.
- Have a custom transport (IP-pinned dialer, HTTP/2 tuning, a credential-injecting `RoundTripper`)? Pass `httpclient.WithTransport(rt)`, basing the transport on `httpclient.HardenedTransport()` to keep the TLS floor. Handed a `*http.Client` by a library? `httpclient.Harden(c)` applies the policy in place.
-
-This is enforced by `forbidigo` (see `.golangci.yml`): `http.DefaultClient` and `http.Get`/`Post`/`PostForm`/`Head` are flagged. The `&http.Client{}` composite literal can't be matched precisely by forbidigo without also flagging legitimate `*http.Client` type references, so that form is caught by review — don't construct raw clients.
-
 ## Documentation

 The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.
--- a/.agents/dllm-backend.md
+++ b/.agents/dllm-backend.md
@@ -1,138 +0,0 @@
-# Working on the dllm Backend
-
-`mudler/dllm.cpp` is a standalone C++/ggml engine for DiffusionGemma
-block-diffusion models. LocalAI wraps it with a **pure-Go** backend at
-`backend/go/dllm/` that dlopens `libdllm.so` via purego (ebitengine/purego) -
-NOT cgo, and NOT a C++ grpc-server fork. The Go side owns chat templating
-(gemma4 renderer) and output parsing (gemma4 streaming parser) and implements
-the rich gRPC interface (`PredictRich`/`PredictStreamRich`, ChatDelta replies).
-
-> NOTE: github.com/mudler/dllm.cpp is still **private** (publishing is
-> planned). Until then the Makefile's anonymous clone fails; use the local-dev
-> symlink shortcut documented at the top of `backend/go/dllm/Makefile`
-> (symlink an out-of-tree `build/libdllm.so` into the backend dir and skip the
-> clone), or a git credential helper with repo access.
-
-## Pin
-
-`backend/go/dllm/Makefile` pins `DLLM_VERSION?=<sha>` at the top
-(whisper / parakeet-cpp / ds4 convention). The bump-deps bot
-(`.github/workflows/bump_deps.yaml`) tracks `mudler/dllm.cpp` `main` and
-rewrites that variable. After a manual bump: `make -C backend/go/dllm purge &&
-make -C backend/go/dllm` (the clone is keyed on the directory existing, not
-the sha).
-
-## C-ABI and the serialization contract
-
-The binding covers the 9-symbol flat C-ABI from dllm.cpp's
-`include/dllm_capi.h` (ABI v1; `main.go` hard-fails on a version mismatch):
-`abi_version, load, free, last_error, free_string, tokenize_json, generate,
-generate_stream, cancel`. Contract points the Go wiring encodes (`capi.go`
-header comment has the full list):
-
- **One ctx = one concurrent generate/tokenize.** A per-model worker
-  goroutine (`Dllm.jobs` in `dllm.go`) owns ALL C calls, making the
-  serialization structural instead of lock discipline.
- **`dllm_capi_cancel` is the ONE exception**: it only flips an atomic and may
-  be called from any goroutine mid-generate, so `Dllm.Cancel` bypasses the
-  worker queue. The flag resets at the start of each generate, so a watchdog
-  racing a new generate must re-issue cancel.
- **`last_error` is a borrowed pointer** and must only be read AFTER the
-  failing call returned (never while a generate is in flight on the same ctx).
- **Free vs in-flight requests**: requests hold `genMu.RLock` for their full
-  duration; `Free` takes the write lock, so it only runs when nothing is in
-  flight, then drains and closes the worker. Post-Free requests get a clean
-  "model not loaded" error.
- `tokenize_json`/`generate` return malloc'd `char*` (bound as `uintptr`,
-  copied, then `dllm_capi_free_string`d); opts/params JSON must be a FLAT
-  object of scalars (`buildOptsJSON` rejects anything else).
-
-## Wire shape
-
-| RPC | Implementation |
-|---|---|
-| LoadModel | `dllm_capi_load` (params: `n_gpu_layers`, `n_threads`, `ctx_len`); `Options[]` parsed into per-request gen opts (`eb_*`, `blocks`, `kv_cache`) by `parseModelGenOpts` |
-| PredictRich | render (if templated) → `dllm_capi_generate` → parse → ONE Reply with aggregated ChatDeltas + legacy `Message` bytes |
-| PredictStreamRich | `dllm_capi_generate_stream`; per committed diffusion block → UTF-8 holdback → parser.Feed → one Reply per non-empty delta batch (channel closed by the CALLER, per `pkg/grpc/interface.go`) |
-| Predict / PredictStream | Legacy paths, delegate to the rich pair (legacy stream INVERTS channel ownership: the impl closes) |
-| TokenizeString | `dllm_capi_tokenize_json` (C side prepends BOS per `vocab.add_bos`) |
-| Cancel | `dllm_capi_cancel`, exposed as the `grpc.Cancellable` capability (`pkg/grpc/interface.go`): the gRPC server arms it via `context.AfterFunc` on the Predict/PredictStream context, so client disconnects/timeouts abort the in-flight generate - llama.cpp `IsCancelled()` parity for Go backends |
-
-`n_threads` and `ctx_len` are accepted-but-ignored by the engine at the
-current pin (the context bound comes from GGUF `n_ctx_train`); they are sent
-for forward compatibility.
-
-## Renderer / parser (the templated chat path)
-
-With `use_tokenizer_template` + raw Messages, the backend owns templating and
-parsing (the ds4 precedent, but in Go):
-
- `gemma4_renderer.go` - `RenderGemma4(msgs, toolsJSON, enableThinking,
-  addGenerationPrompt)`. The file embeds the FULL `tokenizer.chat_template`
-  jinja (17466 bytes, md5 `8c34cf93c7a7815b3fdb300a009c4c17`) extracted
-  verbatim from `diffusiongemma-26B-A4B-it-BF16.gguf` via gguf-py - e.g.
-  `python scripts/dump_gguf.py model.gguf | grep -A400 chat_template` in the
-  dllm.cpp checkout - as a numbered comment block; every Go rule cites its
-  "tpl L<n>" line. Re-verify the md5 before blaming the renderer for a
-  mismatch with a new GGUF. **BOS exception**: the template emits
-  `{{- bos_token -}}` but the renderer deliberately does NOT - dllm.cpp's
-  `run_generate` tokenizes with `prepend_bos = vocab.add_bos` (true for
-  gemma4), so a literal `<bos>` would double it.
- `gemma4_parser.go` - streaming state machine turning raw model text
-  (fragments can split anywhere, including mid-marker) into ChatDeltas:
-  thought channels → `reasoning_content`, `<|tool_call>call:name{...}` →
-  ToolCallDelta, `<turn|>` → done. Marker grammar cross-checked against vLLM
-  PR #45163's gemma4 tool/reasoning parsers. Malformed payloads are re-emitted
-  raw as content, never dropped.
- Thinking is **opt-in** for this family (`Metadata["enable_thinking"]`,
-  default OFF - the inverse of ds4): the template gates every thinking branch
-  on `enable_thinking`, and the no-thinking render pre-closes an empty thought
-  channel, so the parser always starts in content state.
- **UTF-8 boundary holdback** (`splitValidUTF8` in `dllm.go`): per-block
-  detokenization can split a multi-byte character across block boundaries, and
-  grpc-go refuses to marshal invalid UTF-8 in proto3 strings. An incomplete
-  trailing sequence (at most 3 bytes) is carried into the next block; genuinely
-  undecodable bytes become U+FFFD.
-
-Without `use_tokenizer_template`, the prompt passes through verbatim and the
-output is NOT gemma4-parsed (plain content, like any non-autoparsing backend).
-
-## Tests
-
-| Layer | Gate | What |
-|---|---|---|
-| `backend/go/dllm/*_test.go` (renderer/parser/wiring) | none - run in plain `go test ./backend/go/dllm/...` | Ginkgo specs over a fake `generator` seam; canonical renderer fixtures from transformers' `test_modeling_diffusion_gemma.py`, parser tables from the vLLM gemma4 parsers |
-| `backend/go/dllm/dllm_test.go` C-ABI smoke | `DLLM_TEST_LIBRARY` + `DLLM_TEST_TINY_MODEL` (dllm.cpp's `tests/fixtures/tiny_with_vocab.gguf`); Skips when unset | Drives the real `libdllm.so`: ABI check, load, tokenize `[2,18]`, deterministic generate, cancel (incl. mid-stream `Dllm.Cancel` aborting a deliberately slow `eb_max_steps:256` run in ~10ms) |
-| `tests/e2e-backends/dllm_test.go` | `BACKEND_TEST_DLLM=1` + `BACKEND_BINARY` (packaged run.sh) + `BACKEND_TEST_MODEL_FILE` (tiny fixture) | Templated chat round trip (Messages + UseTokenizerTemplate) over the real gRPC binary, non-streaming + streaming; plus client-context cancellation mid-stream (proves the `Cancellable` server plumbing end to end) |
-| Real-model e2e | `BACKEND_TEST_DLLM_REAL_MODEL_FILE` (26B BF16, ~50 GB) + `BACKEND_TEST_DLLM_REAL_GPU_LAYERS` | CUDA-13-class hardware only |
-
-Tool-call e2e is deliberately absent from the tiny-model spec: the fixture has
-random weights and cannot be coaxed into emitting tool markup; the unit tables
-carry that coverage.
-
-## Build matrix
-
-`cpu-dllm` (amd64 + arm64), `cuda13-dllm` (amd64), and
-`cuda13-nvidia-l4t-arm64-dllm` (arm64 CUDA: Jetson / DGX Spark GB10), via
-`.github/backend-matrix.yml`. No darwin/Metal. CUDA builds forward
-`-DDLLM_CUDA=ON` (dllm.cpp gates ggml's CUDA behind its own flag - a bare
-`-DGGML_CUDA=ON` is overridden by the cache FORCE). `libdllm.so` is
-self-contained (ggml statically absorbed, PIC), so `package.sh` only ships
-the binary, `run.sh` and that one .so (the parakeet-cpp-style stub layout;
-no ldd walk yet).
-
-## Known limitations
-
- **Cancel granularity**: the C-ABI cancel flag is per-ctx and resets on
-  every generate entry, so a Cancel racing a NEW generate can be lost, and
-  with requests queued on the worker it aborts whichever generate is
-  currently running (acceptable: the server de-registers the hook on normal
-  completion, one process serves one model).
- **Throughput**: ~0.15 tok/s on the 26B at default settings (GB10) - every
-  denoise step recomputes the full prompt+canvas. The upstream prefix-KV
-  cache (dllm.cpp P3) is the fix; `kv_cache:on` errors until it lands
-  (`auto`/`off` are accepted no-ops).
- **Repo privacy**: see the note at the top - CI clone of dllm.cpp needs the
-  repo published (or credentials) before the backend images can build.
- Engine spec/validation references: dllm.cpp `docs/validation.md` and
-  LocalAI `docs/superpowers/specs/2026-06-10-dllm-cpp-design.md`.
--- a/.agents/ds4-backend.md
+++ b/.agents/ds4-backend.md
@@ -68,34 +68,6 @@ go test -count=1 -timeout=30m -v ./tests/e2e-backends/...

 CI does not load the model; the suite is opt-in via env vars.

-## Distributed mode
-
-ds4 supports **layer-split** distributed inference (a model too big for one host,
-split by transformer layer; the GGUF must be present on every machine, each loads
-only its slice). Topology is **inverted** vs llama.cpp: the coordinator listens,
-workers dial in.
-
- **`ds4-worker` binary**: built and packaged next to `grpc-server` (`package.sh`
-  copies it into `package/`). Links the same engine objects plus `ds4_distributed.o`;
-  **no gRPC/protobuf dependency** (speaks ds4's own TCP transport), so it builds
-  even where `grpc-server` can't. Runs the worker serving loop (`ds4_dist_run`).
- **Coordinator wiring**: the ds4 `grpc-server` acts as coordinator when `LoadModel`
-  `ModelOptions.Options` (from model-YAML `options:`) carry:
-  - `ds4_role:coordinator` (enables distributed mode; absent → single-node, back-compat)
-  - `ds4_layers:0:19` (coordinator's own slice, inclusive; `N:output` includes the head)
-  - `ds4_listen:0.0.0.0:1234` (address workers dial into)
-  - `ds4_route_timeout:60` (optional; seconds Predict/PredictStream wait for the route
-    to form before returning gRPC `UNAVAILABLE`; default 60)
- **Worker CLI**: `local-ai worker ds4-distributed -- <ds4-worker args>` resolves the
-  ds4 backend and execs the packaged `ds4-worker` (raw passthrough), e.g.
-  `--role worker --model /models/ds4flash.gguf --layers 20:output --coordinator <host> 1234`.
-
-Opt-in e2e in `tests/e2e-backends/backend_test.go`, gated by
-`BACKEND_TEST_DS4_DISTRIBUTED=1` (plus `BACKEND_TEST_DS4_WORKER_BINARY`,
-`BACKEND_TEST_DS4_WORKER_LAYERS`, `BACKEND_TEST_DS4_COORDINATOR_LAYERS`,
-`BACKEND_TEST_DS4_LISTEN`). Design spec:
-`docs/superpowers/specs/2026-05-30-ds4-distributed-inference-design.md`.
-
 ## Importer

 `core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by
--- a/.dockerignore
+++ b/.dockerignore
@@ -4,7 +4,6 @@
 .devcontainer
 models
 backends
-volumes
 examples/chatbot-ui/models
 backend/go/image/stablediffusion-ggml/build/
 backend/go/*/build
@@ -22,27 +21,3 @@ __pycache__
 # backend virtual environments
 **/venv
 backend/python/**/source
-
-# In-place llama.cpp clone + per-variant build copies. The Makefile
-# clones llama.cpp itself at the pinned LLAMA_VERSION; if a stale
-# local checkout is COPY'd into the image, the `llama.cpp:` target
-# sees the directory and skips re-cloning, so grpc-server.cpp ends
-# up compiled against whatever (likely older) commit the host had.
-backend/cpp/llama-cpp/llama.cpp
-backend/cpp/llama-cpp-*-build
-
-# Rust backend build output (sources are tracked; target/ is generated)
-backend/rust/*/target
-
-# Local-only artifacts that bloat the build context but the image never needs.
-# Saved image tarballs, locally-installed backends, the host-built binary, and
-# assorted tool/scratch dirs. None of these are git-tracked.
-backend-images
-local-backends
-local-ai
-.crush
-protoc
-tests
-
-# Installed via npm inside the build stage; no need to ship the host copy.
-**/node_modules
--- a/.githooks/pre-commit
+++ b/.githooks/pre-commit
@@ -1,60 +0,0 @@
-#!/usr/bin/env sh
-#
-# LocalAI pre-commit hook. Install it (once per clone) with:
-#
-#     make install-hooks
-#
-# Runs only the checks relevant to what's staged:
-#   - Go files          -> make lint + make test-coverage-check
-#   - core/http/react-ui -> make test-ui-coverage-check (Playwright e2e + gate)
-# A commit touching neither is skipped entirely (docs/YAML/etc. can't change
-# lint findings, Go coverage, or the UI).
-#
-# To bypass for a single commit (e.g. a WIP checkpoint): git commit --no-verify
-set -eu
-
-repo_root="$(git rev-parse --show-toplevel)"
-cd "$repo_root"
-
-staged="$(git diff --cached --name-only --diff-filter=ACMRD)"
-
-go_changed=0
-ui_changed=0
-if echo "$staged" | grep -qE '\.go$'; then go_changed=1; fi
-if echo "$staged" | grep -qE '^core/http/react-ui/'; then ui_changed=1; fi
-
-if [ "$go_changed" -eq 0 ] && [ "$ui_changed" -eq 0 ]; then
-	echo "pre-commit: no Go or React UI changes staged — skipping."
-	exit 0
-fi
-
-if [ "$go_changed" -eq 1 ]; then
-	# Resolve the ref golangci-lint's new-from-merge-base should compare
-	# against. .golangci.yml pins origin/master, which is correct in CI
-	# (origin == the canonical repo) but wrong from a fork clone, where
-	# origin/master lags behind and lint would report the whole upstream
-	# backlog. Prefer upstream/master, then origin/master, then master.
-	lint_base=""
-	for ref in upstream/master origin/master master; do
-		if git rev-parse --verify --quiet "${ref}^{commit}" >/dev/null 2>&1; then
-			lint_base="$ref"
-			break
-		fi
-	done
-
-	echo "pre-commit ▶ golangci-lint (make lint${lint_base:+, new-from $lint_base})"
-	make lint LINT_NEW_FROM="$lint_base"
-
-	echo "pre-commit ▶ coverage gate (make test-coverage-check) — builds and runs the"
-	echo "             pkg/core suites plus tests/e2e; can take a few minutes."
-	make test-coverage-check
-fi
-
-if [ "$ui_changed" -eq 1 ]; then
-	echo "pre-commit ▶ React UI e2e + coverage gate (make test-ui-coverage-check) —"
-	echo "             rebuilds the UI + ui-test-server, runs the Playwright specs, and"
-	echo "             fails if line coverage regressed; can take a couple of minutes."
-	make test-ui-coverage-check
-fi
-
-echo "pre-commit ✓ all relevant checks passed"
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -690,19 +690,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: 'cublas'
-    cuda-major-version: "12"
-    cuda-minor-version: "8"
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-nvidia-cuda-12-rfdetr-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "rfdetr-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "8"
@@ -716,32 +703,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: 'cublas'
-    cuda-major-version: "12"
-    cuda-minor-version: "8"
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-nvidia-cuda-12-crispasr'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "crispasr"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'cublas'
-    cuda-major-version: "12"
-    cuda-minor-version: "8"
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-nvidia-cuda-12-parakeet-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "parakeet-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "8"
@@ -1530,19 +1491,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: 'cublas'
-    cuda-major-version: "13"
-    cuda-minor-version: "0"
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-nvidia-cuda-13-rfdetr-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "rfdetr-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1556,19 +1504,6 @@ include:
    backend: "sam3-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
-  - build-type: 'cublas'
-    cuda-major-version: "13"
-    cuda-minor-version: "0"
-    platforms: 'linux/arm64'
-    skip-drivers: 'false'
-    tag-latest: 'auto'
-    tag-suffix: '-nvidia-l4t-cuda-13-arm64-rfdetr-cpp'
-    base-image: "ubuntu:24.04"
-    ubuntu-version: '2404'
-    runs-on: 'ubuntu-24.04-arm'
-    backend: "rfdetr-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1582,45 +1517,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: 'cublas'
-    cuda-major-version: "13"
-    cuda-minor-version: "0"
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-nvidia-cuda-13-crispasr'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "crispasr"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'cublas'
-    cuda-major-version: "13"
-    cuda-minor-version: "0"
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-nvidia-cuda-13-parakeet-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "parakeet-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'cublas'
-    cuda-major-version: "13"
-    cuda-minor-version: "0"
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-nvidia-cuda-13-dllm'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "dllm"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1634,45 +1530,6 @@ include:
    backend: "whisper"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
-  - build-type: 'cublas'
-    cuda-major-version: "13"
-    cuda-minor-version: "0"
-    platforms: 'linux/arm64'
-    skip-drivers: 'false'
-    tag-latest: 'auto'
-    tag-suffix: '-nvidia-l4t-cuda-13-arm64-crispasr'
-    base-image: "ubuntu:24.04"
-    ubuntu-version: '2404'
-    runs-on: 'ubuntu-24.04-arm'
-    backend: "crispasr"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-  - build-type: 'cublas'
-    cuda-major-version: "13"
-    cuda-minor-version: "0"
-    platforms: 'linux/arm64'
-    skip-drivers: 'false'
-    tag-latest: 'auto'
-    tag-suffix: '-nvidia-l4t-cuda-13-arm64-parakeet-cpp'
-    base-image: "ubuntu:24.04"
-    ubuntu-version: '2404'
-    runs-on: 'ubuntu-24.04-arm'
-    backend: "parakeet-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-  - build-type: 'cublas'
-    cuda-major-version: "13"
-    cuda-minor-version: "0"
-    platforms: 'linux/arm64'
-    skip-drivers: 'false'
-    tag-latest: 'auto'
-    tag-suffix: '-nvidia-l4t-cuda-13-arm64-dllm'
-    base-image: "ubuntu:24.04"
-    ubuntu-version: '2404'
-    runs-on: 'ubuntu-24.04-arm'
-    backend: "dllm"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1792,6 +1649,20 @@ include:
    dockerfile: "./backend/Dockerfile.llama-cpp"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'hipblas'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-rocm-hipblas-turboquant'
+    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-rocm-amd64'
+    runs-on: 'ubuntu-latest'
+    base-image: "rocm/dev-ubuntu-24.04:7.2.1"
+    skip-drivers: 'false'
+    backend: "turboquant"
+    dockerfile: "./backend/Dockerfile.turboquant"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'hipblas'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2764,74 +2635,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  # rfdetr-cpp
-  - build-type: ''
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-cpu-rfdetr-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "rfdetr-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'sycl_f32'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-intel-sycl-f32-rfdetr-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-    skip-drivers: 'false'
-    backend: "rfdetr-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'sycl_f16'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-intel-sycl-f16-rfdetr-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-    skip-drivers: 'false'
-    backend: "rfdetr-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'vulkan'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    platform-tag: 'amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-vulkan-rfdetr-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "rfdetr-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'vulkan'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/arm64'
-    platform-tag: 'arm64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-vulkan-rfdetr-cpp'
-    runs-on: 'ubuntu-24.04-arm'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "rfdetr-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'sycl_f32'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2912,19 +2715,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2204'
-  - build-type: 'cublas'
-    cuda-major-version: "12"
-    cuda-minor-version: "0"
-    platforms: 'linux/arm64'
-    skip-drivers: 'false'
-    tag-latest: 'auto'
-    tag-suffix: '-nvidia-l4t-arm64-rfdetr-cpp'
-    base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
-    runs-on: 'ubuntu-24.04-arm'
-    backend: "rfdetr-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2204'
  # whisper
  - build-type: ''
    cuda-major-version: ""
@@ -2940,20 +2730,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: ''
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    platform-tag: 'amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-cpu-crispasr'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "crispasr"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2968,20 +2744,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: ''
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/arm64'
-    platform-tag: 'arm64'
-    tag-latest: 'auto'
-    tag-suffix: '-cpu-crispasr'
-    runs-on: 'ubuntu-24.04-arm'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "crispasr"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'sycl_f32'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -2995,19 +2757,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: 'sycl_f32'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-intel-sycl-f32-crispasr'
-    runs-on: 'ubuntu-latest'
-    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-    skip-drivers: 'false'
-    backend: "crispasr"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'sycl_f16'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -3021,19 +2770,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: 'sycl_f16'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-intel-sycl-f16-crispasr'
-    runs-on: 'ubuntu-latest'
-    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-    skip-drivers: 'false'
-    backend: "crispasr"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -3048,20 +2784,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: 'vulkan'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    platform-tag: 'amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-vulkan-crispasr'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "crispasr"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'vulkan'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -3076,20 +2798,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: 'vulkan'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/arm64'
-    platform-tag: 'arm64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-vulkan-crispasr'
-    runs-on: 'ubuntu-24.04-arm'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "crispasr"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "0"
@@ -3103,19 +2811,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2204'
-  - build-type: 'cublas'
-    cuda-major-version: "12"
-    cuda-minor-version: "0"
-    platforms: 'linux/arm64'
-    skip-drivers: 'false'
-    tag-latest: 'auto'
-    tag-suffix: '-nvidia-l4t-arm64-crispasr'
-    base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
-    runs-on: 'ubuntu-24.04-arm'
-    backend: "crispasr"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2204'
  - build-type: 'hipblas'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -3129,157 +2824,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: 'hipblas'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-rocm-hipblas-crispasr'
-    base-image: "rocm/dev-ubuntu-24.04:7.2.1"
-    runs-on: 'ubuntu-latest'
-    skip-drivers: 'false'
-    backend: "crispasr"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  # parakeet-cpp
-  - build-type: ''
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    platform-tag: 'amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-cpu-parakeet-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "parakeet-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: ''
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/arm64'
-    platform-tag: 'arm64'
-    tag-latest: 'auto'
-    tag-suffix: '-cpu-parakeet-cpp'
-    runs-on: 'ubuntu-24.04-arm'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "parakeet-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  # dllm
-  - build-type: ''
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    platform-tag: 'amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-cpu-dllm'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "dllm"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: ''
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/arm64'
-    platform-tag: 'arm64'
-    tag-latest: 'auto'
-    tag-suffix: '-cpu-dllm'
-    runs-on: 'ubuntu-24.04-arm'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "dllm"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'sycl_f32'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-intel-sycl-f32-parakeet-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-    skip-drivers: 'false'
-    backend: "parakeet-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'sycl_f16'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-intel-sycl-f16-parakeet-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-    skip-drivers: 'false'
-    backend: "parakeet-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'vulkan'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    platform-tag: 'amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-vulkan-parakeet-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "parakeet-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'vulkan'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/arm64'
-    platform-tag: 'arm64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-vulkan-parakeet-cpp'
-    runs-on: 'ubuntu-24.04-arm'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "parakeet-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'cublas'
-    cuda-major-version: "12"
-    cuda-minor-version: "0"
-    platforms: 'linux/arm64'
-    skip-drivers: 'false'
-    tag-latest: 'auto'
-    tag-suffix: '-nvidia-l4t-arm64-parakeet-cpp'
-    base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
-    runs-on: 'ubuntu-24.04-arm'
-    backend: "parakeet-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2204'
-  - build-type: 'hipblas'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-rocm-hipblas-parakeet-cpp'
-    base-image: "rocm/dev-ubuntu-24.04:7.2.1"
-    runs-on: 'ubuntu-latest'
-    skip-drivers: 'false'
-    backend: "parakeet-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  # acestep-cpp
  - build-type: ''
    cuda-major-version: ""
@@ -4312,14 +3856,6 @@ includeDarwin:
    tag-suffix: "-metal-darwin-arm64-whisper"
    build-type: "metal"
    lang: "go"
-  - backend: "crispasr"
-    tag-suffix: "-metal-darwin-arm64-crispasr"
-    build-type: "metal"
-    lang: "go"
-  - backend: "parakeet-cpp"
-    tag-suffix: "-metal-darwin-arm64-parakeet-cpp"
-    build-type: "metal"
-    lang: "go"
  - backend: "acestep-cpp"
    tag-suffix: "-metal-darwin-arm64-acestep-cpp"
    build-type: "metal"
--- a/.github/gallery-agent/main.go
+++ b/.github/gallery-agent/main.go
@@ -3,7 +3,6 @@ package main
 import (
 	"context"
 	"encoding/json"
-	"errors"
 	"fmt"
 	"os"
 	"strconv"
@@ -114,17 +113,6 @@ func main() {
 	fmt.Println("Searching for trending models on HuggingFace...")
 	rawModels, err := client.GetTrending(searchTerm, limit)
 	if err != nil {
-		if errors.Is(err, hfapi.ErrRateLimited) {
-			fmt.Printf("HuggingFace API is rate limited after retries, skipping this run: %v\n", err)
-			writeSummary(AddedModelSummary{
-				SearchTerm:     searchTerm,
-				TotalFound:     0,
-				ModelsAdded:    0,
-				Quantization:   quantization,
-				ProcessingTime: time.Since(startTime).String(),
-			})
-			return
-		}
 		fmt.Fprintf(os.Stderr, "Error fetching models: %v\n", err)
 		os.Exit(1)
 	}
@@ -289,3 +277,4 @@ func truncateString(s string, maxLen int) string {
 	}
 	return s[:maxLen] + "..."
 }
+
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -40,11 +40,6 @@ jobs:
      id-token: write
    env:
      quay_username: ${{ secrets.quayUsername }}
-      # cosign v2.4.x still gates --registry-referrers-mode=oci-1-1 behind
-      # this flag. Without it, signing fails with:
-      #   invalid argument "oci-1-1" for "--registry-referrers-mode" flag:
-      #   in order to use mode "oci-1-1", you must set COSIGN_EXPERIMENTAL=1
-      COSIGN_EXPERIMENTAL: '1'
    steps:
      # Sparse checkout: the merge job needs `.github/scripts/` (for the
      # keepalive cleanup script) but none of the source tree.
@@ -71,8 +66,7 @@ jobs:

      # cosign signs each pushed manifest list with --recursive so the
      # index and every per-arch entry get an attached Sigstore bundle.
-      # Recent cosign releases always emit the new bundle format, so
-      # there's no extra CLI flag to opt into it.
+      # 2.2+ is required for --new-bundle-format.
      - name: Install cosign
        if: github.event_name != 'pull_request'
        uses: sigstore/cosign-installer@v3
@@ -159,6 +153,7 @@ jobs:
          # manifest before checking signatures need the per-arch
          # signatures, not just the list-level one.
          cosign sign --yes --recursive \
+            --new-bundle-format \
            --registry-referrers-mode=oci-1-1 \
            "quay.io/go-skynet/local-ai-backends@${digest}"

@@ -185,6 +180,7 @@ jobs:
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
          cosign sign --yes --recursive \
+            --new-bundle-format \
            --registry-referrers-mode=oci-1-1 \
            "localai/localai-backends@${digest}"

--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -30,18 +30,6 @@ jobs:
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
            file: "backend/go/whisper/Makefile"
-          - repository: "CrispStrobe/CrispASR"
-            variable: "CRISPASR_VERSION"
-            branch: "main"
-            file: "backend/go/crispasr/Makefile"
-          - repository: "mudler/parakeet.cpp"
-            variable: "PARAKEET_VERSION"
-            branch: "master"
-            file: "backend/go/parakeet-cpp/Makefile"
-          - repository: "mudler/dllm.cpp"
-            variable: "DLLM_VERSION"
-            branch: "main"
-            file: "backend/go/dllm/Makefile"
          - repository: "leejet/stable-diffusion.cpp"
            variable: "STABLEDIFFUSION_GGML_VERSION"
            branch: "master"
@@ -62,10 +50,6 @@ jobs:
            variable: "SAM3_VERSION"
            branch: "main"
            file: "backend/go/sam3-cpp/Makefile"
-          - repository: "mudler/rf-detr.cpp"
-            variable: "RFDETR_VERSION"
-            branch: "main"
-            file: "backend/go/rfdetr-cpp/Makefile"
          - repository: "predict-woo/qwen3-tts.cpp"
            variable: "QWEN3TTS_CPP_VERSION"
            branch: "main"
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -106,7 +106,6 @@ jobs:
            type=ref,event=branch
            type=semver,pattern={{raw}}
            type=sha
-            type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true
--- a/.github/workflows/image_merge.yml
+++ b/.github/workflows/image_merge.yml
@@ -80,7 +80,6 @@ jobs:
            type=ref,event=branch
            type=semver,pattern={{raw}}
            type=sha
-            type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true
--- a/.github/workflows/secscan.yaml
+++ b/.github/workflows/secscan.yaml
@@ -18,7 +18,7 @@ jobs:
        if: ${{ github.actor != 'dependabot[bot]' }}
      - name: Run Gosec Security Scanner
        if: ${{ github.actor != 'dependabot[bot]' }}
-        uses: securego/gosec@v2.27.1
+        uses: securego/gosec@v2.22.9
        with:
          # we let the report trigger content trigger a failure using the GitHub Security features.
          args: '-no-fail -fmt sarif -out results.sarif ./...'
--- a/.github/workflows/stalebot.yml
+++ b/.github/workflows/stalebot.yml
@@ -11,7 +11,7 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/stale@eb5cf3af3ac0a1aa4c9c45633dd1ae542a27a899 # v9
+      - uses: actions/stale@b5d41d4e1d5dceea10e7104786b73624c18a190f # v9
        with:
          stale-issue-message: 'This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.'
          stale-pr-message: 'This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 10 days.'
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -37,7 +37,6 @@ jobs:
      sglang: ${{ steps.detect.outputs.sglang }}
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
      qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
-      rfdetr-cpp: ${{ steps.detect.outputs.rfdetr-cpp }}
      vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
      localvqe: ${{ steps.detect.outputs.localvqe }}
      voxtral: ${{ steps.detect.outputs.voxtral }}
@@ -46,7 +45,6 @@ jobs:
      speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
      sherpa-onnx: ${{ steps.detect.outputs.sherpa-onnx }}
      whisper: ${{ steps.detect.outputs.whisper }}
-      parakeet-cpp: ${{ steps.detect.outputs.parakeet-cpp }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
@@ -634,26 +632,6 @@ jobs:
      - name: Build whisper backend image and run transcription gRPC e2e tests
        run: |
          make test-extra-backend-whisper-transcription
-  # Parakeet ASR via the parakeet-cpp backend (C++/ggml port of NeMo
-  # Parakeet). Drives AudioTranscription (offline, with word timestamps) on
-  # tdt_ctc-110m + the JFK 11s clip.
-  tests-parakeet-cpp-grpc-transcription:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.parakeet-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 90
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.25.4'
-      - name: Build parakeet-cpp backend image and run transcription gRPC e2e tests
-        run: |
-          make test-extra-backend-parakeet-cpp-transcription
  # VITS TTS via the sherpa-onnx backend. Drives both TTS (file write) and
  # TTSStream (PCM chunks) on the e2e-backends harness.
  tests-sherpa-onnx-grpc-tts:
@@ -865,42 +843,6 @@ jobs:
      - name: Test qwen3-tts-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp test
-  # Per-backend smoke for rfdetr-cpp: builds the .so + Go binary and runs
-  # `make -C backend/go/rfdetr-cpp test`. test.sh fetches the small (~20 MB)
-  # rfdetr-nano-q8_0 GGUF from the published mudler/rfdetr-cpp-nano HF repo
-  # via curl and synthesises a tiny PNG to exercise the wire protocol.
-  tests-rfdetr-cpp:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.rfdetr-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential cmake curl libopenblas-dev
-      - name: Setup Go
-        uses: actions/setup-go@v5
-      - name: Display Go version
-        run: go version
-      - name: Proto Dependencies
-        run: |
-          # Install protoc
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Build rfdetr-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp
-      - name: Test rfdetr-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp test
  # Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
  # runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
  # the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -53,22 +53,9 @@ jobs:
          node-version: '22'
      - name: Build React UI
        run: make react-ui
-      # Runs the core suite with coverage and fails if total coverage dropped
-      # below the committed baseline (coverage-baseline.txt). The gate is
-      # strict — any decrease fails. Raise the baseline with
-      # `make test-coverage-baseline` and commit it when coverage rises.
-      - name: Test (with coverage gate)
+      - name: Test
        run: |
-          PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test-coverage-check
-      - name: Upload coverage report
-        if: ${{ always() }}
-        uses: actions/upload-artifact@v4
-        with:
-          name: coverage-linux
-          path: |
-            coverage/coverage.out
-            coverage/coverage.html
-          if-no-files-found: ignore
+          PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
--- a/.github/workflows/tests-ui-e2e.yml
+++ b/.github/workflows/tests-ui-e2e.yml
@@ -37,10 +37,6 @@ jobs:
        uses: actions/setup-node@v6
        with:
          node-version: '22'
-      - name: Setup Bun
-        uses: oven-sh/setup-bun@v2
-        with:
-          bun-version: '1.3.11'
      - name: Proto Dependencies
        run: |
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
@@ -52,12 +48,16 @@ jobs:
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential libopus-dev
-      # Builds an instrumented UI bundle, runs the Playwright specs, and fails
-      # if line coverage regressed beyond the jitter tolerance (the gate is
-      # in `make test-ui-coverage-check`). PLAYWRIGHT_CHROMIUM_PATH is unset
-      # here, so scripts/ensure-playwright-browser.sh installs Chromium via apt.
-      - name: Run UI e2e + coverage gate
-        run: PATH="$PATH:$HOME/go/bin" make test-ui-coverage-check
+      - name: Build UI test server
+        run: PATH="$PATH:$HOME/go/bin" make build-ui-test-server
+      - name: Install Playwright
+        working-directory: core/http/react-ui
+        run: |
+          npm install
+          npx playwright install --with-deps chromium
+      - name: Run Playwright tests
+        working-directory: core/http/react-ui
+        run: npx playwright test
      - name: Upload Playwright report
        if: ${{ failure() }}
        uses: actions/upload-artifact@v7
@@ -65,14 +65,6 @@ jobs:
          name: playwright-report
          path: core/http/react-ui/playwright-report/
          retention-days: 7
-      - name: Upload UI coverage report
-        if: ${{ always() }}
-        uses: actions/upload-artifact@v7
-        with:
-          name: ui-coverage
-          path: core/http/react-ui/coverage/
-          if-no-files-found: ignore
-          retention-days: 7
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
--- a/.gitignore
+++ b/.gitignore
@@ -26,10 +26,6 @@ go-bert
 LocalAI
 /local-ai
 /local-ai-launcher
-# Root-level build artifacts when running `go build ./...` against
-# Go backend packages whose main lives under backend/go/.
-/cloud-proxy
-/local-store
 # prevent above rules from omitting the helm chart
 !charts/*
 # prevent above rules from omitting the api/localai folder
@@ -70,17 +66,10 @@ docs/static/gallery.html
 # per-developer customization files for the development container
 .devcontainer/customization/*

-# Coverage profiles (the committed baseline is coverage-baseline.txt)
-/coverage/
-
 # React UI build artifacts (keep placeholder dist/index.html)
 core/http/react-ui/node_modules/
 core/http/react-ui/dist

-# React UI coverage (vite-plugin-istanbul + nyc, via `make test-ui-coverage`)
-core/http/react-ui/.nyc_output/
-core/http/react-ui/coverage/
-
 # Extracted backend binaries for container-based testing
 local-backends/

@@ -88,6 +77,3 @@ local-backends/
 tests/e2e-ui/ui-test-server
 core/http/react-ui/playwright-report/
 core/http/react-ui/test-results/
-
-# Local worktrees
-.worktrees/
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -56,20 +56,6 @@ linters:
        # are exempt — see linters.exclusions.rules below.
        - pattern: '^os\.(Getenv|LookupEnv|Environ)$'
          msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
-        # Outbound HTTP must go through pkg/httpclient, which refuses redirects
-        # by default and sets a TLS floor. The std-library default client and
-        # the http.Get/Post/... convenience helpers follow redirects (up to 10)
-        # and, on a cross-host redirect, forward custom credential headers such
-        # as Anthropic's x-api-key to the redirect target — leaking the secret
-        # (GHSA-3mj3-57v2-4636). forbidigo can't precisely match the
-        # `&http.Client{}` composite literal without also flagging legitimate
-        # `*http.Client` type references, so that form is enforced by
-        # convention + review; these two patterns catch the implicit-default
-        # client, which is the common footgun.
-        - pattern: '^http\.DefaultClient$'
-          msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.DefaultClient — the std client follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
-        - pattern: '^http\.(Get|Post|PostForm|Head)$'
-          msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.Get/Post/PostForm/Head — these use http.DefaultClient, which follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
  exclusions:
    paths:
      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
@@ -109,18 +95,3 @@ linters:
      - path: _test\.go$
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
-      # pkg/httpclient is the sanctioned home for outbound HTTP clients; it
-      # necessarily references net/http directly.
-      - path: ^pkg/httpclient/
-        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
-        linters: [forbidigo]
-      # Tests drive local httptest servers where redirect/TLS hardening is
-      # irrelevant; the std client is fine there.
-      - path: _test\.go$
-        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
-        linters: [forbidigo]
-      # Vendored upstream whisper.cpp Go bindings are a separate module and
-      # cannot import pkg/httpclient.
-      - path: ^backend/go/whisper/sources/
-        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
-        linters: [forbidigo]
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -26,7 +26,6 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
 | [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
 | [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
-| [.agents/dllm-backend.md](.agents/dllm-backend.md) | Working on the dllm backend (DiffusionGemma block-diffusion) - purego C-ABI binding, per-ctx serialization contract, gemma4 renderer/parser, gated test layers |
 | [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
 | [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
 | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
@@ -36,7 +35,6 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]

 ## Quick Reference

- **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md).
 - **Logging**: Use `github.com/mudler/xlog` (same API as slog)
 - **Go style**: Prefer `any` over `interface{}`
 - **Comments**: Explain *why*, not *what*
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -198,7 +198,6 @@ For AI-assisted development, see [`AGENTS.md`](AGENTS.md) (or the equivalent [`C

 - Prefer modern Go idioms — for example, use `any` instead of `interface{}`.
 - Use [`golangci-lint`](https://golangci-lint.run) to catch common issues before submitting a PR.
- Run `make install-hooks` once per clone to enable the pre-commit hook: Go changes run `make lint` + the coverage gate (`make test-coverage-check`); `core/http/react-ui/` changes run the Playwright e2e suite (`make test-ui`). Bypass a single commit with `git commit --no-verify`.
 - Use [`github.com/mudler/xlog`](https://github.com/mudler/xlog) for logging (same API as `slog`). Do not use `fmt.Println` or the standard `log` package for operational logging.
 - Use tab indentation for Go files (as defined in `.editorconfig`).

@@ -266,12 +265,6 @@ The e2e tests run LocalAI in a Docker container and exercise the API:
 make test-e2e
 ```

-### React UI tests and coverage
-
-The React UI (`core/http/react-ui/`) is covered by Playwright e2e specs, gated by a **monotonic line-coverage ratchet** (`make test-ui-coverage-check`, run in CI and pre-commit). The metric is non-deterministic — a fast local box reads higher than a slow CI runner for the same code — so a small tolerance is unavoidable.
-
-**If your change lowers UI coverage, raise it back by adding specs — do not widen the tolerance or hand-lower the baseline.** A *render-smoke* spec (navigate to a page, assert its header is visible) cheaply covers an entire lazy page. See `core/http/react-ui/e2e/page-render-smoke.spec.js` and the full policy in [.agents/building-and-testing.md](.agents/building-and-testing.md#react-ui-coverage).
-
 ### Running E2E container tests

 These tests build a standard LocalAI Docker image and run it with pre-configured model configs to verify that most endpoints work correctly:
--- a/178
+++ b/178
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/dllm backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -69,41 +69,10 @@ else
 	GORELEASER=$(shell which goreleaser)
 endif

-TEST_PATHS?=./api/... ./pkg/... ./core/... ./backend/go/cloud-proxy/... ./backend/go/local-store/...
-
-## Coverage output and the committed baseline that CI compares against.
-## The gate is strict: total coverage must never decrease (no tolerance).
-## covermode=atomic makes line coverage deterministic regardless of test
-## ordering or flake retries, so there is no run-to-run jitter to absorb.
-COVERAGE_DIR?=$(abspath ./coverage)
-COVERAGE_PROFILE?=$(COVERAGE_DIR)/coverage.out
-COVERAGE_BASELINE?=coverage-baseline.txt
-## Coverage is collected one recursive root at a time and merged (see
-## scripts/run-coverage.sh): passing several recursive roots to a single
-## ginkgo invocation only keeps one root's coverprofile. Mirrors TEST_PATHS
-## minus ./api (which doesn't exist).
-COVERAGE_ROOTS?=./pkg ./core
-## Build tags for the coverage build. `auth` is required to compile the real
-## auth implementation and its ~150 `//go:build auth` tests (otherwise they're
-## invisible and the gate scores auth against a stub). `debug` matches `test`.
-COVERAGE_TAGS?=debug auth
-## Coverage is attributed to these packages via --coverpkg, so the in-process
-## integration suites (COVERAGE_E2E_ROOTS) credit the core/http handlers they
-## drive over HTTP — not just their own test package.
-COVERAGE_COVERPKG?=github.com/mudler/LocalAI/core/...,github.com/mudler/LocalAI/pkg/...
-## In-process integration suites folded into coverage. Run non-recursively
-## (excludes tests/e2e/distributed, which needs containers) with the mock
-## backend built by prepare-test. real-models specs need a downloaded model,
-## so they're filtered out. NOTE: tests/integration is intentionally NOT here —
-## it needs the local-store backend built (`make backends/local-store`), which
-## the coverage CI job doesn't do.
-COVERAGE_E2E_ROOTS?=./tests/e2e
-COVERAGE_E2E_LABELS?=!real-models
-## Drop generated protobuf from the denominator (it has no tests by design).
-COVERAGE_EXCLUDE_RE?=grpc/proto/.*[.]pb[.]go
+TEST_PATHS?=./api/... ./pkg/... ./core/...


-.PHONY: all test test-coverage test-coverage-baseline test-coverage-check test-ui test-ui-coverage-baseline test-ui-coverage-check install-hooks build vendor lint lint-all
+.PHONY: all test build vendor lint lint-all

 all: help

@@ -180,7 +149,7 @@ osx-signed: build

 ## Run
 run: ## run local-ai
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./cmd/local-ai
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./

 prepare-test: protogen-go build-mock-backend

@@ -201,36 +170,6 @@ test: prepare-test
 	OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)

-## Runs the core suite ($(TEST_PATHS)) with statement-coverage instrumentation
-## and writes a merged profile to $(COVERAGE_PROFILE). Deliberately omits
-## --fail-fast so a single failure doesn't truncate the coverage number, and
-## uses covermode=atomic so the result is deterministic. Prints the total.
-test-coverage: prepare-test
-	@echo 'Running tests with coverage'
-	GINKGO_TAGS="$(COVERAGE_TAGS)" \
-	COVERAGE_COVERPKG="$(COVERAGE_COVERPKG)" \
-	COVERAGE_E2E_ROOTS="$(COVERAGE_E2E_ROOTS)" \
-	COVERAGE_E2E_LABELS="$(COVERAGE_E2E_LABELS)" \
-	COVERAGE_EXCLUDE_RE='$(COVERAGE_EXCLUDE_RE)' \
-	OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
-	scripts/run-coverage.sh $(COVERAGE_DIR) $(COVERAGE_PROFILE) $(TEST_FLAKES) $(COVERAGE_ROOTS)
-	@$(GOCMD) tool cover -html=$(COVERAGE_PROFILE) -o $(COVERAGE_DIR)/coverage.html
-	@$(GOCMD) tool cover -func=$(COVERAGE_PROFILE) | tail -n1
-
-## Writes the current total coverage to $(COVERAGE_BASELINE). Run this (and
-## commit the result) whenever a change legitimately raises coverage so the
-## ratchet moves up. Never lower it by hand.
-test-coverage-baseline: test-coverage
-	@$(GOCMD) tool cover -func=$(COVERAGE_PROFILE) | awk '/^total:/{gsub(/%/,"",$$NF); print $$NF}' > $(COVERAGE_BASELINE)
-	@echo "Saved coverage baseline: $$(cat $(COVERAGE_BASELINE))%"
-
-## CI gate: fails if total coverage dropped more than COVERAGE_TOLERANCE
-## (default 0.5pp) below the committed baseline. A small tolerance absorbs the
-## run-to-run jitter from the in-process tests/e2e suite folded in via
-## --coverpkg (timing-dependent which handler lines execute).
-test-coverage-check: test-coverage
-	@scripts/coverage-check.sh $(COVERAGE_PROFILE) $(COVERAGE_BASELINE)
-
 ########################################################
 ## Lint
 ########################################################
@@ -246,17 +185,12 @@ test-coverage-check: test-coverage
 ## everything else automatically, so new packages are scanned by default.
 LINT_EXCLUDE_DIRS_RE=/(backend/go/(piper|silero-vad|llm)|cmd/launcher)(/|$$)

-## Set LINT_NEW_FROM to a git ref to override .golangci.yml's
-## new-from-merge-base (origin/master). Useful from a fork clone where
-## origin/master is stale relative to the canonical repo — the pre-commit
-## hook passes the resolved upstream ref here so local lint matches CI.
-LINT_NEW_FROM?=
 lint:
 	@command -v golangci-lint >/dev/null 2>&1 || { \
 		echo 'golangci-lint not installed. Install: go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest'; \
 		exit 1; \
 	}
-	golangci-lint run $(if $(LINT_NEW_FROM),--new-from-merge-base=$(LINT_NEW_FROM),) $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')
+	golangci-lint run $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')

 ## Like `lint` but reports every issue, including the pre-existing baseline
 ## that `lint` ignores via .golangci.yml's new-from-merge-base. Use this to
@@ -268,17 +202,6 @@ lint-all:
 	}
 	golangci-lint run --new=false --new-from-merge-base= --new-from-rev= $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')

-########################################################
-## Git hooks
-########################################################
-## Points git at the versioned .githooks/ directory so the pre-commit hook
-## (lint + coverage gate) runs locally. Run once per clone. Undo with:
-## `git config --unset core.hooksPath`. Skip a single commit with
-## `git commit --no-verify`.
-install-hooks:
-	git config core.hooksPath .githooks
-	@echo 'Installed git hooks: core.hooksPath -> .githooks (pre-commit runs lint + test-coverage-check on Go changes)'
-
 ########################################################
 ## E2E AIO tests (uses standard image with pre-configured models)
 ########################################################
@@ -309,20 +232,13 @@ run-e2e-aio: protogen-go
 	@echo 'Running e2e AIO tests'
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e-aio

-# Distributed architecture e2e (PostgreSQL + NATS via testcontainers).
-# Includes NatsJWT specs (JWT-enabled NATS). Requires Docker.
-# VLLMMultinode is excluded here; use test-e2e-vllm-multinode for that.
-test-e2e-distributed: protogen-go
-	@echo 'Running distributed e2e tests (label Distributed, incl. NatsJWT)'
-	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter='Distributed && !VLLMMultinode' --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e/distributed
-
 # vLLM multi-node DP smoke (CPU). Builds local-ai:tests and the
 # cpu-vllm backend from the current working tree, then drives a
 # head + headless follower via testcontainers-go and asserts a chat
 # completion. BuildKit caches both images, so re-runs only rebuild
 # what changed. The test lives under tests/e2e/distributed and is
 # selected by the VLLMMultinode label so it doesn't run alongside
-# test-e2e-distributed.
+# the other distributed-suite tests by default.
 test-e2e-vllm-multinode: docker-build-e2e extract-backend-vllm protogen-go
 	@echo 'Running e2e vLLM multi-node DP test'
 	LOCALAI_IMAGE=local-ai \
@@ -352,13 +268,12 @@ prepare-e2e:
 run-e2e-image:
 	docker run -p 5390:8080 -e MODELS_PATH=/models -e THREADS=1 -e DEBUG=true -d --rm -v $(TEST_DIR):/models --name e2e-tests-$(RANDOM) localai-tests

-test-e2e: build-mock-backend build-cloud-proxy-backend prepare-e2e run-e2e-image
+test-e2e: build-mock-backend prepare-e2e run-e2e-image
 	@echo 'Running e2e tests'
 	BUILD_TYPE=$(BUILD_TYPE) \
 	LOCALAI_API=http://$(E2E_BRIDGE_IP):5390 \
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e
 	$(MAKE) clean-mock-backend
-	$(MAKE) clean-cloud-proxy-backend
 	$(MAKE) teardown-e2e
 	docker rmi localai-tests

@@ -565,7 +480,6 @@ prepare-test-extra: protogen-python
 	$(MAKE) -C backend/python/insightface
 	$(MAKE) -C backend/python/speaker-recognition
 	$(MAKE) -C backend/rust/kokoros kokoros-grpc
-	$(MAKE) -C backend/go/rfdetr-cpp

 test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/transformers test
@@ -592,7 +506,6 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/insightface test
 	$(MAKE) -C backend/python/speaker-recognition test
 	$(MAKE) -C backend/rust/kokoros test
-	$(MAKE) -C backend/go/rfdetr-cpp test

 ##
 ## End-to-end gRPC tests that exercise a built backend container image.
@@ -998,19 +911,6 @@ test-extra-backend-whisper-transcription: docker-build-whisper
 	BACKEND_TEST_CAPS=health,load,transcription \
 	$(MAKE) test-extra-backend

-## Audio transcription wrapper for the parakeet-cpp (parakeet.cpp ggml port)
-## backend. Mirrors test-extra-backend-whisper-transcription: drives the
-## AudioTranscription / AudioTranscriptionStream RPCs against a published
-## Parakeet GGUF using the JFK 11s clip from whisper.cpp's CI samples. Not
-## part of the default test suite - run explicitly once the pinned model URL
-## is reachable.
-test-extra-backend-parakeet-cpp-transcription: docker-build-parakeet-cpp
-	BACKEND_IMAGE=local-ai-backend:parakeet-cpp \
-	BACKEND_TEST_MODEL_URL=https://huggingface.co/mudler/parakeet-cpp-gguf/resolve/main/tdt_ctc-110m-f16.gguf \
-	BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
-	BACKEND_TEST_CAPS=health,load,transcription \
-	$(MAKE) test-extra-backend
-
 ## LocalVQE audio transform (joint AEC + noise suppression + dereverb).
 ## Exercises the audio_transform capability end-to-end: batch transform
 ## of a real WAV fixture and bidi streaming of synthetic silent frames.
@@ -1164,16 +1064,10 @@ BACKEND_DS4 = ds4|ds4|.|false|false
 # Golang backends
 BACKEND_PIPER = piper|golang|.|false|true
 BACKEND_LOCAL_STORE = local-store|golang|.|false|true
-BACKEND_CLOUD_PROXY = cloud-proxy|golang|.|false|true
 BACKEND_HUGGINGFACE = huggingface|golang|.|false|true
 BACKEND_SILERO_VAD = silero-vad|golang|.|false|true
 BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|true
 BACKEND_WHISPER = whisper|golang|.|false|true
-BACKEND_CRISPASR = crispasr|golang|.|false|true
-BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
-# dllm is mudler/dllm.cpp, the DiffusionGemma block-diffusion engine,
-# wrapped by the purego backend at backend/go/dllm.
-BACKEND_DLLM = dllm|golang|.|false|true
 BACKEND_VOXTRAL = voxtral|golang|.|false|true
 BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
 BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
@@ -1223,7 +1117,6 @@ BACKEND_KOKOROS = kokoros|rust|.|false|true

 # C++ backends (Go wrapper with purego)
 BACKEND_SAM3_CPP = sam3-cpp|golang|.|false|true
-BACKEND_RFDETR_CPP = rfdetr-cpp|golang|.|false|true

 # Helper function to build docker image for a backend
 # Usage: $(call docker-build-backend,BACKEND_NAME,DOCKERFILE_TYPE,BUILD_CONTEXT,PROGRESS_FLAG,NEEDS_BACKEND_ARG)
@@ -1256,14 +1149,10 @@ $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
 $(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
-$(eval $(call generate-docker-build-target,$(BACKEND_CLOUD_PROXY)))
 $(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SILERO_VAD)))
 $(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
 $(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
-$(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
-$(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_DLLM)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
 $(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
@@ -1306,14 +1195,13 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TINYGRAD)))
 $(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_RFDETR_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))

 # Pattern rule for docker-save targets
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx

 ########################################################
 ### Mock Backend for E2E Tests
@@ -1325,12 +1213,6 @@ build-mock-backend: protogen-go
 clean-mock-backend:
 	rm -f tests/e2e/mock-backend/mock-backend

-build-cloud-proxy-backend: protogen-go
-	$(GOCMD) build -o tests/e2e/mock-backend/cloud-proxy ./backend/go/cloud-proxy
-
-clean-cloud-proxy-backend:
-	rm -f tests/e2e/mock-backend/cloud-proxy
-
 ########################################################
 ### UI E2E Test Server
 ########################################################
@@ -1341,50 +1223,6 @@ build-ui-test-server: build-mock-backend react-ui protogen-go
 test-ui-e2e: build-ui-test-server
 	cd core/http/react-ui && npm install && npx playwright install --with-deps chromium && npx playwright test

-## Optional Playwright worker count for the UI e2e targets below. Pass
-## UI_TEST_WORKERS=N (e.g. `make test-ui-coverage UI_TEST_WORKERS=20`) to
-## override Playwright's default (cores/2). Empty by default so Playwright
-## picks its own worker count.
-UI_TEST_WORKERS ?=
-PLAYWRIGHT_WORKERS_FLAG = $(if $(UI_TEST_WORKERS),--workers=$(UI_TEST_WORKERS),)
-
-## Fast Playwright e2e run used by the pre-commit hook on React UI changes.
-## Force-rebuilds the (non-instrumented) dist so the suite tests the working
-## tree — not a stale dist the `react-ui` skip-guard would leave — re-embeds
-## it into ui-test-server, and runs the specs. Uses the nix-provided browser
-## when PLAYWRIGHT_CHROMIUM_PATH is set (flake dev shell), else falls back to
-## downloading it as `test-ui-e2e` does.
-test-ui: build-mock-backend protogen-go
-	cd core/http/react-ui && bun install && bun run build
-	$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui
-	cd core/http/react-ui && sh $(CURDIR)/scripts/ensure-playwright-browser.sh && bunx playwright test $(PLAYWRIGHT_WORKERS_FLAG)
-
-## React UI code coverage from the Playwright e2e suite. Builds a
-## NON-instrumented bundle with source maps (COVERAGE_V8=true), re-embeds it
-## into the ui-test-server (the dist is //go:embed'ed at compile time), runs the
-## Playwright specs which collect native Chromium V8 coverage (PW_V8_COVERAGE=1)
-## — far cheaper than istanbul's build-time counters (~40% faster end-to-end) —
-## convert it to istanbul via v8-to-istanbul in the coverage fixture, and write
-## an nyc report to core/http/react-ui/coverage/. Removes the dist afterwards so
-## normal builds aren't served source-mapped assets. (The legacy istanbul path
-## still exists: `bun run build:coverage` + unset PW_V8_COVERAGE.)
-test-ui-coverage: build-mock-backend protogen-go
-	trap 'rm -rf "$(CURDIR)/core/http/react-ui/dist"' EXIT; \
-	( cd core/http/react-ui && bun install && bun run build:coverage-v8 ) && \
-	$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui && \
-	( cd core/http/react-ui && rm -rf .nyc_output coverage && \
-	    sh $(CURDIR)/scripts/ensure-playwright-browser.sh && \
-	    PW_V8_COVERAGE=1 bunx playwright test $(PLAYWRIGHT_WORKERS_FLAG) && bun run coverage:report )
-
-## UI coverage baseline (committed) and the strict gate that compares against
-## it — the React mirror of test-coverage-baseline / test-coverage-check.
-test-ui-coverage-baseline: test-ui-coverage
-	@node -e 'const fs=require("fs");process.stdout.write(String(JSON.parse(fs.readFileSync("core/http/react-ui/coverage/coverage-summary.json")).total.lines.pct))' > core/http/react-ui/coverage-baseline.txt
-	@echo "Saved UI coverage baseline: $$(cat core/http/react-ui/coverage-baseline.txt)% lines"
-
-test-ui-coverage-check: test-ui-coverage
-	sh $(CURDIR)/scripts/ui-coverage-check.sh core/http/react-ui/coverage/coverage-summary.json core/http/react-ui/coverage-baseline.txt
-
 test-ui-e2e-docker:
 	docker build -t localai-ui-e2e -f tests/e2e-ui/Dockerfile .
 	docker run --rm localai-ui-e2e
--- a/README.md
+++ b/README.md
@@ -31,18 +31,12 @@

 **LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

-**A small core, not a bundle.** Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.
-
- **Composable by design**: backends are separate and pulled on demand, so you install only what your model needs
- **Open and extensible**: load any model, or build your own backend in any language against an open interface
- **Drop-in API compatibility**: OpenAI, Anthropic, and ElevenLabs APIs across every backend
- **Any model, any modality**: LLMs, vision, voice, image, and video behind one API
- **Any hardware**: NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
- **Multi-user ready**: API key auth, user quotas, role-based access
- **Built-in AI agents**: autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first**: your data never leaves your infrastructure
-
-![A small LocalAI core with backends (llama.cpp, vLLM, MLX, whisper.cpp, stable-diffusion, kokoro, parakeet.cpp...) plugged in as separate on-demand images](docs/static/images/diagrams/composable-core.png)
+- **Drop-in API compatibility** — OpenAI, Anthropic, ElevenLabs APIs
+- **36+ backends** — llama.cpp, vLLM, transformers, whisper, diffusers, MLX...
+- **Any hardware** — NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
+- **Multi-user ready** — API key auth, user quotas, role-based access
+- **Built-in AI agents** — autonomous agents with tool use, RAG, MCP, and skills
+- **Privacy-first** — your data never leaves your infrastructure

 Created by [Ettore Di Giacinto](https://github.com/mudler) and maintained by the [LocalAI team](#team).

@@ -149,26 +143,14 @@ local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
 local-ai run oci://localai/phi-2:latest
 ```

-To test a running LocalAI server from the terminal, open an interactive chat session from another shell. Inside the prompt, `/models` lists installed models and `/model <name>` switches between them.
-
-```bash
-# Terminal 1
-local-ai run llama-3.2-1b-instruct:q4_k_m
-
-# Terminal 2
-local-ai chat --model llama-3.2-1b-instruct:q4_k_m
-```
-
 > **Automatic Backend Detection**: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/).

 For more details, see the [Getting Started guide](https://localai.io/basics/getting_started/).

 ## Latest News

- **May 2026**: **LocalAI 4.3.0** - `llama.cpp` [prompt cache on by default](https://github.com/mudler/LocalAI/pull/9925) (repeated system prompts collapse from minutes to seconds), [keyless cosign signing of backend OCI images](https://github.com/mudler/LocalAI/pull/9823), [per-API-key + per-user usage attribution](https://github.com/mudler/LocalAI/pull/9920), Distributed v3 with [per-request replica routing](https://github.com/mudler/LocalAI/pull/9968). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.3.0)
- **May 2026**: **LocalAI 4.2.0** - LocalAI sees and hears: [voice recognition](https://github.com/mudler/LocalAI/pull/9500), [face recognition + antispoofing liveness](https://github.com/mudler/LocalAI/pull/9480), speaker diarization. Plus [drop-in Ollama API](https://github.com/mudler/LocalAI/pull/9284), [video generation](https://github.com/mudler/LocalAI/pull/9420), redesigned UI with i18n + admin-configurable branding, vLLM at feature parity with llama.cpp, and 11 new backends. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.2.0)
- **April 2026**: **LocalAI 4.1.0** - LocalAI becomes a control tower: distributed cluster mode with VRAM-aware smart routing + autoscaling, multi-user platform with OIDC and API keys, per-user quotas with predictive analytics, in-UI fine-tuning with TRL (auto-export to GGUF), on-the-fly quantization backend, visual pipeline editor. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.1.0)
- **March 2026**: **LocalAI 4.0.0** - native agentic orchestration with the new [Agenthub](https://agenthub.localai.io) community hub, full React UI rewrite with Canvas mode, [MCP Apps + client-side](https://github.com/mudler/LocalAI/pull/8947) with tool streaming, [WebRTC realtime audio](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed](https://github.com/mudler/LocalAI/pull/8801). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.0.0)
+- **April 2026**: [Voice recognition](https://github.com/mudler/LocalAI/pull/9500), [Face recognition, identification & liveness detection](https://github.com/mudler/LocalAI/pull/9480), [Ollama API compatibility](https://github.com/mudler/LocalAI/pull/9284), [Video generation in stable-diffusion.ggml](https://github.com/mudler/LocalAI/pull/9420), [Backend versioning with auto-upgrade](https://github.com/mudler/LocalAI/pull/9315), [Pin models & load-on-demand toggle](https://github.com/mudler/LocalAI/pull/9309), [Universal model importer](https://github.com/mudler/LocalAI/pull/9466), new backends: [sglang](https://github.com/mudler/LocalAI/pull/9359), [ik-llama-cpp](https://github.com/mudler/LocalAI/pull/9326), [TurboQuant](https://github.com/mudler/LocalAI/pull/9355), [sam.cpp](https://github.com/mudler/LocalAI/pull/9288), [Kokoros](https://github.com/mudler/LocalAI/pull/9212), [qwen3tts.cpp](https://github.com/mudler/LocalAI/pull/9316), [tinygrad multimodal](https://github.com/mudler/LocalAI/pull/9364)
+- **March 2026**: [Agent management](https://github.com/mudler/LocalAI/pull/8820), [New React UI](https://github.com/mudler/LocalAI/pull/8772), [WebRTC](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed via P2P and RDMA](https://github.com/mudler/LocalAI/pull/8801), [MCP Apps, MCP Client-side](https://github.com/mudler/LocalAI/pull/8947)
 - **February 2026**: [Realtime API for audio-to-audio with tool calling](https://github.com/mudler/LocalAI/pull/6245), [ACE-Step 1.5 support](https://github.com/mudler/LocalAI/pull/8396)
 - **January 2026**: **LocalAI 3.10.0** — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v3.10.0)
 - **December 2025**: [Dynamic Memory Resource reclaimer](https://github.com/mudler/LocalAI/pull/7583), [Automatic multi-GPU model fitting (llama.cpp)](https://github.com/mudler/LocalAI/pull/7584), [Vibevoice backend](https://github.com/mudler/LocalAI/pull/7494)
@@ -254,22 +236,11 @@ A huge thank you to our generous sponsors who support this project covering CI e
  <a href="https://www.spectrocloud.com/" target="blank">
    <img height="200" src="https://github.com/user-attachments/assets/72eab1dd-8b93-4fc0-9ade-84db49f24962">
  </a>
-</p>
-
-<details>
-
-<summary>
-Past sponsors
-</summary>
-
-<p align="center">
  <a href="https://www.premai.io/" target="blank">
    <img height="200" src="https://github.com/mudler/LocalAI/assets/2420543/42e4ca83-661e-4f79-8e46-ae43689683d6"> <br>
  </a>
 </p>

-</details>
-
 ### Individual sponsors

 A special thanks to individual sponsors, a full list is on [GitHub](https://github.com/sponsors/mudler) and [buymeacoffee](https://buymeacoffee.com/mudler). Special shout out to [drikster80](https://github.com/drikster80) for being generous. Thank you everyone!
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -37,22 +37,6 @@ service Backend {

  rpc Rerank(RerankRequest) returns (RerankResult) {}

-  // TokenClassify runs a token-classification (NER) model on the
-  // supplied text and returns each detected entity span. Used by the
-  // PII redactor's optional NER tier — the regex tier still handles
-  // formatted hits cheaply, while this catches names, locations, and
-  // other unformatted PII that regex misses.
-  rpc TokenClassify(TokenClassifyRequest) returns (TokenClassifyResponse) {}
-
-  // Score evaluates the model's joint log-probability of each
-  // supplied candidate continuation given a shared prompt. The
-  // prompt's KV cache is computed once and reused across candidates.
-  // Used for routing-policy multi-label classification, reranking,
-  // calibrated confidence, and reward-model scoring — any task where
-  // the consumer wants the model's confidence in a pre-specified
-  // continuation rather than a generated one.
-  rpc Score(ScoreRequest) returns (ScoreResponse) {}
-
  rpc GetMetrics(MetricsRequest) returns (MetricsResponse);

  rpc VAD(VADRequest) returns (VADResponse) {}
@@ -84,23 +68,6 @@ service Backend {
  rpc QuantizationProgress(QuantizationProgressRequest) returns (stream QuantizationProgressUpdate) {}
  rpc StopQuantization(QuantizationStopRequest) returns (Result) {}

-  // Forward proxies a raw HTTP request to an upstream provider. The
-  // cloud-proxy backend implements this for passthrough-mode model
-  // configs: the client wire format is preserved end-to-end (no
-  // translation through internal proto), which means new provider
-  // fields work the day they ship. Translation-mode proxies use the
-  // standard Predict/PredictStream RPCs instead. Backends that don't
-  // support this return UNIMPLEMENTED.
-  //
-  // The request is bidirectionally streamed so large bodies can flow
-  // without buffering. In practice the first ForwardRequest carries
-  // path, method, headers, and the initial body chunk; subsequent
-  // messages append body chunks. The first ForwardReply carries the
-  // upstream status and response headers; subsequent messages stream
-  // body chunks (SSE frames or chunked transfer). Cancellation of the
-  // gRPC context closes the upstream connection.
-  rpc Forward(stream ForwardRequest) returns (stream ForwardReply) {}
-
 }

 // Define the empty request
@@ -114,76 +81,6 @@ message MetricsResponse {
  int32 prompt_tokens_processed = 5;
 }

-// TokenClassifyRequest carries the text to classify plus an optional
-// score threshold. The transformers backend interprets threshold as
-// the minimum confidence to include in the response; 0 = include all.
-message TokenClassifyRequest {
-  string text = 1;
-  float threshold = 2;
-}
-
-// TokenClassifyEntity is one detected entity span. Byte offsets are
-// into the original UTF-8 text — start..end is a half-open range that
-// addresses the substring corresponding to entity_group.
-//
-// entity_group follows HuggingFace's aggregated-tag convention (e.g.
-// "PER", "LOC", "ORG", or a PII-specific label like "EMAIL" /
-// "SSN" depending on the model). The redactor's per-pattern action
-// map keys off this string.
-message TokenClassifyEntity {
-  string entity_group = 1;
-  int32 start = 2;
-  int32 end = 3;
-  float score = 4;
-  string text = 5;
-}
-
-message TokenClassifyResponse {
-  repeated TokenClassifyEntity entities = 1;
-}
-
-// ScoreRequest carries one shared prompt and one or more continuations
-// to score against it. The backend tokenises the prompt once and reuses
-// the resulting KV cache across all candidates in this request.
-message ScoreRequest {
-  string prompt = 1;
-  repeated string candidates = 2;
-  // Return per-token logprobs for each candidate when true. Default
-  // false to keep the wire response small; the joint log_prob field
-  // covers the common ranking case.
-  bool include_token_logprobs = 3;
-  // When true, the response also populates length_normalized_log_prob
-  // (joint log-prob divided by candidate token count). Useful when
-  // candidates differ in length and the consumer wants a per-token
-  // measure comparable across them (PMI-style scoring).
-  bool length_normalize = 4;
-}
-
-// CandidateScore is one row in the ScoreResponse, matching by index
-// the candidate in ScoreRequest.candidates.
-message CandidateScore {
-  // Sum of log P(token_i | prompt, candidate_token_<i) across the
-  // candidate's tokens. The primary ranking signal.
-  double log_prob = 1;
-  // log_prob / num_tokens — populated when length_normalize=true on
-  // the request.
-  double length_normalized_log_prob = 2;
-  // Per-token detail — populated when include_token_logprobs=true.
-  repeated TokenLogProb tokens = 3;
-  // Number of tokens the backend tokenised this candidate into, after
-  // any backend-specific normalisation (e.g. leading-space handling).
-  int32 num_tokens = 4;
-}
-
-message TokenLogProb {
-  string token = 1;
-  double log_prob = 2;
-}
-
-message ScoreResponse {
-  repeated CandidateScore candidates = 1;
-}
-
 message RerankRequest {
  string query = 1;
  repeated string documents = 2;
@@ -428,25 +325,6 @@ message ModelOptions {
  // applied verbatim to the backend's engine constructor (e.g. vLLM AsyncEngineArgs).
  // Unknown keys produce an error at LoadModel time.
  string EngineArgs = 73;
-
-  // Proxy carries the cloud-proxy backend's per-model configuration.
-  // Empty for non-proxy backends.
-  ProxyOptions Proxy = 74;
-}
-
-// ProxyOptions configures the cloud-proxy backend. UpstreamURL and
-// Mode are always meaningful; Provider only matters in translate mode.
-// The two api_key_* fields are mutually exclusive and resolved by the
-// backend at LoadModel — core forwards the references rather than the
-// plaintext key.
-message ProxyOptions {
-  string upstream_url = 1;
-  string mode = 2;
-  string provider = 3;
-  string api_key_env = 4;
-  string api_key_file = 5;
-  string upstream_model = 6;
-  int32 request_timeout_seconds = 7;
 }

 message Result {
@@ -537,15 +415,6 @@ message TTSRequest {
  string dst = 3;
  string voice = 4;
  optional string language = 5;
-  // instructions is a free-form, per-request style/voice description (maps to
-  // the OpenAI `instructions` field). Backends that support expressive synthesis
-  // (e.g. Qwen3-TTS CustomVoice/VoiceDesign) prefer this over the static YAML
-  // option when set; backends that don't simply ignore it.
-  optional string instructions = 6;
-  // params carries optional, backend-specific per-request generation parameters
-  // (e.g. Chatterbox exaggeration/cfg_weight/temperature). Values are strings and
-  // coerced by the backend; unset leaves the backend's configured defaults.
-  map<string, string> params = 7;
 }

 message VADRequest {
@@ -1133,32 +1002,3 @@ message QuantizationStopRequest {
  string job_id = 1;
 }

-// ForwardHeader is one HTTP header on the request or response. Headers
-// like Authorization are typically injected by the backend (from the
-// resolved API key) rather than passed through from the client.
-message ForwardHeader {
-  string name = 1;
-  string value = 2;
-}
-
-// ForwardRequest is a streamed HTTP request to the upstream. First
-// message carries path/method/headers; subsequent messages carry
-// body_chunk only. All fields except body_chunk are honoured on the
-// first message and ignored thereafter.
-message ForwardRequest {
-  string path = 1;                          // e.g. "/v1/chat/completions" — appended to the model's upstream_url
-  string method = 2;                        // usually "POST"
-  repeated ForwardHeader headers = 3;
-  bytes body_chunk = 4;
-}
-
-// ForwardReply is a streamed HTTP response from the upstream. First
-// message carries status/headers; subsequent messages carry body_chunk
-// only. SSE responses arrive as a sequence of body_chunk frames; the
-// caller is responsible for any parsing.
-message ForwardReply {
-  int32 status = 1;
-  repeated ForwardHeader headers = 2;
-  bytes body_chunk = 3;
-}
-
--- a/backend/cpp/ds4/.gitignore
+++ b/backend/cpp/ds4/.gitignore
@@ -2,7 +2,6 @@ ds4/
 build/
 package/
 grpc-server
-ds4-worker
 *.o
 backend.pb.cc
 backend.pb.h
--- a/backend/cpp/ds4/CMakeLists.txt
+++ b/backend/cpp/ds4/CMakeLists.txt
@@ -60,13 +60,6 @@ elseif(DS4_GPU STREQUAL "cpu")
    set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
 endif()

-# ds4.c now references ds4_distributed.c (distributed inference) and ds4_ssd.c
-# (SSD expert-cache), each split into its own translation unit upstream. Both
-# are GPU-agnostic objects shared by every GPU mode, so link them in regardless
-# of DS4_GPU.
-list(APPEND DS4_OBJS "${DS4_DIR}/ds4_distributed.o")
-list(APPEND DS4_OBJS "${DS4_DIR}/ds4_ssd.o")
-
 add_executable(${TARGET}
    grpc-server.cpp
    dsml_parser.cpp
@@ -106,36 +99,3 @@ if(DS4_NATIVE)
        target_compile_options(${TARGET} PRIVATE -march=native)
    endif()
 endif()
-
-# ds4-worker: standalone distributed worker. Links the same ds4 engine objects
-# (including ds4_distributed.o) but has NO gRPC/protobuf dependency - it speaks
-# ds4's own TCP transport via ds4_dist_run(). Buildable wherever the engine
-# objects build, even on hosts without protobuf/grpc dev headers.
-add_executable(ds4-worker worker_main.c)
-target_include_directories(ds4-worker PRIVATE ${DS4_DIR})
-foreach(obj ${DS4_OBJS})
-    target_sources(ds4-worker PRIVATE ${obj})
-    set_source_files_properties(${obj} PROPERTIES EXTERNAL_OBJECT TRUE GENERATED TRUE)
-endforeach()
-# worker_main.c is C, but the engine objects built by nvcc (ds4_cuda.o) and the
-# Metal path (ds4_metal.o, Obj-C++) reference the C++ runtime (libstdc++). Force
-# the C++ linker driver so those symbols resolve; the C driver would not link
-# libstdc++ and the CUDA/Metal builds fail with undefined std:: references.
-set_target_properties(ds4-worker PROPERTIES LINKER_LANGUAGE CXX)
-target_link_libraries(ds4-worker PRIVATE Threads::Threads m)
-
-if(DS4_GPU STREQUAL "cuda")
-    target_link_libraries(ds4-worker PRIVATE CUDA::cudart CUDA::cublas)
-elseif(DS4_GPU STREQUAL "metal")
-    target_link_libraries(ds4-worker PRIVATE ${FOUNDATION_LIB} ${METAL_LIB})
-elseif(DS4_GPU STREQUAL "cpu")
-    target_compile_definitions(ds4-worker PRIVATE DS4_NO_GPU)
-endif()
-
-if(DS4_NATIVE)
-    if(APPLE)
-        target_compile_options(ds4-worker PRIVATE -mcpu=native)
-    else()
-        target_compile_options(ds4-worker PRIVATE -march=native)
-    endif()
-endif()
--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -1,10 +1,10 @@
 # ds4 backend Makefile.
 #
-# Upstream pin lives below as DS4_VERSION?=8384adf0f9fa0f3bb342dd925372de778b95b263
+# Upstream pin lives below as DS4_VERSION?=2606543be7a8c125a32cee37f5d1d85dc78f2fcf
 # (.github/bump_deps.sh) can find and update it - matches the
 # llama-cpp / ik-llama-cpp / turboquant convention.

-DS4_VERSION?=8384adf0f9fa0f3bb342dd925372de778b95b263
+DS4_VERSION?=2606543be7a8c125a32cee37f5d1d85dc78f2fcf
 DS4_REPO?=https://github.com/antirez/ds4

 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
@@ -18,20 +18,16 @@ UNAME_S := $(shell uname -s)

 CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release

-# ds4_distributed.o and ds4_ssd.o are GPU-agnostic translation units that
-# ds4.c/ds4_cpu.o now reference (upstream split distributed inference and the
-# SSD expert-cache into their own .c files). Both objects are shared by every
-# GPU mode, so they are appended unconditionally below.
 ifeq ($(BUILD_TYPE),cublas)
    CMAKE_ARGS += -DDS4_GPU=cuda
-    DS4_OBJ_TARGET := ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
+    DS4_OBJ_TARGET := ds4.o ds4_cuda.o
 else ifeq ($(UNAME_S),Darwin)
    CMAKE_ARGS += -DDS4_GPU=metal
-    DS4_OBJ_TARGET := ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
+    DS4_OBJ_TARGET := ds4.o ds4_metal.o
 else
    # CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
    CMAKE_ARGS += -DDS4_GPU=cpu
-    DS4_OBJ_TARGET := ds4_cpu.o ds4_distributed.o ds4_ssd.o
+    DS4_OBJ_TARGET := ds4_cpu.o
 endif

 ifneq ($(NATIVE),true)
@@ -56,18 +52,17 @@ ds4:
 # the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
 ds4/ds4.o: ds4
 ifeq ($(BUILD_TYPE),cublas)
-	+$(MAKE) -C ds4 ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
+	+$(MAKE) -C ds4 ds4.o ds4_cuda.o
 else ifeq ($(UNAME_S),Darwin)
-	+$(MAKE) -C ds4 ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
+	+$(MAKE) -C ds4 ds4.o ds4_metal.o
 else
-	+$(MAKE) -C ds4 ds4_cpu.o ds4_distributed.o ds4_ssd.o
+	+$(MAKE) -C ds4 ds4_cpu.o
 endif

 grpc-server: ds4/ds4.o
 	mkdir -p $(BUILD_DIR)
 	cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
 	cp $(BUILD_DIR)/grpc-server grpc-server
-	cp $(BUILD_DIR)/ds4-worker ds4-worker

 package: grpc-server
 	bash package.sh
@@ -76,7 +71,7 @@ test:
 	@echo "ds4 backend: e2e coverage at tests/e2e-backends/ (BACKEND_BINARY mode)"

 clean:
-	rm -rf $(BUILD_DIR) grpc-server ds4-worker package
+	rm -rf $(BUILD_DIR) grpc-server package
 	if [ -d ds4 ]; then $(MAKE) -C ds4 clean; fi

 purge: clean
--- a/backend/cpp/ds4/grpc-server.cpp
+++ b/backend/cpp/ds4/grpc-server.cpp
@@ -23,11 +23,8 @@ extern "C" {

 #include <atomic>
 #include <chrono>
-#include <climits>
 #include <csignal>
-#include <cstdlib>
 #include <cstring>
-#include <ctime>
 #include <iostream>
 #include <memory>
 #include <mutex>
@@ -54,12 +51,6 @@ ds4_session *g_session = nullptr;
 int g_ctx_size = 32768;
 std::string g_kv_cache_dir; // empty disables disk cache

-// Distributed coordinator state. g_distributed is set true when LoadModel is
-// given 'ds4_role:coordinator'; generation then waits for the worker route to
-// form before running. Single-node behavior is unchanged when unset.
-bool g_distributed = false;
-int g_route_timeout_sec = 60;
-
 std::atomic<Server *> g_server{nullptr};

 // Parse a "key:value" option string. Returns empty when no colon.
@@ -69,77 +60,6 @@ static std::pair<std::string, std::string> split_option(const std::string &opt)
    return {opt.substr(0, colon), opt.substr(colon + 1)};
 }

-// Parse a positive base-10 integer. Returns false (without throwing) on empty,
-// trailing garbage, non-positive, or overflow - unlike std::stoi.
-static bool parse_positive_int(const std::string &s, int *out) {
-    if (s.empty()) return false;
-    char *end = nullptr;
-    long v = std::strtol(s.c_str(), &end, 10);
-    if (!end || *end != '\0' || v <= 0 || v > INT_MAX) return false;
-    *out = static_cast<int>(v);
-    return true;
-}
-
-// Parse a ds4 layer spec "START:END" or "START:output" into the engine's
-// distributed layer fields. Returns false on malformed input.
-static bool parse_layers_spec(const std::string &spec, ds4_distributed_layers *out) {
-    auto colon = spec.find(':');
-    if (colon == std::string::npos) return false;
-    std::string lhs = spec.substr(0, colon);
-    std::string rhs = spec.substr(colon + 1);
-    if (lhs.empty() || rhs.empty()) return false;
-    char *end = nullptr;
-    long start = std::strtol(lhs.c_str(), &end, 10);
-    if (!end || *end != '\0' || start < 0) return false;
-    out->start = static_cast<uint32_t>(start);
-    out->has_output = false;
-    if (rhs == "output") {
-        out->has_output = true;
-        out->end = out->start; // engine treats has_output as "through final layer"
-    } else {
-        long e = std::strtol(rhs.c_str(), &end, 10);
-        if (!end || *end != '\0' || e < start) return false;
-        out->end = static_cast<uint32_t>(e);
-    }
-    out->set = true;
-    return true;
-}
-
-// When acting as a distributed coordinator, block until the worker route
-// covers all layers (ds4_session_distributed_route_ready == 1) or the timeout
-// elapses. Returns an empty string on success, or an error message to return
-// to the client. No-op when not distributed.
-//
-// Takes the g_engine_mu lock by reference and RELEASES it during each poll
-// sleep. The wait can span up to g_route_timeout_sec seconds while workers
-// connect; holding g_engine_mu the whole time would block the Status/Health
-// readiness probes (they also lock g_engine_mu), making LocalAI's loader treat
-// a still-starting worker as hung.
-static std::string wait_route_ready(std::unique_lock<std::mutex> &lock) {
-    if (!g_distributed) return "";
-    char err[256] = {0};
-    const int deadline_polls = g_route_timeout_sec * 10; // 100ms per poll
-    for (int i = 0; i <= deadline_polls; ++i) {
-        int ready = ds4_session_distributed_route_ready(g_session, err, sizeof(err));
-        if (ready == 1) return "";
-        if (ready < 0) {
-            return std::string("ds4 distributed route error: ") +
-                   (err[0] ? err : "unknown");
-        }
-        // Release the lock while sleeping so Status/Health and other RPCs can
-        // interleave during worker startup.
-        lock.unlock();
-        struct timespec ts = {0, 100L * 1000L * 1000L}; // 100ms
-        nanosleep(&ts, nullptr);
-        lock.lock();
-        // A concurrent Free() may have torn down the engine while we slept.
-        if (!g_engine || !g_session) {
-            return "ds4: model unloaded while waiting for distributed route";
-        }
-    }
-    return "ds4 distributed route incomplete: workers not connected (layers uncovered)";
-}
-
 static void append_token_text(ds4_engine *engine, int token, std::string &out) {
    size_t len = 0;
    const char *text = ds4_token_text(engine, token, &len);
@@ -457,11 +377,6 @@ public:
                     backend::Result *result) override {
        std::lock_guard<std::mutex> lock(g_engine_mu);

-        // Reset distributed state so a model swap (a second LoadModel without
-        // ds4_role) doesn't inherit a stale coordinator configuration.
-        g_distributed = false;
-        g_route_timeout_sec = 60;
-
        if (g_engine) {
            if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
            ds4_engine_close(g_engine);
@@ -479,23 +394,12 @@ public:
        std::string mtp_path;
        int mtp_draft = 0;
        float mtp_margin = 3.0f;
-        std::string ds4_role, ds4_layers, ds4_listen;
        for (const auto &opt : request->options()) {
            auto [k, v] = split_option(opt);
            if (k == "mtp_path") mtp_path = v;
            else if (k == "mtp_draft") mtp_draft = std::stoi(v);
            else if (k == "mtp_margin") mtp_margin = std::stof(v);
            else if (k == "kv_cache_dir") g_kv_cache_dir = v;
-            else if (k == "ds4_role") ds4_role = v;
-            else if (k == "ds4_layers") ds4_layers = v;
-            else if (k == "ds4_listen") ds4_listen = v;
-            else if (k == "ds4_route_timeout") {
-                if (!parse_positive_int(v, &g_route_timeout_sec)) {
-                    result->set_success(false);
-                    result->set_message("ds4: ds4_route_timeout must be a positive integer");
-                    return GStatus::OK;
-                }
-            }
        }

        g_kv_cache.SetDir(g_kv_cache_dir);
@@ -518,49 +422,6 @@ public:
        opt.backend = DS4_BACKEND_CUDA;
 #endif

-        // Coordinator wiring. 'ds4_role:coordinator' enables layer-split
-        // distributed inference: this process listens on ds4_listen and owns
-        // the ds4_layers slice; workers dial in (see `local-ai worker
-        // ds4-distributed`). Absent ds4_role => unchanged single-node path.
-        // Must be static: opt.distributed.listen_host is a const char* the
-        // engine retains past this call, so it cannot point at a local that
-        // goes out of scope (otherwise a future "simplify to local" refactor
-        // reintroduces a dangling pointer).
-        static std::string s_listen_host;
-        if (ds4_role == "coordinator") {
-            if (ds4_layers.empty() || ds4_listen.empty()) {
-                result->set_success(false);
-                result->set_message("ds4: ds4_role:coordinator requires ds4_layers and ds4_listen");
-                return GStatus::OK;
-            }
-            // host:port for IPv4/hostname; IPv6 literals are unsupported (the
-            // first colon would split inside the address).
-            auto host_port = split_option(ds4_listen); // "host:port" -> {host, port}
-            if (host_port.second.empty()) {
-                result->set_success(false);
-                result->set_message("ds4: ds4_listen must be host:port");
-                return GStatus::OK;
-            }
-            int listen_port = 0;
-            if (!parse_positive_int(host_port.second, &listen_port)) {
-                result->set_success(false);
-                result->set_message("ds4: ds4_listen port must be a positive integer");
-                return GStatus::OK;
-            }
-            ds4_distributed_layers layers = {};
-            if (!parse_layers_spec(ds4_layers, &layers)) {
-                result->set_success(false);
-                result->set_message("ds4: invalid ds4_layers (want START:END or START:output)");
-                return GStatus::OK;
-            }
-            s_listen_host = host_port.first;
-            opt.distributed.role = DS4_DISTRIBUTED_COORDINATOR;
-            opt.distributed.layers = layers;
-            opt.distributed.listen_host = s_listen_host.c_str();
-            opt.distributed.listen_port = listen_port;
-            g_distributed = true;
-        }
-
        int rc = ds4_engine_open(&g_engine, &opt);
        if (rc != 0 || !g_engine) {
            result->set_success(false);
@@ -597,13 +458,10 @@ public:

    GStatus Predict(ServerContext *, const backend::PredictOptions *request,
                   backend::Reply *reply) override {
-        std::unique_lock<std::mutex> lock(g_engine_mu);
+        std::lock_guard<std::mutex> lock(g_engine_mu);
        if (!g_engine || !g_session) {
            return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
        }
-        if (std::string route_err = wait_route_ready(lock); !route_err.empty()) {
-            return GStatus(StatusCode::UNAVAILABLE, route_err);
-        }
        ds4_tokens prompt = {};
        build_prompt(g_engine, request, &prompt);
        int n_predict = request->tokens() > 0 ? request->tokens() : 256;
@@ -696,13 +554,10 @@ public:

    GStatus PredictStream(ServerContext *, const backend::PredictOptions *request,
                         ServerWriter<backend::Reply> *writer) override {
-        std::unique_lock<std::mutex> lock(g_engine_mu);
+        std::lock_guard<std::mutex> lock(g_engine_mu);
        if (!g_engine || !g_session) {
            return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
        }
-        if (std::string route_err = wait_route_ready(lock); !route_err.empty()) {
-            return GStatus(StatusCode::UNAVAILABLE, route_err);
-        }
        ds4_tokens prompt = {};
        build_prompt(g_engine, request, &prompt);
        int n_predict = request->tokens() > 0 ? request->tokens() : 256;
--- a/backend/cpp/ds4/package.sh
+++ b/backend/cpp/ds4/package.sh
@@ -5,8 +5,7 @@ REPO_ROOT="${CURDIR}/../../.."

 mkdir -p "$CURDIR/package/lib"
 cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
-cp -avf "$CURDIR/ds4-worker"  "$CURDIR/package/"
-cp -rfv "$CURDIR/run.sh"      "$CURDIR/package/"
+cp -rfv "$CURDIR/run.sh"     "$CURDIR/package/"

 UNAME_S=$(uname -s)
 if [ "$UNAME_S" = "Darwin" ]; then
--- a/backend/cpp/ds4/worker_main.c
+++ b/backend/cpp/ds4/worker_main.c
@@ -1,126 +0,0 @@
-// ds4-worker: standalone distributed worker for the LocalAI ds4 backend.
-//
-// A ds4 distributed worker owns a slice of the model's transformer layers,
-// dials the coordinator, and serves activations for its slice. It does NOT
-// speak backend.proto - it speaks ds4's own TCP transport via ds4_dist_run().
-// This binary is intentionally minimal (no HTTP/web/kvstore/linenoise): it
-// only needs the engine objects + ds4_distributed.o, which the backend already
-// builds. It is launched by `local-ai worker ds4-distributed`.
-//
-// Usage:
-//   ds4-worker --role worker --model <gguf> --layers 20:output \
-//              --coordinator <host> <port> [--cpu|--cuda|--metal] [-c CTX] [-t N]
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include <signal.h>
-#include <limits.h>
-
-#include "ds4.h"
-#include "ds4_distributed.h"
-
-static const char *need_arg(int *i, int argc, char **argv, const char *flag) {
-    if (*i + 1 >= argc) {
-        fprintf(stderr, "ds4-worker: missing value for %s\n", flag);
-        exit(2);
-    }
-    return argv[++(*i)];
-}
-
-static int parse_int_arg(const char *s, const char *flag) {
-    char *end = NULL;
-    long v = strtol(s, &end, 10);
-    if (!s[0] || *end || v <= 0 || v > INT_MAX) {
-        fprintf(stderr, "ds4-worker: invalid value for %s: %s\n", flag, s);
-        exit(2);
-    }
-    return (int)v;
-}
-
-static ds4_backend default_backend(void) {
-#if defined(DS4_NO_GPU)
-    return DS4_BACKEND_CPU;
-#elif defined(__APPLE__)
-    return DS4_BACKEND_METAL;
-#else
-    return DS4_BACKEND_CUDA;
-#endif
-}
-
-int main(int argc, char **argv) {
-    signal(SIGPIPE, SIG_IGN);
-
-    ds4_engine_options opt = {0};
-    opt.backend = default_backend();
-    int ctx_size = 32768;
-
-    for (int i = 1; i < argc; i++) {
-        const char *arg = argv[i];
-        if (!strcmp(arg, "-h") || !strcmp(arg, "--help")) {
-            fprintf(stdout, "ds4-worker: standalone ds4 distributed worker\n");
-            ds4_dist_usage(stdout);
-            fprintf(stdout, "  -m, --model PATH   model GGUF (the worker loads only its --layers slice)\n");
-            fprintf(stdout, "  -c, --ctx N        context size (default 32768)\n");
-            fprintf(stdout, "  -t, --threads N    CPU threads\n");
-            fprintf(stdout, "  --cpu|--cuda|--metal  backend override\n");
-            return 0;
-        }
-
-        char dist_err[256] = {0};
-        ds4_dist_cli_parse_result dist_parse =
-            ds4_dist_parse_cli_arg(arg, &i, argc, argv, &opt.distributed,
-                                   dist_err, sizeof(dist_err));
-        if (dist_parse == DS4_DIST_CLI_ERROR) {
-            fprintf(stderr, "ds4-worker: %s\n",
-                    dist_err[0] ? dist_err : "invalid distributed option");
-            return 2;
-        }
-        if (dist_parse == DS4_DIST_CLI_MATCHED) continue;
-
-        if (!strcmp(arg, "-m") || !strcmp(arg, "--model")) {
-            opt.model_path = need_arg(&i, argc, argv, arg);
-        } else if (!strcmp(arg, "-c") || !strcmp(arg, "--ctx")) {
-            ctx_size = parse_int_arg(need_arg(&i, argc, argv, arg), arg);
-        } else if (!strcmp(arg, "-t") || !strcmp(arg, "--threads")) {
-            opt.n_threads = parse_int_arg(need_arg(&i, argc, argv, arg), arg);
-        } else if (!strcmp(arg, "--cpu")) {
-            opt.backend = DS4_BACKEND_CPU;
-        } else if (!strcmp(arg, "--cuda")) {
-            opt.backend = DS4_BACKEND_CUDA;
-        } else if (!strcmp(arg, "--metal")) {
-            opt.backend = DS4_BACKEND_METAL;
-        } else {
-            fprintf(stderr, "ds4-worker: unknown option: %s\n", arg);
-            return 2;
-        }
-    }
-
-    if (opt.distributed.role != DS4_DISTRIBUTED_WORKER) {
-        fprintf(stderr, "ds4-worker: --role worker is required\n");
-        return 2;
-    }
-    if (!opt.model_path) {
-        fprintf(stderr, "ds4-worker: --model is required\n");
-        return 2;
-    }
-
-    char prep_err[256] = {0};
-    if (ds4_dist_prepare_engine_options(&opt.distributed, &opt,
-                                        prep_err, sizeof(prep_err)) != 0) {
-        fprintf(stderr, "ds4-worker: %s\n", prep_err);
-        return 2;
-    }
-
-    ds4_engine *engine = NULL;
-    if (ds4_engine_open(&engine, &opt) != 0 || !engine) {
-        fprintf(stderr, "ds4-worker: failed to open engine\n");
-        return 1;
-    }
-
-    ds4_dist_generation_options gen = {0};
-    gen.ctx_size = ctx_size;
-    int rc = ds4_dist_run(engine, &opt.distributed, &gen);
-    ds4_engine_close(engine);
-    return rc;
-}
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=e6f8112f3ba126eed3ff5b30cdd08085414a7516
+IK_LLAMA_VERSION?=11a1fea9e291f12ce2c803a9d7812c30ca806bcf
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=039e20a2db9e87b2477c76cc04905f3e1acad77f
+LLAMA_VERSION?=ad277572619fcfb6ddd38f4c6437283a4b2b8636
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -34,7 +34,6 @@
 #include <regex>
 #include <algorithm>
 #include <atomic>
-#include <cmath>
 #include <cstdlib>
 #include <fstream>
 #include <iterator>
@@ -122,40 +121,6 @@ static std::string base64_encode_bytes(const unsigned char* data, size_t len) {

 bool loaded_model; // TODO: add a mutex for this, but happens only once loading the model

-// Score bypasses the slot loop (see the comment on Score below) so it
-// must not run concurrently with any slot-loop RPC. These counters
-// are a defence-in-depth tripwire — ModelConfig.Validate already
-// rejects llama-cpp configs that mix score with chat/completion/
-// embeddings, so a healthy deployment never trips them. seq_cst is
-// load-bearing for the increment-then-check pattern below.
-static std::atomic<int> slot_loop_inflight{0};
-static std::atomic<int> score_inflight{0};
-
-// Increment-then-check, not check-then-increment: two simultaneous
-// racers both observe the other's increment and both abort cleanly.
-// Reversed, both could see zero and proceed.
-struct conflict_guard {
-    std::atomic<int>& self;
-    conflict_guard(const char* rpc, std::atomic<int>& self_, std::atomic<int>& other, const char* other_name)
-        : self(self_) {
-        self.fetch_add(1, std::memory_order_seq_cst);
-        int o = other.load(std::memory_order_seq_cst);
-        if (o > 0) {
-            fprintf(stderr,
-                "FATAL: %s called with %s=%d. The llama-cpp backend cannot "
-                "service Score and slot-loop RPCs concurrently — Score "
-                "bypasses the slot loop and races the llama_context. Bind "
-                "Score-using features to a model dedicated to scoring "
-                "(known_usecases: [score] with no chat/completion/embeddings).\n",
-                rpc, other_name, o);
-            std::abort();
-        }
-    }
-    ~conflict_guard() {
-        self.fetch_sub(1, std::memory_order_seq_cst);
-    }
-};
-
 static std::function<void(int)> shutdown_handler;
 static std::atomic_flag is_terminating = ATOMIC_FLAG_INIT;

@@ -381,15 +346,6 @@ json parse_options(bool streaming, const backend::PredictOptions* predict, const
            });
    }

-    // for each video in the request, add the video data
-    for (int i = 0; i < predict->videos_size(); i++) {
-        data["video_data"].push_back(json
-            {
-                {"id", i},
-                {"data",    predict->videos(i)},
-            });
-    }
-
    data["stop"] = predict->stopprompts();
    // data["n_probs"] = predict->nprobs();
    //TODO: images,
@@ -491,13 +447,23 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    if (!request->draftmodel().empty()) {
        params.speculative.draft.mparams.path = request->draftmodel();
        // Default to draft type if a draft model is set but no explicit type.
-        // Upstream made the speculative type a vector (ggml-org/llama.cpp#22838)
-        // and renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE (#22964).
+        // Upstream (post ggml-org/llama.cpp#22838) made the speculative type a
+        // vector; the turboquant fork still uses the legacy scalar. The
+        // LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
+        // backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
+        // Upstream renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE
+        // in ggml-org/llama.cpp#22964; the fork still uses the old name.
+#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
+        if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
+            params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
+        }
+#else
        const bool no_spec_type = params.speculative.types.empty() ||
            (params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
        if (no_spec_type) {
            params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE };
        }
+#endif
    }

    //  params.model_alias ??
@@ -551,34 +517,10 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    params.warmup = true;
    // no_op_offload: disable host tensor op offload (default: false)
    params.no_op_offload = false;
-    // kv_unified: enable unified KV cache. Upstream's server auto-enables this
-    // when the slot count is auto (-np <0), bumping n_parallel to 4 alongside.
-    // LocalAI keeps n_parallel=1 by default, which would skip that auto path
-    // and leave kv_unified=false. We flip the default to true here so the
-    // server-side prompt cache (cache_idle_slots) is actually usable on the
-    // single-slot path that LocalAI ships with: without it, idle slots are
-    // never persisted across requests and the prompt cache is dead weight.
-    // Users can opt out with `options: [ "kv_unified:false" ]`.
-    params.kv_unified = true;
-    // n_ctx_checkpoints: max context checkpoints per slot. Match upstream's
-    // default (32); the previous LocalAI-specific 8 was unnecessarily tight
-    // and limits partial-prefix recovery without a clear memory rationale.
-    params.n_ctx_checkpoints = 32;
-    // cache_idle_slots: save and clear idle slot KV to the prompt cache on
-    // task switch. Upstream default is true; the server auto-disables it if
-    // kv_unified=false or cache_ram_mib=0, so flipping kv_unified above is
-    // what actually unlocks it.
-    params.cache_idle_slots = true;
-    // checkpoint_min_step: minimum spacing between context checkpoints in
-    // tokens (0 disables the minimum). Match upstream's default (256). This
-    // field was renamed from `checkpoint_every_nt` in llama.cpp; the semantics
-    // also shifted from a fixed cadence to a minimum spacing. The turboquant
-    // fork still lacks common_params::checkpoint_min_step, so skip it there
-    // (LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP is injected by
-    // backend/cpp/turboquant/patch-grpc-server.sh).
-#ifndef LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP
-    params.checkpoint_min_step = 256;
-#endif
+    // kv_unified: enable unified KV cache (default: false)
+    params.kv_unified = false;
+    // n_ctx_checkpoints: max context checkpoints per slot (default: 8)
+    params.n_ctx_checkpoints = 8;

     // decode options. Options are in form optname:optvale, or if booleans only optname.
    for (int i = 0; i < request->options_size(); i++) {
@@ -737,44 +679,10 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                try {
                    params.n_ctx_checkpoints = std::stoi(optval_str);
                } catch (const std::exception& e) {
-                    // If conversion fails, keep default value (32)
+                    // If conversion fails, keep default value (8)
                }
            }

-        // --- server-side idle-slot prompt cache toggle (upstream --cache-idle-slots) ---
-        // Saves the slot's KV state into the host-side prompt cache on task
-        // switch so a later request with the same prefix can warm-load it.
-        // Auto-disabled by the server if kv_unified=false or cache_ram=0.
-        } else if (!strcmp(optname, "cache_idle_slots") || !strcmp(optname, "idle_slots_cache")) {
-            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
-                params.cache_idle_slots = true;
-            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
-                params.cache_idle_slots = false;
-            }
-
-#ifndef LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP
-        // --- minimum context-checkpoint spacing (upstream -cms / --checkpoint-min-step) ---
-        // 0 disables the minimum-spacing gate. Old option names (`checkpoint_every_nt`,
-        // `checkpoint_every_n_tokens`) are kept as aliases for backward compatibility
-        // with existing user configs: upstream renamed the field and shifted its
-        // semantics from a fixed cadence to a minimum spacing.
-        //
-        // Gated out for the turboquant fork, which lacks common_params::
-        // checkpoint_min_step. The leading `}` closing the cache_idle_slots
-        // branch is removed with this block; the next `} else if` (n_ubatch)
-        // then closes cache_idle_slots, so braces stay balanced under both
-        // preprocessor branches.
-        } else if (!strcmp(optname, "checkpoint_min_step") || !strcmp(optname, "checkpoint_min_spacing") ||
-                   !strcmp(optname, "checkpoint_every_nt") || !strcmp(optname, "checkpoint_every_n_tokens")) {
-            if (optval != NULL) {
-                try {
-                    params.checkpoint_min_step = std::stoi(optval_str);
-                } catch (const std::exception& e) {
-                    // If conversion fails, keep default value (256)
-                }
-            }
-#endif
-
        // --- physical batch size (upstream -ub / --ubatch-size) ---
        // Note: line ~482 already aliases n_ubatch to n_batch as a default; this
        // option lets users decouple the two (useful for embeddings/rerank).
@@ -906,6 +814,17 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt

        // Speculative decoding options
        } else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
+#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
+            // Fork only knows a single scalar `type`. Take the first comma-
+            // separated value and assign it via the singular helper.
+            std::string first = optval_str;
+            const auto comma = first.find(',');
+            if (comma != std::string::npos) first = first.substr(0, comma);
+            auto type = common_speculative_type_from_name(first);
+            if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
+                params.speculative.type = type;
+            }
+#else
            // Upstream switched to a vector of types (comma-separated for multi-type
            // chaining via common_speculative_types_from_names). We keep accepting a
            // single value here, but also tolerate comma-separated lists.
@@ -934,6 +853,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            if (!parsed.empty()) {
                params.speculative.types = parsed;
            }
+#endif
        } else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
            if (optval != NULL) {
                try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
@@ -971,6 +891,21 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            // shares the target context size. Accept the option for backward
            // compatibility but silently ignore it.

+// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838
+// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`,
+// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}`
+// fields. The turboquant fork branched before that, so its build defines
+// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option
+// keys become unrecognized (silently dropped, like any unknown opt) for it.
+//
+// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the
+// closing-brace position of the `draft_ctx_size` branch on purpose: in the
+// legacy build the chain ends here (the brace closes draft_ctx_size), and in
+// the modern build the chain continues with `} else if (...)` instead, so the
+// brace count stays balanced under both branches of the preprocessor.
+#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
+        }
+#else
        // --- ngram_mod family (upstream --spec-ngram-mod-*) ---
        } else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
            if (optval != NULL) {
@@ -1100,6 +1035,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            }
            if (!cur.empty()) flush(cur);
        }
+#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size
    }

    // Set params.n_parallel from environment variable if not set via options (fallback)
@@ -1149,8 +1085,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            params.tensor_buft_overrides.push_back({nullptr, nullptr});
        }
    }
-    // Terminate the draft tensor_buft_overrides list with a sentinel, mirroring
-    // the main-model handling above.
    if (!params.speculative.draft.tensor_buft_overrides.empty()) {
        params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
    }
@@ -1473,7 +1407,6 @@ public:
        if (params_base.model.path.empty()) {
            return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
        }
-        conflict_guard guard("PredictStream", slot_loop_inflight, score_inflight, "score_inflight");
        json data = parse_options(true, request, params_base, ctx_server.get_llama_context());


@@ -1512,7 +1445,7 @@ public:
                    msg_json["role"] = msg.role();

                    bool is_last_user_msg = (i == last_user_msg_idx);
-                    bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0 || request->videos_size() > 0);
+                    bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0);

                    // Handle content - can be string, null, or array
                    // For multimodal content, we'll embed images/audio from separate fields
@@ -1563,16 +1496,6 @@ public:
                                    content_array.push_back(audio_chunk);
                                }
                            }
-                            if (request->videos_size() > 0) {
-                                for (int j = 0; j < request->videos_size(); j++) {
-                                    json video_chunk;
-                                    video_chunk["type"] = "input_video";
-                                    json input_video;
-                                    input_video["data"] = request->videos(j);
-                                    video_chunk["input_video"] = input_video;
-                                    content_array.push_back(video_chunk);
-                                }
-                            }
                            msg_json["content"] = content_array;
                        } else {
                            // Use content as-is (already array or not last user message)
@@ -1607,16 +1530,6 @@ public:
                                content_array.push_back(audio_chunk);
                            }
                        }
-                        if (request->videos_size() > 0) {
-                            for (int j = 0; j < request->videos_size(); j++) {
-                                json video_chunk;
-                                video_chunk["type"] = "input_video";
-                                json input_video;
-                                input_video["data"] = request->videos(j);
-                                video_chunk["input_video"] = input_video;
-                                content_array.push_back(video_chunk);
-                            }
-                        }
                        msg_json["content"] = content_array;
                    } else if (msg.role() == "tool") {
                        // Tool role messages must have content field set, even if empty
@@ -1932,17 +1845,6 @@ public:
                    body_json["chat_template_kwargs"]["enable_thinking"] = (et_it->second == "true");
                }

-                // Pass reasoning_effort via chat_template_kwargs too: the lever
-                // jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
-                // from enable_thinking which those templates ignore.
-                auto re_it = metadata.find("reasoning_effort");
-                if (re_it != metadata.end() && !re_it->second.empty()) {
-                    if (!body_json.contains("chat_template_kwargs")) {
-                        body_json["chat_template_kwargs"] = json::object();
-                    }
-                    body_json["chat_template_kwargs"]["reasoning_effort"] = re_it->second;
-                }
-
                // Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
                SRV_DBG("[CONVERSATION DEBUG] PredictStream: Full body_json before oaicompat_chat_params_parse:\n%s\n", body_json.dump(2).c_str());

@@ -2068,16 +1970,6 @@ public:
                        files.push_back(decoded_data);
                    }
                }
-
-                const auto &video_data = data.find("video_data");
-                if (video_data != data.end() && video_data->is_array())
-                {
-                    for (const auto &video : *video_data)
-                    {
-                        auto decoded_data = base64_decode(video["data"].get<std::string>());
-                        files.push_back(decoded_data);
-                    }
-                }
            }

            const bool has_mtmd = ctx_server.impl->mctx != nullptr;
@@ -2213,15 +2105,7 @@ public:
        // content element — attaching to both would duplicate the first
        // token since oaicompat_msg_diffs is the same for both.
        json first_res_json = first_result->to_json();
-        // Upstream llama.cpp (ggml-org/llama.cpp#23884) now emits an initial
-        // "begin" partial whose to_json() returns null, used only to signal the
-        // HTTP layer to flush 200 status headers before any token. gRPC has no
-        // such concept, so there is nothing to emit — the real tokens arrive in
-        // the loop below. Feeding this null into build_reply_from_json would
-        // throw (uncaught) and surface as a generic RPC error.
-        if (first_res_json.is_null()) {
-            // skip the begin-of-stream marker
-        } else if (first_res_json.is_array()) {
+        if (first_res_json.is_array()) {
            for (const auto & res : first_res_json) {
                auto reply = build_reply_from_json(res, first_result.get());
                // Skip chat deltas for role-init elements (have "role" in
@@ -2251,10 +2135,7 @@ public:
            }

            json res_json = result->to_json();
-            if (res_json.is_null()) {
-                // begin-of-stream marker (see note above) — nothing to emit
-                continue;
-            } else if (res_json.is_array()) {
+            if (res_json.is_array()) {
                for (const auto & res : res_json) {
                    auto reply = build_reply_from_json(res, result.get());
                    bool is_role_init = res.contains("choices") && !res["choices"].empty() &&
@@ -2285,7 +2166,6 @@ public:
         if (params_base.model.path.empty()) {
             return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
         }
-         conflict_guard guard("Predict", slot_loop_inflight, score_inflight, "score_inflight");
         json data = parse_options(true, request, params_base, ctx_server.get_llama_context());

        data["stream"] = false;
@@ -2330,7 +2210,7 @@ public:
                    }

                    bool is_last_user_msg = (i == last_user_msg_idx);
-                    bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0 || request->videos_size() > 0);
+                    bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0);

                    // Handle content - can be string, null, or array
                    // For multimodal content, we'll embed images/audio from separate fields
@@ -2383,16 +2263,6 @@ public:
                                    content_array.push_back(audio_chunk);
                                }
                            }
-                            if (request->videos_size() > 0) {
-                                for (int j = 0; j < request->videos_size(); j++) {
-                                    json video_chunk;
-                                    video_chunk["type"] = "input_video";
-                                    json input_video;
-                                    input_video["data"] = request->videos(j);
-                                    video_chunk["input_video"] = input_video;
-                                    content_array.push_back(video_chunk);
-                                }
-                            }
                            msg_json["content"] = content_array;
                        } else {
                            // Use content as-is (already array or not last user message)
@@ -2432,16 +2302,6 @@ public:
                                content_array.push_back(audio_chunk);
                            }
                        }
-                        if (request->videos_size() > 0) {
-                            for (int j = 0; j < request->videos_size(); j++) {
-                                json video_chunk;
-                                video_chunk["type"] = "input_video";
-                                json input_video;
-                                input_video["data"] = request->videos(j);
-                                video_chunk["input_video"] = input_video;
-                                content_array.push_back(video_chunk);
-                            }
-                        }
                        msg_json["content"] = content_array;
                        SRV_INF("[CONTENT DEBUG] Predict: Message %d created content array with media\n", i);
                    } else if (!msg.tool_calls().empty()) {
@@ -2766,17 +2626,6 @@ public:
                    body_json["chat_template_kwargs"]["enable_thinking"] = (predict_et_it->second == "true");
                }

-                // Pass reasoning_effort via chat_template_kwargs too: the lever
-                // jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
-                // from enable_thinking which those templates ignore.
-                auto predict_re_it = predict_metadata.find("reasoning_effort");
-                if (predict_re_it != predict_metadata.end() && !predict_re_it->second.empty()) {
-                    if (!body_json.contains("chat_template_kwargs")) {
-                        body_json["chat_template_kwargs"] = json::object();
-                    }
-                    body_json["chat_template_kwargs"]["reasoning_effort"] = predict_re_it->second;
-                }
-
                // Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
                SRV_DBG("[CONVERSATION DEBUG] Predict: Full body_json before oaicompat_chat_params_parse:\n%s\n", body_json.dump(2).c_str());

@@ -2904,16 +2753,6 @@ public:
                        files.push_back(decoded_data);
                    }
                }
-
-                const auto &video_data = data.find("video_data");
-                if (video_data != data.end() && video_data->is_array())
-                {
-                    for (const auto &video : *video_data)
-                    {
-                        auto decoded_data = base64_decode(video["data"].get<std::string>());
-                        files.push_back(decoded_data);
-                    }
-                }
            }

            // process files
@@ -3085,7 +2924,6 @@ public:
        if (params_base.model.path.empty()) {
            return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
        }
-        conflict_guard guard("Embedding", slot_loop_inflight, score_inflight, "score_inflight");
        json body = parse_options(false, request, params_base, ctx_server.get_llama_context());

        body["stream"] = false;
@@ -3193,8 +3031,6 @@ public:
            return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "\"documents\" must be a non-empty string array");
        }

-        conflict_guard guard("Rerank", slot_loop_inflight, score_inflight, "score_inflight");
-
        // Create and queue the task
        auto rd = ctx_server.get_response_reader();
        {
@@ -3267,218 +3103,12 @@ public:
        return grpc::Status::OK;
    }

-    // Score returns the model's joint log-probability of each candidate
-    // continuation given a shared prompt.
-    //
-    // WHY bypass the slot/task queue: upstream server_context exposes
-    // get_llama_context as "main thread only" and the slot loop's
-    // update_slots() owns the context whenever a task is in flight.
-    // No public synchronization primitive is available — so Score is
-    // unsafe to call concurrently with active generation through this
-    // backend. In practice routing-classifier calls happen before the
-    // request is routed to a generation backend, so the model used
-    // for Score is typically idle. Concurrent Score calls are
-    // serialised by a local mutex; KV-cache state is isolated behind
-    // a dedicated sequence ID cleared between candidates.
-    //
-    // A patch to server-context.cpp that adds SERVER_TASK_TYPE_SCORE
-    // and routes scoring through the slot loop would be the correct
-    // long-term fix; tracked as a follow-up.
-    //
-    // Perf TODO (measured: ~450 ms warm for 3 candidates on Arch-
-    // Router-1.5B Q4_K_M + Intel SYCL): the current loop re-decodes
-    // `prompt + candidate` from scratch for every candidate, throwing
-    // away the prompt's KV cache between iterations. A smarter
-    // version would:
-    //   1. Decode just the prompt once into score_seq_id.
-    //   2. Snapshot/cp that sequence (llama_memory_seq_cp) into a
-    //      per-candidate sequence id.
-    //   3. For each candidate, decode only its tokens onto the copy
-    //      (continuing from the saved prompt state), read logits.
-    //   4. llama_memory_seq_rm the copy.
-    // Estimated speedup: 3-candidate calls 450 ms -> ~150-200 ms,
-    // 6-candidate calls 630 ms -> ~220 ms. Single source-file change,
-    // no proto / Go-side changes needed. Worth doing once routing is
-    // wired into the middleware and Score is on the hot path of every
-    // chat request.
-    grpc::Status Score(ServerContext* context, const backend::ScoreRequest* request, backend::ScoreResponse* response) override {
-        auto auth = checkAuth(context);
-        if (!auth.ok()) return auth;
-        if (params_base.model.path.empty()) {
-            return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
-        }
-        if (request->candidates_size() == 0) {
-            return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "candidates must be non-empty");
-        }
-
-        // Tripwire against the slot loop. Acquired before score_mutex
-        // so it fires even when this Score is queued behind another.
-        conflict_guard guard("Score", score_inflight, slot_loop_inflight, "slot_loop_inflight");
-
-        // Serialise concurrent Score calls. The slot loop is still
-        // free to race with us — see the class comment above.
-        static std::mutex score_mutex;
-        std::lock_guard<std::mutex> score_lock(score_mutex);
-
-        llama_context * lctx = ctx_server.get_llama_context();
-        if (lctx == nullptr) {
-            return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "llama context unavailable (sleeping?)");
-        }
-        const llama_vocab * vocab = ctx_server.impl->vocab;
-        const int32_t n_vocab = llama_vocab_n_tokens(vocab);
-        const int32_t n_ctx = llama_n_ctx(lctx);
-        llama_memory_t mem = llama_get_memory(lctx);
-
-        // The KV-cache is sized to seq_to_stream.size() at load
-        // (typically equal to n_slots, often 1). Sequence IDs must
-        // be in [0, n_seq_max), so we can't pick a high-value
-        // "private" ID — we have to share with the slot. We clear
-        // the cache before AND after each candidate to keep
-        // scoring isolated from whatever state the slot held, and
-        // the static mutex above guarantees no other Score call is
-        // racing in the meantime. The slot loop is still free to
-        // race (see comment on this method) — Score must not run
-        // concurrently with generation through this backend.
-        const llama_seq_id score_seq_id = 0;
-        llama_memory_seq_rm(mem, score_seq_id, -1, -1);
-
-        // Tokenize the shared prompt once with add_special=true so
-        // BOS is prepended when the model requires it. parse_special
-        // keeps chat-template markers in the prompt intact.
-        const std::string prompt = request->prompt();
-        std::vector<llama_token> prompt_tokens = common_tokenize(vocab, prompt, /*add_special=*/true, /*parse_special=*/true);
-        const int32_t prompt_len = (int32_t) prompt_tokens.size();
-
-        for (int ci = 0; ci < request->candidates_size(); ci++) {
-            const std::string & candidate_text = request->candidates(ci);
-
-            // Re-tokenize prompt + candidate as a single string. BPE
-            // merges across the boundary can shift the tokenization
-            // versus tokenize(prompt) ++ tokenize(candidate), so we
-            // find the divergence point against prompt_tokens.
-            std::vector<llama_token> full_tokens = common_tokenize(vocab, prompt + candidate_text, /*add_special=*/true, /*parse_special=*/true);
-            int32_t divergence = prompt_len;
-            const int32_t min_len = std::min<int32_t>(prompt_len, (int32_t) full_tokens.size());
-            for (int32_t i = 0; i < min_len; i++) {
-                if (prompt_tokens[i] != full_tokens[i]) {
-                    divergence = i;
-                    break;
-                }
-            }
-            const int32_t cand_len = (int32_t) full_tokens.size() - divergence;
-            backend::CandidateScore * cs = response->add_candidates();
-            cs->set_num_tokens(cand_len);
-            if (cand_len <= 0) {
-                cs->set_log_prob(0.0);
-                if (request->length_normalize()) {
-                    cs->set_length_normalized_log_prob(0.0);
-                }
-                continue;
-            }
-            if (divergence < 1) {
-                // Need at least one prior token (typically BOS) to
-                // predict the first candidate token's logit. Tokeniser
-                // models without BOS + an empty prompt fall in here.
-                return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT,
-                    "Score: prompt produced no leading tokens; need at least one (e.g. BOS) to predict candidate");
-            }
-            if ((int32_t) full_tokens.size() > n_ctx) {
-                return grpc::Status(grpc::StatusCode::OUT_OF_RANGE,
-                    "Score: prompt+candidate exceeds context size (got " +
-                    std::to_string(full_tokens.size()) + ", n_ctx=" + std::to_string(n_ctx) + ")");
-            }
-
-            // Build a batch covering the entire prompt+candidate. We
-            // need logits at (divergence-1) onward — those are the
-            // predictions for each candidate token.
-            llama_batch batch = llama_batch_init((int32_t) full_tokens.size(), 0, 1);
-            for (int32_t i = 0; i < (int32_t) full_tokens.size(); i++) {
-                batch.token[i]    = full_tokens[i];
-                batch.pos[i]      = i;
-                batch.n_seq_id[i] = 1;
-                batch.seq_id[i][0] = score_seq_id;
-                // logits[i] is "do we want the prediction *for the
-                // next token*, computed from this position?"
-                // We want predictions for candidate tokens at
-                // positions divergence .. full_tokens.size()-1, which
-                // come from logits at positions (divergence-1) ..
-                // (full_tokens.size()-2).
-                bool need_logit = (i >= divergence - 1) && (i < (int32_t) full_tokens.size() - 1);
-                batch.logits[i] = need_logit ? 1 : 0;
-            }
-            batch.n_tokens = (int32_t) full_tokens.size();
-
-            // Decode the batch. If decode fails (e.g. KV slot
-            // exhaustion), surface as INTERNAL — the caller will
-            // typically fall back to a sampling-based classifier.
-            int decode_err = llama_decode(lctx, batch);
-            if (decode_err != 0) {
-                llama_batch_free(batch);
-                llama_memory_seq_rm(mem, score_seq_id, -1, -1);
-                return grpc::Status(grpc::StatusCode::INTERNAL,
-                    "llama_decode failed during Score: " + std::to_string(decode_err));
-            }
-
-            // Sum log-probabilities of the actual candidate tokens.
-            double total_log_prob = 0.0;
-            for (int32_t k = 0; k < cand_len; k++) {
-                // The k-th candidate token sits at full_tokens index
-                // (divergence + k). Its predicting logit is at batch
-                // position (divergence + k - 1).
-                int32_t logit_pos = divergence + k - 1;
-                const float * logits = llama_get_logits_ith(lctx, logit_pos);
-                if (logits == nullptr) {
-                    llama_batch_free(batch);
-                    llama_memory_seq_rm(mem, score_seq_id, -1, -1);
-                    return grpc::Status(grpc::StatusCode::INTERNAL,
-                        "llama_get_logits_ith returned null at position " + std::to_string(logit_pos));
-                }
-                llama_token target_token = full_tokens[divergence + k];
-
-                // Compute log_softmax(logits)[target_token] with the
-                // max-subtraction stability trick.
-                float max_logit = logits[0];
-                for (int32_t v = 1; v < n_vocab; v++) {
-                    if (logits[v] > max_logit) max_logit = logits[v];
-                }
-                double sum_exp = 0.0;
-                for (int32_t v = 0; v < n_vocab; v++) {
-                    sum_exp += std::exp((double)(logits[v] - max_logit));
-                }
-                double token_log_prob = (double)(logits[target_token] - max_logit) - std::log(sum_exp);
-                total_log_prob += token_log_prob;
-
-                if (request->include_token_logprobs()) {
-                    backend::TokenLogProb * tlp = cs->add_tokens();
-                    std::string piece = common_token_to_piece(lctx, target_token);
-                    tlp->set_token(piece);
-                    tlp->set_log_prob(token_log_prob);
-                }
-            }
-
-            cs->set_log_prob(total_log_prob);
-            if (request->length_normalize() && cand_len > 0) {
-                cs->set_length_normalized_log_prob(total_log_prob / (double) cand_len);
-            }
-
-            llama_batch_free(batch);
-            // Drop this candidate's KV-cache contribution so the next
-            // candidate starts from a clean state. Without this, the
-            // next decode would conflict at positions 0..N-1 for our
-            // sequence ID.
-            llama_memory_seq_rm(mem, score_seq_id, -1, -1);
-        }
-
-        return grpc::Status::OK;
-    }
-
    grpc::Status TokenizeString(ServerContext* context, const backend::PredictOptions* request, backend::TokenizationResponse* response) override {
        auto auth = checkAuth(context);
        if (!auth.ok()) return auth;
        if (params_base.model.path.empty()) {
            return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
        }
-        conflict_guard guard("TokenizeString", slot_loop_inflight, score_inflight, "score_inflight");
        json body = parse_options(false, request, params_base, ctx_server.get_llama_context());
        body["stream"] = false;

@@ -3500,8 +3130,6 @@ public:

    grpc::Status GetMetrics(ServerContext* /*context*/, const backend::MetricsRequest* /*request*/, backend::MetricsResponse* response) override {

-        conflict_guard guard("GetMetrics", slot_loop_inflight, score_inflight, "score_inflight");
-
 // request slots data using task queue
        auto rd = ctx_server.get_response_reader();
        int task_id = rd.queue_tasks.get_new_id();
--- a/backend/cpp/turboquant/Makefile
+++ b/backend/cpp/turboquant/Makefile
@@ -1,7 +1,7 @@

 # Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
 # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
-TURBOQUANT_VERSION?=7d9715f1f071fa07c7b2ad3dbfd320b314139e65
+TURBOQUANT_VERSION?=4c1c3ac09d2dba0aa9a55b94f6c50c41a92f9c8c
 LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant

 CMAKE_ARGS?=
--- a/backend/cpp/turboquant/patch-grpc-server.sh
+++ b/backend/cpp/turboquant/patch-grpc-server.sh
@@ -1,28 +1,23 @@
 #!/bin/bash
 # Patch the shared backend/cpp/llama-cpp/grpc-server.cpp *copy* used by the
-# turboquant build to account for the gaps between upstream and the fork:
+# turboquant build:
 #
 #   1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
 #      fork-specific `turbo2` / `turbo3` / `turbo4` cache types.
-#   2. Define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top of the file
-#      so the grpc-server option parser skips the two references to
-#      common_params::checkpoint_min_step (the default and the option handler).
-#      That field does not exist in the fork yet; drop this once it does.
 #
-# The fork used to lag upstream on the whole common_params_speculative refactor
-# (ggml-org/llama.cpp#22397/#22838/#22964), the model_tgt rename (#22838) and
-# get_media_marker (#21962), which required a much larger compat shim here
-# (flat-field sed renames + a coarse LOCALAI_LEGACY_LLAMA_CPP_SPEC define). The
-# fork has since rebased past all of those, so the only remaining gap is
-# checkpoint_min_step. If a future bump reintroduces a divergence, add a narrow
-# guard in grpc-server.cpp keyed on a fork-specific macro and inject it here
-# rather than resurrecting the coarse one.
+# Historical context: this script used to also paper over API gaps between the
+# fork and upstream (flat vs nested `common_params_speculative`, missing
+# `get_media_marker()`, `ctx_server.impl->model` vs `model_tgt`, and a
+# LOCALAI_LEGACY_LLAMA_CPP_SPEC compile gate). As of TURBOQUANT_VERSION
+# 4c1c3ac0 the fork has rebased past ggml-org/llama.cpp#21962, #22397 and
+# #22838, so the shared grpc-server.cpp compiles unmodified against the fork.
+# Only the fork-specific KV-cache enum entries remain.
 #
 # We patch the *copy* sitting in turboquant-<flavor>-build/, never the original
-# under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps compiling
+# under backend/cpp/llama-cpp/, so the stock llama-cpp build stays compiling
 # against vanilla upstream.
 #
-# Idempotent: skips each insertion if its marker is already present (so re-runs
+# Idempotent: skips the insertion if its marker is already present (so re-runs
 # of the same build dir don't double-insert).

 set -euo pipefail
@@ -50,7 +45,7 @@ else
    awk '
        /^    GGML_TYPE_Q5_1,$/ && !done {
            print
-            print "    // turboquant fork extras — added by patch-grpc-server.sh"
+            print "    // turboquant fork extras - added by patch-grpc-server.sh"
            print "    GGML_TYPE_TURBO2_0,"
            print "    GGML_TYPE_TURBO3_0,"
            print "    GGML_TYPE_TURBO4_0,"
@@ -70,34 +65,4 @@ else
    echo "==> KV allow-list patch OK"
 fi

-# 2. Define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top of the file so
-#    the grpc-server option parser skips the two references to
-#    common_params::checkpoint_min_step (the default assignment and the option
-#    handler). That field does not exist in the fork yet. Drop this block once
-#    the fork rebases past the bump that added checkpoint_min_step.
-if grep -q '^#define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP' "$SRC"; then
-    echo "==> $SRC already defines LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP, skipping"
-else
-    echo "==> patching $SRC to define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP at the top"
-    # Insert the define before the very first `#include` so it precedes the
-    # checkpoint_min_step references.
-    awk '
-        !done && /^#include/ {
-            print "#define LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP 1"
-            print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
-            print ""
-            done = 1
-        }
-        { print }
-        END {
-            if (!done) {
-                print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP" > "/dev/stderr"
-                exit 1
-            }
-        }
-    ' "$SRC" > "$SRC.tmp"
-    mv "$SRC.tmp" "$SRC"
-    echo "==> LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP define OK"
-fi
-
 echo "==> all patches applied"
--- a/backend/cpp/turboquant/patches/0001-hip-guard-copy2d-peer-fastpath.patch
+++ b/backend/cpp/turboquant/patches/0001-hip-guard-copy2d-peer-fastpath.patch
@@ -1,55 +0,0 @@
-hip: port the turboquant CUDA additions that ggml's HIP shim doesn't cover
-
-The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs
-that ggml's HIP (and MUSA) compatibility layer does not provide, breaking
-the -gpu-rocm-hipblas-turboquant build:
-
-  1. ggml_cuda_copy2d_across_devices() (host-staged cross-device copy for
-     split mul_mat output) uses the CUDA 3D-peer copy APIs
-     cudaMemcpy3DPeerParms / make_cudaPitchedPtr / make_cudaExtent /
-     cudaMemcpy3DPeerAsync. HIP genuinely does not support these (see the
-     fork's own comment "HIP does not support cudaMemcpy3DPeerAsync"), so
-     guard the peer fast path with #if !defined(GGML_USE_HIP) &&
-     !defined(GGML_USE_MUSA) -- matching how the fork already guards the
-     same API for the sibling 2D copy -- and fall through to the existing
-     cudaMemcpyAsync staging fallback below (functionally identical,
-     slightly slower on multi-GPU ROCm).
-
-  2. ggml_backend_cuda_device_event_new() creates its event with plain
-     cudaEventCreate, which ggml's HIP shim does not alias (it only aliases
-     cudaEventCreateWithFlags). Use cudaEventCreateWithFlags(..., 
-     cudaEventDisableTiming) -- exactly what the rest of this file already
-     does (cf. lines ~1034, ~3461) and HIP-safe.
-
-CUDA builds are unaffected. Drop the relevant hunk once the fork HIP-ports
-these; apply-patches.sh fails fast if an anchor goes stale.
-
-diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index 0427e6b..6352e6a 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
-+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -1933,6 +1933,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
-     size_t width, size_t height, cudaStream_t dst_stream, cudaStream_t src_stream) {
- 
-     const auto & info = ggml_cuda_info();
-+#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)  // 3D-peer copy types unmapped by ggml's HIP/MUSA shim; use staging fallback below
-     if (info.peer_access[src_device][dst_device]) {
-         cudaMemcpy3DPeerParms p = {};
-         p.dstDevice = dst_device;
-@@ -1942,6 +1943,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
-         p.extent = make_cudaExtent(width, height, 1);
-         return cudaMemcpy3DPeerAsync(&p, dst_stream);
-     }
-+#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
- 
-     // Fallback: stage all rows through a single contiguous pinned buffer
-     int prev_device = ggml_cuda_get_device();
-@@ -5714,7 +5716,7 @@ static ggml_backend_event_t ggml_backend_cuda_device_event_new(ggml_backend_dev_
-     ggml_cuda_set_device(dev_ctx->device);
- 
-     cudaEvent_t event;
-    CUDA_CHECK(cudaEventCreate(&event));
-+    CUDA_CHECK(cudaEventCreateWithFlags(&event, cudaEventDisableTiming));
- 
-     return new ggml_backend_event {
-         /* .device  = */ dev,
--- a/backend/go/cloud-proxy/Makefile
+++ b/backend/go/cloud-proxy/Makefile
@@ -1,12 +0,0 @@
-GOCMD=go
-
-cloud-proxy:
-	CGO_ENABLED=0 $(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o cloud-proxy ./
-
-package:
-	bash package.sh
-
-build: cloud-proxy package
-
-clean:
-	rm -f cloud-proxy
--- a/backend/go/cloud-proxy/cloud_proxy_suite_test.go
+++ b/backend/go/cloud-proxy/cloud_proxy_suite_test.go
@@ -1,16 +0,0 @@
-package main
-
-import (
-	"testing"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-// Ginkgo bootstrap. The other Test* functions in this package use
-// raw testing.T and run independently; they coexist with Ginkgo
-// specs registered via Describe / Context.
-func TestCloudProxySpecs(t *testing.T) {
-	RegisterFailHandler(Fail)
-	RunSpecs(t, "cloud-proxy specs")
-}
--- a/backend/go/cloud-proxy/main.go
+++ b/backend/go/cloud-proxy/main.go
@@ -1,39 +0,0 @@
-package main
-
-// cloud-proxy is a LocalAI backend that forwards request traffic to an
-// external HTTP provider (OpenAI, Anthropic, etc.). Two modes:
-//
-//   - passthrough: serves the Forward RPC; the client wire format is
-//     preserved end-to-end, no translation.
-//   - translate: serves Predict/PredictStream; the backend converts
-//     internal proto to the provider's wire format. (Phases 5–6.)
-//
-// LoadModel reads UpstreamURL/Mode/Provider/key references from
-// ProxyOptions and resolves the API key once at load time.
-
-import (
-	"flag"
-	"os"
-
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-	"github.com/mudler/xlog"
-	"golang.org/x/term"
-)
-
-var addr = flag.String("addr", "localhost:50051", "the address to listen on")
-
-func main() {
-	// xlog's default handler emits ANSI color codes; that's fine for an
-	// interactive shell but unreadable when the backend's stdout is
-	// captured by LocalAI and tee'd to a log file. Force plain text when
-	// LOCALAI_LOG_FORMAT is unset and stdout isn't a terminal.
-	format := os.Getenv("LOCALAI_LOG_FORMAT")
-	if format == "" && !term.IsTerminal(int(os.Stdout.Fd())) {
-		format = xlog.TextFormat
-	}
-	xlog.SetLogger(xlog.NewLogger(xlog.LogLevel(os.Getenv("LOCALAI_LOG_LEVEL")), format))
-	flag.Parse()
-	if err := grpc.StartServer(*addr, NewCloudProxy()); err != nil {
-		panic(err)
-	}
-}
--- a/backend/go/cloud-proxy/package.sh
+++ b/backend/go/cloud-proxy/package.sh
@@ -1,13 +0,0 @@
-#!/bin/bash
-
-# Script to copy the cloud-proxy binary into the package dir for the
-# final Dockerfile stage. Mirrors backend/go/local-store/package.sh —
-# no extra runtime libs needed since the backend is pure Go.
-
-set -e
-
-CURDIR=$(dirname "$(realpath $0)")
-
-mkdir -p $CURDIR/package
-cp -avf $CURDIR/cloud-proxy $CURDIR/package/
-cp -rfv $CURDIR/run.sh $CURDIR/package/
--- a/backend/go/cloud-proxy/passthrough_edge_test.go
+++ b/backend/go/cloud-proxy/passthrough_edge_test.go
@@ -1,325 +0,0 @@
-package main
-
-import (
-	"context"
-	"errors"
-	"io"
-	"net/http"
-	"net/http/httptest"
-	"strconv"
-	"sync"
-
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-var _ = Describe("composeURL", func() {
-	// Upstream URL convention: gallery configs put the canonical path
-	// in upstream_url, so per-request Path is ignored. A bare-host
-	// upstream_url accepts the per-request path.
-	DescribeTable("path resolution",
-		func(upstream, reqPath, want string) {
-			got, err := composeURL(upstream, reqPath)
-			Expect(err).NotTo(HaveOccurred())
-			Expect(got).To(Equal(want))
-		},
-		Entry("full path wins", "https://api.openai.com/v1/chat/completions", "/v1/something-else", "https://api.openai.com/v1/chat/completions"),
-		Entry("bare host accepts path", "https://api.openai.com", "/v1/chat/completions", "https://api.openai.com/v1/chat/completions"),
-		Entry("root slash treated as bare", "https://api.openai.com/", "/v1/chat/completions", "https://api.openai.com/v1/chat/completions"),
-		Entry("bare host + empty path", "https://api.openai.com", "", "https://api.openai.com"),
-	)
-
-	It("returns an error on invalid upstream URL", func() {
-		_, err := composeURL("://garbage", "")
-		Expect(err).To(HaveOccurred())
-	})
-})
-
-var _ = Describe("applyAuthHeader", func() {
-	It("sets x-api-key and anthropic-version for Anthropic, no Authorization", func() {
-		req, _ := http.NewRequest("POST", "https://example.com", nil)
-		applyAuthHeader(req, providerAnthropic, "ant-key")
-		Expect(req.Header.Get("x-api-key")).To(Equal("ant-key"))
-		Expect(req.Header.Get("anthropic-version")).NotTo(BeEmpty())
-		Expect(req.Header.Get("Authorization")).To(BeEmpty(), "Authorization must not leak on Anthropic backend")
-	})
-
-	It("sets Bearer Authorization for OpenAI, no x-api-key", func() {
-		req, _ := http.NewRequest("POST", "https://example.com", nil)
-		applyAuthHeader(req, providerOpenAI, "sk-key")
-		Expect(req.Header.Get("Authorization")).To(Equal("Bearer sk-key"))
-		Expect(req.Header.Get("x-api-key")).To(BeEmpty(), "x-api-key must not leak on OpenAI backend")
-	})
-
-	It("defaults to Bearer when provider is empty", func() {
-		// Passthrough mode often has provider == "" because the operator
-		// doesn't claim a specific upstream wire format. Most providers
-		// (including OpenAI-compatible ones) accept Bearer, so default to it.
-		req, _ := http.NewRequest("POST", "https://example.com", nil)
-		applyAuthHeader(req, "", "some-key")
-		Expect(req.Header.Get("Authorization")).To(Equal("Bearer some-key"))
-	})
-
-	It("preserves an existing anthropic-version header", func() {
-		// If the client supplied anthropic-version (rare but legitimate
-		// for an upstream pinned to a specific date), the proxy must not
-		// clobber it.
-		req, _ := http.NewRequest("POST", "https://example.com", nil)
-		req.Header.Set("anthropic-version", "2024-10-01")
-		applyAuthHeader(req, providerAnthropic, "k")
-		Expect(req.Header.Get("anthropic-version")).To(Equal("2024-10-01"))
-	})
-})
-
-var _ = Describe("isHopByHopHeader", func() {
-	DescribeTable("hop-by-hop classification",
-		func(header string, want bool) {
-			Expect(isHopByHopHeader(header)).To(Equal(want))
-		},
-		Entry("Connection is hop-by-hop", "Connection", true),
-		Entry("Keep-Alive is hop-by-hop", "Keep-Alive", true),
-		Entry("Proxy-Connection is hop-by-hop", "Proxy-Connection", true),
-		Entry("Transfer-Encoding is hop-by-hop", "Transfer-Encoding", true),
-		Entry("TE is hop-by-hop", "TE", true),
-		Entry("Trailer is hop-by-hop", "Trailer", true),
-		Entry("Upgrade is hop-by-hop", "Upgrade", true),
-		Entry("Host is hop-by-hop", "Host", true),
-		Entry("Content-Length is hop-by-hop", "Content-Length", true),
-		// Case-insensitive — RFC 7230 doesn't constrain header case.
-		Entry("lowercase connection is hop-by-hop", "connection", true),
-		Entry("uppercase HOST is hop-by-hop", "HOST", true),
-		// Non hop-by-hop — must NOT be stripped.
-		Entry("Authorization is end-to-end", "Authorization", false),
-		Entry("Content-Type is end-to-end", "Content-Type", false),
-		Entry("Accept is end-to-end", "Accept", false),
-		Entry("X-Custom is end-to-end", "X-Custom", false),
-	)
-})
-
-var _ = Describe("Forward", func() {
-	It("strips hop-by-hop and Connection headers before upstream, preserves custom headers", func() {
-		gotConnection := make(chan string, 1)
-		gotXCustom := make(chan string, 1)
-		gotHost := make(chan string, 1)
-		upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-			gotConnection <- r.Header.Get("Connection")
-			gotXCustom <- r.Header.Get("X-Custom")
-			gotHost <- r.Header.Get("Host")
-			w.WriteHeader(http.StatusOK)
-		}))
-		defer upstream.Close()
-
-		cp := NewCloudProxy()
-		Expect(cp.Load(&pb.ModelOptions{
-			Proxy: &pb.ProxyOptions{
-				UpstreamUrl: upstream.URL,
-				Mode:        modePassthrough,
-			},
-		})).To(Succeed())
-
-		addr := "test://forward-hopbyhop"
-		grpc.Provide(addr, cp)
-		c := grpc.NewClient(addr, true, nil, false)
-		stream, err := c.Forward(context.Background())
-		Expect(err).NotTo(HaveOccurred())
-		Expect(stream.Send(&pb.ForwardRequest{
-			Path:   "/v1/chat/completions",
-			Method: "POST",
-			Headers: []*pb.ForwardHeader{
-				{Name: "Connection", Value: "keep-alive"},
-				{Name: "Host", Value: "spoofed.example.com"},
-				{Name: "X-Custom", Value: "preserved"},
-			},
-		})).To(Succeed())
-		Expect(stream.CloseSend()).To(Succeed())
-		_, _ = stream.Recv()
-		for {
-			if _, err := stream.Recv(); errors.Is(err, io.EOF) || err != nil {
-				break
-			}
-		}
-
-		Expect(<-gotConnection).To(BeEmpty(), "Connection must not leak to upstream")
-		Expect(<-gotHost).NotTo(Equal("spoofed.example.com"), "Host header must not be spoofed through")
-		Expect(<-gotXCustom).To(Equal("preserved"), "X-Custom header must survive")
-	})
-
-	It("replaces caller-supplied Authorization with the configured key", func() {
-		// The proxy must overwrite a client-supplied Authorization header
-		// so a downstream caller can't smuggle stale or wrong credentials.
-		gotAuth := make(chan string, 1)
-		upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-			gotAuth <- r.Header.Get("Authorization")
-			w.WriteHeader(http.StatusOK)
-		}))
-		defer upstream.Close()
-
-		GinkgoT().Setenv("CLOUD_PROXY_AUTH_REPLACE_KEY", "sk-real")
-
-		cp := NewCloudProxy()
-		Expect(cp.Load(&pb.ModelOptions{
-			Proxy: &pb.ProxyOptions{
-				UpstreamUrl: upstream.URL,
-				Mode:        modePassthrough,
-				ApiKeyEnv:   "CLOUD_PROXY_AUTH_REPLACE_KEY",
-			},
-		})).To(Succeed())
-
-		addr := "test://forward-replaces-auth"
-		grpc.Provide(addr, cp)
-		c := grpc.NewClient(addr, true, nil, false)
-		stream, err := c.Forward(context.Background())
-		Expect(err).NotTo(HaveOccurred())
-		Expect(stream.Send(&pb.ForwardRequest{
-			Path:   "/v1/chat/completions",
-			Method: "POST",
-			Headers: []*pb.ForwardHeader{
-				// Client-supplied Authorization with the wrong scheme / key.
-				{Name: "Authorization", Value: "Basic Zm9vOmJhcg=="},
-			},
-		})).To(Succeed())
-		Expect(stream.CloseSend()).To(Succeed())
-		_, _ = stream.Recv()
-		for {
-			if _, err := stream.Recv(); errors.Is(err, io.EOF) || err != nil {
-				break
-			}
-		}
-
-		Expect(<-gotAuth).To(Equal("Bearer sk-real"), "caller-supplied Basic header must be replaced")
-	})
-
-	It("refuses to follow upstream redirects and never leaks the key to the redirect target", func() {
-		// A 3xx from the configured upstream means misconfiguration or a
-		// hijacked/spoofed host. Following it would replay the request —
-		// and the injected API key — to the Location host. Anthropic's
-		// x-api-key is NOT stripped by Go on cross-host redirects, so this
-		// would be a credential leak. The proxy must refuse the redirect.
-		sinkHit := make(chan string, 1)
-		sink := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-			sinkHit <- r.Header.Get("x-api-key")
-			w.WriteHeader(http.StatusOK)
-		}))
-		defer sink.Close()
-
-		redirector := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-			http.Redirect(w, r, sink.URL, http.StatusFound)
-		}))
-		defer redirector.Close()
-
-		GinkgoT().Setenv("CLOUD_PROXY_REDIRECT_KEY", "ant-secret")
-
-		cp := NewCloudProxy()
-		Expect(cp.Load(&pb.ModelOptions{
-			Proxy: &pb.ProxyOptions{
-				UpstreamUrl: redirector.URL,
-				Mode:        modePassthrough,
-				Provider:    providerAnthropic,
-				ApiKeyEnv:   "CLOUD_PROXY_REDIRECT_KEY",
-			},
-		})).To(Succeed())
-
-		addr := "test://forward-no-redirect"
-		grpc.Provide(addr, cp)
-		c := grpc.NewClient(addr, true, nil, false)
-		stream, err := c.Forward(context.Background())
-		Expect(err).NotTo(HaveOccurred())
-		Expect(stream.Send(&pb.ForwardRequest{
-			Path:   "/v1/messages",
-			Method: "POST",
-		})).To(Succeed())
-		Expect(stream.CloseSend()).To(Succeed())
-
-		// Drain the stream; a refused redirect surfaces as a non-EOF error.
-		var streamErr error
-		for {
-			if _, err := stream.Recv(); err != nil {
-				if !errors.Is(err, io.EOF) {
-					streamErr = err
-				}
-				break
-			}
-		}
-		Expect(streamErr).To(HaveOccurred(), "refused redirect must surface as an error")
-		Expect(sinkHit).NotTo(Receive(), "the redirect target must never be contacted")
-	})
-
-	It("handles concurrent calls without interference", func() {
-		// CloudProxy explicitly omits base.SingleThread — independent
-		// Forward streams must not block each other or leak state.
-		upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-			body, _ := io.ReadAll(r.Body)
-			w.WriteHeader(http.StatusOK)
-			_, _ = w.Write(body)
-		}))
-		defer upstream.Close()
-
-		cp := NewCloudProxy()
-		Expect(cp.Load(&pb.ModelOptions{
-			Proxy: &pb.ProxyOptions{
-				UpstreamUrl: upstream.URL,
-				Mode:        modePassthrough,
-			},
-		})).To(Succeed())
-		addr := "test://forward-concurrent"
-		grpc.Provide(addr, cp)
-		c := grpc.NewClient(addr, true, nil, false)
-
-		const N = 8
-		var wg sync.WaitGroup
-		errs := make(chan error, N)
-		for i := 0; i < N; i++ {
-			wg.Add(1)
-			go func(idx int) {
-				defer wg.Done()
-				stream, err := c.Forward(context.Background())
-				if err != nil {
-					errs <- err
-					return
-				}
-				payload := "request-" + string(rune('A'+idx))
-				if err := stream.Send(&pb.ForwardRequest{
-					Path:      "/v1/chat/completions",
-					Method:    "POST",
-					BodyChunk: []byte(payload),
-				}); err != nil {
-					errs <- err
-					return
-				}
-				_ = stream.CloseSend()
-				_, _ = stream.Recv()
-				var body []byte
-				for {
-					r, err := stream.Recv()
-					if errors.Is(err, io.EOF) {
-						break
-					}
-					if err != nil {
-						errs <- err
-						return
-					}
-					body = append(body, r.GetBodyChunk()...)
-				}
-				if string(body) != payload {
-					errs <- &echoMismatch{want: payload, got: string(body)}
-				}
-			}(i)
-		}
-		wg.Wait()
-		close(errs)
-		var collected []error
-		for err := range errs {
-			collected = append(collected, err)
-		}
-		Expect(collected).To(BeEmpty(), "no concurrent Forward call should fail")
-	})
-})
-
-type echoMismatch struct{ want, got string }
-
-func (e *echoMismatch) Error() string {
-	return "echo mismatch: want " + strconv.Quote(e.want) + " got " + strconv.Quote(e.got)
-}
--- a/backend/go/cloud-proxy/provider_anthropic.go
+++ b/backend/go/cloud-proxy/provider_anthropic.go
@@ -1,508 +0,0 @@
-package main
-
-import (
-	"bufio"
-	"bytes"
-	"context"
-	"encoding/json"
-	"fmt"
-	"io"
-	"net/http"
-	"strings"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	"github.com/mudler/xlog"
-)
-
-// Anthropic Messages API wire-format types. Narrowed to what translate
-// mode preserves through the Reply proto: text + tool_use blocks +
-// usage tokens. Image blocks, prompt caching, metadata, and stop
-// sequence metadata are not modelled — passthrough mode covers those.
-//
-// Notable differences from OpenAI:
-//   - max_tokens is REQUIRED. Anthropic 400s without it.
-//   - Roles are user/assistant only — system messages move to a
-//     top-level `system` string field.
-//   - Streaming SSE uses event: lines alongside data: lines. The
-//     events we care about: content_block_start (carries tool_use
-//     init: id + name), content_block_delta (text_delta with text;
-//     input_json_delta with partial_json for tool arguments), and
-//     message_stop (terminates the stream). Others are ignored.
-
-type anthropicRequest struct {
-	Model         string               `json:"model"`
-	MaxTokens     int32                `json:"max_tokens"`
-	System        string               `json:"system,omitempty"`
-	Messages      []anthropicMessage   `json:"messages"`
-	Stream        bool                 `json:"stream,omitempty"`
-	Temperature   *float64             `json:"temperature,omitempty"`
-	TopP          *float64             `json:"top_p,omitempty"`
-	StopSequences []string             `json:"stop_sequences,omitempty"`
-	Tools         []anthropicTool      `json:"tools,omitempty"`
-	ToolChoice    *anthropicToolChoice `json:"tool_choice,omitempty"`
-}
-
-// Content is `any` because Anthropic accepts a bare string OR a
-// list of content blocks. Use the string form for plain user/
-// assistant turns; switch to []anthropicContentBlock when the
-// turn needs tool_use (assistant) or tool_result (user) blocks.
-type anthropicMessage struct {
-	Role    string `json:"role"`
-	Content any    `json:"content"`
-}
-
-type anthropicTool struct {
-	Name        string          `json:"name"`
-	Description string          `json:"description,omitempty"`
-	InputSchema json.RawMessage `json:"input_schema"`
-}
-
-// anthropicToolChoice mirrors the four shapes Anthropic accepts:
-// {"type":"auto"} | {"type":"any"} | {"type":"tool","name":"X"} |
-// {"type":"none"} (newer models). OpenAI's "auto"/"none"/
-// "required"/{"function":{"name":"X"}} all map here.
-type anthropicToolChoice struct {
-	Type string `json:"type"`
-	Name string `json:"name,omitempty"`
-}
-
-// anthropicContentBlock is the union shape used both for response
-// blocks (text/tool_use we read off the wire) and outbound request
-// blocks (tool_use/tool_result we emit in the conversation history).
-// Anthropic encodes tool calls inline rather than as a separate field,
-// so we walk Content[] looking for type=="tool_use" on responses and
-// produce equivalent blocks when serialising prior-turn tool calls.
-type anthropicContentBlock struct {
-	Type  string          `json:"type"`
-	Text  string          `json:"text,omitempty"`
-	ID    string          `json:"id,omitempty"`
-	Name  string          `json:"name,omitempty"`
-	Input json.RawMessage `json:"input,omitempty"`
-	// Tool-result block fields. tool_result uses `content` (not
-	// `text`) and pairs with `tool_use_id`; modelling them as
-	// distinct fields avoids ambiguity at marshal time.
-	ToolUseID     string `json:"tool_use_id,omitempty"`
-	ResultContent string `json:"content,omitempty"`
-}
-
-type anthropicResponse struct {
-	ID      string                  `json:"id"`
-	Type    string                  `json:"type"`
-	Role    string                  `json:"role"`
-	Content []anthropicContentBlock `json:"content"`
-	Model   string                  `json:"model"`
-	Usage   *anthropicUsage         `json:"usage,omitempty"`
-}
-
-type anthropicUsage struct {
-	InputTokens  int `json:"input_tokens"`
-	OutputTokens int `json:"output_tokens"`
-}
-
-// anthropicStreamEvent is the union shape used for every event type we
-// process. Type discriminates; only the matching fields are populated.
-// content_block_start carries ContentBlock (with id/name for tool_use);
-// content_block_delta carries Delta (text or partial_json).
-type anthropicStreamEvent struct {
-	Type         string                 `json:"type"`
-	Index        int                    `json:"index,omitempty"`
-	ContentBlock *anthropicContentBlock `json:"content_block,omitempty"`
-	Delta        *anthropicStreamDelta  `json:"delta,omitempty"`
-	Message      *anthropicResponse     `json:"message,omitempty"`
-	Usage        *anthropicUsage        `json:"usage,omitempty"`
-}
-
-type anthropicStreamDelta struct {
-	Type        string `json:"type,omitempty"`
-	Text        string `json:"text,omitempty"`
-	PartialJSON string `json:"partial_json,omitempty"`
-}
-
-// Anthropic requires max_tokens. If the caller didn't set it, use a
-// generous-but-bounded default so the request doesn't 400.
-const anthropicDefaultMaxTokens int32 = 4096
-
-const anthropicToolChoiceNone = "none"
-
-// Reused JSON-Schema defaults for malformed inputs. Anthropic requires
-// input_schema to be a JSON object and tool_use.input to be a JSON
-// object; clients that omit them must not 400 the entire request.
-var (
-	emptyJSONObject   = json.RawMessage(`{}`)
-	emptyObjectSchema = json.RawMessage(`{"type":"object","properties":{}}`)
-)
-
-func buildAnthropicRequest(opts *pb.PredictOptions, cfg *proxyConfig, stream bool) ([]byte, error) {
-	req := anthropicRequest{
-		Model:         modelName(cfg, opts),
-		MaxTokens:     opts.GetTokens(),
-		Stream:        stream,
-		StopSequences: opts.GetStopPrompts(),
-	}
-	if req.MaxTokens <= 0 {
-		req.MaxTokens = anthropicDefaultMaxTokens
-	}
-	// Newer Anthropic models 400 when both temperature and top_p are
-	// set ("`temperature` and `top_p` cannot both be specified for
-	// this model. Please use only one.") even though their docs only
-	// "recommend" picking one. The OpenAI-compatible chat UI almost
-	// always sends both with default values, so prefer temperature
-	// and drop top_p when both are present.
-	if t := opts.GetTemperature(); t != 0 {
-		v := float64(t)
-		req.Temperature = &v
-	} else if t := opts.GetTopP(); t != 0 {
-		v := float64(t)
-		req.TopP = &v
-	}
-
-	req.Tools = convertOpenAITools(opts.GetTools())
-	req.ToolChoice = convertOpenAIToolChoice(opts.GetToolChoice())
-	// Anthropic rejects tool_choice without tools and older models
-	// don't accept {"type":"none"} — collapse to a no-tools request.
-	if req.ToolChoice != nil && req.ToolChoice.Type == anthropicToolChoiceNone {
-		req.Tools, req.ToolChoice = nil, nil
-	}
-
-	var systemParts []string
-	for _, m := range opts.GetMessages() {
-		role := m.GetRole()
-		if role == "system" {
-			if c := m.GetContent(); c != "" {
-				systemParts = append(systemParts, c)
-			}
-			continue
-		}
-		switch role {
-		case "user":
-			req.Messages = append(req.Messages, anthropicMessage{
-				Role:    "user",
-				Content: m.GetContent(),
-			})
-		case "assistant":
-			if blocks := assistantBlocks(m); blocks != nil {
-				req.Messages = append(req.Messages, anthropicMessage{Role: "assistant", Content: blocks})
-				continue
-			}
-			req.Messages = append(req.Messages, anthropicMessage{
-				Role:    "assistant",
-				Content: m.GetContent(),
-			})
-		case "tool", "function":
-			req.Messages = appendToolResult(req.Messages, anthropicContentBlock{
-				Type:          "tool_result",
-				ToolUseID:     m.GetToolCallId(),
-				ResultContent: m.GetContent(),
-			})
-		}
-	}
-	req.System = strings.Join(systemParts, "\n\n")
-
-	if len(req.Messages) == 0 && opts.GetPrompt() != "" {
-		req.Messages = []anthropicMessage{{Role: "user", Content: opts.GetPrompt()}}
-	}
-
-	return json.Marshal(req)
-}
-
-// appendToolResult appends a tool_result block as a user message,
-// merging into a preceding user message that already carries blocks.
-// Anthropic concatenates consecutive same-role messages on its end,
-// but explicit merging keeps the body smaller and the conversation
-// strictly alternating — which some upstream filters require.
-func appendToolResult(msgs []anthropicMessage, block anthropicContentBlock) []anthropicMessage {
-	if n := len(msgs); n > 0 && msgs[n-1].Role == "user" {
-		if existing, ok := msgs[n-1].Content.([]anthropicContentBlock); ok {
-			msgs[n-1].Content = append(existing, block)
-			return msgs
-		}
-	}
-	return append(msgs, anthropicMessage{
-		Role:    "user",
-		Content: []anthropicContentBlock{block},
-	})
-}
-
-func convertOpenAITools(toolsJSON string) []anthropicTool {
-	if toolsJSON == "" {
-		return nil
-	}
-	var raw []openAITool
-	if err := json.Unmarshal([]byte(toolsJSON), &raw); err != nil {
-		xlog.Warn("cloud-proxy: anthropic translate: unparseable tools JSON, dropping", "error", err)
-		return nil
-	}
-	tools := make([]anthropicTool, 0, len(raw))
-	for _, t := range raw {
-		if t.Function.Name == "" {
-			continue
-		}
-		schema := t.Function.Parameters
-		if len(schema) == 0 {
-			schema = emptyObjectSchema
-		}
-		tools = append(tools, anthropicTool{
-			Name:        t.Function.Name,
-			Description: t.Function.Description,
-			InputSchema: schema,
-		})
-	}
-	return tools
-}
-
-// convertOpenAIToolChoice accepts the spec form
-// ({type:function, function:{name:X}}) and the flat legacy form
-// ({type:function, name:X}) some clients send. Unknown object shapes
-// are warned and dropped rather than silently treated as auto.
-func convertOpenAIToolChoice(toolChoiceJSON string) *anthropicToolChoice {
-	if toolChoiceJSON == "" {
-		return nil
-	}
-	var asString string
-	if err := json.Unmarshal([]byte(toolChoiceJSON), &asString); err == nil {
-		switch asString {
-		case "auto":
-			return &anthropicToolChoice{Type: "auto"}
-		case "none":
-			return &anthropicToolChoice{Type: anthropicToolChoiceNone}
-		case "required":
-			return &anthropicToolChoice{Type: "any"}
-		}
-		return nil
-	}
-	var asObj struct {
-		Type     string `json:"type"`
-		Name     string `json:"name"`
-		Function struct {
-			Name string `json:"name"`
-		} `json:"function"`
-	}
-	if err := json.Unmarshal([]byte(toolChoiceJSON), &asObj); err != nil {
-		xlog.Warn("cloud-proxy: anthropic translate: unparseable tool_choice, dropping", "error", err)
-		return nil
-	}
-	if name := asObj.Function.Name; name != "" {
-		return &anthropicToolChoice{Type: "tool", Name: name}
-	}
-	if asObj.Name != "" {
-		return &anthropicToolChoice{Type: "tool", Name: asObj.Name}
-	}
-	xlog.Warn("cloud-proxy: anthropic translate: unrecognised tool_choice shape, dropping", "shape", toolChoiceJSON)
-	return nil
-}
-
-// openAITool mirrors pkg/functions.Tool but keeps Parameters as
-// json.RawMessage so the input_schema passes through verbatim — no
-// re-marshal cost, no fidelity loss on exotic schemas.
-type openAITool struct {
-	Type     string `json:"type"`
-	Function struct {
-		Name        string          `json:"name"`
-		Description string          `json:"description"`
-		Parameters  json.RawMessage `json:"parameters"`
-	} `json:"function"`
-}
-
-func assistantBlocks(m *pb.Message) []anthropicContentBlock {
-	toolCallsJSON := m.GetToolCalls()
-	if toolCallsJSON == "" {
-		return nil
-	}
-	var toolCalls []openAIToolCall
-	if err := json.Unmarshal([]byte(toolCallsJSON), &toolCalls); err != nil || len(toolCalls) == 0 {
-		return nil
-	}
-	blocks := make([]anthropicContentBlock, 0, len(toolCalls)+1)
-	if text := m.GetContent(); text != "" {
-		blocks = append(blocks, anthropicContentBlock{Type: "text", Text: text})
-	}
-	for _, tc := range toolCalls {
-		// OpenAI's arguments are a JSON-encoded string; pass through
-		// as RawMessage so a non-JSON string from a poorly-formed
-		// local model doesn't crash the marshaller downstream.
-		args := json.RawMessage(tc.Function.Arguments)
-		if len(args) == 0 {
-			args = emptyJSONObject
-		}
-		blocks = append(blocks, anthropicContentBlock{
-			Type:  "tool_use",
-			ID:    tc.ID,
-			Name:  tc.Function.Name,
-			Input: args,
-		})
-	}
-	return blocks
-}
-
-// doAnthropicRequest is the Anthropic counterpart of doOpenAIRequest.
-// applyAuthHeader sets x-api-key and anthropic-version when provider
-// is anthropic, so this method doesn't need to duplicate that.
-func (c *CloudProxy) doAnthropicRequest(ctx context.Context, cfg *proxyConfig, body []byte) (*http.Response, error) {
-	req, err := http.NewRequestWithContext(ctx, http.MethodPost, cfg.upstreamURL, bytes.NewReader(body))
-	if err != nil {
-		return nil, fmt.Errorf("cloud-proxy: build request: %w", err)
-	}
-	req.Header.Set("Content-Type", "application/json")
-	req.Header.Set("Accept", "*/*")
-	if cfg.apiKey != "" {
-		applyAuthHeader(req, cfg.provider, cfg.apiKey)
-	}
-	resp, err := c.client.Do(req)
-	if err != nil {
-		return nil, fmt.Errorf("cloud-proxy: upstream request: %w", err)
-	}
-	return resp, nil
-}
-
-// predictAnthropicRich returns the full Reply: joined text from all
-// text blocks, tool_use blocks mapped to ToolCallDelta, and usage
-// tokens.
-func (c *CloudProxy) predictAnthropicRich(ctx context.Context, cfg *proxyConfig, opts *pb.PredictOptions) (*pb.Reply, error) {
-	body, err := buildAnthropicRequest(opts, cfg, false)
-	if err != nil {
-		return nil, fmt.Errorf("cloud-proxy: marshal request: %w", err)
-	}
-	resp, err := c.doAnthropicRequest(ctx, cfg, body)
-	if err != nil {
-		return nil, err
-	}
-	defer func() { _ = resp.Body.Close() }()
-
-	if resp.StatusCode >= 400 {
-		errBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
-		return nil, fmt.Errorf("cloud-proxy: upstream %d: %s", resp.StatusCode, string(errBody))
-	}
-
-	var parsed anthropicResponse
-	if err := json.NewDecoder(resp.Body).Decode(&parsed); err != nil {
-		return nil, fmt.Errorf("cloud-proxy: decode response: %w", err)
-	}
-
-	reply := &pb.Reply{}
-	if parsed.Usage != nil {
-		reply.PromptTokens = int32(parsed.Usage.InputTokens)
-		reply.Tokens = int32(parsed.Usage.OutputTokens)
-	}
-
-	var content strings.Builder
-	var toolCalls []*pb.ToolCallDelta
-	toolIdx := 0
-	for _, b := range parsed.Content {
-		switch b.Type {
-		case "text":
-			content.WriteString(b.Text)
-		case "tool_use":
-			// Input is a structured JSON object; we serialise to a
-			// string so it fits the OpenAI-shaped arguments field
-			// downstream consumers expect.
-			args := ""
-			if len(b.Input) > 0 {
-				args = string(b.Input)
-			}
-			toolCalls = append(toolCalls, newToolCallDelta(toolIdx, b.ID, b.Name, args))
-			toolIdx++
-		}
-	}
-	reply.Message = []byte(content.String())
-	if len(toolCalls) > 0 {
-		reply.ChatDeltas = []*pb.ChatDelta{{ToolCalls: toolCalls}}
-	}
-	return reply, nil
-}
-
-// predictAnthropicStreamRich streams Reply chunks from Anthropic's SSE.
-// Three event types matter: content_block_start (initialises tool_use
-// id+name), content_block_delta (carries text or input_json_delta),
-// message_stop (terminates). The block index from the wire feeds
-// straight into ToolCallDelta.Index so downstream consumers can
-// reassemble multiple parallel tool calls.
-func (c *CloudProxy) predictAnthropicStreamRich(ctx context.Context, cfg *proxyConfig, opts *pb.PredictOptions, results chan<- *pb.Reply) error {
-	body, err := buildAnthropicRequest(opts, cfg, true)
-	if err != nil {
-		return fmt.Errorf("cloud-proxy: marshal request: %w", err)
-	}
-	resp, err := c.doAnthropicRequest(ctx, cfg, body)
-	if err != nil {
-		return err
-	}
-	defer func() { _ = resp.Body.Close() }()
-
-	if resp.StatusCode >= 400 {
-		errBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
-		return fmt.Errorf("cloud-proxy: upstream %d: %s", resp.StatusCode, string(errBody))
-	}
-
-	scanner := bufio.NewScanner(resp.Body)
-	scanner.Buffer(make([]byte, 0, 64*1024), 1<<20)
-	for scanner.Scan() {
-		line := scanner.Text()
-		if !strings.HasPrefix(line, "data:") {
-			continue
-		}
-		payload := strings.TrimSpace(strings.TrimPrefix(line, "data:"))
-		if payload == "" {
-			continue
-		}
-		var ev anthropicStreamEvent
-		if err := json.Unmarshal([]byte(payload), &ev); err != nil {
-			xlog.Debug("cloud-proxy: skip malformed SSE chunk", "error", err)
-			continue
-		}
-		switch ev.Type {
-		case "content_block_start":
-			// tool_use blocks announce id + name here; arguments arrive
-			// in subsequent input_json_delta events. Emit a Reply with
-			// just the tool_call init fields so consumers can allocate
-			// a slot at this index.
-			if ev.ContentBlock != nil && ev.ContentBlock.Type == "tool_use" {
-				if !sendReply(ctx, results, &pb.Reply{
-					ChatDeltas: []*pb.ChatDelta{{ToolCalls: []*pb.ToolCallDelta{
-						newToolCallDelta(ev.Index, ev.ContentBlock.ID, ev.ContentBlock.Name, ""),
-					}}},
-				}) {
-					return ctx.Err()
-				}
-			}
-		case "content_block_delta":
-			if ev.Delta == nil {
-				continue
-			}
-			switch ev.Delta.Type {
-			case "text_delta":
-				if ev.Delta.Text == "" {
-					continue
-				}
-				if !sendReply(ctx, results, &pb.Reply{
-					Message:    []byte(ev.Delta.Text),
-					ChatDeltas: []*pb.ChatDelta{{Content: ev.Delta.Text}},
-				}) {
-					return ctx.Err()
-				}
-			case "input_json_delta":
-				if ev.Delta.PartialJSON == "" {
-					continue
-				}
-				if !sendReply(ctx, results, &pb.Reply{
-					ChatDeltas: []*pb.ChatDelta{{ToolCalls: []*pb.ToolCallDelta{
-						newToolCallDelta(ev.Index, "", "", ev.Delta.PartialJSON),
-					}}},
-				}) {
-					return ctx.Err()
-				}
-			}
-		case "message_delta":
-			// Anthropic sends final usage in message_delta.usage. Emit
-			// a usage-only Reply so the consumer can record totals.
-			if ev.Usage != nil {
-				if !sendReply(ctx, results, &pb.Reply{
-					Tokens: int32(ev.Usage.OutputTokens),
-				}) {
-					return ctx.Err()
-				}
-			}
-		case "message_stop":
-			return nil
-		}
-	}
-	return scanner.Err()
-}
--- a/backend/go/cloud-proxy/provider_anthropic_test.go
+++ b/backend/go/cloud-proxy/provider_anthropic_test.go
@@ -1,334 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"io"
-	"math"
-	"net/http"
-	"net/http/httptest"
-	"strings"
-	"testing"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	. "github.com/onsi/gomega"
-)
-
-// fakeAnthropicUpstream mirrors fakeOpenAIUpstream but decodes the
-// request body as an anthropicRequest so tests can assert on the
-// translated wire shape (system field, max_tokens, etc.).
-func fakeAnthropicUpstream(t *testing.T, handler func(req anthropicRequest) (status int, body string, contentType string)) (*httptest.Server, *anthropicRequest) {
-	t.Helper()
-	var captured anthropicRequest
-	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-		raw, _ := io.ReadAll(r.Body)
-		_ = json.Unmarshal(raw, &captured)
-		status, body, ct := handler(captured)
-		w.Header().Set("Content-Type", ct)
-		w.WriteHeader(status)
-		_, _ = io.WriteString(w, body)
-	}))
-	return srv, &captured
-}
-
-func newAnthropicTranslateCloudProxy(t *testing.T, upstreamURL string) *CloudProxy {
-	t.Helper()
-	g := NewWithT(t)
-	t.Setenv("CLOUD_PROXY_ANTHROPIC_FAKE", "sk-ant-fake")
-	cp := NewCloudProxy()
-	err := cp.Load(&pb.ModelOptions{
-		Model: "claude-local",
-		Proxy: &pb.ProxyOptions{
-			UpstreamUrl:   upstreamURL,
-			Mode:          modeTranslate,
-			Provider:      providerAnthropic,
-			ApiKeyEnv:     "CLOUD_PROXY_ANTHROPIC_FAKE",
-			UpstreamModel: "claude-3-5-sonnet-20241022",
-		},
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	return cp
-}
-
-func TestPredict_Anthropic_BasicMessages(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"id":"msg_1","type":"message","role":"assistant","content":[{"type":"text","text":"hi there"}],"model":"claude-3-5-sonnet-20241022","usage":{"input_tokens":5,"output_tokens":2}}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	got, err := cp.Predict(&pb.PredictOptions{
-		Messages: []*pb.Message{
-			{Role: "system", Content: "be brief"},
-			{Role: "user", Content: "hello"},
-		},
-		Temperature: 0.5,
-		TopP:        0.9,
-		Tokens:      32,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(got).To(Equal("hi there"))
-
-	g.Expect(captured.Model).To(Equal("claude-3-5-sonnet-20241022"))
-	// System message must be hoisted out of Messages into top-level field.
-	g.Expect(captured.System).To(Equal("be brief"))
-	g.Expect(captured.Messages).To(HaveLen(1))
-	g.Expect(captured.Messages[0].Role).To(Equal("user"))
-	g.Expect(captured.MaxTokens).To(Equal(int32(32)))
-	g.Expect(captured.Temperature).NotTo(BeNil())
-	g.Expect(*captured.Temperature).To(Equal(0.5))
-	// Anthropic 400s when both temperature and top_p are set; the
-	// translator must prefer temperature and drop top_p.
-	g.Expect(captured.TopP).To(BeNil())
-	g.Expect(captured.Stream).To(BeFalse())
-}
-
-// When only top_p is set, it should be forwarded.
-func TestPredict_Anthropic_TopPOnly(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	_, err := cp.Predict(&pb.PredictOptions{
-		Messages: []*pb.Message{{Role: "user", Content: "hello"}},
-		TopP:     0.9,
-		Tokens:   16,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.Temperature).To(BeNil())
-	// PredictOptions.TopP is float32 on the wire; the translator widens
-	// to float64 so 0.9 round-trips as 0.8999999761581421… — compare
-	// with a small tolerance rather than exact equality.
-	g.Expect(captured.TopP).NotTo(BeNil())
-	g.Expect(math.Abs(*captured.TopP - 0.9)).To(BeNumerically("<=", 1e-6))
-}
-
-func TestPredict_Anthropic_DefaultsMaxTokens(t *testing.T) {
-	g := NewWithT(t)
-	// Anthropic 400s without max_tokens. The translator must default
-	// it when the caller doesn't supply Tokens.
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	_, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "x"}}})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.MaxTokens).To(Equal(anthropicDefaultMaxTokens))
-}
-
-func TestPredict_Anthropic_PromptFallback(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	_, err := cp.Predict(&pb.PredictOptions{Prompt: "what time is it?", Tokens: 16})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.Messages).To(HaveLen(1))
-	g.Expect(captured.Messages[0].Role).To(Equal("user"))
-	g.Expect(captured.Messages[0].Content).To(Equal("what time is it?"))
-}
-
-func TestPredict_Anthropic_ConcatenatesContentBlocks(t *testing.T) {
-	g := NewWithT(t)
-	// Anthropic may return multiple text blocks; the translator joins
-	// them so the Predict() string return is the full assistant message.
-	srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[{"type":"text","text":"hello "},{"type":"text","text":"world"}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	got, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "x"}}, Tokens: 16})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(got).To(Equal("hello world"))
-}
-
-func TestPredict_Anthropic_UpstreamError(t *testing.T) {
-	g := NewWithT(t)
-	srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 401, `{"error":{"type":"authentication_error","message":"bad key"}}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	_, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "x"}}, Tokens: 16})
-	g.Expect(err).To(HaveOccurred())
-	g.Expect(err.Error()).To(ContainSubstring("401"))
-}
-
-func TestPredictStream_Anthropic_StreamsTextDeltas(t *testing.T) {
-	g := NewWithT(t)
-	// Real Anthropic SSE has event: lines + data: lines. The translator
-	// only needs the data: payload; only content_block_delta with
-	// delta.type=text_delta carries content. message_stop ends.
-	frames := []string{
-		"event: message_start\ndata: {\"type\":\"message_start\"}\n\n",
-		"event: content_block_start\ndata: {\"type\":\"content_block_start\",\"index\":0}\n\n",
-		"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"text_delta\",\"text\":\"hello\"}}\n\n",
-		"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"text_delta\",\"text\":\" \"}}\n\n",
-		"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"text_delta\",\"text\":\"world\"}}\n\n",
-		"event: content_block_stop\ndata: {\"type\":\"content_block_stop\",\"index\":0}\n\n",
-		"event: message_stop\ndata: {\"type\":\"message_stop\"}\n\n",
-	}
-	body := strings.Join(frames, "")
-
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, body, "text/event-stream"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	results := make(chan string, 8)
-	done := make(chan error, 1)
-	go func() {
-		done <- cp.PredictStream(&pb.PredictOptions{
-			Messages: []*pb.Message{{Role: "user", Content: "hi"}},
-			Tokens:   16,
-		}, results)
-	}()
-
-	var got []string
-	for s := range results {
-		got = append(got, s)
-	}
-	err := <-done
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(strings.Join(got, "")).To(Equal("hello world"))
-	g.Expect(captured.Stream).To(BeTrue())
-}
-
-func TestBuildAnthropic_TranslatesOpenAITools(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	tools := `[{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}]`
-	_, err := cp.Predict(&pb.PredictOptions{
-		Messages:   []*pb.Message{{Role: "user", Content: "weather in Paris?"}},
-		Tools:      tools,
-		ToolChoice: `"auto"`,
-		Tokens:     32,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.Tools).To(HaveLen(1))
-	g.Expect(captured.Tools[0].Name).To(Equal("get_weather"))
-	g.Expect(captured.Tools[0].Description).To(Equal("Get weather"))
-	// input_schema must be the parameters object verbatim.
-	g.Expect(string(captured.Tools[0].InputSchema)).To(ContainSubstring(`"city"`))
-	g.Expect(captured.ToolChoice).NotTo(BeNil())
-	g.Expect(captured.ToolChoice.Type).To(Equal("auto"))
-}
-
-func TestBuildAnthropic_ToolChoice_RequiredMapsToAny(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-	_, err := cp.Predict(&pb.PredictOptions{
-		Messages:   []*pb.Message{{Role: "user", Content: "x"}},
-		Tools:      `[{"type":"function","function":{"name":"t","parameters":{"type":"object"}}}]`,
-		ToolChoice: `"required"`,
-		Tokens:     16,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.ToolChoice).NotTo(BeNil())
-	g.Expect(captured.ToolChoice.Type).To(Equal("any"))
-}
-
-func TestBuildAnthropic_ToolChoice_NoneDropsTools(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-	_, err := cp.Predict(&pb.PredictOptions{
-		Messages:   []*pb.Message{{Role: "user", Content: "x"}},
-		Tools:      `[{"type":"function","function":{"name":"t","parameters":{"type":"object"}}}]`,
-		ToolChoice: `"none"`,
-		Tokens:     16,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.Tools).To(BeNil())
-	g.Expect(captured.ToolChoice).To(BeNil())
-}
-
-func TestBuildAnthropic_ToolChoice_NamedFunction(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-	_, err := cp.Predict(&pb.PredictOptions{
-		Messages:   []*pb.Message{{Role: "user", Content: "x"}},
-		Tools:      `[{"type":"function","function":{"name":"weather","parameters":{"type":"object"}}}]`,
-		ToolChoice: `{"type":"function","function":{"name":"weather"}}`,
-		Tokens:     16,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.ToolChoice).NotTo(BeNil())
-	g.Expect(captured.ToolChoice.Type).To(Equal("tool"))
-	g.Expect(captured.ToolChoice.Name).To(Equal("weather"))
-}
-
-func TestBuildAnthropic_RoundTripsAssistantToolCalls(t *testing.T) {
-	g := NewWithT(t)
-	// LocalAI Assistant's second turn: the LLM previously emitted a
-	// tool_use, the server executed it, and the conversation now
-	// includes the assistant turn (with tool_calls) plus a tool-role
-	// result message. Both must convert to Anthropic block form.
-	srv, captured := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"content":[{"type":"text","text":"ok"}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	tools := `[{"type":"function","function":{"name":"list_models","parameters":{"type":"object"}}}]`
-	toolCallsJSON := `[{"id":"call_abc","type":"function","function":{"name":"list_models","arguments":"{}"}}]`
-	_, err := cp.Predict(&pb.PredictOptions{
-		Tools: tools,
-		Messages: []*pb.Message{
-			{Role: "user", Content: "what models are installed?"},
-			{Role: "assistant", Content: "", ToolCalls: toolCallsJSON},
-			{Role: "tool", Content: `{"models":["a","b"]}`, ToolCallId: "call_abc"},
-		},
-		Tokens: 64,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-
-	g.Expect(captured.Messages).To(HaveLen(3))
-	// 1. user text — bare string
-	s, ok := captured.Messages[0].Content.(string)
-	g.Expect(ok).To(BeTrue())
-	g.Expect(s).To(Equal("what models are installed?"))
-	// 2. assistant — must be a content-block list with one tool_use
-	// json.Unmarshal of `any` produces []any not []anthropicContentBlock.
-	blocks, ok := captured.Messages[1].Content.([]any)
-	g.Expect(ok).To(BeTrue())
-	g.Expect(blocks).To(HaveLen(1))
-	b0, _ := blocks[0].(map[string]any)
-	g.Expect(b0["type"]).To(Equal("tool_use"))
-	g.Expect(b0["id"]).To(Equal("call_abc"))
-	g.Expect(b0["name"]).To(Equal("list_models"))
-	// 3. tool → user with tool_result block
-	g.Expect(captured.Messages[2].Role).To(Equal("user"))
-	resBlocks, _ := captured.Messages[2].Content.([]any)
-	r0, _ := resBlocks[0].(map[string]any)
-	g.Expect(r0["type"]).To(Equal("tool_result"))
-	g.Expect(r0["tool_use_id"]).To(Equal("call_abc"))
-	g.Expect(r0["content"]).To(Equal(`{"models":["a","b"]}`))
-}
--- a/backend/go/cloud-proxy/provider_edge_test.go
+++ b/backend/go/cloud-proxy/provider_edge_test.go
@@ -1,119 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"strings"
-	"testing"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	. "github.com/onsi/gomega"
-)
-
-// Verify buildOpenAIRequest preserves caller-supplied tools and
-// tool_choice as opaque JSON. PredictOptions carries them as strings;
-// they must land in the outbound request body unchanged so the
-// upstream sees the caller's intent verbatim. A regression here would
-// silently disable function calling for translate-mode clients.
-func TestBuildOpenAIRequest_ToolsAndToolChoicePassthrough(t *testing.T) {
-	g := NewWithT(t)
-	cfg := &proxyConfig{upstreamModel: "gpt-4o"}
-	toolsJSON := `[{"type":"function","function":{"name":"search","parameters":{"type":"object"}}}]`
-	choiceJSON := `{"type":"function","function":{"name":"search"}}`
-
-	body, err := buildOpenAIRequest(&pb.PredictOptions{
-		Messages:   []*pb.Message{{Role: "user", Content: "find x"}},
-		Tools:      toolsJSON,
-		ToolChoice: choiceJSON,
-	}, cfg, false)
-	g.Expect(err).NotTo(HaveOccurred())
-
-	var decoded openAIRequest
-	err = json.Unmarshal(body, &decoded)
-	g.Expect(err).NotTo(HaveOccurred())
-	// Compare the JSON-canonical form so whitespace differences are ignored.
-	gotTools, _ := json.Marshal(json.RawMessage(decoded.Tools))
-	wantTools, _ := json.Marshal(json.RawMessage(toolsJSON))
-	g.Expect(string(gotTools)).To(Equal(string(wantTools)))
-	gotChoice, _ := json.Marshal(json.RawMessage(decoded.ToolChoice))
-	wantChoice, _ := json.Marshal(json.RawMessage(choiceJSON))
-	g.Expect(string(gotChoice)).To(Equal(string(wantChoice)))
-}
-
-// Garbage JSON in tools / tool_choice is silently dropped (omitted)
-// rather than blowing up the request. Documents the parseRawJSON
-// behaviour — operators shouldn't see hard failures from an upstream
-// caller's mis-formatted tools field.
-func TestBuildOpenAIRequest_InvalidToolsJSONDropped(t *testing.T) {
-	g := NewWithT(t)
-	cfg := &proxyConfig{upstreamModel: "gpt-4o"}
-	body, err := buildOpenAIRequest(&pb.PredictOptions{
-		Messages:   []*pb.Message{{Role: "user", Content: "x"}},
-		Tools:      "this is not json",
-		ToolChoice: "{also bad",
-	}, cfg, false)
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(string(body)).NotTo(ContainSubstring("this is not json"))
-	g.Expect(string(body)).NotTo(ContainSubstring("{also bad"))
-}
-
-// Anthropic empty content array yields an empty Reply (not an error).
-// Mirrors how an upstream tool_use-only response might arrive — the
-// content array can legitimately be empty in some edge cases.
-func TestPredictRich_Anthropic_EmptyContent(t *testing.T) {
-	g := NewWithT(t)
-	srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{"id":"m1","type":"message","role":"assistant","content":[],"usage":{"input_tokens":3,"output_tokens":0}}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	reply, err := cp.PredictRich(&pb.PredictOptions{
-		Messages: []*pb.Message{{Role: "user", Content: "x"}},
-		Tokens:   16,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(string(reply.GetMessage())).To(Equal(""))
-	g.Expect(reply.GetChatDeltas()).To(HaveLen(0))
-	g.Expect(reply.GetPromptTokens()).To(Equal(int32(3)))
-}
-
-// A truncated / malformed SSE payload mid-stream should be tolerated:
-// the malformed chunk gets skipped (xlog.Debug logged), valid chunks
-// before AND after it still reach the channel.
-func TestPredictStreamRich_OpenAI_TolerantOfBadChunks(t *testing.T) {
-	g := NewWithT(t)
-	body := strings.Join([]string{
-		`data: {"choices":[{"index":0,"delta":{"content":"hello"}}]}`,
-		``,
-		`data: this-is-not-json{{`,
-		``,
-		`data: {"choices":[{"index":0,"delta":{"content":" world"}}]}`,
-		``,
-		`data: [DONE]`,
-		``,
-	}, "\n")
-
-	srv, _ := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
-		return 200, body, "text/event-stream"
-	})
-	defer srv.Close()
-	cp := newTranslateCloudProxy(t, srv.URL)
-
-	results := make(chan *pb.Reply, 8)
-	done := make(chan error, 1)
-	go func() {
-		done <- cp.PredictStreamRich(&pb.PredictOptions{
-			Messages: []*pb.Message{{Role: "user", Content: "hi"}},
-		}, results)
-		close(results)
-	}()
-
-	var assembled strings.Builder
-	for reply := range results {
-		assembled.Write(reply.GetMessage())
-	}
-	err := <-done
-	g.Expect(err).NotTo(HaveOccurred())
-	// The good chunks before and after the malformed one both made it through.
-	g.Expect(assembled.String()).To(Equal("hello world"))
-}
--- a/backend/go/cloud-proxy/provider_openai.go
+++ b/backend/go/cloud-proxy/provider_openai.go
@@ -1,320 +0,0 @@
-package main
-
-import (
-	"bufio"
-	"bytes"
-	"context"
-	"encoding/json"
-	"errors"
-	"fmt"
-	"io"
-	"net/http"
-	"strings"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	"github.com/mudler/xlog"
-)
-
-// OpenAI Chat Completions wire-format types. Narrowed to the fields
-// translate mode needs to preserve through the Reply proto: content,
-// role, tool_calls (typed so we can map them to pb.ToolCallDelta),
-// and sampling params copied verbatim from PredictOptions.
-//
-// Provider-specific extensions (logit_bias, function calling beyond
-// tool_calls, etc.) are not modelled — passthrough mode covers callers
-// that need full upstream fidelity.
-
-type openAIRequest struct {
-	Model            string          `json:"model"`
-	Messages         []openAIMessage `json:"messages"`
-	Stream           bool            `json:"stream,omitempty"`
-	Temperature      *float64        `json:"temperature,omitempty"`
-	TopP             *float64        `json:"top_p,omitempty"`
-	MaxTokens        *int32          `json:"max_tokens,omitempty"`
-	Stop             []string        `json:"stop,omitempty"`
-	FrequencyPenalty *float64        `json:"frequency_penalty,omitempty"`
-	PresencePenalty  *float64        `json:"presence_penalty,omitempty"`
-	Tools            json.RawMessage `json:"tools,omitempty"`
-	ToolChoice       json.RawMessage `json:"tool_choice,omitempty"`
-}
-
-type openAIMessage struct {
-	Role       string           `json:"role"`
-	Content    string           `json:"content,omitempty"`
-	Name       string           `json:"name,omitempty"`
-	ToolCallID string           `json:"tool_call_id,omitempty"`
-	ToolCalls  []openAIToolCall `json:"tool_calls,omitempty"`
-}
-
-// openAIToolCall covers both the non-streaming response shape (full
-// id+function+arguments) and the streaming-delta shape (sparse fields,
-// index assignment). The proto's ToolCallDelta absorbs both — name is
-// set on first appearance, arguments arrive incrementally in streaming.
-type openAIToolCall struct {
-	Index    int                `json:"index,omitempty"`
-	ID       string             `json:"id,omitempty"`
-	Type     string             `json:"type,omitempty"`
-	Function openAIFunctionCall `json:"function,omitempty"`
-}
-
-type openAIFunctionCall struct {
-	Name      string `json:"name,omitempty"`
-	Arguments string `json:"arguments,omitempty"`
-}
-
-type openAIChoice struct {
-	Index        int           `json:"index"`
-	Message      openAIMessage `json:"message"`
-	FinishReason string        `json:"finish_reason"`
-}
-
-type openAIResponse struct {
-	ID      string         `json:"id"`
-	Choices []openAIChoice `json:"choices"`
-	Usage   *openAIUsage   `json:"usage,omitempty"`
-}
-
-type openAIStreamChoice struct {
-	Index int `json:"index"`
-	Delta struct {
-		Content   string           `json:"content,omitempty"`
-		Role      string           `json:"role,omitempty"`
-		ToolCalls []openAIToolCall `json:"tool_calls,omitempty"`
-	} `json:"delta"`
-	FinishReason string `json:"finish_reason,omitempty"`
-}
-
-type openAIStreamChunk struct {
-	Choices []openAIStreamChoice `json:"choices"`
-	Usage   *openAIUsage         `json:"usage,omitempty"`
-}
-
-type openAIUsage struct {
-	PromptTokens     int `json:"prompt_tokens"`
-	CompletionTokens int `json:"completion_tokens"`
-	TotalTokens      int `json:"total_tokens"`
-}
-
-// buildOpenAIRequest converts pb.PredictOptions into the OpenAI Chat
-// Completions request body. Prefers Messages when non-empty; falls
-// back to wrapping Prompt as a single user message so plain
-// /completions-style calls still work in translate mode.
-func buildOpenAIRequest(opts *pb.PredictOptions, cfg *proxyConfig, stream bool) ([]byte, error) {
-	req := openAIRequest{
-		Model:      modelName(cfg, opts),
-		Stream:     stream,
-		Stop:       opts.GetStopPrompts(),
-		Tools:      parseRawJSON(opts.GetTools()),
-		ToolChoice: parseRawJSON(opts.GetToolChoice()),
-	}
-	if t := opts.GetTemperature(); t != 0 {
-		v := float64(t)
-		req.Temperature = &v
-	}
-	if t := opts.GetTopP(); t != 0 {
-		v := float64(t)
-		req.TopP = &v
-	}
-	if n := opts.GetTokens(); n > 0 {
-		req.MaxTokens = &n
-	}
-	if p := opts.GetFrequencyPenalty(); p != 0 {
-		v := float64(p)
-		req.FrequencyPenalty = &v
-	}
-	if p := opts.GetPresencePenalty(); p != 0 {
-		v := float64(p)
-		req.PresencePenalty = &v
-	}
-
-	for _, m := range opts.GetMessages() {
-		msg := openAIMessage{
-			Role:       m.GetRole(),
-			Content:    m.GetContent(),
-			Name:       m.GetName(),
-			ToolCallID: m.GetToolCallId(),
-		}
-		// Pre-existing tool_calls arrive as a JSON string from the
-		// upstream caller's previous assistant turn; pass-through as-is.
-		if tc := m.GetToolCalls(); tc != "" {
-			_ = json.Unmarshal([]byte(tc), &msg.ToolCalls)
-		}
-		req.Messages = append(req.Messages, msg)
-	}
-	// Fallback for plain Prompt requests (no Messages array). LocalAI
-	// templating may have produced a flat prompt; rewrap as a single
-	// user message so the upstream chat endpoint accepts it.
-	if len(req.Messages) == 0 && opts.GetPrompt() != "" {
-		req.Messages = []openAIMessage{{Role: "user", Content: opts.GetPrompt()}}
-	}
-
-	return json.Marshal(req)
-}
-
-// modelName picks the upstream model: upstream_model from the proxy
-// config wins (operator override), else the local model name captured
-// at LoadModel time. Operator sets upstream_model to map LocalAI's
-// alias (e.g. "claude-strict") to the upstream's canonical name
-// (e.g. "claude-3-5-sonnet-20241022").
-func modelName(cfg *proxyConfig, _ *pb.PredictOptions) string {
-	if cfg.upstreamModel != "" {
-		return cfg.upstreamModel
-	}
-	return cfg.localModel
-}
-
-// parseRawJSON parses a JSON string into a RawMessage so it round-trips
-// into the upstream body. Returns nil for empty/invalid input so the
-// field is omitted (omitempty).
-func parseRawJSON(s string) json.RawMessage {
-	if s == "" {
-		return nil
-	}
-	var probe json.RawMessage
-	if err := json.Unmarshal([]byte(s), &probe); err != nil {
-		return nil
-	}
-	return probe
-}
-
-// doOpenAIRequest builds + sends the upstream request. Returns the
-// raw response on success; caller handles status / body.
-func (c *CloudProxy) doOpenAIRequest(ctx context.Context, cfg *proxyConfig, body []byte) (*http.Response, error) {
-	req, err := http.NewRequestWithContext(ctx, http.MethodPost, cfg.upstreamURL, bytes.NewReader(body))
-	if err != nil {
-		return nil, fmt.Errorf("cloud-proxy: build request: %w", err)
-	}
-	req.Header.Set("Content-Type", "application/json")
-	req.Header.Set("Accept", "*/*")
-	if cfg.apiKey != "" {
-		applyAuthHeader(req, cfg.provider, cfg.apiKey)
-	}
-	resp, err := c.client.Do(req)
-	if err != nil {
-		return nil, fmt.Errorf("cloud-proxy: upstream request: %w", err)
-	}
-	return resp, nil
-}
-
-// predictOpenAIRich is the non-streaming translate path. Returns a
-// fully-populated *pb.Reply with assistant content, tool calls, and
-// token usage. The gRPC server forwards the Reply verbatim.
-func (c *CloudProxy) predictOpenAIRich(ctx context.Context, cfg *proxyConfig, opts *pb.PredictOptions) (*pb.Reply, error) {
-	body, err := buildOpenAIRequest(opts, cfg, false)
-	if err != nil {
-		return nil, fmt.Errorf("cloud-proxy: marshal request: %w", err)
-	}
-	resp, err := c.doOpenAIRequest(ctx, cfg, body)
-	if err != nil {
-		return nil, err
-	}
-	defer func() { _ = resp.Body.Close() }()
-
-	if resp.StatusCode >= 400 {
-		errBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
-		return nil, fmt.Errorf("cloud-proxy: upstream %d: %s", resp.StatusCode, string(errBody))
-	}
-
-	var parsed openAIResponse
-	if err := json.NewDecoder(resp.Body).Decode(&parsed); err != nil {
-		return nil, fmt.Errorf("cloud-proxy: decode response: %w", err)
-	}
-	if len(parsed.Choices) == 0 {
-		return nil, errors.New("cloud-proxy: upstream returned no choices")
-	}
-
-	choice := parsed.Choices[0]
-	reply := &pb.Reply{
-		Message: []byte(choice.Message.Content),
-	}
-	if parsed.Usage != nil {
-		reply.PromptTokens = int32(parsed.Usage.PromptTokens)
-		reply.Tokens = int32(parsed.Usage.CompletionTokens)
-	}
-	if len(choice.Message.ToolCalls) > 0 {
-		// Non-streaming: a single ChatDelta carries the full tool-call
-		// set. Index/Name/Arguments are populated together; downstream
-		// consumers don't need to assemble streaming deltas.
-		delta := &pb.ChatDelta{}
-		for _, tc := range choice.Message.ToolCalls {
-			delta.ToolCalls = append(delta.ToolCalls,
-				newToolCallDelta(tc.Index, tc.ID, tc.Function.Name, tc.Function.Arguments))
-		}
-		reply.ChatDeltas = []*pb.ChatDelta{delta}
-	}
-	return reply, nil
-}
-
-// predictOpenAIStreamRich streams *pb.Reply chunks. Each chunk carries
-// either a content delta (Message + ChatDeltas[].Content) or tool-call
-// deltas (ChatDeltas[].ToolCalls). The final Reply carries usage tokens
-// when the upstream sends them (stream_options.include_usage).
-func (c *CloudProxy) predictOpenAIStreamRich(ctx context.Context, cfg *proxyConfig, opts *pb.PredictOptions, results chan<- *pb.Reply) error {
-	body, err := buildOpenAIRequest(opts, cfg, true)
-	if err != nil {
-		return fmt.Errorf("cloud-proxy: marshal request: %w", err)
-	}
-	resp, err := c.doOpenAIRequest(ctx, cfg, body)
-	if err != nil {
-		return err
-	}
-	defer func() { _ = resp.Body.Close() }()
-
-	if resp.StatusCode >= 400 {
-		errBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1<<20))
-		return fmt.Errorf("cloud-proxy: upstream %d: %s", resp.StatusCode, string(errBody))
-	}
-
-	scanner := bufio.NewScanner(resp.Body)
-	scanner.Buffer(make([]byte, 0, 64*1024), 1<<20)
-	for scanner.Scan() {
-		line := scanner.Text()
-		if !strings.HasPrefix(line, "data:") {
-			continue
-		}
-		payload := strings.TrimSpace(strings.TrimPrefix(line, "data:"))
-		if payload == "" || payload == "[DONE]" {
-			return nil
-		}
-		var chunk openAIStreamChunk
-		if err := json.Unmarshal([]byte(payload), &chunk); err != nil {
-			xlog.Debug("cloud-proxy: skip malformed SSE chunk", "error", err)
-			continue
-		}
-		// Usage frames may arrive separately from content frames when
-		// stream_options.include_usage is set; emit a usage-only Reply
-		// in that case so the consumer sees the totals.
-		if chunk.Usage != nil && len(chunk.Choices) == 0 {
-			if !sendReply(ctx, results, &pb.Reply{
-				PromptTokens: int32(chunk.Usage.PromptTokens),
-				Tokens:       int32(chunk.Usage.CompletionTokens),
-			}) {
-				return ctx.Err()
-			}
-			continue
-		}
-		for _, ch := range chunk.Choices {
-			reply := &pb.Reply{}
-			if ch.Delta.Content != "" {
-				reply.Message = []byte(ch.Delta.Content)
-				reply.ChatDeltas = []*pb.ChatDelta{{Content: ch.Delta.Content}}
-			}
-			if len(ch.Delta.ToolCalls) > 0 {
-				if len(reply.ChatDeltas) == 0 {
-					reply.ChatDeltas = []*pb.ChatDelta{{}}
-				}
-				for _, tc := range ch.Delta.ToolCalls {
-					reply.ChatDeltas[0].ToolCalls = append(reply.ChatDeltas[0].ToolCalls,
-						newToolCallDelta(tc.Index, tc.ID, tc.Function.Name, tc.Function.Arguments))
-				}
-			}
-			if reply.Message == nil && len(reply.ChatDeltas) == 0 {
-				continue
-			}
-			if !sendReply(ctx, results, reply) {
-				return ctx.Err()
-			}
-		}
-	}
-	return scanner.Err()
-}
--- a/backend/go/cloud-proxy/provider_openai_test.go
+++ b/backend/go/cloud-proxy/provider_openai_test.go
@@ -1,170 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"io"
-	"net/http"
-	"net/http/httptest"
-	"strings"
-	"testing"
-
-	. "github.com/onsi/gomega"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-)
-
-// fakeOpenAIUpstream returns an httptest.Server that decodes the
-// inbound request as an openAIRequest, calls handler with it, and
-// writes the handler's reply as the response.
-func fakeOpenAIUpstream(t *testing.T, handler func(req openAIRequest) (status int, body string, contentType string)) (*httptest.Server, *openAIRequest) {
-	t.Helper()
-	var captured openAIRequest
-	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-		raw, _ := io.ReadAll(r.Body)
-		_ = json.Unmarshal(raw, &captured)
-		status, body, ct := handler(captured)
-		w.Header().Set("Content-Type", ct)
-		w.WriteHeader(status)
-		_, _ = io.WriteString(w, body)
-	}))
-	return srv, &captured
-}
-
-func newTranslateCloudProxy(t *testing.T, upstreamURL string) *CloudProxy {
-	t.Helper()
-	g := NewWithT(t)
-	t.Setenv("CLOUD_PROXY_OPENAI_FAKE", "sk-fake-openai")
-	cp := NewCloudProxy()
-	err := cp.Load(&pb.ModelOptions{
-		Model: "gpt-4o-local",
-		Proxy: &pb.ProxyOptions{
-			UpstreamUrl:   upstreamURL,
-			Mode:          modeTranslate,
-			Provider:      providerOpenAI,
-			ApiKeyEnv:     "CLOUD_PROXY_OPENAI_FAKE",
-			UpstreamModel: "gpt-4o",
-		},
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	return cp
-}
-
-func TestPredict_OpenAI_BasicChat(t *testing.T) {
-	g := NewWithT(t)
-	srv, captured := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
-		return 200, `{"id":"resp-1","choices":[{"index":0,"message":{"role":"assistant","content":"hi there"},"finish_reason":"stop"}],"usage":{"prompt_tokens":5,"completion_tokens":2,"total_tokens":7}}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newTranslateCloudProxy(t, srv.URL)
-
-	got, err := cp.Predict(&pb.PredictOptions{
-		Messages: []*pb.Message{
-			{Role: "system", Content: "be brief"},
-			{Role: "user", Content: "hello"},
-		},
-		Temperature: 0.5,
-		TopP:        0.9,
-		Tokens:      32,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(got).To(Equal("hi there"))
-
-	// Verify the upstream saw a properly-translated request.
-	g.Expect(captured.Model).To(Equal("gpt-4o"))
-	g.Expect(captured.Messages).To(HaveLen(2))
-	g.Expect(captured.Messages[0].Role).To(Equal("system"))
-	g.Expect(captured.Messages[1].Role).To(Equal("user"))
-	g.Expect(captured.Temperature).NotTo(BeNil())
-	g.Expect(*captured.Temperature).To(Equal(0.5))
-	g.Expect(captured.MaxTokens).NotTo(BeNil())
-	g.Expect(*captured.MaxTokens).To(Equal(int32(32)))
-	g.Expect(captured.Stream).To(BeFalse())
-}
-
-func TestPredict_OpenAI_PromptFallback(t *testing.T) {
-	g := NewWithT(t)
-	// No Messages array — backend should synth a single user message
-	// from Prompt so non-chat clients still route through translate.
-	srv, captured := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
-		return 200, `{"choices":[{"message":{"role":"assistant","content":"ok"}}]}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newTranslateCloudProxy(t, srv.URL)
-
-	_, err := cp.Predict(&pb.PredictOptions{Prompt: "what time is it?"})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(captured.Messages).To(HaveLen(1))
-	g.Expect(captured.Messages[0].Role).To(Equal("user"))
-	g.Expect(captured.Messages[0].Content).To(Equal("what time is it?"))
-}
-
-func TestPredict_OpenAI_UpstreamError(t *testing.T) {
-	g := NewWithT(t)
-	srv, _ := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
-		return 401, `{"error":{"message":"bad key"}}`, "application/json"
-	})
-	defer srv.Close()
-	cp := newTranslateCloudProxy(t, srv.URL)
-
-	_, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "x"}}})
-	g.Expect(err).To(HaveOccurred())
-	g.Expect(err.Error()).To(ContainSubstring("401"))
-}
-
-func TestPredictStream_OpenAI_StreamsContent(t *testing.T) {
-	g := NewWithT(t)
-	// Stream three content deltas then [DONE]. Verify the channel
-	// receives them in order with no missing pieces.
-	chunks := []string{
-		`{"choices":[{"index":0,"delta":{"role":"assistant"}}]}`,
-		`{"choices":[{"index":0,"delta":{"content":"hello"}}]}`,
-		`{"choices":[{"index":0,"delta":{"content":" "}}]}`,
-		`{"choices":[{"index":0,"delta":{"content":"world"}}]}`,
-		`{"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}`,
-	}
-	body := ""
-	for _, c := range chunks {
-		body += "data: " + c + "\n\n"
-	}
-	body += "data: [DONE]\n\n"
-
-	srv, captured := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
-		return 200, body, "text/event-stream"
-	})
-	defer srv.Close()
-	cp := newTranslateCloudProxy(t, srv.URL)
-
-	results := make(chan string, 8)
-	done := make(chan error, 1)
-	go func() {
-		done <- cp.PredictStream(&pb.PredictOptions{
-			Messages: []*pb.Message{{Role: "user", Content: "hi"}},
-		}, results)
-	}()
-
-	var got []string
-	for s := range results {
-		got = append(got, s)
-	}
-	err := <-done
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(strings.Join(got, "")).To(Equal("hello world"))
-	g.Expect(captured.Stream).To(BeTrue())
-}
-
-func TestPredict_RejectedInPassthroughMode(t *testing.T) {
-	g := NewWithT(t)
-	t.Setenv("CLOUD_PROXY_FAKE", "k")
-	cp := NewCloudProxy()
-	err := cp.Load(&pb.ModelOptions{
-		Proxy: &pb.ProxyOptions{
-			UpstreamUrl: "https://example.com",
-			Mode:        modePassthrough,
-			ApiKeyEnv:   "CLOUD_PROXY_FAKE",
-		},
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	_, err = cp.Predict(&pb.PredictOptions{})
-	g.Expect(err).To(HaveOccurred())
-	g.Expect(err.Error()).To(ContainSubstring("only valid in translate"))
-}
--- a/backend/go/cloud-proxy/proxy.go
+++ b/backend/go/cloud-proxy/proxy.go
@@ -1,436 +0,0 @@
-package main
-
-import (
-	"context"
-	"errors"
-	"fmt"
-	"io"
-	"net/http"
-	"net/url"
-	"os"
-	"strings"
-	"sync/atomic"
-
-	"github.com/mudler/xlog"
-
-	"github.com/mudler/LocalAI/pkg/grpc/base"
-	"github.com/mudler/LocalAI/pkg/grpc/grpcerrors"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	"github.com/mudler/LocalAI/pkg/httpclient"
-)
-
-// Mirror of core/config.Proxy{Mode,Provider}* — backends don't
-// import core to keep the boundary clean.
-const (
-	modePassthrough = "passthrough"
-	modeTranslate   = "translate"
-
-	providerOpenAI    = "openai"
-	providerAnthropic = "anthropic"
-)
-
-// CloudProxy is the LocalAI backend that proxies model traffic to a
-// configured upstream HTTP provider. Concurrency: base.SingleThread is
-// NOT embedded — forward calls are independent and HTTP transport is
-// goroutine-safe, so multiple Forward streams can run in parallel.
-// Locking would serialise requests to a chat provider for no benefit.
-type CloudProxy struct {
-	base.Base
-
-	cfg    atomic.Pointer[proxyConfig]
-	client *http.Client
-}
-
-type proxyConfig struct {
-	upstreamURL   string
-	mode          string
-	provider      string
-	upstreamModel string
-	localModel    string // ModelOptions.Model — fallback when upstream_model is unset
-	apiKey        string // resolved at Load time
-}
-
-func NewCloudProxy() *CloudProxy {
-	// httpclient.New refuses redirects outright: the proxy talks to a
-	// single configured upstream API (OpenAI/Anthropic/...) that answers
-	// directly, so a 3xx means misconfiguration, a hijacked upstream, or
-	// DNS trickery — never normal operation. Following it would replay the
-	// request, including the operator's x-api-key (which Go does NOT strip
-	// on cross-host redirects), to an unvetted host and leak the key
-	// (GHSA-3mj3-57v2-4636). It also imposes no body deadline, so streaming
-	// SSE responses that legitimately last minutes are not truncated.
-	return &CloudProxy{client: httpclient.New()}
-}
-
-func (c *CloudProxy) Load(opts *pb.ModelOptions) error {
-	po := opts.GetProxy()
-	if po == nil {
-		return errors.New("cloud-proxy: Load requires ProxyOptions to be set")
-	}
-	if po.GetUpstreamUrl() == "" {
-		return errors.New("cloud-proxy: upstream_url is required")
-	}
-	if _, err := url.ParseRequestURI(po.GetUpstreamUrl()); err != nil {
-		return fmt.Errorf("cloud-proxy: upstream_url %q invalid: %w", po.GetUpstreamUrl(), err)
-	}
-
-	mode := po.GetMode()
-	if mode == "" {
-		mode = modePassthrough
-	}
-	switch mode {
-	case modePassthrough:
-	case modeTranslate:
-		switch po.GetProvider() {
-		case providerOpenAI:
-			// implemented in provider_openai.go
-		case providerAnthropic:
-			// implemented in provider_anthropic.go
-		default:
-			return fmt.Errorf("cloud-proxy: translate mode requires provider in {%s, %s}, got %q",
-				providerOpenAI, providerAnthropic, po.GetProvider())
-		}
-	default:
-		return fmt.Errorf("cloud-proxy: unknown mode %q", mode)
-	}
-
-	key, err := resolveAPIKey(po.GetApiKeyEnv(), po.GetApiKeyFile())
-	if err != nil {
-		return err
-	}
-
-	c.cfg.Store(&proxyConfig{
-		upstreamURL:   po.GetUpstreamUrl(),
-		mode:          mode,
-		provider:      po.GetProvider(),
-		upstreamModel: po.GetUpstreamModel(),
-		localModel:    opts.GetModel(),
-		apiKey:        key,
-	})
-	xlog.Info("cloud-proxy: ready",
-		"upstream", po.GetUpstreamUrl(),
-		"mode", mode,
-		"provider", po.GetProvider(),
-		"has_key", key != "")
-	return nil
-}
-
-// resolveAPIKey mirrors config.ProxyConfig.ResolveAPIKey. Duplicated
-// (a few lines) rather than importing core/config from a backend
-// binary — keeps backends independent of core's package layout.
-// Mutual-exclusion is enforced upstream in core/config.Validate.
-func resolveAPIKey(envName, filePath string) (string, error) {
-	if envName != "" {
-		v := os.Getenv(envName)
-		if v == "" {
-			return "", fmt.Errorf("cloud-proxy: api_key_env %q is unset", envName)
-		}
-		return v, nil
-	}
-	if filePath != "" {
-		b, err := os.ReadFile(filePath)
-		if err != nil {
-			return "", fmt.Errorf("cloud-proxy: read api_key_file %q: %w", filePath, err)
-		}
-		return strings.TrimSpace(string(b)), nil
-	}
-	return "", nil
-}
-
-// PredictRich is the non-streaming translate path. Returns a fully-
-// populated *pb.Reply: content, tool-call deltas (ChatDeltas), and
-// usage tokens. Implements the optional grpc.AIModelRich interface;
-// the gRPC server prefers this path over Predict when present so
-// tool calls survive the round-trip. Passthrough mode rejects
-// PredictRich — callers must use Forward.
-func (c *CloudProxy) PredictRich(opts *pb.PredictOptions) (reply *pb.Reply, err error) {
-	cfg := c.cfg.Load()
-	if cfg == nil {
-		return nil, grpcerrors.ModelNotLoaded("cloud-proxy")
-	}
-	if cfg.mode != modeTranslate {
-		return nil, fmt.Errorf("cloud-proxy: Predict only valid in translate mode (have %s)", cfg.mode)
-	}
-	xlog.Info("cloud-proxy: predict", "provider", cfg.provider, "upstream", cfg.upstreamURL, "upstream_model", cfg.upstreamModel)
-	defer func() {
-		if err != nil {
-			xlog.Warn("cloud-proxy: predict failed", "provider", cfg.provider, "error", err)
-		}
-	}()
-	ctx := context.Background()
-	switch cfg.provider {
-	case providerOpenAI:
-		return c.predictOpenAIRich(ctx, cfg, opts)
-	case providerAnthropic:
-		return c.predictAnthropicRich(ctx, cfg, opts)
-	default:
-		return nil, fmt.Errorf("cloud-proxy: predict not implemented for provider %q", cfg.provider)
-	}
-}
-
-// PredictStreamRich is the rich streaming counterpart of PredictRich.
-// Each emitted Reply carries either a content delta, tool-call deltas,
-// or usage tokens (the final upstream frame). base.Base.PredictStream
-// is bypassed when AIModelRich is implemented, so the channel is
-// closed by the gRPC server pump.
-func (c *CloudProxy) PredictStreamRich(opts *pb.PredictOptions, results chan<- *pb.Reply) (err error) {
-	cfg := c.cfg.Load()
-	if cfg == nil {
-		return grpcerrors.ModelNotLoaded("cloud-proxy")
-	}
-	if cfg.mode != modeTranslate {
-		return fmt.Errorf("cloud-proxy: PredictStream only valid in translate mode (have %s)", cfg.mode)
-	}
-	xlog.Info("cloud-proxy: predict-stream", "provider", cfg.provider, "upstream", cfg.upstreamURL, "upstream_model", cfg.upstreamModel)
-	defer func() {
-		if err != nil {
-			xlog.Warn("cloud-proxy: predict-stream failed", "provider", cfg.provider, "error", err)
-		}
-	}()
-	ctx := context.Background()
-	switch cfg.provider {
-	case providerOpenAI:
-		return c.predictOpenAIStreamRich(ctx, cfg, opts, results)
-	case providerAnthropic:
-		return c.predictAnthropicStreamRich(ctx, cfg, opts, results)
-	default:
-		return fmt.Errorf("cloud-proxy: predictStream not implemented for provider %q", cfg.provider)
-	}
-}
-
-// Predict is the legacy (string, error) AIModel signature. Used only
-// if a caller goes through the non-rich path (it shouldn't, since
-// server.go prefers PredictRich). Provided so the AIModel interface
-// is satisfied for backends that haven't opted into the rich variant.
-func (c *CloudProxy) Predict(opts *pb.PredictOptions) (string, error) {
-	reply, err := c.PredictRich(opts)
-	if err != nil {
-		return "", err
-	}
-	return string(reply.GetMessage()), nil
-}
-
-// PredictStream is the legacy chan-string streaming path. Adapts the
-// rich stream by extracting only content text — tool-call-only chunks
-// (no Message bytes) and usage-only chunks are silently dropped, since
-// the legacy chan-string contract cannot represent them. Consumers
-// that need tool calls must call PredictStreamRich directly.
-func (c *CloudProxy) PredictStream(opts *pb.PredictOptions, results chan string) error {
-	defer close(results)
-	richCh := make(chan *pb.Reply)
-	errCh := make(chan error, 1)
-	go func() {
-		errCh <- c.PredictStreamRich(opts, richCh)
-		close(richCh)
-	}()
-	for reply := range richCh {
-		if msg := reply.GetMessage(); len(msg) > 0 {
-			results <- string(msg)
-		}
-	}
-	return <-errCh
-}
-
-// sendReply pushes one Reply onto a stream channel honouring ctx
-// cancellation. Returns false on cancel so the caller can exit with
-// ctx.Err(). Used by both translate-mode providers.
-func sendReply(ctx context.Context, results chan<- *pb.Reply, reply *pb.Reply) bool {
-	select {
-	case results <- reply:
-		return true
-	case <-ctx.Done():
-		return false
-	}
-}
-
-// newToolCallDelta is a small constructor for the cross-provider
-// tool-call delta shape. Centralised so the int32 cast and the four
-// fields stay consistent across the OpenAI / Anthropic translators.
-// Empty name/args are valid — Anthropic streaming announces the call
-// with id+name then sends arguments incrementally; OpenAI's reverse
-// pattern (args without name) also lands here.
-func newToolCallDelta(index int, id, name, args string) *pb.ToolCallDelta {
-	return &pb.ToolCallDelta{
-		Index:     int32(index),
-		Id:        id,
-		Name:      name,
-		Arguments: args,
-	}
-}
-
-// Forward shovels bytes between a Forward gRPC stream and an upstream
-// HTTP request. First request message carries path/method/headers and
-// the initial body chunk; subsequent messages append body chunks. The
-// first reply carries upstream status + response headers; subsequent
-// replies stream body chunks until the upstream connection closes.
-// Cancellation of ctx (the gRPC stream context) closes the upstream
-// connection.
-func (c *CloudProxy) Forward(ctx context.Context, in <-chan *pb.ForwardRequest, out chan<- *pb.ForwardReply) error {
-	defer close(out)
-
-	cfg := c.cfg.Load()
-	if cfg == nil {
-		return grpcerrors.ModelNotLoaded("cloud-proxy")
-	}
-	if cfg.mode != modePassthrough {
-		return fmt.Errorf("cloud-proxy: Forward only valid in passthrough mode (have %s)", cfg.mode)
-	}
-
-	first, ok := <-in
-	if !ok {
-		return errors.New("cloud-proxy: Forward stream closed before first request")
-	}
-
-	// Honour the per-request path only when the configured upstream_url
-	// has no path of its own — gallery convention is to put the
-	// canonical path in upstream_url.
-	fullURL, err := composeURL(cfg.upstreamURL, first.GetPath())
-	if err != nil {
-		return err
-	}
-
-	method := first.GetMethod()
-	if method == "" {
-		method = http.MethodPost
-	}
-
-	// Pipe the body in from the gRPC stream so the HTTP request can
-	// start before the client finishes sending. The pipe-reader is
-	// closed via CloseWithError on the error paths so the writer
-	// goroutine doesn't block forever.
-	pr, pw := io.Pipe()
-
-	go func() {
-		var writeErr error
-		defer func() { _ = pw.CloseWithError(writeErr) }()
-		if len(first.GetBodyChunk()) > 0 {
-			if _, writeErr = pw.Write(first.GetBodyChunk()); writeErr != nil {
-				return
-			}
-		}
-		for req := range in {
-			if len(req.GetBodyChunk()) == 0 {
-				continue
-			}
-			if _, writeErr = pw.Write(req.GetBodyChunk()); writeErr != nil {
-				return
-			}
-		}
-	}()
-
-	req, err := http.NewRequestWithContext(ctx, method, fullURL, pr)
-	if err != nil {
-		_ = pr.CloseWithError(err) // unblocks the body-pump's pw.Write
-		return fmt.Errorf("cloud-proxy: build request: %w", err)
-	}
-
-	// Apply caller-supplied headers, then override with the
-	// authorization header derived from the resolved key. Caller-
-	// supplied Authorization is always replaced — operators may not
-	// know the backend's auth scheme, and silently leaking through a
-	// client Authorization header to a different upstream would
-	// confuse the upstream and could leak credentials.
-	for _, h := range first.GetHeaders() {
-		if h == nil || h.GetName() == "" {
-			continue
-		}
-		// Strip hop-by-hop headers that aren't meaningful to the
-		// upstream (Host is set by the http client from the URL;
-		// Content-Length is computed from the body).
-		if isHopByHopHeader(h.GetName()) {
-			continue
-		}
-		req.Header.Add(h.GetName(), h.GetValue())
-	}
-	if cfg.apiKey != "" {
-		applyAuthHeader(req, cfg.provider, cfg.apiKey)
-	}
-
-	xlog.Info("cloud-proxy: forward", "method", method, "url", fullURL, "provider", cfg.provider)
-	resp, err := c.client.Do(req)
-	if err != nil {
-		xlog.Warn("cloud-proxy: forward upstream failed", "url", fullURL, "error", err)
-		return fmt.Errorf("cloud-proxy: upstream request failed: %w", err)
-	}
-	defer func() { _ = resp.Body.Close() }()
-
-	logFn := xlog.Info
-	if resp.StatusCode >= 400 {
-		logFn = xlog.Warn
-	}
-	logFn("cloud-proxy: forward response", "url", fullURL, "status", resp.StatusCode)
-
-	// First reply: status + response headers, no body.
-	headers := make([]*pb.ForwardHeader, 0, len(resp.Header))
-	for k, vs := range resp.Header {
-		for _, v := range vs {
-			headers = append(headers, &pb.ForwardHeader{Name: k, Value: v})
-		}
-	}
-	out <- &pb.ForwardReply{Status: int32(resp.StatusCode), Headers: headers}
-
-	// Subsequent replies: body chunks. Use a fixed 8KB buffer — small
-	// enough that SSE token frames flush promptly, large enough that
-	// long chunked-transfer bodies aren't death by a thousand reads.
-	buf := make([]byte, 8*1024)
-	for {
-		n, rerr := resp.Body.Read(buf)
-		if n > 0 {
-			chunk := make([]byte, n)
-			copy(chunk, buf[:n])
-			out <- &pb.ForwardReply{BodyChunk: chunk}
-		}
-		if rerr != nil {
-			if errors.Is(rerr, io.EOF) {
-				return nil
-			}
-			return fmt.Errorf("cloud-proxy: upstream body read: %w", rerr)
-		}
-	}
-}
-
-// composeURL combines the configured upstream URL with the per-request
-// path. The upstream URL typically already includes the canonical path
-// (e.g. https://api.openai.com/v1/chat/completions) so the per-request
-// path is ignored in that case. When upstream_url is a bare host
-// (https://api.openai.com), the request path is appended.
-func composeURL(upstream, reqPath string) (string, error) {
-	u, err := url.Parse(upstream)
-	if err != nil {
-		return "", fmt.Errorf("cloud-proxy: parse upstream_url %q: %w", upstream, err)
-	}
-	if u.Path == "" || u.Path == "/" {
-		u.Path = reqPath
-	}
-	return u.String(), nil
-}
-
-// applyAuthHeader writes the appropriate authorization header for the
-// provider. OpenAI/Anthropic/most providers use Bearer; Anthropic
-// historically uses x-api-key + anthropic-version, but accepts Bearer
-// too via the OpenAI-compatible path. Default to Bearer when provider
-// is empty (passthrough mode where the operator doesn't claim a
-// provider).
-func applyAuthHeader(req *http.Request, provider, key string) {
-	switch provider {
-	case providerAnthropic:
-		req.Header.Set("x-api-key", key)
-		if req.Header.Get("anthropic-version") == "" {
-			req.Header.Set("anthropic-version", "2023-06-01")
-		}
-	default:
-		req.Header.Set("Authorization", "Bearer "+key)
-	}
-}
-
-// isHopByHopHeader returns true for headers that should not be
-// forwarded from the client request to the upstream (RFC 7230 §6.1
-// hop-by-hop list, plus a few that the http.Client sets itself).
-func isHopByHopHeader(name string) bool {
-	switch strings.ToLower(name) {
-	case "connection", "proxy-connection", "keep-alive", "transfer-encoding",
-		"te", "trailer", "upgrade", "host", "content-length":
-		return true
-	}
-	return false
-}
--- a/backend/go/cloud-proxy/proxy_test.go
+++ b/backend/go/cloud-proxy/proxy_test.go
@@ -1,206 +0,0 @@
-package main
-
-import (
-	"context"
-	"errors"
-	"io"
-	"net/http"
-	"net/http/httptest"
-	"strings"
-	"testing"
-
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-
-	. "github.com/onsi/gomega"
-)
-
-// helper: run a CloudProxy in-process via grpc.Provide so tests can
-// call Forward through the public Backend interface without listening
-// on a real socket.
-func newInProcClient(t *testing.T, proxy *CloudProxy) grpc.Backend {
-	t.Helper()
-	addr := "test://" + t.Name()
-	grpc.Provide(addr, proxy)
-	return grpc.NewClient(addr, true, nil, false)
-}
-
-func TestForward_PassthroughEcho(t *testing.T) {
-	g := NewWithT(t)
-	// Fake upstream: echoes the request body back, prefixed with a
-	// canary so the test can assert both that the body reached the
-	// upstream and the response made it back to the client.
-	gotBody := make(chan string, 1)
-	gotAuth := make(chan string, 1)
-	gotPath := make(chan string, 1)
-	upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-		body, _ := io.ReadAll(r.Body)
-		gotBody <- string(body)
-		gotAuth <- r.Header.Get("Authorization")
-		gotPath <- r.URL.Path
-		w.Header().Set("X-Echo", "true")
-		w.WriteHeader(http.StatusOK)
-		_, _ = w.Write([]byte("echo: " + string(body)))
-	}))
-	defer upstream.Close()
-
-	t.Setenv("CLOUD_PROXY_FAKE_KEY", "sk-fake")
-
-	cp := NewCloudProxy()
-	err := cp.Load(&pb.ModelOptions{
-		Proxy: &pb.ProxyOptions{
-			UpstreamUrl: upstream.URL,
-			Mode:        modePassthrough,
-			ApiKeyEnv:   "CLOUD_PROXY_FAKE_KEY",
-		},
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-
-	c := newInProcClient(t, cp)
-	stream, err := c.Forward(context.Background())
-	g.Expect(err).NotTo(HaveOccurred())
-
-	err = stream.Send(&pb.ForwardRequest{
-		Path:      "/v1/chat/completions",
-		Method:    "POST",
-		Headers:   []*pb.ForwardHeader{{Name: "Content-Type", Value: "application/json"}},
-		BodyChunk: []byte(`{"prompt":`),
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	err = stream.Send(&pb.ForwardRequest{BodyChunk: []byte(`"hi"}`)})
-	g.Expect(err).NotTo(HaveOccurred())
-	err = stream.CloseSend()
-	g.Expect(err).NotTo(HaveOccurred())
-
-	// First reply: status + headers.
-	first, err := stream.Recv()
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(first.Status).To(Equal(int32(http.StatusOK)))
-	g.Expect(hasHeader(first.Headers, "X-Echo", "true")).To(BeTrue())
-
-	// Subsequent replies: body.
-	var body []byte
-	for {
-		r, err := stream.Recv()
-		if errors.Is(err, io.EOF) {
-			break
-		}
-		g.Expect(err).NotTo(HaveOccurred())
-		body = append(body, r.BodyChunk...)
-	}
-	g.Expect(string(body)).To(Equal(`echo: {"prompt":"hi"}`))
-
-	// Upstream observations.
-	var gotBodyVal, gotAuthVal, gotPathVal string
-	g.Eventually(gotBody).Should(Receive(&gotBodyVal), "upstream never saw body")
-	g.Expect(gotBodyVal).To(Equal(`{"prompt":"hi"}`))
-	g.Eventually(gotAuth).Should(Receive(&gotAuthVal), "upstream never saw auth header")
-	g.Expect(gotAuthVal).To(Equal("Bearer sk-fake"))
-	g.Eventually(gotPath).Should(Receive(&gotPathVal), "upstream never saw path")
-	g.Expect(gotPathVal).To(Equal("/v1/chat/completions"))
-}
-
-func TestForward_AnthropicAuthHeader(t *testing.T) {
-	g := NewWithT(t)
-	gotXAPIKey := make(chan string, 1)
-	gotVersion := make(chan string, 1)
-	upstream := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-		gotXAPIKey <- r.Header.Get("x-api-key")
-		gotVersion <- r.Header.Get("anthropic-version")
-		w.WriteHeader(http.StatusOK)
-	}))
-	defer upstream.Close()
-
-	t.Setenv("CLOUD_PROXY_ANTHROPIC_KEY", "sk-ant-fake")
-
-	cp := NewCloudProxy()
-	err := cp.Load(&pb.ModelOptions{
-		Proxy: &pb.ProxyOptions{
-			UpstreamUrl: upstream.URL,
-			Mode:        modePassthrough,
-			Provider:    providerAnthropic,
-			ApiKeyEnv:   "CLOUD_PROXY_ANTHROPIC_KEY",
-		},
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-
-	c := newInProcClient(t, cp)
-	stream, err := c.Forward(context.Background())
-	g.Expect(err).NotTo(HaveOccurred())
-	err = stream.Send(&pb.ForwardRequest{Path: "/v1/messages", Method: "POST"})
-	g.Expect(err).NotTo(HaveOccurred())
-	_ = stream.CloseSend()
-	_, _ = stream.Recv() // drain status
-	for {
-		if _, err := stream.Recv(); errors.Is(err, io.EOF) || err != nil {
-			break
-		}
-	}
-
-	g.Expect(<-gotXAPIKey).To(Equal("sk-ant-fake"))
-	g.Expect(<-gotVersion).NotTo(BeEmpty())
-}
-
-func TestLoad_ValidatesConfig(t *testing.T) {
-	g := NewWithT(t)
-	cp := NewCloudProxy()
-
-	err := cp.Load(&pb.ModelOptions{})
-	g.Expect(err).To(HaveOccurred())
-	g.Expect(err.Error()).To(ContainSubstring("ProxyOptions"))
-
-	err = cp.Load(&pb.ModelOptions{Proxy: &pb.ProxyOptions{}})
-	g.Expect(err).To(HaveOccurred())
-	g.Expect(err.Error()).To(ContainSubstring("upstream_url"))
-
-	err = cp.Load(&pb.ModelOptions{Proxy: &pb.ProxyOptions{
-		UpstreamUrl: "https://example.com",
-		Mode:        "rewrite",
-	}})
-	g.Expect(err).To(HaveOccurred())
-	g.Expect(err.Error()).To(ContainSubstring("unknown mode"))
-
-	// translate + openai should load successfully (Phase 5).
-	err = cp.Load(&pb.ModelOptions{Proxy: &pb.ProxyOptions{
-		UpstreamUrl: "https://example.com/v1/chat/completions",
-		Mode:        modeTranslate,
-		Provider:    providerOpenAI,
-	}})
-	g.Expect(err).NotTo(HaveOccurred())
-
-	// translate + anthropic should load successfully (Phase 6).
-	err = cp.Load(&pb.ModelOptions{Proxy: &pb.ProxyOptions{
-		UpstreamUrl: "https://example.com/v1/messages",
-		Mode:        modeTranslate,
-		Provider:    providerAnthropic,
-	}})
-	g.Expect(err).NotTo(HaveOccurred())
-
-	err = cp.Load(&pb.ModelOptions{Proxy: &pb.ProxyOptions{
-		UpstreamUrl: "https://example.com",
-		ApiKeyEnv:   "DEFINITELY_UNSET_ENV_VAR_XYZ",
-	}})
-	g.Expect(err).To(HaveOccurred())
-	g.Expect(err.Error()).To(ContainSubstring("unset"))
-}
-
-func TestForward_RejectsWithoutLoad(t *testing.T) {
-	g := NewWithT(t)
-	cp := NewCloudProxy()
-	c := newInProcClient(t, cp)
-	stream, err := c.Forward(context.Background())
-	g.Expect(err).NotTo(HaveOccurred())
-	_ = stream.CloseSend()
-	_, err = stream.Recv()
-	g.Expect(err).To(HaveOccurred())
-	g.Expect(err.Error()).To(ContainSubstring("not loaded"))
-}
-
-func hasHeader(hs []*pb.ForwardHeader, name, value string) bool {
-	for _, h := range hs {
-		if strings.EqualFold(h.GetName(), name) && h.GetValue() == value {
-			return true
-		}
-	}
-	return false
-}
--- a/backend/go/cloud-proxy/run.sh
+++ b/backend/go/cloud-proxy/run.sh
@@ -1,6 +0,0 @@
-#!/bin/bash
-set -ex
-
-CURDIR=$(dirname "$(realpath $0)")
-
-exec $CURDIR/cloud-proxy "$@"
--- a/backend/go/cloud-proxy/toolcalls_test.go
+++ b/backend/go/cloud-proxy/toolcalls_test.go
@@ -1,232 +0,0 @@
-package main
-
-import (
-	"strings"
-	"testing"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	. "github.com/onsi/gomega"
-)
-
-// OpenAI: non-streaming tool call response. Verify the response is
-// mapped to Reply.ChatDeltas[].ToolCalls with id/name/arguments intact,
-// and usage tokens land on Reply.PromptTokens / Reply.Tokens.
-func TestPredictRich_OpenAI_ToolCalls(t *testing.T) {
-	srv, _ := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
-		return 200, `{
-			"id":"resp-1",
-			"choices":[{
-				"index":0,
-				"message":{
-					"role":"assistant",
-					"content":"",
-					"tool_calls":[
-						{"id":"call_abc","type":"function","function":{"name":"get_weather","arguments":"{\"location\":\"SF\"}"}},
-						{"id":"call_def","type":"function","function":{"name":"get_time","arguments":"{\"tz\":\"PT\"}"}}
-					]
-				},
-				"finish_reason":"tool_calls"
-			}],
-			"usage":{"prompt_tokens":42,"completion_tokens":18,"total_tokens":60}
-		}`, "application/json"
-	})
-	defer srv.Close()
-	g := NewWithT(t)
-	cp := newTranslateCloudProxy(t, srv.URL)
-
-	reply, err := cp.PredictRich(&pb.PredictOptions{
-		Messages: []*pb.Message{{Role: "user", Content: "what's the weather?"}},
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(string(reply.GetMessage())).To(Equal(""))
-	g.Expect(reply.GetPromptTokens()).To(Equal(int32(42)))
-	g.Expect(reply.GetTokens()).To(Equal(int32(18)))
-	g.Expect(reply.GetChatDeltas()).To(HaveLen(1))
-	tcs := reply.GetChatDeltas()[0].GetToolCalls()
-	g.Expect(tcs).To(HaveLen(2))
-	g.Expect(tcs[0].GetId()).To(Equal("call_abc"))
-	g.Expect(tcs[0].GetName()).To(Equal("get_weather"))
-	g.Expect(tcs[0].GetArguments()).To(ContainSubstring(`"location":"SF"`))
-	g.Expect(tcs[1].GetId()).To(Equal("call_def"))
-	g.Expect(tcs[1].GetName()).To(Equal("get_time"))
-}
-
-// OpenAI: streaming tool call. Arguments arrive as a sequence of
-// delta chunks; the consumer is expected to concatenate by tool index.
-// Verify each chunk reaches the channel and the assembled arguments
-// match the input.
-func TestPredictStreamRich_OpenAI_ToolCallDeltas(t *testing.T) {
-	chunks := []string{
-		// Frame 0: announce the tool call (id + name, no args yet).
-		`{"choices":[{"index":0,"delta":{"role":"assistant","tool_calls":[{"index":0,"id":"call_xyz","type":"function","function":{"name":"search"}}]}}]}`,
-		// Frames 1-3: arguments arrive in fragments.
-		`{"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\"q\":"}}]}}]}`,
-		`{"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"clo"}}]}}]}`,
-		`{"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"uds\"}"}}]}}]}`,
-		// Stop frame.
-		`{"choices":[{"index":0,"delta":{},"finish_reason":"tool_calls"}]}`,
-	}
-	body := ""
-	for _, c := range chunks {
-		body += "data: " + c + "\n\n"
-	}
-	body += "data: [DONE]\n\n"
-
-	srv, _ := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
-		return 200, body, "text/event-stream"
-	})
-	defer srv.Close()
-	g := NewWithT(t)
-	cp := newTranslateCloudProxy(t, srv.URL)
-
-	results := make(chan *pb.Reply, 16)
-	done := make(chan error, 1)
-	go func() {
-		done <- cp.PredictStreamRich(&pb.PredictOptions{
-			Messages: []*pb.Message{{Role: "user", Content: "find something"}},
-		}, results)
-		close(results)
-	}()
-
-	var (
-		toolName  string
-		toolID    string
-		toolIndex int32 = -1
-		argsBuf   strings.Builder
-	)
-	for reply := range results {
-		for _, cd := range reply.GetChatDeltas() {
-			for _, tc := range cd.GetToolCalls() {
-				if tc.GetName() != "" {
-					toolName = tc.GetName()
-				}
-				if tc.GetId() != "" {
-					toolID = tc.GetId()
-				}
-				if toolIndex == -1 {
-					toolIndex = tc.GetIndex()
-				}
-				argsBuf.WriteString(tc.GetArguments())
-			}
-		}
-	}
-	err := <-done
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(toolID).To(Equal("call_xyz"))
-	g.Expect(toolName).To(Equal("search"))
-	g.Expect(toolIndex).To(Equal(int32(0)))
-	g.Expect(argsBuf.String()).To(Equal(`{"q":"clouds"}`))
-}
-
-// Anthropic: non-streaming tool_use block. The block appears in
-// Content[] alongside text blocks; the input field is a structured
-// JSON object. Map to ToolCallDelta with arguments as serialised JSON
-// so downstream OpenAI-shaped consumers see a familiar format.
-func TestPredictRich_Anthropic_ToolUse(t *testing.T) {
-	srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, `{
-			"id":"msg_1","type":"message","role":"assistant",
-			"content":[
-				{"type":"text","text":"Let me check that."},
-				{"type":"tool_use","id":"toolu_01","name":"weather","input":{"location":"SF"}}
-			],
-			"model":"claude","usage":{"input_tokens":12,"output_tokens":34}
-		}`, "application/json"
-	})
-	defer srv.Close()
-	g := NewWithT(t)
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	reply, err := cp.PredictRich(&pb.PredictOptions{
-		Messages: []*pb.Message{{Role: "user", Content: "what's the weather?"}},
-		Tokens:   64,
-	})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(string(reply.GetMessage())).To(Equal("Let me check that."))
-	g.Expect(reply.GetPromptTokens()).To(Equal(int32(12)))
-	g.Expect(reply.GetTokens()).To(Equal(int32(34)))
-	g.Expect(reply.GetChatDeltas()).To(HaveLen(1))
-	g.Expect(reply.GetChatDeltas()[0].GetToolCalls()).To(HaveLen(1))
-	tc := reply.GetChatDeltas()[0].GetToolCalls()[0]
-	g.Expect(tc.GetId()).To(Equal("toolu_01"))
-	g.Expect(tc.GetName()).To(Equal("weather"))
-	g.Expect(tc.GetArguments()).To(ContainSubstring(`"location":"SF"`))
-}
-
-// Anthropic: streaming tool_use. content_block_start announces the
-// tool's id + name; input_json_delta events carry argument fragments
-// which the consumer accumulates. message_delta carries final usage.
-func TestPredictStreamRich_Anthropic_InputJSONDelta(t *testing.T) {
-	frames := []string{
-		"event: message_start\ndata: {\"type\":\"message_start\"}\n\n",
-		// Block 0 is a tool_use; consumer should allocate a slot.
-		"event: content_block_start\ndata: {\"type\":\"content_block_start\",\"index\":0,\"content_block\":{\"type\":\"tool_use\",\"id\":\"toolu_42\",\"name\":\"lookup\"}}\n\n",
-		"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"input_json_delta\",\"partial_json\":\"{\\\"q\\\":\"}}\n\n",
-		"event: content_block_delta\ndata: {\"type\":\"content_block_delta\",\"index\":0,\"delta\":{\"type\":\"input_json_delta\",\"partial_json\":\"\\\"rain\\\"}\"}}\n\n",
-		"event: content_block_stop\ndata: {\"type\":\"content_block_stop\",\"index\":0}\n\n",
-		"event: message_delta\ndata: {\"type\":\"message_delta\",\"usage\":{\"output_tokens\":7}}\n\n",
-		"event: message_stop\ndata: {\"type\":\"message_stop\"}\n\n",
-	}
-	body := strings.Join(frames, "")
-
-	srv, _ := fakeAnthropicUpstream(t, func(_ anthropicRequest) (int, string, string) {
-		return 200, body, "text/event-stream"
-	})
-	defer srv.Close()
-	g := NewWithT(t)
-	cp := newAnthropicTranslateCloudProxy(t, srv.URL)
-
-	results := make(chan *pb.Reply, 16)
-	done := make(chan error, 1)
-	go func() {
-		done <- cp.PredictStreamRich(&pb.PredictOptions{
-			Messages: []*pb.Message{{Role: "user", Content: "rain?"}},
-			Tokens:   64,
-		}, results)
-		close(results)
-	}()
-
-	var (
-		toolID, toolName string
-		argsBuf          strings.Builder
-		finalTokens      int32
-	)
-	for reply := range results {
-		if reply.GetTokens() > 0 && len(reply.GetChatDeltas()) == 0 {
-			finalTokens = reply.GetTokens()
-			continue
-		}
-		for _, cd := range reply.GetChatDeltas() {
-			for _, tc := range cd.GetToolCalls() {
-				if tc.GetId() != "" {
-					toolID = tc.GetId()
-				}
-				if tc.GetName() != "" {
-					toolName = tc.GetName()
-				}
-				argsBuf.WriteString(tc.GetArguments())
-			}
-		}
-	}
-	err := <-done
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(toolID).To(Equal("toolu_42"))
-	g.Expect(toolName).To(Equal("lookup"))
-	g.Expect(argsBuf.String()).To(Equal(`{"q":"rain"}`))
-	g.Expect(finalTokens).To(Equal(int32(7)))
-}
-
-// Sanity: the legacy Predict() (string, error) signature still works
-// — it delegates to PredictRich and extracts Message.
-func TestPredict_LegacyWrapper_OpenAI(t *testing.T) {
-	srv, _ := fakeOpenAIUpstream(t, func(_ openAIRequest) (int, string, string) {
-		return 200, `{"choices":[{"message":{"role":"assistant","content":"hello"}}]}`, "application/json"
-	})
-	defer srv.Close()
-	g := NewWithT(t)
-	cp := newTranslateCloudProxy(t, srv.URL)
-
-	got, err := cp.Predict(&pb.PredictOptions{Messages: []*pb.Message{{Role: "user", Content: "hi"}}})
-	g.Expect(err).NotTo(HaveOccurred())
-	g.Expect(got).To(Equal("hello"))
-}
--- a/backend/go/crispasr/.gitignore
+++ b/backend/go/crispasr/.gitignore
@@ -1,5 +0,0 @@
-sources
-build*
-libgocrispasr*.so
-crispasr
-package
--- a/backend/go/crispasr/CMakeLists.txt
+++ b/backend/go/crispasr/CMakeLists.txt
@@ -1,30 +0,0 @@
-cmake_minimum_required(VERSION 3.12)
-project(gocrispasr LANGUAGES C CXX)
-set(CMAKE_POSITION_INDEPENDENT_CODE ON)
-set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
-
-add_subdirectory(./sources/CrispASR)
-
-add_library(gocrispasr MODULE cpp/crispasr_shim.cpp)
-target_include_directories(gocrispasr PRIVATE
-    ${CMAKE_CURRENT_SOURCE_DIR}/sources/CrispASR/include
-    ${CMAKE_CURRENT_SOURCE_DIR}/sources/CrispASR/ggml/include)
-# Link the same backend set as crispasr-cli (examples/cli/CMakeLists.txt) so
-# the session API can dispatch to every compiled-in architecture, not just
-# whisper. crispasr is the referencer; the backend static libs supply the
-# per-architecture symbols; ggml is the math/runtime base.
-target_link_libraries(gocrispasr PRIVATE
-    crispasr-lib
-    parakeet canary canary_ctc cohere granite_speech granite_nle
-    voxtral voxtral4b qwen3_asr qwen3_tts orpheus chatterbox indextts
-    kokoro voxcpm2_tts m2m100 t5_translate wav2vec2-ggml vibevoice
-    silero-lid pyannote-seg funasr paraformer sensevoice
-    crisp_audio
-    ggml)
-
-if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
-    target_link_libraries(gocrispasr PRIVATE stdc++fs)
-endif()
-
-set_property(TARGET gocrispasr PROPERTY CXX_STANDARD 17)
-set_target_properties(gocrispasr PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
--- a/backend/go/crispasr/Makefile
+++ b/backend/go/crispasr/Makefile
@@ -1,132 +0,0 @@
-CMAKE_ARGS?=
-BUILD_TYPE?=
-NATIVE?=false
-
-GOCMD?=go
-GO_TAGS?=
-JOBS?=$(shell nproc --ignore=1)
-
-# CrispASR version (release tag)
-CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
-CRISPASR_VERSION?=c29f6653a516a3001d923944dad8892072cc7334
-SO_TARGET?=libgocrispasr.so
-
-CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
-# Keep the build lean: no tests/examples/server/SDL2/curl/ffmpeg (the FROM scratch
-# image cannot satisfy those runtime deps). All ASR/TTS model backends stay enabled.
-CMAKE_ARGS+=-DCRISPASR_BUILD_TESTS=OFF -DCRISPASR_BUILD_EXAMPLES=OFF -DCRISPASR_BUILD_SERVER=OFF
-CMAKE_ARGS+=-DCRISPASR_SDL2=OFF -DCRISPASR_CURL=OFF -DCRISPASR_FFMPEG=OFF
-
-ifeq ($(NATIVE),false)
-	CMAKE_ARGS+=-DGGML_NATIVE=OFF
-endif
-
-ifeq ($(BUILD_TYPE),cublas)
-	CMAKE_ARGS+=-DGGML_CUDA=ON
-else ifeq ($(BUILD_TYPE),openblas)
-	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
-else ifeq ($(BUILD_TYPE),clblas)
-	CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
-else ifeq ($(BUILD_TYPE),hipblas)
-	CMAKE_ARGS+=-DGGML_HIPBLAS=ON
-else ifeq ($(BUILD_TYPE),vulkan)
-	CMAKE_ARGS+=-DGGML_VULKAN=ON
-else ifeq ($(OS),Darwin)
-	ifneq ($(BUILD_TYPE),metal)
-		CMAKE_ARGS+=-DGGML_METAL=OFF
-	else
-		CMAKE_ARGS+=-DGGML_METAL=ON
-		CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
-	endif
-endif
-
-ifeq ($(BUILD_TYPE),sycl_f16)
-	CMAKE_ARGS+=-DGGML_SYCL=ON \
-		-DCMAKE_C_COMPILER=icx \
-		-DCMAKE_CXX_COMPILER=icpx \
-		-DGGML_SYCL_F16=ON
-endif
-
-ifeq ($(BUILD_TYPE),sycl_f32)
-	CMAKE_ARGS+=-DGGML_SYCL=ON \
-		-DCMAKE_C_COMPILER=icx \
-		-DCMAKE_CXX_COMPILER=icpx
-endif
-
-sources/CrispASR:
-	mkdir -p sources/CrispASR
-	cd sources/CrispASR && \
-	git init && \
-	git remote add origin $(CRISPASR_REPO) && \
-	git fetch origin && \
-	git checkout $(CRISPASR_VERSION) && \
-	git submodule update --init --recursive --depth 1 --single-branch
-	# CrispASR's src/CMakeLists.txt locates its vendored llama.cpp
-	# (crispasr-llama-core, used by the chat C-ABI) via ${CMAKE_SOURCE_DIR},
-	# which assumes CrispASR is the top-level CMake project. We add_subdirectory
-	# it, so ${CMAKE_SOURCE_DIR} is THIS backend dir and the talk-llama sources
-	# aren't found. Rewrite to ${PROJECT_SOURCE_DIR} (the crispasr project root),
-	# which is correct both standalone and as a subproject. Idempotent.
-	sed -i 's#\$${CMAKE_SOURCE_DIR}/examples/talk-llama#\$${PROJECT_SOURCE_DIR}/examples/talk-llama#' sources/CrispASR/src/CMakeLists.txt
-
-# Detect OS
-UNAME_S := $(shell uname -s)
-
-ifeq ($(UNAME_S),Linux)
-	VARIANT_TARGETS = libgocrispasr-avx.so libgocrispasr-avx2.so libgocrispasr-avx512.so libgocrispasr-fallback.so
-else
-	VARIANT_TARGETS = libgocrispasr-fallback.so
-endif
-
-crispasr: main.go gocrispasr.go $(VARIANT_TARGETS)
-	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o crispasr ./
-
-package: crispasr
-	bash package.sh
-
-build: package
-
-clean: purge
-	rm -rf libgocrispasr*.so package sources/CrispASR crispasr
-
-purge:
-	rm -rf build*
-
-ifeq ($(UNAME_S),Linux)
-libgocrispasr-avx.so: sources/CrispASR
-	$(MAKE) purge
-	$(info ${GREEN}I crispasr build info:avx${RESET})
-	SO_TARGET=libgocrispasr-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
-	rm -rfv build*
-
-libgocrispasr-avx2.so: sources/CrispASR
-	$(MAKE) purge
-	$(info ${GREEN}I crispasr build info:avx2${RESET})
-	SO_TARGET=libgocrispasr-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgocrispasr-custom
-	rm -rfv build*
-
-libgocrispasr-avx512.so: sources/CrispASR
-	$(MAKE) purge
-	$(info ${GREEN}I crispasr build info:avx512${RESET})
-	SO_TARGET=libgocrispasr-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgocrispasr-custom
-	rm -rfv build*
-endif
-
-libgocrispasr-fallback.so: sources/CrispASR
-	$(MAKE) purge
-	$(info ${GREEN}I crispasr build info:fallback${RESET})
-	SO_TARGET=libgocrispasr-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgocrispasr-custom
-	rm -rfv build*
-
-libgocrispasr-custom: CMakeLists.txt cpp/crispasr_shim.cpp cpp/crispasr_shim.h
-	mkdir -p build-$(SO_TARGET) && \
-	cd build-$(SO_TARGET) && \
-	cmake .. $(CMAKE_ARGS) && \
-	cmake --build . --config Release -j$(JOBS) && \
-	cd .. && \
-	mv build-$(SO_TARGET)/libgocrispasr.so ./$(SO_TARGET)
-
-test: crispasr
-	CGO_ENABLED=0 $(GOCMD) test -v ./...
-
-all: crispasr package
--- a/backend/go/crispasr/cpp/crispasr_shim.cpp
+++ b/backend/go/crispasr/cpp/crispasr_shim.cpp
@@ -1,253 +0,0 @@
-#include "crispasr_shim.h"
-#include "ggml-backend.h"
-#include "crispasr.h"
-#include <atomic>
-#include <vector>
-
-// Opaque session types. crispasr.h declares `struct crispasr_session;` but not
-// the result type nor the open/transcribe/result accessors — those are
-// CA_EXPORT extern "C" symbols in src/crispasr_c_api.cpp, so we forward-declare
-// exactly the ones we use. Signatures verified against
-// sources/CrispASR/src/crispasr_c_api.cpp.
-struct crispasr_session_result;
-extern "C" {
-crispasr_session *crispasr_session_open(const char *model_path, int n_threads);
-crispasr_session *crispasr_session_open_explicit(const char *model_path,
-                                                 const char *backend_name,
-                                                 int n_threads);
-int crispasr_session_set_codec_path(crispasr_session *s, const char *path);
-void crispasr_session_close(crispasr_session *s);
-const char *crispasr_session_backend(crispasr_session *s);
-int crispasr_session_set_translate(crispasr_session *s, int enable);
-crispasr_session_result *crispasr_session_transcribe_lang(
-    crispasr_session *s, const float *pcm, int n_samples, const char *language);
-int crispasr_session_result_n_segments(crispasr_session_result *r);
-const char *crispasr_session_result_segment_text(crispasr_session_result *r,
-                                                  int i);
-int64_t crispasr_session_result_segment_t0(crispasr_session_result *r, int i);
-int64_t crispasr_session_result_segment_t1(crispasr_session_result *r, int i);
-void crispasr_session_result_free(crispasr_session_result *r);
-float *crispasr_session_synthesize(crispasr_session *s, const char *text,
-                                   int *out_n_samples);
-void crispasr_pcm_free(float *pcm);
-int crispasr_session_set_speaker_name(crispasr_session *s, const char *name);
-int crispasr_session_set_voice(crispasr_session *s, const char *path,
-                               const char *ref_text_or_null);
-}
-
-static crispasr_session *g_session = nullptr;
-static crispasr_session_result *g_result = nullptr;
-
-static struct whisper_vad_context *vctx;
-static std::vector<float> flat_segs;
-
-static std::atomic<int> g_abort{0};
-
-extern "C" void set_abort(int v) {
-  g_abort.store(v, std::memory_order_relaxed);
-}
-
-static void ggml_log_cb(enum ggml_log_level level, const char *log,
-                        void *data) {
-  const char *level_str;
-
-  if (!log) {
-    return;
-  }
-
-  switch (level) {
-  case GGML_LOG_LEVEL_DEBUG:
-    level_str = "DEBUG";
-    break;
-  case GGML_LOG_LEVEL_INFO:
-    level_str = "INFO";
-    break;
-  case GGML_LOG_LEVEL_WARN:
-    level_str = "WARN";
-    break;
-  case GGML_LOG_LEVEL_ERROR:
-    level_str = "ERROR";
-    break;
-  default: /* Potential future-proofing */
-    level_str = "?????";
-    break;
-  }
-
-  fprintf(stderr, "[%-5s] ", level_str);
-  fputs(log, stderr);
-  fflush(stderr);
-}
-
-int load_model(const char *const model_path, int threads,
-               const char *backend_name) {
-  whisper_log_set(ggml_log_cb, nullptr);
-  ggml_backend_load_all();
-
-  if (backend_name && *backend_name) {
-    g_session =
-        crispasr_session_open_explicit(model_path, backend_name, threads);
-  } else {
-    g_session = crispasr_session_open(model_path, threads);
-  }
-  if (g_session == nullptr) {
-    fprintf(stderr, "error: failed to open CrispASR session for model\n");
-    return 1;
-  }
-
-  fprintf(stderr, "info: CrispASR backend selected: %s\n",
-          crispasr_session_backend(g_session));
-  return 0;
-}
-
-// set_codec_path forwards a companion file (qwen3-tts codec, orpheus SNAC,
-// chatterbox s3gen, or mimo-asr tokenizer) to the active session. Returns 0 on
-// success or when the active backend needs no companion, negative on failure,
-// and -1 when no session is open.
-int set_codec_path(const char *path) {
-  return g_session ? crispasr_session_set_codec_path(g_session, path) : -1;
-}
-
-int load_model_vad(const char *const model_path) {
-  whisper_log_set(ggml_log_cb, nullptr);
-  ggml_backend_load_all();
-
-  struct whisper_vad_context_params vcparams =
-      whisper_vad_default_context_params();
-
-  // XXX: Overridden to false in upstream due to performance?
-  // vcparams.use_gpu = true;
-
-  vctx = whisper_vad_init_from_file_with_params(model_path, vcparams);
-  if (vctx == nullptr) {
-    fprintf(stderr, "error: Failed to init model as VAD\n");
-    return 1;
-  }
-
-  return 0;
-}
-
-int vad(float pcmf32[], size_t pcmf32_len, float **segs_out,
-        size_t *segs_out_len) {
-  if (!whisper_vad_detect_speech(vctx, pcmf32, pcmf32_len)) {
-    fprintf(stderr, "error: failed to detect speech\n");
-    return 1;
-  }
-
-  struct whisper_vad_params params = whisper_vad_default_params();
-  struct whisper_vad_segments *segs =
-      whisper_vad_segments_from_probs(vctx, params);
-  size_t segn = whisper_vad_segments_n_segments(segs);
-
-  // fprintf(stderr, "Got segments %zd\n", segn);
-
-  flat_segs.clear();
-
-  for (int i = 0; i < segn; i++) {
-    flat_segs.push_back(whisper_vad_segments_get_segment_t0(segs, i));
-    flat_segs.push_back(whisper_vad_segments_get_segment_t1(segs, i));
-  }
-
-  // fprintf(stderr, "setting out variables: %p=%p -> %p, %p=%zx -> %zx\n",
-  //         segs_out, *segs_out, flat_segs.data(), segs_out_len, *segs_out_len,
-  //         flat_segs.size());
-  *segs_out = flat_segs.data();
-  *segs_out_len = flat_segs.size();
-
-  // fprintf(stderr, "freeing segs\n");
-  whisper_vad_free_segments(segs);
-
-  // fprintf(stderr, "returning\n");
-  return 0;
-}
-
-// threads, diarize and prompt are accepted for Go-side API parity but unused
-// in Phase 1: the thread count is fixed at session open, and diarization and
-// the initial prompt are separate CrispASR features not yet wired through the
-// session ASR path.
-int transcribe(uint32_t threads, char *lang, bool translate, bool diarize,
-               float pcmf32[], size_t pcmf32_len, size_t *segs_out_len,
-               char *prompt) {
-  (void)threads;
-  (void)diarize;
-  (void)prompt;
-
-  if (!g_session) {
-    return 1;
-  }
-
-  // Reset stale abort flag from any prior cancelled call. set_abort remains
-  // best-effort: the session transcribe call is blocking and exposes no abort
-  // hook, so a mid-decode abort cannot interrupt it.
-  g_abort.store(0, std::memory_order_relaxed);
-
-  crispasr_session_set_translate(g_session, translate ? 1 : 0);
-
-  if (g_result) {
-    crispasr_session_result_free(g_result);
-    g_result = nullptr;
-  }
-
-  const char *language = (lang && *lang) ? lang : nullptr;
-  g_result = crispasr_session_transcribe_lang(g_session, pcmf32, (int)pcmf32_len,
-                                              language);
-  if (!g_result) {
-    fprintf(stderr, "error: transcription failed\n");
-    return 1;
-  }
-
-  *segs_out_len = crispasr_session_result_n_segments(g_result);
-  return 0;
-}
-
-const char *get_segment_text(int i) {
-  if (!g_result) {
-    return "";
-  }
-  return crispasr_session_result_segment_text(g_result, i);
-}
-
-int64_t get_segment_t0(int i) {
-  if (!g_result) {
-    return 0;
-  }
-  return crispasr_session_result_segment_t0(g_result, i);
-}
-
-int64_t get_segment_t1(int i) {
-  if (!g_result) {
-    return 0;
-  }
-  return crispasr_session_result_segment_t1(g_result, i);
-}
-
-const char *get_backend(void) {
-  return g_session ? crispasr_session_backend(g_session) : "";
-}
-
-// TTS uses the already-open session (crispasr_session_open auto-detects a TTS
-// model). Output is 24 kHz mono float PCM (upstream CrispASR convention),
-// malloc'd by the C API; the caller must release it via tts_free.
-float *tts_synthesize(const char *text, int *out_n_samples) {
-  if (out_n_samples) *out_n_samples = 0;
-  if (!g_session || !text) return nullptr;
-  return crispasr_session_synthesize(g_session, text, out_n_samples);
-}
-
-void tts_free(float *pcm) {
-  if (pcm) crispasr_pcm_free(pcm);
-}
-
-int tts_set_voice(const char *name) {
-  if (!g_session || !name || !*name) return 0;
-  return crispasr_session_set_speaker_name(g_session, name);
-}
-
-// tts_set_voice_file loads a voice from a file: a .gguf path selects a voice
-// pack, a .wav path with a non-empty ref_text performs zero-shot voice cloning
-// (the C API returns -2 when ref_text is required but missing). Returns -1 when
-// no session is open or path is null.
-int tts_set_voice_file(const char *path, const char *ref_text) {
-  if (!g_session || !path) return -1;
-  const char *ref = (ref_text && *ref_text) ? ref_text : nullptr;
-  return crispasr_session_set_voice(g_session, path, ref);
-}
--- a/backend/go/crispasr/cpp/crispasr_shim.h
+++ b/backend/go/crispasr/cpp/crispasr_shim.h
@@ -1,23 +0,0 @@
-#include <cstddef>
-#include <cstdint>
-
-extern "C" {
-int load_model(const char *const model_path, int threads,
-               const char *backend_name);
-int set_codec_path(const char *path);
-int load_model_vad(const char *const model_path);
-int vad(float pcmf32[], size_t pcmf32_size, float **segs_out,
-        size_t *segs_out_len);
-int transcribe(uint32_t threads, char *lang, bool translate, bool diarize,
-               float pcmf32[], size_t pcmf32_len, size_t *segs_out_len,
-               char *prompt);
-const char *get_segment_text(int i);
-int64_t get_segment_t0(int i);
-int64_t get_segment_t1(int i);
-const char *get_backend(void);
-void set_abort(int v);
-float *tts_synthesize(const char *text, int *out_n_samples); // 24kHz mono float, malloc'd; NULL on failure
-void tts_free(float *pcm);
-int tts_set_voice(const char *name); // best-effort speaker selection; 0 ok
-int tts_set_voice_file(const char *path, const char *ref_text); // load voice pack (.gguf) or zero-shot clone (.wav + ref_text)
-}
--- a/backend/go/crispasr/gocrispasr.go
+++ b/backend/go/crispasr/gocrispasr.go
@@ -1,497 +0,0 @@
-package main
-
-import (
-	"context"
-	"fmt"
-	"os"
-	"path/filepath"
-	"strings"
-	"sync"
-	"unsafe"
-
-	"github.com/go-audio/audio"
-	"github.com/go-audio/wav"
-	"github.com/mudler/LocalAI/pkg/grpc/base"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	"github.com/mudler/LocalAI/pkg/utils"
-	"google.golang.org/grpc/codes"
-	"google.golang.org/grpc/status"
-)
-
-var (
-	CppLoadModel       func(modelPath string, threads int, backendName string) int
-	CppSetCodecPath    func(path string) int
-	CppLoadModelVAD    func(modelPath string) int
-	CppVAD             func(pcmf32 []float32, pcmf32Size uintptr, segsOut unsafe.Pointer, segsOutLen unsafe.Pointer) int
-	CppTranscribe      func(threads uint32, lang string, translate bool, diarize bool, pcmf32 []float32, pcmf32Len uintptr, segsOutLen unsafe.Pointer, prompt string) int
-	CppGetSegmentText  func(i int) string
-	CppGetSegmentStart func(i int) int64
-	CppGetSegmentEnd   func(i int) int64
-	CppGetBackend      func() string
-	CppSetAbort        func(v int)
-	CppTTSSynthesize   func(text string, outNSamples unsafe.Pointer) uintptr
-	CppTTSFree         func(ptr uintptr)
-	CppTTSSetVoice     func(name string) int
-	CppTTSSetVoiceFile func(path string, refText string) int
-)
-
-type CrispASR struct {
-	base.SingleThread
-}
-
-// splitOption splits a "prefix:value" model option into its key and value,
-// matching the convention used by other backends (see sherpa-onnx). It returns
-// ok=false when the option carries no ':' separator.
-func splitOption(oo string) (key, value string, ok bool) {
-	parts := strings.SplitN(oo, ":", 2)
-	if len(parts) != 2 {
-		return "", "", false
-	}
-	return parts[0], parts[1], true
-}
-
-func (w *CrispASR) Load(opts *pb.ModelOptions) error {
-	vadOnly := false
-	backendName := ""
-	codecPath := ""
-	speakerName := ""
-	voicePath := ""
-	voiceRefText := ""
-
-	for _, oo := range opts.Options {
-		if oo == "vad_only" {
-			vadOnly = true
-			continue
-		}
-		switch key, value, ok := splitOption(oo); {
-		case ok && key == "backend":
-			backendName = value
-		case ok && key == "codec":
-			codecPath = value
-		case ok && key == "speaker":
-			speakerName = value
-		case ok && key == "voice":
-			voicePath = value
-		case ok && key == "voice_text":
-			voiceRefText = value
-		default:
-			fmt.Fprintf(os.Stderr, "Unrecognized option: %v\n", oo)
-		}
-	}
-
-	if vadOnly {
-		if ret := CppLoadModelVAD(opts.ModelFile); ret != 0 {
-			return fmt.Errorf("Failed to load CrispASR VAD model")
-		}
-
-		return nil
-	}
-
-	// Resolve a relative companion path against the model directory so a config
-	// can reference a sibling codec/tokenizer file by name alone.
-	if codecPath != "" && !filepath.IsAbs(codecPath) {
-		codecPath = filepath.Join(filepath.Dir(opts.ModelFile), codecPath)
-	}
-
-	// A voice file (.gguf pack or .wav prompt) is resolved against the model
-	// directory just like the codec, so a config can reference a sibling file.
-	if voicePath != "" && !filepath.IsAbs(voicePath) {
-		voicePath = filepath.Join(filepath.Dir(opts.ModelFile), voicePath)
-	}
-
-	if ret := CppLoadModel(opts.ModelFile, int(opts.Threads), backendName); ret != 0 {
-		return fmt.Errorf("Failed to load CrispASR transcription model")
-	}
-
-	// Load the companion file (codec/tokenizer/s3gen) after the session is open.
-	// rc==0 means success or "not applicable" for the active backend; only a
-	// negative code is fatal.
-	if codecPath != "" {
-		if rc := CppSetCodecPath(codecPath); rc < 0 {
-			return fmt.Errorf("crispasr: failed to load companion file %q (rc=%d)", codecPath, rc)
-		}
-		fmt.Fprintf(os.Stderr, "CrispASR companion file loaded: %s\n", codecPath)
-	}
-
-	// Apply the Load-time default voice. A baked speaker (speaker:) is selected
-	// by name and is best-effort: a backend that can't honor it is logged, not
-	// fatal. A voice file (voice:) is a hard requirement once configured, so a
-	// negative rc fails Load.
-	if speakerName != "" {
-		if rc := CppTTSSetVoice(speakerName); rc != 0 {
-			fmt.Fprintf(os.Stderr, "crispasr: speaker %q not applied (rc=%d)\n", speakerName, rc)
-		}
-	}
-	if voicePath != "" {
-		if rc := CppTTSSetVoiceFile(voicePath, voiceRefText); rc < 0 {
-			return fmt.Errorf("crispasr: failed to load voice %q (rc=%d)", voicePath, rc)
-		}
-		fmt.Fprintf(os.Stderr, "CrispASR voice loaded: %s\n", voicePath)
-	}
-
-	fmt.Fprintf(os.Stderr, "CrispASR backend selected: %s\n", CppGetBackend())
-
-	return nil
-}
-
-func (w *CrispASR) VAD(req *pb.VADRequest) (pb.VADResponse, error) {
-	audio := req.Audio
-	// We expect 0xdeadbeef to be overwritten and if we see it in a stack trace we know it wasn't
-	segsPtr, segsLen := uintptr(0xdeadbeef), uintptr(0xdeadbeef)
-	segsPtrPtr, segsLenPtr := unsafe.Pointer(&segsPtr), unsafe.Pointer(&segsLen)
-
-	if ret := CppVAD(audio, uintptr(len(audio)), segsPtrPtr, segsLenPtr); ret != 0 {
-		return pb.VADResponse{}, fmt.Errorf("Failed VAD")
-	}
-
-	// Happens when CPP vector has not had any elements pushed to it
-	if segsPtr == 0 {
-		return pb.VADResponse{
-			Segments: []*pb.VADSegment{},
-		}, nil
-	}
-
-	// unsafeptr warning is caused by segsPtr being on the stack and therefor being subject to stack copying AFAICT
-	// however the stack shouldn't have grown between setting segsPtr and now, also the memory pointed to is allocated by C++
-	segs := unsafe.Slice((*float32)(unsafe.Pointer(segsPtr)), segsLen) //nolint:govet // segsPtr addresses C++-owned heap memory passed back through the cgo-free purego boundary; the uintptr->Pointer round-trip is intentional and the buffer outlives this read.
-
-	vadSegments := []*pb.VADSegment{}
-	for i := range len(segs) >> 1 {
-		s := segs[2*i] / 100
-		t := segs[2*i+1] / 100
-		vadSegments = append(vadSegments, &pb.VADSegment{
-			Start: s,
-			End:   t,
-		})
-	}
-
-	return pb.VADResponse{
-		Segments: vadSegments,
-	}, nil
-}
-
-func (w *CrispASR) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
-	if err := ctx.Err(); err != nil {
-		return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
-	}
-
-	dir, err := os.MkdirTemp("", "crispasr")
-	if err != nil {
-		return pb.TranscriptResult{}, err
-	}
-	defer func() { _ = os.RemoveAll(dir) }()
-
-	convertedPath := filepath.Join(dir, "converted.wav")
-
-	if err := utils.AudioToWav(opts.Dst, convertedPath); err != nil {
-		return pb.TranscriptResult{}, err
-	}
-
-	fh, err := os.Open(convertedPath)
-	if err != nil {
-		return pb.TranscriptResult{}, err
-	}
-	defer func() { _ = fh.Close() }()
-
-	d := wav.NewDecoder(fh)
-	buf, err := d.FullPCMBuffer()
-	if err != nil {
-		return pb.TranscriptResult{}, err
-	}
-
-	data := buf.AsFloat32Buffer().Data
-	var duration float32
-	if buf.Format != nil && buf.Format.SampleRate > 0 {
-		duration = float32(len(data)) / float32(buf.Format.SampleRate)
-	}
-	segsLen := uintptr(0xdeadbeef)
-	segsLenPtr := unsafe.Pointer(&segsLen)
-
-	// Watcher: flips the C-side abort flag when ctx is cancelled. The
-	// goroutine is joined synchronously (close(done) signals it to exit,
-	// wg.Wait() blocks until it has) so a late CppSetAbort(1) cannot fire
-	// after the function returns and corrupt the next transcription call.
-	done := make(chan struct{})
-	var wg sync.WaitGroup
-	wg.Add(1)
-	go func() {
-		defer wg.Done()
-		select {
-		case <-ctx.Done():
-			CppSetAbort(1)
-		case <-done:
-		}
-	}()
-	defer func() {
-		close(done)
-		wg.Wait()
-	}()
-
-	ret := CppTranscribe(opts.Threads, opts.Language, opts.Translate, opts.Diarize, data, uintptr(len(data)), segsLenPtr, opts.Prompt)
-	if ret == 2 {
-		return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
-	}
-	if ret != 0 {
-		return pb.TranscriptResult{}, fmt.Errorf("Failed Transcribe")
-	}
-
-	segments := []*pb.TranscriptSegment{}
-	text := ""
-	for i := range int(segsLen) {
-		// segment start/end conversion factor taken from https://github.com/ggml-org/whisper.cpp/blob/master/examples/cli/cli.cpp#L895
-		s := CppGetSegmentStart(i) * (10000000)
-		t := CppGetSegmentEnd(i) * (10000000)
-		// The session result can emit bytes that aren't valid UTF-8 (e.g. a
-		// multibyte codepoint split across token boundaries); protobuf string
-		// fields reject those at marshal time. Scrub before the value escapes
-		// cgo. The session result is segment+word based and exposes no token
-		// IDs, so Tokens is left empty.
-		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
-
-		segment := &pb.TranscriptSegment{
-			Id:    int32(i),
-			Text:  txt,
-			Start: s, End: t,
-		}
-
-		segments = append(segments, segment)
-
-		text += " " + strings.TrimSpace(txt)
-	}
-
-	return pb.TranscriptResult{
-		Segments: segments,
-		Text:     strings.TrimSpace(text),
-		Language: opts.Language,
-		Duration: duration,
-	}, nil
-}
-
-// AudioTranscriptionStream runs the session transcribe to completion and then
-// emits one delta per non-empty segment, followed by a final TranscriptResult.
-// Progressive/real-time streaming isn't available via the session API (there
-// is no per-decode callback), so deltas are emitted per-segment after the
-// blocking decode returns rather than as segments are produced. The offline
-// AudioTranscription is unchanged; both paths share the session and the
-// SingleThread concurrency model.
-func (w *CrispASR) AudioTranscriptionStream(ctx context.Context, opts *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error {
-	defer close(results)
-
-	if err := ctx.Err(); err != nil {
-		return status.Error(codes.Canceled, "transcription cancelled")
-	}
-
-	dir, err := os.MkdirTemp("", "crispasr")
-	if err != nil {
-		return err
-	}
-	defer func() { _ = os.RemoveAll(dir) }()
-
-	convertedPath := filepath.Join(dir, "converted.wav")
-	if err := utils.AudioToWav(opts.Dst, convertedPath); err != nil {
-		return err
-	}
-
-	fh, err := os.Open(convertedPath)
-	if err != nil {
-		return err
-	}
-	defer func() { _ = fh.Close() }()
-
-	d := wav.NewDecoder(fh)
-	buf, err := d.FullPCMBuffer()
-	if err != nil {
-		return err
-	}
-	data := buf.AsFloat32Buffer().Data
-	var duration float32
-	if buf.Format != nil && buf.Format.SampleRate > 0 {
-		duration = float32(len(data)) / float32(buf.Format.SampleRate)
-	}
-
-	// Same abort-watcher pattern as AudioTranscription. Joined synchronously
-	// so a late CppSetAbort(1) cannot fire after this function returns.
-	// Best-effort only: the session transcribe is blocking with no abort hook.
-	done := make(chan struct{})
-	var wg sync.WaitGroup
-	wg.Add(1)
-	go func() {
-		defer wg.Done()
-		select {
-		case <-ctx.Done():
-			CppSetAbort(1)
-		case <-done:
-		}
-	}()
-	defer func() {
-		close(done)
-		wg.Wait()
-	}()
-
-	segsLen := uintptr(0xdeadbeef)
-	segsLenPtr := unsafe.Pointer(&segsLen)
-	ret := CppTranscribe(opts.Threads, opts.Language, opts.Translate, opts.Diarize, data, uintptr(len(data)), segsLenPtr, opts.Prompt)
-	if ret == 2 {
-		return status.Error(codes.Canceled, "transcription cancelled")
-	}
-	if ret != 0 {
-		return fmt.Errorf("Failed Transcribe")
-	}
-
-	// Walk the segments once: emit a delta per non-empty segment and build the
-	// final TranscriptResult.Segments alongside. The first delta has no leading
-	// space and subsequent ones are prefixed with a single space, so
-	// concat(deltas) == final.Text exactly, matching the e2e contract.
-	segments := []*pb.TranscriptSegment{}
-	var assembled strings.Builder
-	for i := range int(segsLen) {
-		s := CppGetSegmentStart(i) * 10000000
-		t := CppGetSegmentEnd(i) * 10000000
-		txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
-		segments = append(segments, &pb.TranscriptSegment{
-			Id:    int32(i),
-			Text:  txt,
-			Start: s, End: t,
-		})
-
-		trimmed := strings.TrimSpace(txt)
-		if trimmed == "" {
-			continue
-		}
-		var delta string
-		if assembled.Len() == 0 {
-			delta = trimmed
-		} else {
-			delta = " " + trimmed
-		}
-		results <- &pb.TranscriptStreamResponse{Delta: delta}
-		assembled.WriteString(delta)
-	}
-
-	final := &pb.TranscriptResult{
-		Segments: segments,
-		Text:     assembled.String(),
-		Language: opts.Language,
-		Duration: duration,
-	}
-	results <- &pb.TranscriptStreamResponse{FinalResult: final}
-	return nil
-}
-
-// synthesize returns 24 kHz mono float32 PCM for text via the open session.
-func (w *CrispASR) synthesize(text string) ([]float32, error) {
-	if text == "" {
-		return nil, fmt.Errorf("crispasr: TTS requires non-empty text")
-	}
-	var n int32
-	ptr := CppTTSSynthesize(text, unsafe.Pointer(&n))
-	if ptr == 0 || n <= 0 {
-		return nil, fmt.Errorf("crispasr: synthesis failed (the loaded model may not be a supported TTS backend, or needs extra config e.g. orpheus SNAC codec)")
-	}
-	defer CppTTSFree(ptr)
-	src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free.
-	out := make([]float32, int(n)) // copy out of C memory before free
-	copy(out, src)
-	return out, nil
-}
-
-// setVoice applies a per-call speaker/voice override (best effort). CrispASR
-// returns a negative code when the active backend can't honor the name; we log
-// it rather than fail, so an unknown voice falls back to the default speaker.
-func setVoice(voice string) {
-	v := strings.TrimSpace(voice)
-	if v == "" {
-		return
-	}
-	if rc := CppTTSSetVoice(v); rc != 0 {
-		fmt.Fprintf(os.Stderr, "crispasr: voice %q not applied by the active TTS backend (rc=%d); using default\n", v, rc)
-	}
-}
-
-func (w *CrispASR) TTS(req *pb.TTSRequest) error {
-	if req.Dst == "" {
-		return fmt.Errorf("crispasr: TTS requires a destination path")
-	}
-	setVoice(req.Voice)
-	pcm, err := w.synthesize(req.Text)
-	if err != nil {
-		return err
-	}
-	return writeWAV24k(req.Dst, pcm)
-}
-
-// TTSStream is the streaming counterpart to TTS. CrispASR has no progressive
-// (native streaming) synth, so we synthesize the whole utterance, encode it to
-// a 24 kHz WAV, and emit the encoded bytes as a single chunk. The gRPC server
-// wrapper (pkg/grpc/server.go:TTSStream) ranges over the channel until it is
-// closed, so this method owns the close - mirrors vibevoice-cpp's TTSStream.
-func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
-	defer close(results)
-
-	if req.Text == "" {
-		return fmt.Errorf("crispasr: TTSStream requires text")
-	}
-	setVoice(req.Voice)
-	pcm, err := w.synthesize(req.Text)
-	if err != nil {
-		return err
-	}
-
-	tmp, err := os.CreateTemp("", "crispasr-tts-stream-*.wav")
-	if err != nil {
-		return fmt.Errorf("crispasr: tempfile: %w", err)
-	}
-	dst := tmp.Name()
-	if err := tmp.Close(); err != nil {
-		return fmt.Errorf("crispasr: close tempfile: %w", err)
-	}
-	defer func() { _ = os.Remove(dst) }()
-
-	if err := writeWAV24k(dst, pcm); err != nil {
-		return err
-	}
-
-	encoded, err := os.ReadFile(dst)
-	if err != nil {
-		return fmt.Errorf("crispasr: read tempfile: %w", err)
-	}
-	results <- encoded
-	return nil
-}
-
-// writeWAV24k writes pcm as a 24000 Hz, mono, 16-bit PCM WAV at dst.
-func writeWAV24k(dst string, pcm []float32) error {
-	f, err := os.Create(dst)
-	if err != nil {
-		return fmt.Errorf("crispasr: create %q: %w", dst, err)
-	}
-
-	enc := wav.NewEncoder(f, 24000, 16, 1, 1)
-	ints := make([]int, len(pcm))
-	for i, s := range pcm {
-		if s > 1 {
-			s = 1
-		} else if s < -1 {
-			s = -1
-		}
-		ints[i] = int(s * 32767)
-	}
-	buf := &audio.IntBuffer{
-		Format:         &audio.Format{NumChannels: 1, SampleRate: 24000},
-		Data:           ints,
-		SourceBitDepth: 16,
-	}
-	if err := enc.Write(buf); err != nil {
-		_ = enc.Close()
-		_ = f.Close()
-		return fmt.Errorf("crispasr: encode WAV: %w", err)
-	}
-	if err := enc.Close(); err != nil {
-		_ = f.Close()
-		return fmt.Errorf("crispasr: finalize WAV: %w", err)
-	}
-	if err := f.Close(); err != nil {
-		return fmt.Errorf("crispasr: close %q: %w", dst, err)
-	}
-	return nil
-}
--- a/backend/go/crispasr/gocrispasr_test.go
+++ b/backend/go/crispasr/gocrispasr_test.go
@@ -1,193 +0,0 @@
-package main
-
-import (
-	"context"
-	"os"
-	"path/filepath"
-	"strings"
-	"sync"
-	"testing"
-
-	"github.com/ebitengine/purego"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-	"google.golang.org/grpc/codes"
-	"google.golang.org/grpc/status"
-)
-
-func TestCrispASR(t *testing.T) {
-	RegisterFailHandler(Fail)
-	RunSpecs(t, "CrispASR Backend Suite")
-}
-
-var (
-	libLoadOnce sync.Once
-	libLoadErr  error
-)
-
-// ensureLibLoaded mirrors main.go's bootstrap so a Go test can drive the
-// bridge without spinning up the gRPC server. Skips the current spec when the
-// shared library isn't present (e.g. running before `make backends/whisper`).
-func ensureLibLoaded() {
-	libLoadOnce.Do(func() {
-		libName := os.Getenv("CRISPASR_LIBRARY")
-		if libName == "" {
-			libName = "./libgocrispasr-fallback.so"
-		}
-		if _, err := os.Stat(libName); err != nil {
-			libLoadErr = err
-			return
-		}
-		gosd, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
-		if err != nil {
-			libLoadErr = err
-			return
-		}
-		purego.RegisterLibFunc(&CppLoadModel, gosd, "load_model")
-		purego.RegisterLibFunc(&CppSetCodecPath, gosd, "set_codec_path")
-		purego.RegisterLibFunc(&CppTranscribe, gosd, "transcribe")
-		purego.RegisterLibFunc(&CppGetSegmentText, gosd, "get_segment_text")
-		purego.RegisterLibFunc(&CppGetSegmentStart, gosd, "get_segment_t0")
-		purego.RegisterLibFunc(&CppGetSegmentEnd, gosd, "get_segment_t1")
-		purego.RegisterLibFunc(&CppGetBackend, gosd, "get_backend")
-		purego.RegisterLibFunc(&CppSetAbort, gosd, "set_abort")
-		purego.RegisterLibFunc(&CppTTSSynthesize, gosd, "tts_synthesize")
-		purego.RegisterLibFunc(&CppTTSFree, gosd, "tts_free")
-		purego.RegisterLibFunc(&CppTTSSetVoice, gosd, "tts_set_voice")
-		purego.RegisterLibFunc(&CppTTSSetVoiceFile, gosd, "tts_set_voice_file")
-	})
-	if libLoadErr != nil {
-		Skip("whisper library not loadable: " + libLoadErr.Error())
-	}
-}
-
-// fixturesOrSkip returns the model + audio paths or skips the spec if either
-// env var is unset. The test never runs in default CI — it requires a real
-// whisper model and a long audio file (~3 minutes) on disk.
-func fixturesOrSkip() (string, string) {
-	modelPath := os.Getenv("CRISPASR_MODEL_PATH")
-	audioPath := os.Getenv("CRISPASR_AUDIO_PATH")
-	if modelPath == "" || audioPath == "" {
-		Skip("set CRISPASR_MODEL_PATH and CRISPASR_AUDIO_PATH to run this spec")
-	}
-	return modelPath, audioPath
-}
-
-// ttsModelOrSkip returns the TTS model path or skips the spec when the env var
-// is unset. Like the transcription fixtures, this never runs in default CI — it
-// needs a real TTS model (e.g. a vibevoice GGUF) on disk.
-func ttsModelOrSkip() string {
-	modelPath := os.Getenv("CRISPASR_TTS_MODEL_PATH")
-	if modelPath == "" {
-		Skip("set CRISPASR_TTS_MODEL_PATH to run this spec")
-	}
-	return modelPath
-}
-
-var _ = Describe("CrispASR", func() {
-	Context("AudioTranscription cancellation", func() {
-		It("returns codes.Canceled on a pre-cancelled context and still succeeds afterwards", func() {
-			modelPath, audioPath := fixturesOrSkip()
-			ensureLibLoaded()
-
-			w := &CrispASR{}
-			Expect(w.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
-
-			// The session transcribe is blocking and exposes no abort hook, so
-			// a mid-decode cancel can't interrupt it. The contract we can rely
-			// on is the pre-call ctx.Err() check: a context cancelled before
-			// the call must yield codes.Canceled without starting a decode.
-			ctx, cancel := context.WithCancel(context.Background())
-			cancel()
-
-			_, err := w.AudioTranscription(ctx, &pb.TranscriptRequest{
-				Dst:      audioPath,
-				Threads:  4,
-				Language: "en",
-			})
-			Expect(err).To(HaveOccurred(), "expected pre-cancelled context to fail")
-			st, ok := status.FromError(err)
-			Expect(ok).To(BeTrue(), "expected gRPC status error, got %v", err)
-			Expect(st.Code()).To(Equal(codes.Canceled), "expected codes.Canceled, got %v", err)
-
-			// Subsequent transcription must succeed — proves g_abort reset.
-			res, err := w.AudioTranscription(context.Background(), &pb.TranscriptRequest{
-				Dst:      audioPath,
-				Threads:  4,
-				Language: "en",
-			})
-			Expect(err).ToNot(HaveOccurred(), "post-cancel transcription failed")
-			Expect(res.Text).ToNot(BeEmpty(), "post-cancel transcription returned empty text")
-		})
-	})
-
-	Context("AudioTranscriptionStream", func() {
-		It("emits multiple deltas progressively for a multi-segment clip", func() {
-			modelPath, audioPath := fixturesOrSkip()
-			ensureLibLoaded()
-
-			w := &CrispASR{}
-			Expect(w.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
-
-			results := make(chan *pb.TranscriptStreamResponse, 64)
-			done := make(chan error, 1)
-			go func() {
-				done <- w.AudioTranscriptionStream(context.Background(), &pb.TranscriptRequest{
-					Dst:      audioPath,
-					Threads:  4,
-					Language: "en",
-					Stream:   true,
-				}, results)
-			}()
-
-			var deltas []string
-			var assembled strings.Builder
-			var finalText string
-			var finalSegmentCount int
-			for chunk := range results {
-				if d := chunk.GetDelta(); d != "" {
-					deltas = append(deltas, d)
-					assembled.WriteString(d)
-				}
-				if final := chunk.GetFinalResult(); final != nil {
-					finalText = final.GetText()
-					finalSegmentCount = len(final.GetSegments())
-				}
-			}
-			Expect(<-done).ToNot(HaveOccurred())
-
-			// One delta per non-empty segment is emitted after the blocking
-			// decode returns (the session API has no per-decode callback), so a
-			// multi-segment clip MUST produce >=2 delta events, and
-			// concat(deltas) MUST equal final.Text exactly.
-			Expect(len(deltas)).To(BeNumerically(">=", 2),
-				"expected multiple deltas from a multi-segment clip, got %d (assembled=%q)",
-				len(deltas), assembled.String())
-			Expect(finalSegmentCount).To(BeNumerically(">=", 2),
-				"expected final to carry multiple segments")
-			Expect(assembled.String()).To(Equal(finalText),
-				"concat(deltas) must equal final.Text")
-		})
-	})
-
-	Context("TTS", func() {
-		It("synthesizes a non-empty WAV", func() {
-			ttsModel := ttsModelOrSkip()
-			ensureLibLoaded()
-
-			w := &CrispASR{}
-			Expect(w.Load(&pb.ModelOptions{ModelFile: ttsModel})).To(Succeed())
-
-			dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
-			Expect(w.TTS(&pb.TTSRequest{Text: "Hello from CrispASR.", Dst: dst})).To(Succeed())
-
-			info, err := os.Stat(dst)
-			Expect(err).ToNot(HaveOccurred(), "synthesized WAV should exist at %q", dst)
-			// A real 24 kHz mono WAV is a 44-byte header plus samples; anything
-			// this small would mean an empty/failed synth.
-			Expect(info.Size()).To(BeNumerically(">", 1024),
-				"expected a non-trivial WAV, got %d bytes", info.Size())
-		})
-	})
-})
--- a/backend/go/crispasr/main.go
+++ b/backend/go/crispasr/main.go
@@ -1,58 +0,0 @@
-package main
-
-// Note: this is started internally by LocalAI and a server is allocated for each model
-import (
-	"flag"
-	"os"
-
-	"github.com/ebitengine/purego"
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-)
-
-var (
-	addr = flag.String("addr", "localhost:50051", "the address to connect to")
-)
-
-type LibFuncs struct {
-	FuncPtr any
-	Name    string
-}
-
-func main() {
-	libName := os.Getenv("CRISPASR_LIBRARY")
-	if libName == "" {
-		libName = "./libgocrispasr-fallback.so"
-	}
-
-	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
-	if err != nil {
-		panic(err)
-	}
-
-	libFuncs := []LibFuncs{
-		{&CppLoadModel, "load_model"},
-		{&CppSetCodecPath, "set_codec_path"},
-		{&CppLoadModelVAD, "load_model_vad"},
-		{&CppVAD, "vad"},
-		{&CppTranscribe, "transcribe"},
-		{&CppGetSegmentText, "get_segment_text"},
-		{&CppGetSegmentStart, "get_segment_t0"},
-		{&CppGetSegmentEnd, "get_segment_t1"},
-		{&CppGetBackend, "get_backend"},
-		{&CppSetAbort, "set_abort"},
-		{&CppTTSSynthesize, "tts_synthesize"},
-		{&CppTTSFree, "tts_free"},
-		{&CppTTSSetVoice, "tts_set_voice"},
-		{&CppTTSSetVoiceFile, "tts_set_voice_file"},
-	}
-
-	for _, lf := range libFuncs {
-		purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
-	}
-
-	flag.Parse()
-
-	if err := grpc.StartServer(*addr, &CrispASR{}); err != nil {
-		panic(err)
-	}
-}
--- a/backend/go/crispasr/package.sh
+++ b/backend/go/crispasr/package.sh
@@ -1,65 +0,0 @@
-#!/bin/bash
-
-# Script to copy the appropriate libraries based on architecture
-# This script is used in the final stage of the Dockerfile
-
-set -e
-
-CURDIR=$(dirname "$(realpath $0)")
-REPO_ROOT="${CURDIR}/../../.."
-
-# Create lib directory
-mkdir -p $CURDIR/package/lib
-
-cp -avf $CURDIR/crispasr $CURDIR/package/
-cp -fv $CURDIR/libgocrispasr-*.so $CURDIR/package/
-cp -fv $CURDIR/run.sh $CURDIR/package/
-
-# Detect architecture and copy appropriate libraries
-if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
-    # x86_64 architecture
-    echo "Detected x86_64 architecture, copying x86_64 libraries..."
-    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
-    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
-    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
-elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
-    # ARM64 architecture
-    echo "Detected ARM64 architecture, copying ARM64 libraries..."
-    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
-    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
-    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
-elif [ $(uname -s) = "Darwin" ]; then
-    echo "Detected Darwin"
-else
-    echo "Error: Could not detect architecture"
-    exit 1
-fi
-
-# Package GPU libraries based on BUILD_TYPE
-# The GPU library packaging script will detect BUILD_TYPE and copy appropriate GPU libraries
-GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
-if [ -f "$GPU_LIB_SCRIPT" ]; then
-    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
-    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
-    package_gpu_libs
-fi
-
-echo "Packaging completed successfully"
-ls -liah $CURDIR/package/
-ls -liah $CURDIR/package/lib/
--- a/backend/go/crispasr/run.sh
+++ b/backend/go/crispasr/run.sh
@@ -1,52 +0,0 @@
-#!/bin/bash
-set -ex
-
-# Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath $0)")
-
-cd /
-
-echo "CPU info:"
-if [ "$(uname)" != "Darwin" ]; then
-	grep -e "model\sname" /proc/cpuinfo | head -1
-	grep -e "flags" /proc/cpuinfo | head -1
-fi
-
-LIBRARY="$CURDIR/libgocrispasr-fallback.so"
-
-if [ "$(uname)" != "Darwin" ]; then
-	if grep -q -e "\savx\s" /proc/cpuinfo ; then
-		echo "CPU:    AVX    found OK"
-		if [ -e $CURDIR/libgocrispasr-avx.so ]; then
-			LIBRARY="$CURDIR/libgocrispasr-avx.so"
-		fi
-	fi
-
-	if grep -q -e "\savx2\s" /proc/cpuinfo ; then
-		echo "CPU:    AVX2   found OK"
-		if [ -e $CURDIR/libgocrispasr-avx2.so ]; then
-			LIBRARY="$CURDIR/libgocrispasr-avx2.so"
-		fi
-	fi
-
-	# Check avx 512
-	if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
-		echo "CPU:    AVX512F found OK"
-		if [ -e $CURDIR/libgocrispasr-avx512.so ]; then
-			LIBRARY="$CURDIR/libgocrispasr-avx512.so"
-		fi
-	fi
-fi
-
-export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
-export CRISPASR_LIBRARY=$LIBRARY
-
-# If there is a lib/ld.so, use it
-if [ -f $CURDIR/lib/ld.so ]; then
-	echo "Using lib/ld.so"
-	echo "Using library: $LIBRARY"
-	exec $CURDIR/lib/ld.so $CURDIR/crispasr "$@"
-fi
-
-echo "Using library: $LIBRARY"
-exec $CURDIR/crispasr "$@"
--- a/backend/go/dllm/.gitignore
+++ b/backend/go/dllm/.gitignore
@@ -1,10 +0,0 @@
-.cache/
-sources/
-build/
-package/
-dllm-grpc
-# build artifacts staged in-tree by the Makefile (cp from sources/) or
-# symlinked for local dev; the real sources live in dllm.cpp upstream.
-*.so
-*.so.*
-compile_commands.json
--- a/backend/go/dllm/Makefile
+++ b/backend/go/dllm/Makefile
@@ -1,93 +0,0 @@
-# dllm backend Makefile.
-#
-# Upstream pin lives below as DLLM_VERSION?=<sha> so .github/bump_deps.sh
-# can find and update it - matches the whisper.cpp / parakeet-cpp / ds4
-# convention.
-#
-# Local dev shortcut: if you already have an out-of-tree dllm.cpp build,
-# you can symlink the .so into this directory and skip the clone/cmake
-# steps entirely, e.g.:
-#
-#   ln -sf /path/to/dllm.cpp/build/libdllm.so .
-#   go build -o dllm-grpc .
-#
-# That's what the gated C-ABI binding smoke uses (DLLM_TEST_LIBRARY). The
-# default target below does the proper clone-at-pin + cmake build so CI
-# doesn't need a side-checkout.
-#
-# NOTE: github.com/mudler/dllm.cpp is still private (publishing is planned);
-# until then the anonymous clone below fails. Use the symlink shortcut above
-# with a local checkout, or a git credential helper with access to the repo.
-
-DLLM_VERSION?=b22fcebebfb225131113188599a9ae542b2935d7
-DLLM_REPO?=https://github.com/mudler/dllm.cpp
-
-GOCMD?=go
-GO_TAGS?=
-JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
-
-BUILD_TYPE?=
-NATIVE?=false
-
-# libdllm.so is self-contained: dllm.cpp's CMakeLists statically absorbs ggml
-# (BUILD_SHARED_LIBS=OFF + PIC) into the shared lib, so dlopen needs no
-# libggml*.so alongside it, only system libs (libstdc++/libgomp/libc) the
-# runtime image already provides. Tests/CLI are upstream-only concerns.
-CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DDLLM_BUILD_TESTS=OFF
-
-ifeq ($(NATIVE),false)
-	CMAKE_ARGS+=-DGGML_NATIVE=OFF
-endif
-
-# Same arch set the sibling ggml backends (acestep/vibevoice/qwen3-tts) bake
-# for their cublas images; override for a native build.
-CUDA_ARCHITECTURES?=75-virtual;80-virtual;86-real;89-real
-
-# dllm.cpp gates CUDA behind DLLM_CUDA (set(GGML_CUDA ... CACHE FORCE)), so
-# forward that instead of a bare -DGGML_CUDA=ON.
-ifeq ($(BUILD_TYPE),cublas)
-	CMAKE_ARGS+=-DDLLM_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="$(CUDA_ARCHITECTURES)"
-endif
-
-.PHONY: dllm-grpc package build clean purge test all
-
-all: dllm-grpc
-
-# Clone the upstream dllm.cpp source at the pinned commit (ggml comes in as
-# a submodule). Directory acts as the target so make only re-clones when
-# missing. After a DLLM_VERSION bump, run 'make purge && make' to refetch.
-sources/dllm.cpp:
-	mkdir -p sources/dllm.cpp
-	cd sources/dllm.cpp && \
-	git init -q && \
-	git remote add origin $(DLLM_REPO) && \
-	git fetch --depth 1 origin $(DLLM_VERSION) && \
-	git checkout FETCH_HEAD && \
-	git submodule update --init --recursive --depth 1 --single-branch
-
-# Build the shared lib out-of-tree, then stage it next to the Go sources so
-# purego.Dlopen("libdllm.so") and the packaging step both pick it up.
-libdllm.so: sources/dllm.cpp
-	cmake -B sources/dllm.cpp/build -S sources/dllm.cpp $(CMAKE_ARGS)
-	cmake --build sources/dllm.cpp/build --config Release -j$(JOBS)
-	cp -fv sources/dllm.cpp/build/libdllm.so ./
-
-dllm-grpc: libdllm.so main.go capi.go
-	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o dllm-grpc .
-
-package: dllm-grpc
-	bash package.sh
-
-build: package
-
-# Test target. The C-ABI binding smoke is gated on DLLM_TEST_LIBRARY +
-# DLLM_TEST_TINY_MODEL; without them the gated specs auto-skip and only the
-# pure-Go helper specs run.
-test:
-	LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1
-
-clean: purge
-	rm -rf libdllm.so* package dllm-grpc
-
-purge:
-	rm -rf sources/dllm.cpp
--- a/backend/go/dllm/capi.go
+++ b/backend/go/dllm/capi.go
@@ -1,256 +0,0 @@
-package main
-
-// Typed Go wrappers over dllm.cpp's flat C-ABI (include/dllm_capi.h, ABI v1).
-//
-// Contract highlights the wrappers encode (see the header + src/capi.cpp):
-//   - tokenize_json/generate return malloc'd char* the CALLER owns: bound as
-//     uintptr, copied with goStringFromCPtr, released via dllm_capi_free_string.
-//   - last_error returns a BORROWED pointer (valid until the next call on the
-//     same ctx): bound as a plain string (purego copies), never freed, and only
-//     read AFTER the failing call has returned - reading it while a generate is
-//     in flight on the same ctx violates the per-ctx serialization contract.
-//   - All entry points except dllm_capi_cancel must be externally serialized
-//     per ctx (one ctx = one concurrent generate/tokenize). Cancel only flips
-//     an atomic and may be called from any goroutine mid-generate.
-//   - No C++ exception crosses the boundary; failures land in last_error.
-
-import (
-	"encoding/json"
-	"fmt"
-	"sync"
-	"sync/atomic"
-	"unsafe"
-
-	"github.com/ebitengine/purego"
-)
-
-// dllmABIVersion is the DLLM_CAPI_ABI_VERSION this binding was written
-// against; main.go refuses to start against a libdllm.so reporting another.
-const dllmABIVersion = 1
-
-// purego-bound entry points from libdllm.so. Names match dllm_capi.h
-// exactly; loadCAPI (main.go) fills these in at boot.
-var (
-	cppAbiVersion func() int32
-	cppLoad       func(ggufPath, paramsJSON string) uintptr
-	cppFree       func(ctx uintptr)
-	cppLastError  func(ctx uintptr) string // borrowed pointer: purego copies, do NOT free
-	cppFreeString func(s uintptr)
-	// malloc'd char* returns, hence uintptr (see loadCAPI's doc comment).
-	cppTokenizeJSON func(ctx uintptr, text string) uintptr
-	cppGenerate     func(ctx uintptr, prompt, optsJSON string) uintptr
-	// on_block/on_step are C function pointers produced by purego.NewCallback;
-	// userData carries the streamCallStates registry key.
-	cppGenerateStream func(ctx uintptr, prompt, optsJSON string, onBlock, onStep, userData uintptr) int32
-	cppCancel         func(ctx uintptr)
-)
-
-// cAbiVersion returns the library's DLLM_CAPI_ABI_VERSION.
-func cAbiVersion() int32 {
-	return cppAbiVersion()
-}
-
-// cLoad opens the GGUF at path with the flat params JSON (e.g.
-// {"n_gpu_layers":99}). Returns 0 on failure; per the header contract there
-// is no ctx to carry the reason, the C side logs it to stderr (and
-// cLastError(0) only yields the static NULL-ctx message).
-func cLoad(path, paramsJSON string) uintptr {
-	return cppLoad(path, paramsJSON)
-}
-
-// cFree releases a ctx; safe on 0 (delete nullptr).
-func cFree(h uintptr) {
-	cppFree(h)
-}
-
-// cLastError returns the ctx's last error message (or the static NULL-ctx
-// message for h==0). The C pointer is borrowed and only valid until the next
-// call on the same ctx; purego's string return copies it immediately, so the
-// returned Go string is safe to keep. Must not be called while another call
-// on the same ctx is in flight.
-func cLastError(h uintptr) string {
-	return cppLastError(h)
-}
-
-// lastErrorOr is cLastError with a fallback for the empty-message case, so
-// wrapped errors never end in ": ".
-func lastErrorOr(h uintptr, fallback string) string {
-	if msg := cLastError(h); msg != "" {
-		return msg
-	}
-	return fallback
-}
-
-// cTokenizeJSON tokenizes text (the C side prepends bos per vocab.add_bos)
-// and returns the token ids as a JSON array string, e.g. "[2,18]".
-func cTokenizeJSON(h uintptr, text string) (string, error) {
-	ret := cppTokenizeJSON(h, text)
-	if ret == 0 {
-		return "", fmt.Errorf("dllm: tokenize failed: %s", lastErrorOr(h, "unknown error"))
-	}
-	out := goStringFromCPtr(ret)
-	cppFreeString(ret)
-	return out, nil
-}
-
-// cGenerate runs a blocking generation and returns the detokenized text.
-// optsJSON must be a FLAT JSON object of scalars (use buildOptsJSON); the C
-// parser rejects nested objects/arrays. NULL return -> last_error (read only
-// after the call returned, per the serialization contract); a cancelled call
-// surfaces as the "cancelled" message.
-func cGenerate(h uintptr, prompt, optsJSON string) (string, error) {
-	ret := cppGenerate(h, prompt, optsJSON)
-	if ret == 0 {
-		return "", fmt.Errorf("dllm: generate failed: %s", lastErrorOr(h, "unknown error"))
-	}
-	out := goStringFromCPtr(ret)
-	cppFreeString(ret)
-	return out, nil
-}
-
-// streamCallState carries the Go callbacks for one in-flight
-// cGenerateStream call; the registry key travels through C as user_data.
-// The map shape mirrors the whisper backend's streamCallStates: only one
-// entry per ctx is ever live (the C-ABI is serialized per ctx), but keying
-// by call survives multiple models/processes sharing the package.
-type streamCallState struct {
-	onBlock func(text string)
-	onStep  func(step, total int, preview string)
-}
-
-var (
-	streamCallStates sync.Map // uint64 -> *streamCallState
-	streamCallSeq    atomic.Uint64
-
-	// purego.NewCallback allocates a finite, never-released callback slot, so
-	// the two trampolines are created exactly once and reused across calls.
-	streamCbOnce sync.Once
-	blockCbPtr   uintptr
-	stepCbPtr    uintptr
-)
-
-// onBlockTrampoline is the Go side of dllm_block_cb. It runs on the C
-// calling thread, mid-generate: keep it tiny and non-blocking (callers that
-// bridge to goroutines must hand off via buffered channels). The text
-// pointer is only valid for the duration of the invocation, so it is copied
-// to a Go string immediately.
-func onBlockTrampoline(text uintptr, userData uintptr) {
-	v, ok := streamCallStates.Load(uint64(userData))
-	if !ok {
-		return // call already torn down
-	}
-	state := v.(*streamCallState)
-	if state.onBlock != nil {
-		state.onBlock(goStringFromCPtr(text))
-	}
-}
-
-// onStepTrampoline is the Go side of dllm_step_cb; same threading and
-// lifetime caveats as onBlockTrampoline.
-func onStepTrampoline(step int32, totalSteps int32, canvasPreview uintptr, userData uintptr) {
-	v, ok := streamCallStates.Load(uint64(userData))
-	if !ok {
-		return
-	}
-	state := v.(*streamCallState)
-	if state.onStep != nil {
-		state.onStep(int(step), int(totalSteps), goStringFromCPtr(canvasPreview))
-	}
-}
-
-// cGenerateStream runs a generation with per-committed-block (onBlock) and
-// per-denoising-step (onStep) callbacks; either may be nil. The callbacks
-// run on the C thread (see the trampoline docs). Returns an error carrying
-// last_error on failure; cancellation surfaces as the "cancelled" message.
-func cGenerateStream(h uintptr, prompt, optsJSON string, onBlock func(text string), onStep func(step, total int, preview string)) error {
-	streamCbOnce.Do(func() {
-		blockCbPtr = purego.NewCallback(onBlockTrampoline)
-		stepCbPtr = purego.NewCallback(onStepTrampoline)
-	})
-
-	id := streamCallSeq.Add(1)
-	streamCallStates.Store(id, &streamCallState{onBlock: onBlock, onStep: onStep})
-	defer streamCallStates.Delete(id)
-
-	// Pass NULL for absent callbacks so the C side skips the per-block /
-	// per-step detokenize work entirely.
-	var blockPtr, stepPtr uintptr
-	if onBlock != nil {
-		blockPtr = blockCbPtr
-	}
-	if onStep != nil {
-		stepPtr = stepCbPtr
-	}
-
-	if rc := cppGenerateStream(h, prompt, optsJSON, blockPtr, stepPtr, uintptr(id)); rc != 0 {
-		return fmt.Errorf("dllm: generate_stream failed: %s", lastErrorOr(h, "unknown error"))
-	}
-	return nil
-}
-
-// cCancel requests cancellation of the in-flight generate on h. This is the
-// ONE entry point safe to call from any goroutine while a generate runs (it
-// only flips an atomic). Note the cancel-reset race from the header: each
-// generate resets the flag on entry, so a watchdog should re-issue cancel if
-// the call has not returned.
-func cCancel(h uintptr) {
-	cppCancel(h)
-}
-
-// buildOptsJSON renders generation options as the flat JSON object the
-// C-ABI expects (known keys: n_predict, blocks, seed, eb_*, kv_cache). The
-// C-side scanner only understands scalar number/string values and rejects
-// nested objects/arrays loudly; bools are rejected here too because the
-// scanner has no concept of them. Fail loud rather than let an option be
-// silently misread.
-//
-// CAVEAT: json.Marshal HTML-escapes <, > and & inside string values (e.g.
-// "<" becomes the six-byte \u003c sequence). None of the known string-valued keys
-// (kv_cache: auto|on|off) can contain those bytes today; if one ever does,
-// switch to an Encoder with SetEscapeHTML(false) like gemma4JSONString.
-func buildOptsJSON(opts map[string]any) (string, error) {
-	if len(opts) == 0 {
-		return "{}", nil
-	}
-	for k, v := range opts {
-		switch v.(type) {
-		case string,
-			int, int8, int16, int32, int64,
-			uint, uint8, uint16, uint32, uint64,
-			float32, float64,
-			json.Number:
-			// scalar: fine
-		default:
-			return "", fmt.Errorf("dllm: opts key %q has non-scalar value %T (the C-ABI only accepts flat number/string scalars)", k, v)
-		}
-	}
-	b, err := json.Marshal(opts)
-	if err != nil {
-		return "", fmt.Errorf("dllm: marshal opts: %w", err)
-	}
-	return string(b), nil
-}
-
-// goStringFromCPtr copies a NUL-terminated C string into Go memory. cptr is
-// the raw pointer returned by purego from the C-ABI (a malloc'd buffer the
-// caller owns, or a callback argument only valid during the invocation);
-// owning callers must free it via cppFreeString after the copy lands.
-//
-// A direct unsafe.Pointer(cptr) conversion trips go vet's unsafeptr check,
-// which can't distinguish a C-owned heap pointer from Go-managed memory (the
-// parakeet-cpp and whisper backends tolerate that warning). Reinterpreting
-// through &cptr below is equivalent at runtime and keeps plain `go vet`
-// clean. It is safe either way: the pointer addresses C memory the Go GC
-// neither tracks nor moves, and we dereference it immediately to copy the
-// bytes out.
-func goStringFromCPtr(cptr uintptr) string {
-	if cptr == 0 {
-		return ""
-	}
-	p := *(*unsafe.Pointer)(unsafe.Pointer(&cptr)) // C-owned buffer, not Go-GC memory (see doc above)
-	n := 0
-	for *(*byte)(unsafe.Add(p, n)) != 0 {
-		n++
-	}
-	return string(unsafe.Slice((*byte)(p), n))
-}
--- a/backend/go/dllm/dllm.go
+++ b/backend/go/dllm/dllm.go
@@ -1,553 +0,0 @@
-package main
-
-// LocalAI gRPC backend for dllm.cpp (DiffusionGemma block-diffusion models).
-//
-// Wiring overview:
-//   - Load opens the GGUF via dllm_capi_load and starts the per-model worker
-//     goroutine that serializes every C call (see submit).
-//   - PredictRich / PredictStreamRich implement grpc.AIModelRich: when the
-//     request carries raw messages (use_tokenizer_template), the backend owns
-//     templating (RenderGemma4) and output parsing (Gemma4Parser) and replies
-//     with ChatDeltas, like the llama.cpp autoparser and the ds4 backend.
-//   - The legacy Predict / PredictStream methods delegate to the rich pair
-//     (cloud-proxy precedent); the gRPC server prefers the rich path anyway.
-
-import (
-	"encoding/json"
-	"errors"
-	"fmt"
-	"strconv"
-	"strings"
-	"sync"
-	"unicode/utf8"
-
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-	"github.com/mudler/LocalAI/pkg/grpc/base"
-	"github.com/mudler/LocalAI/pkg/grpc/grpcerrors"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	"github.com/mudler/xlog"
-)
-
-// The gRPC server cancels in-flight generations on client disconnect only
-// for backends advertising the Cancellable capability; keep Dllm pinned to
-// it so a signature drift fails the build, not the disconnect path.
-var _ grpc.Cancellable = (*Dllm)(nil)
-
-// generator is the seam between the backend wiring and the dllm.cpp C-ABI:
-// the real implementation (capiGenerator) wraps the cGenerate/cTokenizeJSON
-// family, while tests substitute a fake to exercise prompt construction,
-// parsing and serialization without libdllm.so.
-type generator interface {
-	generate(prompt, optsJSON string) (string, error)
-	// generateStream invokes onBlock once per committed diffusion block, on
-	// the thread running the C call, before returning.
-	generateStream(prompt, optsJSON string, onBlock func(text string)) error
-	tokenizeJSON(text string) (string, error)
-	// cancel is the ONE entry point safe to call concurrently with an
-	// in-flight generate on the same ctx (dllm_capi.h: it only flips an
-	// atomic; everything else must be externally serialized per ctx).
-	cancel()
-	free()
-}
-
-// capiGenerator is the production generator over one dllm_ctx handle.
-type capiGenerator struct {
-	h uintptr
-}
-
-func (g *capiGenerator) generate(prompt, optsJSON string) (string, error) {
-	return cGenerate(g.h, prompt, optsJSON)
-}
-
-func (g *capiGenerator) generateStream(prompt, optsJSON string, onBlock func(text string)) error {
-	// on_step (per-denoise-step canvas preview, dllm.cpp's --visual) is
-	// passed as nil for now: a future progress hook for the React UI can
-	// plumb it through without touching the C binding.
-	return cGenerateStream(g.h, prompt, optsJSON, onBlock, nil)
-}
-
-func (g *capiGenerator) tokenizeJSON(text string) (string, error) {
-	return cTokenizeJSON(g.h, text)
-}
-
-func (g *capiGenerator) cancel() {
-	cCancel(g.h)
-}
-
-func (g *capiGenerator) free() {
-	cFree(g.h)
-}
-
-// Dllm is the gRPC backend instance: one per loaded model (LocalAI starts
-// one backend process per model).
-type Dllm struct {
-	base.Base
-
-	gen generator
-	// genOpts holds the model-level generation overrides parsed from
-	// ModelOptions.Options at Load (eb_*, blocks, kv_cache). The C-ABI takes
-	// them per-generate, not per-load, so they are merged into every
-	// request's opts JSON (requestOptsJSON).
-	genOpts map[string]any
-
-	// jobs is the per-model worker queue. dllm_capi.h requires every entry
-	// point EXCEPT dllm_capi_cancel to be externally serialized per ctx (one
-	// ctx = one concurrent generate/tokenize; last_error is unsafe to read
-	// while a call is in flight). A single goroutine owning all C calls makes
-	// that contract structural instead of relying on lock discipline.
-	jobs     chan func()
-	workerWG sync.WaitGroup
-
-	// genMu guards gen against Free racing in-flight requests: requests hold
-	// the read lock for their full duration (they stay concurrent with each
-	// other - the worker still serializes the C calls), Free takes the write
-	// lock so it can only run when no request is in flight.
-	genMu sync.RWMutex
-}
-
-func (d *Dllm) startWorker() {
-	d.jobs = make(chan func())
-	d.workerWG.Add(1)
-	go func() {
-		defer d.workerWG.Done()
-		for job := range d.jobs {
-			job()
-		}
-	}()
-}
-
-// submit runs job on the worker goroutine and waits for it to finish.
-// Concurrent gRPC requests therefore queue up and execute one at a time
-// against the single dllm_ctx.
-func (d *Dllm) submit(job func()) {
-	done := make(chan struct{})
-	d.jobs <- func() {
-		defer close(done)
-		job()
-	}
-	<-done
-}
-
-// Load opens the GGUF and prepares the worker. Load-time engine parameters
-// travel as the flat params JSON of dllm_capi_load; generation overrides
-// from Options are stored for per-request opts JSON instead (the C-ABI has
-// no per-load sampler state).
-func (d *Dllm) Load(opts *pb.ModelOptions) error {
-	if d.gen != nil {
-		return errors.New("dllm: model already loaded")
-	}
-
-	params := map[string]any{
-		"n_gpu_layers": opts.GetNGPULayers(),
-	}
-	if opts.GetThreads() > 0 {
-		params["n_threads"] = opts.GetThreads()
-	}
-	if opts.GetContextSize() > 0 {
-		params["ctx_len"] = opts.GetContextSize()
-	}
-	paramsJSON, err := buildOptsJSON(params)
-	if err != nil {
-		return err
-	}
-
-	d.genOpts = parseModelGenOpts(opts.GetOptions())
-
-	h := cLoad(opts.GetModelFile(), paramsJSON)
-	if h == 0 {
-		// No ctx exists on load failure, so last_error(NULL) only carries the
-		// static NULL-ctx message; the real reason is on the backend's stderr.
-		return fmt.Errorf("dllm: load %q failed: %s (see backend log for details)",
-			opts.GetModelFile(), lastErrorOr(0, "unknown error"))
-	}
-	d.gen = &capiGenerator{h: h}
-	d.startWorker()
-	xlog.Info("dllm: model loaded", "model", opts.GetModelFile(), "params", paramsJSON, "gen_opts", d.genOpts)
-	return nil
-}
-
-// Free releases the dllm ctx and stops the worker. Safe when never loaded.
-//
-// The write lock is essential: the gRPC server (pkg/grpc/server.go, see the
-// model-unload path around line 764) calls Free with no locking of its own,
-// and base.Base provides none either. Without it a request racing Free would
-// panic sending on the closed jobs channel - or worse, generate on a freed C
-// ctx. Holding genMu until gen is nil also turns post-Free requests into a
-// clean "model not loaded" error instead of a crash.
-func (d *Dllm) Free() error {
-	d.genMu.Lock()
-	defer d.genMu.Unlock()
-	if d.gen == nil {
-		return nil
-	}
-	d.submit(d.gen.free)
-	close(d.jobs)
-	d.workerWG.Wait()
-	d.gen = nil
-	return nil
-}
-
-// Cancel requests cancellation of the in-flight generate (the
-// grpc.Cancellable capability). The gRPC server arms it via
-// context.AfterFunc on the request/stream context, so a client
-// disconnect or timeout aborts the generation server-side - the same
-// semantics the llama.cpp C++ backend gets from polling IsCancelled().
-// It deliberately bypasses the worker queue: dllm_capi_cancel is the one
-// call the C-ABI allows from any goroutine mid-generate (it only flips
-// an atomic).
-//
-// Note dllm_capi.h's cancel-reset race: each generate resets the flag on
-// entry, so a Cancel racing a NEW generate on the same ctx can be lost
-// (and, with requests queued on the worker, it aborts whichever generate
-// is currently running). The single-flag granularity is acceptable here
-// because the server de-registers the hook on normal completion and one
-// backend process serves one model.
-func (d *Dllm) Cancel() {
-	// RLock so a server-side AfterFunc firing in the window between a
-	// request finishing and a model unload cannot touch a freed C ctx
-	// (Free holds the write lock while tearing gen down). cancel() is the
-	// one C call that is safe concurrently with an in-flight generate, so
-	// taking a read lock here cannot deadlock against request holders.
-	d.genMu.RLock()
-	defer d.genMu.RUnlock()
-	if d.gen != nil {
-		d.gen.cancel()
-	}
-}
-
-// dllmGenOptKeys are the ModelOptions.Options keys this backend forwards to
-// the engine. Options is a shared free-form bag (other layers put their own
-// entries there), so unknown keys are skipped with a warning, not an error.
-var dllmGenOptKeys = map[string]bool{
-	"blocks":   true,
-	"kv_cache": true, // "auto"|"on"|"off"; honored by the engine from P3
-}
-
-// parseModelGenOpts parses "key:value" Options entries into the flat scalar
-// map merged into every generate's opts JSON. eb_* (Entropy-Bound sampler
-// knobs) and the keys in dllmGenOptKeys are recognized; values are typed by
-// first successful parse (int, then float, else string) to match the C
-// scanner's number/string scalars.
-func parseModelGenOpts(options []string) map[string]any {
-	out := map[string]any{}
-	for _, o := range options {
-		key, val, found := strings.Cut(o, ":")
-		if !found {
-			xlog.Warn("dllm: ignoring malformed option (want key:value)", "option", o)
-			continue
-		}
-		if !strings.HasPrefix(key, "eb_") && !dllmGenOptKeys[key] {
-			xlog.Debug("dllm: ignoring unrecognized option", "key", key)
-			continue
-		}
-		out[key] = parseScalarOpt(val)
-	}
-	return out
-}
-
-func parseScalarOpt(v string) any {
-	if iv, err := strconv.ParseInt(v, 10, 64); err == nil {
-		return iv
-	}
-	if fv, err := strconv.ParseFloat(v, 64); err == nil {
-		return fv
-	}
-	return v
-}
-
-// metadataEnableThinking reads the enable_thinking gate. Unlike ds4 (default
-// ON, matching ds4-server), dllm defaults OFF: DiffusionGemma's chat
-// template guards every thinking branch with `enable_thinking is defined and
-// enable_thinking`, i.e. thinking is opt-in for this model family, and the
-// no-thinking render pre-closes an empty thought channel that the OFF
-// default must produce.
-func metadataEnableThinking(opts *pb.PredictOptions) bool {
-	v := opts.GetMetadata()["enable_thinking"]
-	return v == "true" || v == "1"
-}
-
-// buildPrompt resolves the prompt for a request. With use_tokenizer_template
-// and raw messages the backend owns templating (RenderGemma4) and the output
-// is in the known gemma4 format, so parse=true. Without it the caller
-// templated the prompt themselves (LocalAI's Go templates + PEG fallback, or
-// a bare completion): the prompt passes through verbatim and the output is
-// NOT gemma4-parsed - it is emitted as plain content and the Go side's
-// extraction applies, as for any non-autoparsing backend.
-func buildPrompt(opts *pb.PredictOptions) (prompt string, parse bool, err error) {
-	if opts.GetUseTokenizerTemplate() && len(opts.GetMessages()) > 0 {
-		prompt, err = RenderGemma4(opts.GetMessages(), opts.GetTools(), metadataEnableThinking(opts), true)
-		return prompt, true, err
-	}
-	return opts.GetPrompt(), false, nil
-}
-
-// requestOptsJSON merges the model-level overrides with the request's
-// sampling fields into the flat opts JSON for one generate call.
-func (d *Dllm) requestOptsJSON(opts *pb.PredictOptions) (string, error) {
-	m := make(map[string]any, len(d.genOpts)+2)
-	for k, v := range d.genOpts {
-		m[k] = v
-	}
-	if n := opts.GetTokens(); n > 0 {
-		// The engine rounds n_predict UP to a whole number of diffusion
-		// blocks (the canvas is denoised block-wise), so the completion may
-		// run slightly past the requested budget. Tokens==0 omits the key so
-		// the C-ABI default of 256 applies (hardcoded in capi.cpp's
-		// parse_gen_opts, independent of canvas_length).
-		m["n_predict"] = n
-	}
-	if s := opts.GetSeed(); s > 0 {
-		// The engine seeds mt19937 with explicit non-negative seeds. Seed<=0
-		// is omitted: proto3 cannot distinguish 0 from unset, and negative
-		// values conventionally mean "random" across LocalAI backends.
-		m["seed"] = s
-	}
-	return buildOptsJSON(m)
-}
-
-// prepareRequest is the shared prologue of the rich methods: resolve the
-// prompt (and whether the output gets gemma4-parsed) and build the per-call
-// opts JSON.
-func (d *Dllm) prepareRequest(opts *pb.PredictOptions) (prompt string, parse bool, optsJSON string, err error) {
-	prompt, parse, err = buildPrompt(opts)
-	if err != nil {
-		return "", false, "", err
-	}
-	optsJSON, err = d.requestOptsJSON(opts)
-	if err != nil {
-		return "", false, "", err
-	}
-	return prompt, parse, optsJSON, nil
-}
-
-// sanitizeUTF8 makes s safe for a proto3 string field. Block-boundary
-// detokenization and byte-fallback tokens can produce invalid UTF-8, and
-// grpc-go refuses to marshal it ("string field contains invalid UTF-8"), so
-// every string destined for a Reply/ChatDelta must pass through here (or
-// through splitValidUTF8, which calls it). Lone malformed bytes are genuinely
-// undecodable: replace with U+FFFD rather than crash the stream.
-func sanitizeUTF8(s string) string {
-	if utf8.ValidString(s) {
-		return s
-	}
-	return strings.ToValidUTF8(s, "<22>")
-}
-
-// utf8SeqLen returns the declared sequence length of a UTF-8 leading byte
-// (1 for bytes that can never lead a multi-byte sequence, so they are never
-// held back and fall through to sanitizeUTF8's replacement).
-func utf8SeqLen(b byte) int {
-	switch {
-	case b&0xE0 == 0xC0:
-		return 2
-	case b&0xF0 == 0xE0:
-		return 3
-	case b&0xF8 == 0xF0:
-		return 4
-	default:
-		return 1
-	}
-}
-
-// splitValidUTF8 prepends the previous block's carry to the new block and
-// splits the result into text safe to emit now and a trailing INCOMPLETE
-// UTF-8 sequence (at most utf8.UTFMax-1 bytes) to carry into the next block:
-// the per-block detokenize can split a multi-byte character across block
-// boundaries (llama.cpp's grpc-server holds back the same way). Only a
-// suffix that can still become a valid rune is withheld; bytes that are
-// already undecodable are replaced immediately so the carry stays bounded.
-func splitValidUTF8(carry, block string) (emit, newCarry string) {
-	s := carry + block
-	cut := len(s)
-	for i := len(s) - 1; i >= 0 && len(s)-i < utf8.UTFMax; i-- {
-		b := s[i]
-		if b < utf8.RuneSelf {
-			break // ASCII: everything before the tail scan is complete
-		}
-		if !utf8.RuneStart(b) {
-			continue // continuation byte: keep looking for its leading byte
-		}
-		// Leading byte: hold the sequence back iff it declares more bytes
-		// than the stream has produced so far (it may complete next block).
-		if utf8SeqLen(b) > len(s)-i {
-			cut = i
-		}
-		break
-	}
-	return sanitizeUTF8(s[:cut]), s[cut:]
-}
-
-// PredictRich is the non-streaming inference path (grpc.AIModelRich).
-// Returns one Reply whose Message is the aggregated assistant content and
-// whose ChatDeltas carry the parsed content/reasoning/tool-call events.
-func (d *Dllm) PredictRich(opts *pb.PredictOptions) (*pb.Reply, error) {
-	d.genMu.RLock()
-	defer d.genMu.RUnlock()
-	if d.gen == nil {
-		return nil, grpcerrors.ModelNotLoaded("dllm")
-	}
-	prompt, parse, optsJSON, err := d.prepareRequest(opts)
-	if err != nil {
-		return nil, err
-	}
-
-	var out string
-	var genErr error
-	d.submit(func() {
-		out, genErr = d.gen.generate(prompt, optsJSON)
-	})
-	if genErr != nil {
-		return nil, genErr
-	}
-	// Byte-fallback tokens can detokenize to invalid UTF-8; proto3 strings
-	// must be valid or grpc-go fails the whole reply at marshal time.
-	out = sanitizeUTF8(out)
-
-	if !parse {
-		// Raw-prompt mode: plain content, no gemma4 parsing (see buildPrompt).
-		return &pb.Reply{Message: []byte(out), ChatDeltas: []*pb.ChatDelta{{Content: out}}}, nil
-	}
-
-	// The prompt renders with add_generation_prompt; both thinking modes
-	// leave the model starting in content state (see the Gemma4Parser header
-	// comment), hence NewGemma4Parser(false).
-	parser := NewGemma4Parser(false)
-	if reply := replyFromDeltas(append(parser.Feed(out), parser.Close()...)); reply != nil {
-		return reply, nil
-	}
-	// Everything was markers (or out was empty): an empty but non-nil Reply.
-	return &pb.Reply{}, nil
-}
-
-// PredictStreamRich is the streaming counterpart (grpc.AIModelRich): one
-// Reply per committed diffusion block that produced deltas. Per the
-// interface contract the channel is only sent into here - the gRPC server
-// closes it after this returns (opposite to legacy PredictStream).
-func (d *Dllm) PredictStreamRich(opts *pb.PredictOptions, results chan<- *pb.Reply) error {
-	d.genMu.RLock()
-	defer d.genMu.RUnlock()
-	if d.gen == nil {
-		return grpcerrors.ModelNotLoaded("dllm")
-	}
-	prompt, parse, optsJSON, err := d.prepareRequest(opts)
-	if err != nil {
-		return err
-	}
-
-	var parser *Gemma4Parser
-	if parse {
-		parser = NewGemma4Parser(false)
-	}
-	// emit runs inside onBlock, i.e. on the thread driving the C generate.
-	// Sending on results can block on a slow consumer, but the server-side
-	// pump (pkg/grpc/server.go PredictStream) drains continuously and drops
-	// undeliverable sends, so this backpressure is brief and bounded - and
-	// pausing the diffusion loop under it is the desired behavior anyway.
-	emit := func(text string) {
-		if !parse {
-			if text != "" {
-				results <- &pb.Reply{Message: []byte(text), ChatDeltas: []*pb.ChatDelta{{Content: text}}}
-			}
-			return
-		}
-		deltas := parser.Feed(text)
-		if reply := replyFromDeltas(deltas); reply != nil {
-			results <- reply
-		}
-	}
-	// onBlock guards emit (and through it the parser) against invalid UTF-8:
-	// a multi-byte character split across block boundaries is held back until
-	// it completes (see splitValidUTF8), so proto3 marshaling never fails.
-	var carry string
-	onBlock := func(block string) {
-		var text string
-		text, carry = splitValidUTF8(carry, block)
-		emit(text)
-	}
-
-	var genErr error
-	d.submit(func() {
-		genErr = d.gen.generateStream(prompt, optsJSON, onBlock)
-	})
-	if genErr != nil {
-		return genErr
-	}
-	if carry != "" {
-		// The stream ended mid-sequence: the held-back bytes can no longer
-		// complete, so flush them through the U+FFFD last resort.
-		emit(sanitizeUTF8(carry))
-	}
-	if parse {
-		if reply := replyFromDeltas(parser.Close()); reply != nil {
-			results <- reply
-		}
-	}
-	return nil
-}
-
-// replyFromDeltas wraps one batch of parsed deltas into a streaming Reply,
-// or nil when the batch is empty (markers consumed, nothing emitted yet).
-// Message mirrors the batch's content text so legacy chan-string consumers
-// see exactly the displayed tokens.
-func replyFromDeltas(deltas []*pb.ChatDelta) *pb.Reply {
-	if len(deltas) == 0 {
-		return nil
-	}
-	var content strings.Builder
-	for _, delta := range deltas {
-		content.WriteString(delta.GetContent())
-	}
-	return &pb.Reply{Message: []byte(content.String()), ChatDeltas: deltas}
-}
-
-// Predict is the legacy (string, error) signature; the gRPC server prefers
-// PredictRich, this exists for non-rich callers (cloud-proxy precedent).
-func (d *Dllm) Predict(opts *pb.PredictOptions) (string, error) {
-	reply, err := d.PredictRich(opts)
-	if err != nil {
-		return "", err
-	}
-	return string(reply.GetMessage()), nil
-}
-
-// PredictStream is the legacy chan-string path: rich replies reduced to
-// their content text. Note the inverted channel ownership - the LEGACY
-// contract requires the impl to close the channel.
-func (d *Dllm) PredictStream(opts *pb.PredictOptions, results chan string) error {
-	defer close(results)
-	richCh := make(chan *pb.Reply)
-	errCh := make(chan error, 1)
-	go func() {
-		errCh <- d.PredictStreamRich(opts, richCh)
-		close(richCh)
-	}()
-	for reply := range richCh {
-		if msg := reply.GetMessage(); len(msg) > 0 {
-			results <- string(msg)
-		}
-	}
-	return <-errCh
-}
-
-// TokenizeString tokenizes opts.Prompt via dllm_capi_tokenize_json (the C
-// side prepends bos per the vocab) and decodes the returned id array.
-func (d *Dllm) TokenizeString(opts *pb.PredictOptions) (pb.TokenizationResponse, error) {
-	d.genMu.RLock()
-	defer d.genMu.RUnlock()
-	if d.gen == nil {
-		return pb.TokenizationResponse{}, grpcerrors.ModelNotLoaded("dllm")
-	}
-	var out string
-	var tokErr error
-	d.submit(func() {
-		out, tokErr = d.gen.tokenizeJSON(opts.GetPrompt())
-	})
-	if tokErr != nil {
-		return pb.TokenizationResponse{}, tokErr
-	}
-	var tokens []int32
-	if err := json.Unmarshal([]byte(out), &tokens); err != nil {
-		return pb.TokenizationResponse{}, fmt.Errorf("dllm: decode tokenize result %q: %w", out, err)
-	}
-	return pb.TokenizationResponse{Length: int32(len(tokens)), Tokens: tokens}, nil
-}
--- a/backend/go/dllm/dllm_test.go
+++ b/backend/go/dllm/dllm_test.go
@@ -1,807 +0,0 @@
-package main
-
-import (
-	"errors"
-	"os"
-	"runtime"
-	"sync"
-	"testing"
-	"time"
-	"unicode/utf8"
-	"unsafe"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-)
-
-func TestDllm(t *testing.T) {
-	RegisterFailHandler(Fail)
-	RunSpecs(t, "dllm Backend Suite")
-}
-
-var (
-	libLoadOnce sync.Once
-	libLoadErr  error
-)
-
-// ensureLibLoaded mirrors main.go's bootstrap so a Go test can drive the
-// C-ABI bridge without spinning up the gRPC server. The library path comes
-// from DLLM_TEST_LIBRARY (gated specs Skip when it is unset).
-func ensureLibLoaded() {
-	libLoadOnce.Do(func() {
-		libLoadErr = loadCAPI(os.Getenv("DLLM_TEST_LIBRARY"))
-	})
-}
-
-// C-ABI binding smoke: drives the real libdllm.so against the tiny GGUF
-// fixture from dllm.cpp (tests/fixtures/tiny_with_vocab.gguf). Gated on:
-//
-//	DLLM_TEST_LIBRARY   absolute path to libdllm.so
-//	DLLM_TEST_TINY_MODEL absolute path to tiny_with_vocab.gguf
-var _ = Describe("C-ABI binding", func() {
-	BeforeEach(func() {
-		if os.Getenv("DLLM_TEST_LIBRARY") == "" || os.Getenv("DLLM_TEST_TINY_MODEL") == "" {
-			Skip("set DLLM_TEST_LIBRARY and DLLM_TEST_TINY_MODEL to run the C-ABI binding smoke")
-		}
-		ensureLibLoaded()
-		Expect(libLoadErr).ToNot(HaveOccurred())
-	})
-
-	It("binds the 9 symbols and round-trips the tiny model", func() {
-		Expect(cAbiVersion()).To(Equal(int32(1)))
-
-		h := cLoad(os.Getenv("DLLM_TEST_TINY_MODEL"), "{}")
-		Expect(h).ToNot(BeZero(), "dllm_capi_load of the tiny fixture")
-
-		// Tiny fixture vocab: "hello" tokenizes to ids [2,18] (bos prepended
-		// by the C side: vocab.add_bos).
-		toks, err := cTokenizeJSON(h, "hello")
-		Expect(err).ToNot(HaveOccurred())
-		Expect(toks).To(Equal("[2,18]"))
-
-		// Deterministic generation: an explicit non-negative seed seeds
-		// mt19937, so two identical calls must produce identical text.
-		out1, err := cGenerate(h, "hello", `{"n_predict":16,"seed":7}`)
-		Expect(err).ToNot(HaveOccurred())
-		Expect(out1).ToNot(BeEmpty())
-		// Cancel with no call in flight is dropped: each generate resets the
-		// cancel flag on entry (header contract), so this must not affect
-		// the next call. Also binds the 9th symbol; safe on NULL too.
-		cCancel(h)
-		cCancel(0)
-
-		out2, err := cGenerate(h, "hello", `{"n_predict":16,"seed":7}`)
-		Expect(err).ToNot(HaveOccurred())
-		Expect(out2).To(Equal(out1))
-
-		// Streaming variant: same opts, blocks arrive via the purego
-		// callback trampoline. The per-block detokenize can differ from the
-		// seamless full-text decode at block boundaries, so only assert that
-		// blocks arrived and were non-trivial, not byte equality with out1.
-		var blocks []string
-		var steps int
-		err = cGenerateStream(h, "hello", `{"n_predict":16,"seed":7}`,
-			func(text string) { blocks = append(blocks, text) },
-			func(step, total int, preview string) { steps++ },
-		)
-		Expect(err).ToNot(HaveOccurred())
-		Expect(blocks).ToNot(BeEmpty())
-		Expect(steps).To(BeNumerically(">", 0))
-
-		// Load failure path: NULL ctx back, and last_error(NULL) returns the
-		// static NULL-ctx message (there is no ctx to carry the real reason).
-		bad := cLoad("/nonexistent/dllm-model.gguf", "{}")
-		Expect(bad).To(BeZero())
-		Expect(cLastError(0)).ToNot(BeEmpty())
-
-		// Free is safe on a live handle and a NULL one (delete nullptr).
-		cFree(h)
-		cFree(0)
-	})
-})
-
-// Ungated specs for the pure-Go helpers (no libdllm.so required).
-var _ = Describe("buildOptsJSON", func() {
-	It("renders flat scalars as a JSON object", func() {
-		out, err := buildOptsJSON(map[string]any{
-			"n_predict": 16,
-			"seed":      int64(7),
-			"eb_t_min":  0.5,
-			"kv_cache":  "auto",
-		})
-		Expect(err).ToNot(HaveOccurred())
-		Expect(out).To(MatchJSON(`{"n_predict":16,"seed":7,"eb_t_min":0.5,"kv_cache":"auto"}`))
-	})
-
-	It("renders an empty object for no options", func() {
-		out, err := buildOptsJSON(nil)
-		Expect(err).ToNot(HaveOccurred())
-		Expect(out).To(Equal("{}"))
-	})
-
-	It("rejects nested objects (the C-side scanner only reads flat scalars)", func() {
-		_, err := buildOptsJSON(map[string]any{"sampler": map[string]any{"seed": 1}})
-		Expect(err).To(HaveOccurred())
-	})
-
-	It("rejects arrays", func() {
-		_, err := buildOptsJSON(map[string]any{"stop": []string{"a"}})
-		Expect(err).To(HaveOccurred())
-	})
-
-	It("rejects booleans (the C-side scanner only understands numbers and strings)", func() {
-		_, err := buildOptsJSON(map[string]any{"flag": true})
-		Expect(err).To(HaveOccurred())
-	})
-})
-
-var _ = Describe("splitValidUTF8", func() {
-	It("holds back a trailing incomplete sequence and completes it next block", func() {
-		emit, carry := splitValidUTF8("", "caf\xe2")
-		Expect(emit).To(Equal("caf"))
-		Expect(carry).To(Equal("\xe2"))
-
-		emit, carry = splitValidUTF8(carry, "\x82")
-		Expect(emit).To(BeEmpty())
-		Expect(carry).To(Equal("\xe2\x82"))
-
-		emit, carry = splitValidUTF8(carry, "\xac!")
-		Expect(emit).To(Equal("€!"))
-		Expect(carry).To(BeEmpty())
-	})
-
-	It("holds back up to 3 bytes of a 4-byte sequence", func() {
-		emit, carry := splitValidUTF8("", "x\xf0\x9f\x98") // 😀 missing its last byte
-		Expect(emit).To(Equal("x"))
-		Expect(carry).To(Equal("\xf0\x9f\x98"))
-
-		emit, carry = splitValidUTF8(carry, "\x80")
-		Expect(emit).To(Equal("😀"))
-		Expect(carry).To(BeEmpty())
-	})
-
-	It("replaces undecodable bytes immediately instead of carrying them", func() {
-		// A mid-string invalid byte can never complete: carrying it would let
-		// the carry grow unboundedly, so it is substituted on the spot.
-		emit, carry := splitValidUTF8("", "a\xe2bc")
-		Expect(emit).To(Equal("a<>bc"))
-		Expect(carry).To(BeEmpty())
-
-		// Orphan continuation bytes at the end have no leading byte to wait
-		// for either.
-		emit, carry = splitValidUTF8("", "a\x82")
-		Expect(emit).To(Equal("a<>"))
-		Expect(carry).To(BeEmpty())
-	})
-
-	It("passes pure ASCII and complete UTF-8 through untouched", func() {
-		emit, carry := splitValidUTF8("", "héllo €")
-		Expect(emit).To(Equal("héllo €"))
-		Expect(carry).To(BeEmpty())
-	})
-})
-
-var _ = Describe("goStringFromCPtr", func() {
-	It("copies a NUL-terminated buffer", func() {
-		buf := []byte("dllm\x00")
-		s := goStringFromCPtr(uintptr(unsafe.Pointer(&buf[0])))
-		// The uintptr round-trip hides buf from the GC's liveness analysis;
-		// keep it reachable until after the copy.
-		runtime.KeepAlive(buf)
-		Expect(s).To(Equal("dllm"))
-	})
-
-	It("returns the empty string for NULL", func() {
-		Expect(goStringFromCPtr(0)).To(Equal(""))
-	})
-})
-
-// ---------------------------------------------------------------------------
-// Backend wiring (T4): fake-generator specs, no libdllm.so required.
-// ---------------------------------------------------------------------------
-
-type fakeGenCall struct {
-	prompt   string
-	optsJSON string
-}
-
-// fakeGen implements generator in-process. It records every call (prompt +
-// opts JSON), tracks concurrent in-flight calls to prove worker
-// serialization, and replays canned output (out for generate/tokenize,
-// blocks for generateStream).
-type fakeGen struct {
-	mu          sync.Mutex
-	calls       []fakeGenCall
-	inFlight    int
-	maxInFlight int
-
-	out    string
-	blocks []string
-	err    error
-	delay  time.Duration
-}
-
-func (f *fakeGen) begin(prompt, optsJSON string) {
-	f.mu.Lock()
-	defer f.mu.Unlock()
-	f.calls = append(f.calls, fakeGenCall{prompt: prompt, optsJSON: optsJSON})
-	f.inFlight++
-	if f.inFlight > f.maxInFlight {
-		f.maxInFlight = f.inFlight
-	}
-}
-
-func (f *fakeGen) end() {
-	f.mu.Lock()
-	defer f.mu.Unlock()
-	f.inFlight--
-}
-
-func (f *fakeGen) snapshot() (calls []fakeGenCall, maxInFlight int) {
-	f.mu.Lock()
-	defer f.mu.Unlock()
-	return append([]fakeGenCall(nil), f.calls...), f.maxInFlight
-}
-
-func (f *fakeGen) generate(prompt, optsJSON string) (string, error) {
-	f.begin(prompt, optsJSON)
-	defer f.end()
-	if f.delay > 0 {
-		time.Sleep(f.delay)
-	}
-	return f.out, f.err
-}
-
-func (f *fakeGen) generateStream(prompt, optsJSON string, onBlock func(text string)) error {
-	f.begin(prompt, optsJSON)
-	defer f.end()
-	if f.err != nil {
-		return f.err
-	}
-	for _, b := range f.blocks {
-		onBlock(b)
-	}
-	return nil
-}
-
-func (f *fakeGen) tokenizeJSON(text string) (string, error) {
-	f.begin(text, "")
-	defer f.end()
-	return f.out, f.err
-}
-
-func (f *fakeGen) cancel() {}
-func (f *fakeGen) free()   {}
-
-// newTestDllm assembles a backend around a fake generator (bypassing Load,
-// which needs libdllm.so) and registers cleanup of the worker goroutine.
-func newTestDllm(g generator, genOpts map[string]any) *Dllm {
-	d := &Dllm{gen: g, genOpts: genOpts}
-	d.startWorker()
-	DeferCleanup(func() { Expect(d.Free()).To(Succeed()) })
-	return d
-}
-
-// drainReplies empties ch without blocking, failing the spec if the channel
-// was closed (PredictStreamRich must NOT close it - interface.go contract).
-// Size ch above the expected reply count: an overflow deadlocks the spec on
-// the producer's send instead of failing it.
-func drainReplies(ch chan *pb.Reply) []*pb.Reply {
-	var out []*pb.Reply
-	for {
-		select {
-		case r, ok := <-ch:
-			if !ok {
-				Fail("PredictStreamRich closed the results channel (the gRPC server owns the close)")
-			}
-			expectValidUTF8Reply(r)
-			out = append(out, r)
-		default:
-			return out
-		}
-	}
-}
-
-// expectValidUTF8Reply is the blanket guard for the proto3 marshaling
-// contract: grpc-go rejects any string field carrying invalid UTF-8, so every
-// reply field that ends up in a proto string must validate.
-func expectValidUTF8Reply(r *pb.Reply) {
-	GinkgoHelper()
-	Expect(utf8.ValidString(string(r.GetMessage()))).To(BeTrue(), "Reply.Message carries invalid UTF-8")
-	for _, delta := range r.GetChatDeltas() {
-		Expect(utf8.ValidString(delta.GetContent())).To(BeTrue(), "ChatDelta.Content carries invalid UTF-8")
-		Expect(utf8.ValidString(delta.GetReasoningContent())).To(BeTrue(), "ChatDelta.ReasoningContent carries invalid UTF-8")
-		for _, tc := range delta.GetToolCalls() {
-			Expect(utf8.ValidString(tc.GetName())).To(BeTrue(), "ToolCallDelta.Name carries invalid UTF-8")
-			Expect(utf8.ValidString(tc.GetArguments())).To(BeTrue(), "ToolCallDelta.Arguments carries invalid UTF-8")
-		}
-	}
-}
-
-var _ = Describe("Dllm backend wiring", func() {
-	Describe("PredictRich", func() {
-		It("renders gemma4 from raw messages and parses the output when use_tokenizer_template is set", func() {
-			fake := &fakeGen{out: "<|channel>thought\npondering<channel|>The answer.<turn|>"}
-			d := newTestDllm(fake, nil)
-
-			reply, err := d.PredictRich(&pb.PredictOptions{
-				UseTokenizerTemplate: true,
-				Messages:             []*pb.Message{{Role: "user", Content: "Write a long essay about Portugal."}},
-				Metadata:             map[string]string{"enable_thinking": "true"},
-			})
-			Expect(err).ToNot(HaveOccurred())
-
-			calls, _ := fake.snapshot()
-			Expect(calls).To(HaveLen(1))
-			// The enable_thinking=true render from the transformers fixture.
-			Expect(calls[0].prompt).To(Equal(
-				"<|turn>system\n<|think|>\n<turn|>\n<|turn>user\nWrite a long essay about Portugal.<turn|>\n<|turn>model\n"))
-
-			Expect(string(reply.GetMessage())).To(Equal("The answer."))
-			Expect(reply.GetChatDeltas()).To(HaveLen(2))
-			Expect(reply.GetChatDeltas()[0].GetReasoningContent()).To(Equal("pondering"))
-			Expect(reply.GetChatDeltas()[1].GetContent()).To(Equal("The answer."))
-		})
-
-		It("defaults enable_thinking OFF (the gemma4 template treats thinking as opt-in)", func() {
-			fake := &fakeGen{out: "hi"}
-			d := newTestDllm(fake, nil)
-
-			_, err := d.PredictRich(&pb.PredictOptions{
-				UseTokenizerTemplate: true,
-				Messages:             []*pb.Message{{Role: "user", Content: "Write a long essay about Portugal."}},
-			})
-			Expect(err).ToNot(HaveOccurred())
-
-			calls, _ := fake.snapshot()
-			// No-thinking render: the template pre-opens AND pre-closes an
-			// empty thought channel in the generation prompt.
-			Expect(calls[0].prompt).To(Equal(
-				"<|turn>user\nWrite a long essay about Portugal.<turn|>\n<|turn>model\n<|channel>thought\n<channel|>"))
-		})
-
-		It("passes the raw prompt verbatim and skips gemma4 parsing without use_tokenizer_template", func() {
-			// Marker-looking text must survive untouched: in raw-prompt mode
-			// the caller templates themselves and the Go-side extraction
-			// applies, so the backend must not interpret the output.
-			fake := &fakeGen{out: "<|channel>thought\nnot parsed<channel|>tail"}
-			d := newTestDllm(fake, nil)
-
-			reply, err := d.PredictRich(&pb.PredictOptions{Prompt: "my raw prompt"})
-			Expect(err).ToNot(HaveOccurred())
-
-			calls, _ := fake.snapshot()
-			Expect(calls[0].prompt).To(Equal("my raw prompt"))
-			Expect(string(reply.GetMessage())).To(Equal(fake.out))
-			Expect(reply.GetChatDeltas()).To(HaveLen(1))
-			Expect(reply.GetChatDeltas()[0].GetContent()).To(Equal(fake.out))
-		})
-
-		It("sanitizes invalid UTF-8 in the non-streaming output", func() {
-			// Byte-fallback tokens can decode to lone malformed bytes; the
-			// whole-output sanitize must replace them so proto3 marshaling of
-			// Message/ChatDeltas cannot fail.
-			fake := &fakeGen{out: "a\xe2b"}
-			d := newTestDllm(fake, nil)
-
-			reply, err := d.PredictRich(&pb.PredictOptions{Prompt: "p"})
-			Expect(err).ToNot(HaveOccurred())
-			expectValidUTF8Reply(reply)
-			Expect(string(reply.GetMessage())).To(Equal("a<>b"))
-			Expect(reply.GetChatDeltas()[0].GetContent()).To(Equal("a<>b"))
-		})
-
-		It("maps Tokens and Seed into the opts JSON on top of the model-level overrides", func() {
-			fake := &fakeGen{out: "x"}
-			d := newTestDllm(fake, map[string]any{"eb_t_min": 0.5, "kv_cache": "auto"})
-
-			_, err := d.PredictRich(&pb.PredictOptions{Prompt: "p", Tokens: 32, Seed: 7})
-			Expect(err).ToNot(HaveOccurred())
-
-			calls, _ := fake.snapshot()
-			Expect(calls[0].optsJSON).To(MatchJSON(`{"n_predict":32,"seed":7,"eb_t_min":0.5,"kv_cache":"auto"}`))
-		})
-
-		It("omits n_predict and seed when unset so the engine defaults apply", func() {
-			fake := &fakeGen{out: "x"}
-			d := newTestDllm(fake, nil)
-
-			_, err := d.PredictRich(&pb.PredictOptions{Prompt: "p"})
-			Expect(err).ToNot(HaveOccurred())
-
-			calls, _ := fake.snapshot()
-			Expect(calls[0].optsJSON).To(MatchJSON(`{}`))
-		})
-
-		It("surfaces generator errors", func() {
-			fake := &fakeGen{err: errors.New("boom")}
-			d := newTestDllm(fake, nil)
-
-			_, err := d.PredictRich(&pb.PredictOptions{Prompt: "p"})
-			Expect(err).To(MatchError("boom"))
-		})
-
-		It("errors before generating when no model is loaded", func() {
-			d := &Dllm{} // no Load, no worker: must fail fast, not hang
-			_, err := d.PredictRich(&pb.PredictOptions{Prompt: "p"})
-			Expect(err).To(HaveOccurred())
-		})
-
-		It("makes a concurrent Free wait for the in-flight request (both finish cleanly)", func() {
-			// server.go's Free has no locking of its own: the backend's genMu
-			// must hold Free back until the racing generate drains, instead of
-			// closing the jobs channel (panic) or freeing the C ctx under it.
-			fake := &fakeGen{out: "x", delay: 50 * time.Millisecond}
-			d := newTestDllm(fake, nil)
-
-			predictDone := make(chan error, 1)
-			go func() {
-				defer GinkgoRecover()
-				_, err := d.PredictRich(&pb.PredictOptions{Prompt: "p"})
-				predictDone <- err
-			}()
-			// Wait until the fake generate is actually in flight (the read
-			// lock is held from before submit until PredictRich returns).
-			Eventually(func() int {
-				_, maxInFlight := fake.snapshot()
-				return maxInFlight
-			}).Should(Equal(1))
-
-			Expect(d.Free()).To(Succeed())
-			// Free's write lock means the request finished before Free did.
-			var predictErr error
-			Eventually(predictDone).Should(Receive(&predictErr))
-			Expect(predictErr).ToNot(HaveOccurred())
-		})
-
-		It("returns model-not-loaded for requests after Free", func() {
-			fake := &fakeGen{out: "x"}
-			d := newTestDllm(fake, nil)
-			Expect(d.Free()).To(Succeed())
-
-			_, err := d.PredictRich(&pb.PredictOptions{Prompt: "p"})
-			Expect(err).To(MatchError(ContainSubstring("model not loaded")))
-		})
-
-		It("serializes concurrent requests through the worker goroutine", func() {
-			// dllm_capi.h: one ctx = one concurrent generate. Two overlapping
-			// PredictRich calls must execute the C calls one at a time.
-			fake := &fakeGen{out: "x", delay: 30 * time.Millisecond}
-			d := newTestDllm(fake, nil)
-
-			var wg sync.WaitGroup
-			for range 2 {
-				wg.Add(1)
-				go func() {
-					defer wg.Done()
-					defer GinkgoRecover()
-					_, err := d.PredictRich(&pb.PredictOptions{Prompt: "p"})
-					Expect(err).ToNot(HaveOccurred())
-				}()
-			}
-			wg.Wait()
-
-			calls, maxInFlight := fake.snapshot()
-			Expect(calls).To(HaveLen(2))
-			Expect(maxInFlight).To(Equal(1), "generate calls overlapped despite the worker queue")
-		})
-	})
-
-	Describe("PredictStreamRich", func() {
-		It("emits one reply per delta-producing block and leaves the channel open", func() {
-			// Blocks split mid-marker and mid-payload: the parser's holdback
-			// must keep marker fragments out of the emitted deltas.
-			fake := &fakeGen{blocks: []string{
-				"<|channel>thou",        // partial channel open: no deltas yet
-				"ght\nponder",           // header completes, reasoning starts
-				"ing<channel|>Hi ",      // reasoning ends, content starts
-				"there<turn|>discarded", // turn ends: trailing text dropped
-			}}
-			d := newTestDllm(fake, nil)
-
-			ch := make(chan *pb.Reply, 16)
-			err := d.PredictStreamRich(&pb.PredictOptions{
-				UseTokenizerTemplate: true,
-				Messages:             []*pb.Message{{Role: "user", Content: "hi"}},
-			}, ch)
-			Expect(err).ToNot(HaveOccurred())
-
-			replies := drainReplies(ch)
-			Expect(replies).To(HaveLen(3), "block 1 completes no delta and must not produce a reply")
-
-			var content, reasoning string
-			for _, r := range replies {
-				for _, delta := range r.GetChatDeltas() {
-					content += delta.GetContent()
-					reasoning += delta.GetReasoningContent()
-				}
-			}
-			Expect(reasoning).To(Equal("pondering"))
-			Expect(content).To(Equal("Hi there"))
-			// Message mirrors each reply's content so legacy consumers see
-			// exactly the displayed tokens.
-			Expect(string(replies[1].GetMessage())).To(Equal("Hi "))
-			Expect(string(replies[2].GetMessage())).To(Equal("there"))
-		})
-
-		It("streams raw blocks verbatim without use_tokenizer_template", func() {
-			fake := &fakeGen{blocks: []string{"abc", "", "<|channel>def"}}
-			d := newTestDllm(fake, nil)
-
-			ch := make(chan *pb.Reply, 16)
-			err := d.PredictStreamRich(&pb.PredictOptions{Prompt: "raw"}, ch)
-			Expect(err).ToNot(HaveOccurred())
-
-			replies := drainReplies(ch)
-			Expect(replies).To(HaveLen(2), "empty blocks produce no reply")
-			Expect(string(replies[0].GetMessage())).To(Equal("abc"))
-			Expect(string(replies[1].GetMessage())).To(Equal("<|channel>def"))
-			Expect(replies[1].GetChatDeltas()).To(HaveLen(1))
-		})
-
-		It("flushes parser holdback after the stream ends", func() {
-			// The unterminated partial marker "<chan" is held back during the
-			// stream and must come out as content on the final flush.
-			fake := &fakeGen{blocks: []string{"tail<chan"}}
-			d := newTestDllm(fake, nil)
-
-			ch := make(chan *pb.Reply, 16)
-			err := d.PredictStreamRich(&pb.PredictOptions{
-				UseTokenizerTemplate: true,
-				Messages:             []*pb.Message{{Role: "user", Content: "hi"}},
-			}, ch)
-			Expect(err).ToNot(HaveOccurred())
-
-			var content string
-			for _, r := range drainReplies(ch) {
-				content += string(r.GetMessage())
-			}
-			Expect(content).To(Equal("tail<chan"))
-		})
-
-		It("reassembles a multi-byte character split across block boundaries", func() {
-			// Per-block detokenize can split "€" (E2 82 AC) as E2 | 82 AC.
-			// Emitting the lone E2 would make grpc-go fail the marshal of the
-			// whole reply; the trailing incomplete sequence must be held back
-			// and completed by the next block.
-			fake := &fakeGen{blocks: []string{"caf\xe2", "\x82\xac ok"}}
-			d := newTestDllm(fake, nil)
-
-			ch := make(chan *pb.Reply, 16)
-			err := d.PredictStreamRich(&pb.PredictOptions{Prompt: "raw"}, ch)
-			Expect(err).ToNot(HaveOccurred())
-
-			var content string
-			for _, r := range drainReplies(ch) { // drain asserts ValidString per reply
-				content += string(r.GetMessage())
-			}
-			Expect(content).To(Equal("caf€ ok"))
-		})
-
-		It("reassembles a split multi-byte character in parsed (gemma4) mode too", func() {
-			fake := &fakeGen{blocks: []string{"caf\xe2", "\x82\xac<turn|>"}}
-			d := newTestDllm(fake, nil)
-
-			ch := make(chan *pb.Reply, 16)
-			err := d.PredictStreamRich(&pb.PredictOptions{
-				UseTokenizerTemplate: true,
-				Messages:             []*pb.Message{{Role: "user", Content: "hi"}},
-			}, ch)
-			Expect(err).ToNot(HaveOccurred())
-
-			var content string
-			for _, r := range drainReplies(ch) {
-				for _, delta := range r.GetChatDeltas() {
-					content += delta.GetContent()
-				}
-			}
-			Expect(content).To(Equal("caf€"))
-		})
-
-		It("replaces an incomplete sequence left at stream end with U+FFFD", func() {
-			// A byte-fallback token can leave a lone leading byte (0xE2) that
-			// no later block completes: the final flush must substitute it,
-			// never emit it raw and never drop into a marshal error.
-			fake := &fakeGen{blocks: []string{"ok\xe2"}}
-			d := newTestDllm(fake, nil)
-
-			ch := make(chan *pb.Reply, 16)
-			err := d.PredictStreamRich(&pb.PredictOptions{Prompt: "raw"}, ch)
-			Expect(err).ToNot(HaveOccurred())
-
-			var content string
-			for _, r := range drainReplies(ch) {
-				content += string(r.GetMessage())
-			}
-			Expect(content).To(Equal("ok<6F>"))
-		})
-
-		It("surfaces generator errors without sending replies", func() {
-			fake := &fakeGen{err: errors.New("stream boom")}
-			d := newTestDllm(fake, nil)
-
-			ch := make(chan *pb.Reply, 16)
-			err := d.PredictStreamRich(&pb.PredictOptions{Prompt: "p"}, ch)
-			Expect(err).To(MatchError("stream boom"))
-			Expect(drainReplies(ch)).To(BeEmpty())
-		})
-
-		It("errors before generating when no model is loaded", func() {
-			d := &Dllm{} // no Load, no worker: must fail fast, not hang
-			ch := make(chan *pb.Reply, 1)
-			err := d.PredictStreamRich(&pb.PredictOptions{Prompt: "p"}, ch)
-			Expect(err).To(MatchError(ContainSubstring("model not loaded")))
-			Expect(drainReplies(ch)).To(BeEmpty())
-		})
-	})
-
-	Describe("legacy Predict/PredictStream adapters", func() {
-		It("Predict returns the aggregated content string", func() {
-			fake := &fakeGen{out: "plain text"}
-			d := newTestDllm(fake, nil)
-
-			out, err := d.Predict(&pb.PredictOptions{Prompt: "p"})
-			Expect(err).ToNot(HaveOccurred())
-			Expect(out).To(Equal("plain text"))
-		})
-
-		It("PredictStream forwards content strings and closes the channel (legacy ownership)", func() {
-			fake := &fakeGen{blocks: []string{"a", "b"}}
-			d := newTestDllm(fake, nil)
-
-			ch := make(chan string, 16)
-			Expect(d.PredictStream(&pb.PredictOptions{Prompt: "p"}, ch)).To(Succeed())
-
-			var got []string
-			for s := range ch { // terminates only if the impl closed ch
-				got = append(got, s)
-			}
-			Expect(got).To(Equal([]string{"a", "b"}))
-		})
-	})
-
-	Describe("TokenizeString", func() {
-		It("decodes the C-side JSON id array", func() {
-			fake := &fakeGen{out: "[2,18]"}
-			d := newTestDllm(fake, nil)
-
-			resp, err := d.TokenizeString(&pb.PredictOptions{Prompt: "hello"})
-			Expect(err).ToNot(HaveOccurred())
-			Expect(resp.Length).To(Equal(int32(2)))
-			Expect(resp.Tokens).To(Equal([]int32{2, 18}))
-
-			calls, _ := fake.snapshot()
-			Expect(calls[0].prompt).To(Equal("hello"))
-		})
-
-		It("fails loud on a malformed id array", func() {
-			fake := &fakeGen{out: "not json"}
-			d := newTestDllm(fake, nil)
-
-			_, err := d.TokenizeString(&pb.PredictOptions{Prompt: "hello"})
-			Expect(err).To(HaveOccurred())
-		})
-
-		It("errors before tokenizing when no model is loaded", func() {
-			d := &Dllm{} // no Load, no worker: must fail fast, not hang
-			_, err := d.TokenizeString(&pb.PredictOptions{Prompt: "hello"})
-			Expect(err).To(MatchError(ContainSubstring("model not loaded")))
-		})
-	})
-
-	Describe("parseModelGenOpts", func() {
-		It("parses eb_*/blocks/kv_cache entries and types values by first successful parse", func() {
-			got := parseModelGenOpts([]string{
-				"eb_max_steps:16",
-				"eb_t_min:0.25",
-				"kv_cache:auto",
-				"blocks:4",
-				"unrelated_key:1", // other layers' options: skipped
-				"malformed",       // no colon: skipped
-			})
-			Expect(got).To(Equal(map[string]any{
-				"eb_max_steps": int64(16),
-				"eb_t_min":     0.25,
-				"kv_cache":     "auto",
-				"blocks":       int64(4),
-			}))
-		})
-
-		It("round-trips through buildOptsJSON (only flat scalars are produced)", func() {
-			got := parseModelGenOpts([]string{"eb_entropy_bound:0.8", "kv_cache:off"})
-			out, err := buildOptsJSON(got)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(out).To(MatchJSON(`{"eb_entropy_bound":0.8,"kv_cache":"off"}`))
-		})
-	})
-})
-
-// ---------------------------------------------------------------------------
-// Gated backend round-trip against the real libdllm.so + tiny GGUF fixture.
-// ---------------------------------------------------------------------------
-
-var _ = Describe("Dllm backend (real tiny model)", func() {
-	BeforeEach(func() {
-		if os.Getenv("DLLM_TEST_LIBRARY") == "" || os.Getenv("DLLM_TEST_TINY_MODEL") == "" {
-			Skip("set DLLM_TEST_LIBRARY and DLLM_TEST_TINY_MODEL to run the backend round-trip")
-		}
-		ensureLibLoaded()
-		Expect(libLoadErr).ToNot(HaveOccurred())
-	})
-
-	It("round-trips Load, PredictRich, PredictStreamRich and TokenizeString", func() {
-		d := &Dllm{}
-		Expect(d.Load(&pb.ModelOptions{ModelFile: os.Getenv("DLLM_TEST_TINY_MODEL")})).To(Succeed())
-		DeferCleanup(func() { Expect(d.Free()).To(Succeed()) })
-
-		// TokenizeString: tiny fixture vocab tokenizes "hello" to [2,18].
-		resp, err := d.TokenizeString(&pb.PredictOptions{Prompt: "hello"})
-		Expect(err).ToNot(HaveOccurred())
-		Expect(resp.Tokens).To(Equal([]int32{2, 18}))
-		Expect(resp.Length).To(Equal(int32(2)))
-
-		req := &pb.PredictOptions{
-			UseTokenizerTemplate: true,
-			Messages:             []*pb.Message{{Role: "user", Content: "hello"}},
-			Tokens:               16,
-			Seed:                 7,
-		}
-
-		// Non-streaming: the tiny random-weight model emits arbitrary vocab
-		// words; with no gemma4 markers in them everything is content.
-		reply, err := d.PredictRich(req)
-		Expect(err).ToNot(HaveOccurred())
-		Expect(string(reply.GetMessage())).ToNot(BeEmpty())
-		Expect(reply.GetChatDeltas()).ToNot(BeEmpty())
-
-		// Streaming: at least one reply, and the channel-ownership rule is
-		// honored (drainReplies fails the spec on a closed channel).
-		ch := make(chan *pb.Reply, 64)
-		Expect(d.PredictStreamRich(req, ch)).To(Succeed())
-		replies := drainReplies(ch)
-		Expect(replies).ToNot(BeEmpty())
-		var streamed string
-		for _, r := range replies {
-			streamed += string(r.GetMessage())
-		}
-		Expect(streamed).ToNot(BeEmpty())
-	})
-
-	It("aborts an in-flight generation promptly on Cancel", func() {
-		d := &Dllm{}
-		// eb_max_steps inflates the per-block denoise loop so the full run
-		// takes ~10s on the tiny fixture (vs ~40ms at engine defaults; 16
-		// blocks, first block after ~0.7s) - long enough that a prompt
-		// post-cancel return is distinguishable from the generation simply
-		// finishing.
-		Expect(d.Load(&pb.ModelOptions{
-			ModelFile: os.Getenv("DLLM_TEST_TINY_MODEL"),
-			Options:   []string{"eb_max_steps:256"},
-		})).To(Succeed())
-		DeferCleanup(func() { Expect(d.Free()).To(Succeed()) })
-
-		ch := make(chan *pb.Reply, 64)
-		errCh := make(chan error, 1)
-		go func() {
-			defer GinkgoRecover()
-			errCh <- d.PredictStreamRich(&pb.PredictOptions{Prompt: "hello", Tokens: 256, Seed: 7}, ch)
-		}()
-
-		// Cancel only once the first block proves the generate is in
-		// flight: the C side resets the cancel flag on generate entry, so
-		// an earlier Cancel would be swallowed (dllm_capi.h race note).
-		Eventually(ch, "60s").Should(Receive())
-		cancelAt := time.Now()
-		d.Cancel()
-
-		// Uncancelled, ~10s of generation remain; the cancelled call must
-		// come back in milliseconds (the flag is checked per denoise step).
-		var genErr error
-		Eventually(errCh, "5s").Should(Receive(&genErr))
-		latency := time.Since(cancelAt)
-		Expect(genErr).To(MatchError(ContainSubstring("cancelled")))
-		GinkgoWriter.Printf("dllm cancel: PredictStreamRich returned %v after Cancel\n", latency)
-	})
-})
--- a/backend/go/dllm/gemma4_parser.go
+++ b/backend/go/dllm/gemma4_parser.go
@@ -1,562 +0,0 @@
-// Gemma4 (DiffusionGemma) streaming output parser: raw model text, fed in
-// arbitrary fragments (per committed diffusion block; a fragment can split
-// anywhere, including mid-marker and mid-payload), is turned into
-// pb.ChatDelta events (content / reasoning_content / tool_calls).
-//
-// Normative sources:
-//   - The chat template embedded at the top of gemma4_renderer.go ("tpl L<n>"
-//     citations below refer to its numbered lines). The OUTPUT format mirrors
-//     what the template renders for assistant history: thought channels
-//     (<|channel>thought\n ... <channel|>, tpl L240), tool calls
-//     (<|tool_call>call:name{...}<tool_call|>, tpl L246-L257) and turn ends
-//     (<turn|>, tpl L351).
-//   - vLLM PR #45163: vllm/tool_parsers/gemma4_tool_parser.py (marker
-//     handling, the call:name{...} argument grammar and its decoder, ported
-//     below) and vllm/reasoning/gemma4_reasoning_parser.py (channel markers,
-//     the "thought\n" role label, is_reasoning_end semantics).
-//
-// Initial state (derived from the generation prompt, tpl L356-L362, see
-// RenderGemma4):
-//   - enable_thinking=false: the prompt ends with "<|turn>model\n" +
-//     "<|channel>thought\n<channel|>" - an EMPTY thought channel, pre-opened
-//     AND pre-closed by the template. The model's output therefore starts in
-//     plain content. Use NewGemma4Parser(false).
-//   - enable_thinking=true: the prompt ends at "<|turn>model\n" and the model
-//     opens and closes its own thought channel in the OUTPUT
-//     ("<|channel>thought\n...reasoning...<channel|>final answer", per the
-//     vLLM Gemma4ReasoningParser docstring). The parser still starts in
-//     content state - the channel markers in the output drive the switch.
-//     Use NewGemma4Parser(false) here too.
-//   - NewGemma4Parser(true) is for callers that pre-open the thought channel
-//     in the prompt themselves (appending "<|channel>thought\n" after the
-//     generation prompt to force thinking): the output then begins mid-thought
-//     and everything is reasoning until the first <channel|>.
-//
-// State diagram (markers are consumed, never emitted):
-//
-//	             <|channel>                  \n (channel name dropped: the
-//	[content] --------------> [chan-header] ----> [thought]   "thought\n" role
-//	   ^ |  <channel|> (stray close: swallowed,                label, stripped
-//	   +-+  strip_thinking semantics, tpl L148-L158)           like vLLM does)
-//	   ^                  <channel|>
-//	   +----------------------------------------- [thought]
-//	   ^                  <tool_call|>                 | <|tool_call> (implicit
-//	   +-------------- [tool-call] <-------------------+  reasoning end, vLLM
-//	   |  <|tool_call>     ^                               is_reasoning_end)
-//	   +-------------------+
-//	[content]/[thought] --- <turn|> ---> [done]  (everything after is dropped)
-//
-// Buffering rules:
-//   - content/thought states hold back at most len(longest marker)-1 bytes:
-//     the longest tail that is still a proper prefix of a watched marker.
-//     Content is otherwise emitted immediately (no unbounded buffering).
-//   - the tool-call state buffers the whole payload until <tool_call|>. This
-//     is unbounded in principle but bounded in practice by the model's
-//     diffusion canvas, and is required because the call:name{...} payload
-//     only becomes decodable (and trustworthy) once complete - the same
-//     reason vLLM's parser accumulates before parsing.
-//   - Close() flushes whatever is still held: partial markers come out as
-//     content/reasoning (per the state that held them); an unterminated
-//     channel header or tool-call payload is re-emitted RAW (including its
-//     opening marker) as content - malformed output is never silently
-//     dropped (mirrors vLLM extract_tool_calls returning the raw text as
-//     content when its regex does not match).
-//
-// Streaming granularity DIVERGENCE from vLLM: vLLM re-parses the partial
-// payload on every token and streams argument-JSON diffs (its `partial=True`
-// decoder mode plus withholding logic exist only for that). Our fragments are
-// whole committed diffusion blocks, so each completed tool call is emitted
-// once, as a single ToolCallDelta carrying index + id + name + the full
-// arguments JSON - exactly the shape backend/python/vllm/backend.py emits
-// per call and pkg/functions.ToolCallsFromChatDeltas re-accumulates.
-package main
-
-import (
-	"encoding/json"
-	"regexp"
-	"strconv"
-	"strings"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-)
-
-// gemma4CallRE is vLLM's tool_call_regex
-// (`<\|tool_call>call:([\w\-\.]+)\{(.*?)\}<tool_call\|>`, DOTALL) anchored to
-// a single already-extracted payload: name charset [\w\-.], braces mandatory.
-var gemma4CallRE = regexp.MustCompile(`(?s)^call:([\w\-.]+)\{(.*)\}$`)
-
-type g4State int
-
-const (
-	g4Content g4State = iota
-	g4ChanHeader
-	g4Thought
-	g4ToolCall
-	g4Done
-)
-
-// Markers watched per emitting state. A stray <tool_call|> outside a tool
-// call is deliberately NOT watched: it passes through verbatim, consistent
-// with the malformed-payload fallback re-emitting it as content.
-var (
-	gemma4ContentMarkers = []string{gemma4ChannelOpen, gemma4ChannelClose, gemma4ToolCallOpen, gemma4TurnEnd}
-	gemma4ThoughtMarkers = []string{gemma4ChannelClose, gemma4ToolCallOpen, gemma4TurnEnd}
-)
-
-type Gemma4Parser struct {
-	state g4State
-	// held is the per-state carry-over between Feed calls: a partial marker
-	// (content/thought), a partial channel header (chan-header) or the
-	// payload accumulated so far (tool-call).
-	held    string
-	toolIdx int
-}
-
-// NewGemma4Parser returns a parser positioned per the initial-state rules in
-// the header comment: startInThought=true only when the caller pre-opened a
-// thought channel in the prompt.
-func NewGemma4Parser(startInThought bool) *Gemma4Parser {
-	state := g4Content
-	if startInThought {
-		state = g4Thought
-	}
-	return &Gemma4Parser{state: state}
-}
-
-// Feed consumes the next output fragment and returns the deltas it completes.
-func (p *Gemma4Parser) Feed(text string) []*pb.ChatDelta {
-	if text == "" || p.state == g4Done {
-		return nil
-	}
-	pending := p.held + text
-	p.held = ""
-	var em g4Emitter
-	for pending != "" {
-		switch p.state {
-		case g4Content, g4Thought:
-			markers := gemma4ContentMarkers
-			if p.state == g4Thought {
-				markers = gemma4ThoughtMarkers
-			}
-			idx, marker := findEarliestGemma4Marker(pending, markers)
-			if idx == -1 {
-				hold := gemma4MarkerHoldback(pending, markers)
-				p.emitText(&em, pending[:len(pending)-hold])
-				p.held = pending[len(pending)-hold:]
-				pending = ""
-				continue
-			}
-			p.emitText(&em, pending[:idx])
-			pending = pending[idx+len(marker):]
-			switch marker {
-			case gemma4ChannelOpen:
-				p.state = g4ChanHeader
-			case gemma4ChannelClose:
-				// In thought: channel ends. In content: stray close,
-				// swallowed (strip_thinking keeps both sides, tpl L148-L158).
-				p.state = g4Content
-			case gemma4ToolCallOpen:
-				p.state = g4ToolCall
-			case gemma4TurnEnd:
-				p.state = g4Done
-			}
-		case g4ChanHeader:
-			// The channel header is "<name>\n"; the template only ever writes
-			// "thought" (tpl L240/L360) and the label is structural, so it is
-			// dropped, not emitted (vLLM strips the same "thought\n" prefix).
-			nl := strings.IndexByte(pending, '\n')
-			if nl == -1 {
-				p.held = pending
-				pending = ""
-				continue
-			}
-			pending = pending[nl+1:]
-			p.state = g4Thought
-		case g4ToolCall:
-			end := strings.Index(pending, gemma4ToolCallClose)
-			if end == -1 {
-				p.held = pending
-				pending = ""
-				continue
-			}
-			p.emitToolCall(&em, pending[:end])
-			pending = pending[end+len(gemma4ToolCallClose):]
-			p.state = g4Content
-		case g4Done:
-			pending = ""
-		}
-	}
-	return em.deltas
-}
-
-// Close flushes held-back partials. Incomplete structures (open channel
-// header, unterminated tool payload) are re-emitted raw as content rather
-// than dropped. The parser is finished afterwards.
-func (p *Gemma4Parser) Close() []*pb.ChatDelta {
-	var em g4Emitter
-	switch p.state {
-	case g4Content:
-		em.content(p.held)
-	case g4Thought:
-		em.reasoning(p.held)
-	case g4ChanHeader:
-		em.content(gemma4ChannelOpen + p.held)
-	case g4ToolCall:
-		em.content(gemma4ToolCallOpen + p.held)
-	case g4Done:
-	}
-	p.held = ""
-	p.state = g4Done
-	return em.deltas
-}
-
-func (p *Gemma4Parser) emitText(em *g4Emitter, s string) {
-	if p.state == g4Thought {
-		em.reasoning(s)
-		return
-	}
-	em.content(s)
-}
-
-// emitToolCall decodes one complete <|tool_call>...<tool_call|> payload. On a
-// payload that does not match call:name{...} the raw text (markers included)
-// is emitted as content, mirroring vLLM's extract_tool_calls fallback.
-func (p *Gemma4Parser) emitToolCall(em *g4Emitter, payload string) {
-	m := gemma4CallRE.FindStringSubmatch(payload)
-	if m == nil {
-		em.content(gemma4ToolCallOpen + payload + gemma4ToolCallClose)
-		return
-	}
-	// Index-based ids: deterministic (the split-invariance property relies
-	// on it) and matching the call_<n> convention of pkg/grpc/rich_test.go;
-	// core only needs ids to be non-empty and unique within the response.
-	em.tool(p.toolIdx, "call_"+strconv.Itoa(p.toolIdx), m[1], decodeGemma4Args(m[2], 0))
-	p.toolIdx++
-}
-
-// g4Emitter collects ChatDeltas; empty text events are dropped.
-type g4Emitter struct {
-	deltas []*pb.ChatDelta
-}
-
-func (e *g4Emitter) content(s string) {
-	if s != "" {
-		e.deltas = append(e.deltas, &pb.ChatDelta{Content: s})
-	}
-}
-
-func (e *g4Emitter) reasoning(s string) {
-	if s != "" {
-		e.deltas = append(e.deltas, &pb.ChatDelta{ReasoningContent: s})
-	}
-}
-
-func (e *g4Emitter) tool(index int, id, name, argsJSON string) {
-	e.deltas = append(e.deltas, &pb.ChatDelta{ToolCalls: []*pb.ToolCallDelta{{
-		Index:     int32(index),
-		Id:        id,
-		Name:      name,
-		Arguments: argsJSON,
-	}}})
-}
-
-// findEarliestGemma4Marker returns the position and value of the first
-// complete marker occurrence, or (-1, "").
-func findEarliestGemma4Marker(s string, markers []string) (int, string) {
-	best, bestMarker := -1, ""
-	for _, m := range markers {
-		if idx := strings.Index(s, m); idx >= 0 && (best == -1 || idx < best) {
-			best, bestMarker = idx, m
-		}
-	}
-	return best, bestMarker
-}
-
-// gemma4MarkerHoldback returns the length of the longest suffix of s that is
-// a proper prefix of a watched marker - the only bytes that may still grow
-// into a marker and therefore must not be emitted yet (bounded by the
-// longest marker, so content is never buffered unboundedly).
-func gemma4MarkerHoldback(s string, markers []string) int {
-	maxHold := 0
-	for _, m := range markers {
-		if len(m)-1 > maxHold {
-			maxHold = len(m) - 1
-		}
-	}
-	if len(s) < maxHold {
-		maxHold = len(s)
-	}
-	for k := maxHold; k >= 1; k-- {
-		tail := s[len(s)-k:]
-		for _, m := range markers {
-			if strings.HasPrefix(m, tail) {
-				return k
-			}
-		}
-	}
-	return 0
-}
-
-// ---------------------------------------------------------------------------
-// call:name{...} argument decoder
-//
-// Port of vLLM's _parse_gemma4_args / _parse_gemma4_array /
-// _parse_gemma4_value (gemma4_tool_parser.py) in non-partial mode only: this
-// parser decodes exclusively COMPLETE payloads (incomplete ones fall back to
-// raw content at Close), so vLLM's partial-withholding machinery
-// (trailing-dot floats, withheld bare tails) is intentionally not ported.
-//
-// Grammar (inverse of the renderer's formatGemma4Argument, tpl L118-L147):
-//
-//	args    := pair (',' pair)*
-//	pair    := key ':' value          (keys unquoted, up to the first ':')
-//	value   := string | object | array | bare
-//	string  := '<|"|>' ... '<|"|>'    (no escapes; unterminated -> rest)
-//	object  := '{' args '}'           (delimited strings skipped when
-//	array   := '[' value,* ']'         counting braces/brackets)
-//	bare    := true | false | null/none/nil | number | bare-string
-//
-// Output is a JSON object/array string with keys in payload order (Python
-// dict insertion order), built with HTML escaping off so payload text
-// survives byte-for-byte.
-// ---------------------------------------------------------------------------
-
-func isGemma4Space(c byte) bool { return c == ' ' || c == '\n' || c == '\t' }
-
-// gemma4MaxArgsDepth caps the mutual recursion between decodeGemma4Args and
-// decodeGemma4Array. Defense against model-generated deep nesting: a Go stack
-// overflow is a fatal process kill, not a recoverable error, so past the cap
-// a nested body gracefully degrades to a JSON string of its raw text.
-const gemma4MaxArgsDepth = 100
-
-// decodeGemma4Args decodes one args body (the text between the outer braces
-// of call:name{...}) into a JSON object string. depth is the current nesting
-// level (0 at the payload root); see gemma4MaxArgsDepth.
-func decodeGemma4Args(s string, depth int) string {
-	if depth > gemma4MaxArgsDepth {
-		return gemma4JSONString(s)
-	}
-	var b strings.Builder
-	b.WriteString("{")
-	first := true
-	pair := func(key, val string) {
-		if !first {
-			b.WriteString(",")
-		}
-		first = false
-		b.WriteString(gemma4JSONString(key))
-		b.WriteString(":")
-		b.WriteString(val)
-	}
-	i, n := 0, len(s)
-	for i < n {
-		for i < n && (isGemma4Space(s[i]) || s[i] == ',') {
-			i++
-		}
-		if i >= n {
-			break
-		}
-		keyStart := i
-		for i < n && s[i] != ':' {
-			i++
-		}
-		if i >= n {
-			break // no ':' -> trailing junk, dropped (vLLM does the same)
-		}
-		key := strings.TrimSpace(s[keyStart:i])
-		i++ // skip ':'
-		for i < n && isGemma4Space(s[i]) {
-			i++
-		}
-		if i >= n {
-			pair(key, `""`) // "key:" with nothing after -> empty string
-			break
-		}
-		switch {
-		case strings.HasPrefix(s[i:], gemma4StringDelim):
-			i += len(gemma4StringDelim)
-			if end := strings.Index(s[i:], gemma4StringDelim); end == -1 {
-				pair(key, gemma4JSONString(s[i:])) // unterminated -> take rest
-				i = n
-			} else {
-				pair(key, gemma4JSONString(s[i:i+end]))
-				i += end + len(gemma4StringDelim)
-			}
-		case s[i] == '{':
-			inner, next := scanGemma4Balanced(s, i, '{', '}')
-			pair(key, decodeGemma4Args(inner, depth+1))
-			i = next
-		case s[i] == '[':
-			inner, next := scanGemma4Balanced(s, i, '[', ']')
-			pair(key, decodeGemma4Array(inner, depth+1))
-			i = next
-		default:
-			valStart := i
-			for i < n && s[i] != ',' && s[i] != '}' && s[i] != ']' {
-				i++
-			}
-			if i == valStart {
-				// No progress (value starts on a stray '}'/']'): abort on
-				// malformed input rather than loop, like vLLM.
-				i = n
-				continue
-			}
-			pair(key, decodeGemma4Bare(s[valStart:i]))
-		}
-	}
-	b.WriteString("}")
-	return b.String()
-}
-
-// decodeGemma4Array decodes one array body (the text between '[' and ']')
-// into a JSON array string. depth is the current nesting level; see
-// gemma4MaxArgsDepth.
-func decodeGemma4Array(s string, depth int) string {
-	if depth > gemma4MaxArgsDepth {
-		return gemma4JSONString(s)
-	}
-	var b strings.Builder
-	b.WriteString("[")
-	first := true
-	item := func(val string) {
-		if !first {
-			b.WriteString(",")
-		}
-		first = false
-		b.WriteString(val)
-	}
-	i, n := 0, len(s)
-	for i < n {
-		for i < n && (isGemma4Space(s[i]) || s[i] == ',') {
-			i++
-		}
-		if i >= n {
-			break
-		}
-		switch {
-		case strings.HasPrefix(s[i:], gemma4StringDelim):
-			i += len(gemma4StringDelim)
-			if end := strings.Index(s[i:], gemma4StringDelim); end == -1 {
-				item(gemma4JSONString(s[i:]))
-				i = n
-			} else {
-				item(gemma4JSONString(s[i : i+end]))
-				i += end + len(gemma4StringDelim)
-			}
-		case s[i] == '{':
-			inner, next := scanGemma4Balanced(s, i, '{', '}')
-			item(decodeGemma4Args(inner, depth+1))
-			i = next
-		case s[i] == '[':
-			inner, next := scanGemma4Balanced(s, i, '[', ']')
-			item(decodeGemma4Array(inner, depth+1))
-			i = next
-		default:
-			valStart := i
-			for i < n && s[i] != ',' && s[i] != ']' {
-				i++
-			}
-			if i == valStart {
-				i = n // no progress: abort on malformed input, like vLLM
-				continue
-			}
-			item(decodeGemma4Bare(s[valStart:i]))
-		}
-	}
-	b.WriteString("]")
-	return b.String()
-}
-
-// scanGemma4Balanced scans a brace/bracket-balanced span starting at the
-// opener s[start], skipping over <|"|>-delimited strings so structural
-// characters inside them do not count (vLLM's depth scan). Returns the inner
-// text and the index just past the closer; an unterminated span yields the
-// rest of the string (the inner decoder still extracts what is there - this
-// path is only reachable from genuinely malformed complete payloads).
-func scanGemma4Balanced(s string, start int, open, close byte) (string, int) {
-	depth := 1
-	i := start + 1
-	innerStart := i
-	n := len(s)
-	for i < n && depth > 0 {
-		if strings.HasPrefix(s[i:], gemma4StringDelim) {
-			i += len(gemma4StringDelim)
-			if nd := strings.Index(s[i:], gemma4StringDelim); nd == -1 {
-				i = n
-			} else {
-				i += nd + len(gemma4StringDelim)
-			}
-			continue
-		}
-		switch s[i] {
-		case open:
-			depth++
-		case close:
-			depth--
-		}
-		i++
-	}
-	if depth > 0 {
-		return s[innerStart:], n
-	}
-	return s[innerStart : i-1], i
-}
-
-// decodeGemma4Bare maps an undelimited value to its JSON form: booleans,
-// null aliases (null/none/nil, case-insensitive - the renderer writes
-// Python None as "None", tpl L144-L145 via format_argument's else branch),
-// numbers (vLLM's rule: a '.' tries float, otherwise int; anything that
-// fails parses as a bare string).
-func decodeGemma4Bare(raw string) string {
-	v := strings.TrimSpace(raw)
-	if v == "" {
-		return `""`
-	}
-	if v == "true" || v == "false" {
-		return v
-	}
-	switch strings.ToLower(v) {
-	case "null", "none", "nil":
-		return "null"
-	}
-	if strings.Contains(v, ".") {
-		if f, err := strconv.ParseFloat(v, 64); err == nil {
-			return formatGemma4Float(f)
-		}
-	} else if iv, err := strconv.ParseInt(v, 10, 64); err == nil {
-		return strconv.FormatInt(iv, 10)
-	}
-	return gemma4JSONString(v)
-}
-
-// formatGemma4Float renders like Python's json.dumps(float): integral floats
-// keep a ".0" suffix ("108." decodes to 108.0, not 108), so the arguments
-// JSON matches what vLLM would have produced for the same payload.
-func formatGemma4Float(f float64) string {
-	s := strconv.FormatFloat(f, 'g', -1, 64)
-	if !strings.ContainsAny(s, ".eE") {
-		s += ".0"
-	}
-	return s
-}
-
-// gemma4JSONString encodes a JSON string WITHOUT HTML escaping (json.Marshal
-// would escape the angle brackets in "<div>" to \u003c / \u003e sequences;
-// payload text should survive
-// byte-for-byte, like Python's json.dumps(ensure_ascii=False)).
-func gemma4JSONString(s string) string {
-	var sb strings.Builder
-	enc := json.NewEncoder(&sb)
-	enc.SetEscapeHTML(false)
-	if err := enc.Encode(s); err != nil {
-		// Unreachable for plain strings; fall back to default escaping
-		// rather than emitting invalid JSON.
-		b, mErr := json.Marshal(s)
-		if mErr != nil {
-			return `""`
-		}
-		return string(b)
-	}
-	// Encode appends a trailing newline.
-	return strings.TrimSuffix(sb.String(), "\n")
-}
--- a/backend/go/dllm/gemma4_parser_test.go
+++ b/backend/go/dllm/gemma4_parser_test.go
@@ -1,592 +0,0 @@
-package main
-
-// Parser specs for Gemma4Parser (model output text -> pb.ChatDelta events).
-//
-// Fixture provenance:
-//   - Entries marked "vLLM: <name>" are direct ports of the named test from
-//     vLLM PR #45163, tests/tool_parsers/test_gemma4_tool_parser.py (the
-//     authoritative test-suite for the gemma4 tool-call wire format). The
-//     streaming tests' chunk lists are reused verbatim as Feed fragments.
-//   - Decoder entries port the TestParseGemma4Args / TestParseGemma4Array
-//     classes from the same file (non-partial mode only; this parser never
-//     decodes partial payloads, see the divergence note in gemma4_parser.go).
-//   - Channel/turn-marker expectations come from the chat template embedded
-//     in gemma4_renderer.go (tpl L356-L362 generation prompt, L148-L158
-//     strip_thinking) and vLLM's Gemma4ReasoningParser
-//     (vllm/reasoning/gemma4_reasoning_parser.py).
-
-import (
-	"encoding/json"
-	"fmt"
-	"strings"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-)
-
-// flatGemma4Tool is one accumulated tool call, mirroring how LocalAI core
-// folds ToolCallDelta streams (pkg/functions/chat_deltas.go
-// ToolCallsFromChatDeltas: name/id latch on first non-empty, arguments
-// concatenate per index). Tests flatten through the same rules so they
-// assert exactly what core will reconstruct.
-type flatGemma4Tool struct {
-	id   string
-	name string
-	args string
-}
-
-func flattenGemma4Deltas(deltas []*pb.ChatDelta) (string, string, []flatGemma4Tool) {
-	var content, reasoning strings.Builder
-	byIndex := map[int32]*flatGemma4Tool{}
-	maxIdx := int32(-1)
-	for _, d := range deltas {
-		content.WriteString(d.GetContent())
-		reasoning.WriteString(d.GetReasoningContent())
-		for _, tc := range d.GetToolCalls() {
-			acc, ok := byIndex[tc.GetIndex()]
-			if !ok {
-				acc = &flatGemma4Tool{}
-				byIndex[tc.GetIndex()] = acc
-			}
-			if tc.GetName() != "" {
-				acc.name = tc.GetName()
-			}
-			if tc.GetId() != "" {
-				acc.id = tc.GetId()
-			}
-			acc.args += tc.GetArguments()
-			if tc.GetIndex() > maxIdx {
-				maxIdx = tc.GetIndex()
-			}
-		}
-	}
-	var tools []flatGemma4Tool
-	for i := int32(0); i <= maxIdx; i++ {
-		if acc, ok := byIndex[i]; ok {
-			tools = append(tools, *acc)
-		}
-	}
-	return content.String(), reasoning.String(), tools
-}
-
-type wantGemma4Tool struct {
-	name     string
-	argsJSON string // compared with MatchJSON (key order irrelevant)
-}
-
-type parseGemma4Case struct {
-	startInThought bool
-	fragments      []string
-	wantContent    string
-	wantReasoning  string
-	wantTools      []wantGemma4Tool
-}
-
-func parseGemma4Fragments(startInThought bool, fragments []string) []*pb.ChatDelta {
-	p := NewGemma4Parser(startInThought)
-	var all []*pb.ChatDelta
-	for _, f := range fragments {
-		all = append(all, p.Feed(f)...)
-	}
-	return append(all, p.Close()...)
-}
-
-var _ = Describe("Gemma4Parser", func() {
-	DescribeTable("parses streamed gemma4 output into ChatDeltas",
-		func(c parseGemma4Case) {
-			content, reasoning, tools := flattenGemma4Deltas(parseGemma4Fragments(c.startInThought, c.fragments))
-			Expect(content).To(Equal(c.wantContent))
-			Expect(reasoning).To(Equal(c.wantReasoning))
-			Expect(tools).To(HaveLen(len(c.wantTools)))
-			seenIDs := map[string]bool{}
-			for i, want := range c.wantTools {
-				Expect(tools[i].name).To(Equal(want.name), "tool %d name", i)
-				Expect(tools[i].args).To(MatchJSON(want.argsJSON), "tool %d arguments", i)
-				Expect(tools[i].id).ToNot(BeEmpty(), "tool %d id", i)
-				Expect(seenIDs).ToNot(HaveKey(tools[i].id), "tool %d id must be unique", i)
-				seenIDs[tools[i].id] = true
-			}
-		},
-
-		// --- (1) pure content -------------------------------------------------
-		// vLLM: test_no_tool_calls
-		Entry("pure content, single fragment", parseGemma4Case{
-			fragments:   []string{"Hello, how can I help you today?"},
-			wantContent: "Hello, how can I help you today?",
-		}),
-
-		// --- (2) thought -> final transition ----------------------------------
-		// enable_thinking render: prompt ends at <|turn>model\n and the model
-		// opens/closes its own thought channel in the OUTPUT (vLLM
-		// Gemma4ReasoningParser docstring; tpl L356-L362). The "thought\n"
-		// role label after <|channel> is structural and must be stripped
-		// (vLLM _THOUGHT_PREFIX handling).
-		Entry("thought channel then final content", parseGemma4Case{
-			fragments:     []string{"<|channel>thought\nLet me think about this.\n<channel|>The answer is 42."},
-			wantReasoning: "Let me think about this.\n",
-			wantContent:   "The answer is 42.",
-		}),
-
-		// --- (3) startInThought both ways -------------------------------------
-		Entry("startInThought=true routes initial text to reasoning until <channel|>", parseGemma4Case{
-			startInThought: true,
-			fragments:      []string{"I am thinking hard.<channel|>Done."},
-			wantReasoning:  "I am thinking hard.",
-			wantContent:    "Done.",
-		}),
-		// A stray <channel|> with no open channel is swallowed, matching the
-		// template's strip_thinking (tpl L148-L158: the marker is dropped,
-		// text on both sides is kept).
-		Entry("startInThought=false keeps the same text as content, stray <channel|> swallowed", parseGemma4Case{
-			startInThought: false,
-			fragments:      []string{"I am thinking hard.<channel|>Done."},
-			wantContent:    "I am thinking hard.Done.",
-		}),
-
-		// --- (4) one tool call, full payload type zoo --------------------------
-		Entry("single tool call: strings, numbers, bools, null, nested object and array", parseGemma4Case{
-			fragments: []string{`<|tool_call>call:complex_function{text:<|"|>with, comma and {braces}<|"|>,count:42,score:3.14,yes:true,no:false,nothing:null,obj:{inner:<|"|>v<|"|>,k:1},arr:[<|"|>a<|"|>,2,true]}<tool_call|>`},
-			wantTools: []wantGemma4Tool{{
-				name:     "complex_function",
-				argsJSON: `{"text":"with, comma and {braces}","count":42,"score":3.14,"yes":true,"no":false,"nothing":null,"obj":{"inner":"v","k":1},"arr":["a",2,true]}`,
-			}},
-		}),
-
-		// --- (5) payload split across 3 fragments ------------------------------
-		Entry("tool-call payload split across three fragments", parseGemma4Case{
-			fragments: []string{
-				"<|tool_call>call:get_weather{loc",
-				`ation:<|"|>Paris, Fra`,
-				`nce<|"|>}<tool_call|>`,
-			},
-			wantTools: []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"Paris, France"}`}},
-		}),
-
-		// --- (6) marker split across fragments ----------------------------------
-		Entry("tool-call open marker split across fragments", parseGemma4Case{
-			fragments: []string{
-				"<|tool_ca",
-				`ll>call:get_weather{location:<|"|>London<|"|>}<tool_call|>`,
-			},
-			wantTools: []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"London"}`}},
-		}),
-		Entry("channel open marker split across fragments", parseGemma4Case{
-			fragments: []string{
-				"<|chan",
-				"nel>thought\ndeep thought<channel|>final",
-			},
-			wantReasoning: "deep thought",
-			wantContent:   "final",
-		}),
-
-		// --- (7) trailing partial marker held, flushed by Close -----------------
-		Entry("trailing partial marker is held back and flushed by Close", parseGemma4Case{
-			fragments:   []string{"Hello <|tool"},
-			wantContent: "Hello <|tool",
-		}),
-
-		// --- (8) malformed/incomplete payload -> content fallback ---------------
-		// vLLM: test_incomplete_tool_call (no end marker: the whole text stays
-		// content, never silently dropped).
-		Entry("incomplete tool payload at Close is emitted as raw content", parseGemma4Case{
-			fragments:   []string{`<|tool_call>call:get_weather{location:<|"|>London`},
-			wantContent: `<|tool_call>call:get_weather{location:<|"|>London`,
-		}),
-		Entry("malformed complete payload is emitted as raw content, parsing continues", parseGemma4Case{
-			fragments:   []string{"<|tool_call>oops no call syntax<tool_call|> done"},
-			wantContent: "<|tool_call>oops no call syntax<tool_call|> done",
-		}),
-
-		// --- (9) <turn|> ends the turn -------------------------------------------
-		Entry("text after <turn|> is ignored, including later fragments", parseGemma4Case{
-			fragments: []string{
-				"before<turn|>after",
-				`more <|tool_call>call:f{}<tool_call|>`,
-			},
-			wantContent: "before",
-		}),
-		Entry("<turn|> inside a thought channel ends the turn", parseGemma4Case{
-			startInThought: true,
-			fragments:      []string{"thinking<turn|>ignored"},
-			wantReasoning:  "thinking",
-		}),
-
-		// --- (10) ported vLLM non-streaming cases ---------------------------------
-		// vLLM: test_single_tool_call
-		Entry("vLLM: test_single_tool_call", parseGemma4Case{
-			fragments: []string{`<|tool_call>call:get_weather{location:<|"|>London<|"|>}<tool_call|>`},
-			wantTools: []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"London"}`}},
-		}),
-		// vLLM: test_multiple_arguments
-		Entry("vLLM: test_multiple_arguments", parseGemma4Case{
-			fragments: []string{`<|tool_call>call:get_weather{location:<|"|>San Francisco<|"|>,unit:<|"|>celsius<|"|>}<tool_call|>`},
-			wantTools: []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"San Francisco","unit":"celsius"}`}},
-		}),
-		// vLLM: test_text_before_tool_call. DIVERGENCE: vLLM's non-streaming
-		// extractor trims the content ("...you."); a streaming parser cannot
-		// retroactively trim already-emitted text, so the trailing space is
-		// kept (vLLM's own streaming path keeps it too, see
-		// test_streaming_text_before_tool_call which only checks a prefix).
-		Entry("vLLM: test_text_before_tool_call (streaming semantics: no trim)", parseGemma4Case{
-			fragments:   []string{`Let me check the weather for you. <|tool_call>call:get_weather{location:<|"|>Paris<|"|>}<tool_call|>`},
-			wantContent: "Let me check the weather for you. ",
-			wantTools:   []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"Paris"}`}},
-		}),
-		// vLLM: test_multiple_tool_calls (also covers case 11: multi-tool sequence)
-		Entry("vLLM: test_multiple_tool_calls", parseGemma4Case{
-			fragments: []string{`<|tool_call>call:get_weather{location:<|"|>London<|"|>}<tool_call|><|tool_call>call:get_time{location:<|"|>London<|"|>}<tool_call|>`},
-			wantTools: []wantGemma4Tool{
-				{name: "get_weather", argsJSON: `{"location":"London"}`},
-				{name: "get_time", argsJSON: `{"location":"London"}`},
-			},
-		}),
-		// vLLM: test_nested_arguments
-		Entry("vLLM: test_nested_arguments", parseGemma4Case{
-			fragments: []string{`<|tool_call>call:complex_function{nested:{inner:<|"|>value<|"|>},list:[<|"|>a<|"|>,<|"|>b<|"|>]}<tool_call|>`},
-			wantTools: []wantGemma4Tool{{name: "complex_function", argsJSON: `{"nested":{"inner":"value"},"list":["a","b"]}`}},
-		}),
-		// vLLM: test_tool_call_with_number_and_boolean
-		Entry("vLLM: test_tool_call_with_number_and_boolean", parseGemma4Case{
-			fragments: []string{`<|tool_call>call:set_status{is_active:true,count:42,score:3.14}<tool_call|>`},
-			wantTools: []wantGemma4Tool{{name: "set_status", argsJSON: `{"is_active":true,"count":42,"score":3.14}`}},
-		}),
-		// vLLM: test_hyphenated_function_name
-		Entry("vLLM: test_hyphenated_function_name", parseGemma4Case{
-			fragments: []string{`<|tool_call>call:get-weather{location:<|"|>London<|"|>}<tool_call|>`},
-			wantTools: []wantGemma4Tool{{name: "get-weather", argsJSON: `{"location":"London"}`}},
-		}),
-		// vLLM: test_dotted_function_name
-		Entry("vLLM: test_dotted_function_name", parseGemma4Case{
-			fragments: []string{`<|tool_call>call:weather.get{location:<|"|>London<|"|>}<tool_call|>`},
-			wantTools: []wantGemma4Tool{{name: "weather.get", argsJSON: `{"location":"London"}`}},
-		}),
-		// vLLM: test_no_arguments
-		Entry("vLLM: test_no_arguments", parseGemma4Case{
-			fragments: []string{"<|tool_call>call:get_status{}<tool_call|>"},
-			wantTools: []wantGemma4Tool{{name: "get_status", argsJSON: `{}`}},
-		}),
-
-		// --- ported vLLM streaming cases (chunk lists reused as fragments) --------
-		// vLLM: test_basic_streaming_single_tool
-		Entry("vLLM: test_basic_streaming_single_tool", parseGemma4Case{
-			fragments: []string{
-				"<|tool_call>",
-				"call:get_weather{",
-				`location:<|"|>Paris`,
-				", France",
-				`<|"|>}`,
-				"<tool_call|>",
-			},
-			wantTools: []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"Paris, France"}`}},
-		}),
-		// vLLM: test_streaming_multi_arg
-		Entry("vLLM: test_streaming_multi_arg", parseGemma4Case{
-			fragments: []string{
-				"<|tool_call>",
-				"call:get_weather{",
-				`location:<|"|>Tokyo<|"|>,`,
-				`unit:<|"|>celsius<|"|>}`,
-				"<tool_call|>",
-			},
-			wantTools: []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"Tokyo","unit":"celsius"}`}},
-		}),
-		// vLLM: test_streaming_text_before_tool_call
-		Entry("vLLM: test_streaming_text_before_tool_call", parseGemma4Case{
-			fragments: []string{
-				"Let me check ",
-				"the weather. ",
-				"<|tool_call>",
-				"call:get_weather{",
-				`location:<|"|>London<|"|>}`,
-				"<tool_call|>",
-			},
-			wantContent: "Let me check the weather. ",
-			wantTools:   []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"London"}`}},
-		}),
-		// vLLM: test_streaming_numeric_args
-		Entry("vLLM: test_streaming_numeric_args", parseGemma4Case{
-			fragments: []string{
-				"<|tool_call>",
-				"call:set_config{",
-				"count:42,",
-				"active:true}",
-				"<tool_call|>",
-			},
-			wantTools: []wantGemma4Tool{{name: "set_config", argsJSON: `{"count":42,"active":true}`}},
-		}),
-		// vLLM: test_streaming_boolean_split_across_chunks
-		Entry("vLLM: test_streaming_boolean_split_across_chunks", parseGemma4Case{
-			fragments: []string{
-				"<|tool_call>",
-				"call:search{input:{all:tru",
-				"e}}",
-				"<tool_call|>",
-			},
-			wantTools: []wantGemma4Tool{{name: "search", argsJSON: `{"input":{"all":true}}`}},
-		}),
-		// vLLM: test_streaming_false_split_across_chunks
-		Entry("vLLM: test_streaming_false_split_across_chunks", parseGemma4Case{
-			fragments: []string{
-				"<|tool_call>",
-				"call:set{flag:fals",
-				"e}",
-				"<tool_call|>",
-			},
-			wantTools: []wantGemma4Tool{{name: "set", argsJSON: `{"flag":false}`}},
-		}),
-		// vLLM: test_streaming_number_split_across_chunks
-		Entry("vLLM: test_streaming_number_split_across_chunks", parseGemma4Case{
-			fragments: []string{
-				"<|tool_call>",
-				"call:set{count:4",
-				"2}",
-				"<tool_call|>",
-			},
-			wantTools: []wantGemma4Tool{{name: "set", argsJSON: `{"count":42}`}},
-		}),
-		// vLLM: test_streaming_empty_args
-		Entry("vLLM: test_streaming_empty_args", parseGemma4Case{
-			fragments: []string{
-				"<|tool_call>",
-				"call:get_status{}",
-				"<tool_call|>",
-			},
-			wantTools: []wantGemma4Tool{{name: "get_status", argsJSON: `{}`}},
-		}),
-		// vLLM: test_streaming_split_delimiter_no_invalid_json (string
-		// delimiter <|"|> split across fragments must not leak fragments).
-		Entry("vLLM: test_streaming_split_delimiter_no_invalid_json", parseGemma4Case{
-			fragments: []string{
-				"<|tool_call>",
-				"call:todowrite{",
-				`content:<|"|>Buy milk<|`,
-				`"|>}`,
-				"<tool_call|>",
-			},
-			wantTools: []wantGemma4Tool{{name: "todowrite", argsJSON: `{"content":"Buy milk"}`}},
-		}),
-		// vLLM: test_streaming_does_not_duplicate_plain_text_after_tool_call
-		Entry("vLLM: test_streaming_does_not_duplicate_plain_text_after_tool_call", parseGemma4Case{
-			fragments: []string{
-				"<|tool_call>",
-				"call:get_weather{",
-				`location:<|"|>Paris<|"|>}`,
-				"<tool_call|><",
-				"div>",
-			},
-			wantContent: "<div>",
-			wantTools:   []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"Paris"}`}},
-		}),
-		// vLLM: test_streaming_html_argument_does_not_duplicate_tag_prefixes
-		Entry("vLLM: test_streaming_html_argument_does_not_duplicate_tag_prefixes", parseGemma4Case{
-			fragments: []string{
-				"<|tool_call>",
-				"call:write_file{",
-				`path:<|"|>index.html<|"|>,`,
-				`content:<|"|><!DOCTYPE html>` + "\n<",
-				`html lang="zh-CN">` + "\n<",
-				"head>\n    <",
-				`meta charset="UTF-8">` + "\n    <",
-				`meta name="viewport" content="width=device-width">` + "\n",
-				`<|"|>}`,
-				"<tool_call|>",
-			},
-			wantTools: []wantGemma4Tool{{
-				name:     "write_file",
-				argsJSON: `{"path":"index.html","content":"<!DOCTYPE html>\n<html lang=\"zh-CN\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width\">\n"}`,
-			}},
-		}),
-		// vLLM: test_streaming_single_chunk_complete_tool_call
-		Entry("vLLM: test_streaming_single_chunk_complete_tool_call", parseGemma4Case{
-			fragments: []string{`<|tool_call>call:name_a_color{color_hex:<|"|>00ff11<|"|>}<tool_call|>`},
-			wantTools: []wantGemma4Tool{{name: "name_a_color", argsJSON: `{"color_hex":"00ff11"}`}},
-		}),
-		// vLLM: test_streaming_multi_chunk_batched_tool_calls (two complete
-		// calls in ONE fragment; both must come out with distinct indices)
-		Entry("vLLM: test_streaming_multi_chunk_batched_tool_calls", parseGemma4Case{
-			fragments: []string{
-				`<|tool_call>call:get_weather{location:<|"|>London<|"|>}<tool_call|>` +
-					`<|tool_call>call:get_time{timezone:<|"|>GMT<|"|>}<tool_call|>`,
-			},
-			wantTools: []wantGemma4Tool{
-				{name: "get_weather", argsJSON: `{"location":"London"}`},
-				{name: "get_time", argsJSON: `{"timezone":"GMT"}`},
-			},
-		}),
-		// vLLM: test_streaming_trailing_bare_bool_not_duplicated
-		Entry("vLLM: test_streaming_trailing_bare_bool_not_duplicated", parseGemma4Case{
-			fragments: []string{
-				"<|tool_call>",
-				"call:Edit{",
-				`file_path:<|"|>src/env.py<|"|>,`,
-				`old_string:<|"|>old_val<|"|>,`,
-				`new_string:<|"|>new_val<|"|>,`,
-				"replace_all:",
-				"false}",
-				"<tool_call|>",
-			},
-			wantTools: []wantGemma4Tool{{
-				name:     "Edit",
-				argsJSON: `{"file_path":"src/env.py","old_string":"old_val","new_string":"new_val","replace_all":false}`,
-			}},
-		}),
-
-		// --- implicit reasoning end on <|tool_call> (vLLM is_reasoning_end:
-		// a tool_call token means reasoning is over) -----------------------------
-		Entry("tool call inside an open thought channel ends the reasoning", parseGemma4Case{
-			startInThought: true,
-			fragments:      []string{`need the weather<|tool_call>call:get_weather{location:<|"|>Rome<|"|>}<tool_call|>`},
-			wantReasoning:  "need the weather",
-			wantTools:      []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"Rome"}`}},
-		}),
-
-		// --- (12) empty fragments are no-ops --------------------------------------
-		Entry("empty fragments are no-ops", parseGemma4Case{
-			fragments:   []string{"", "Hello", "", "", " world", ""},
-			wantContent: "Hello world",
-		}),
-	)
-
-	It("returns no deltas for an empty fragment and after Close", func() {
-		p := NewGemma4Parser(false)
-		Expect(p.Feed("")).To(BeEmpty())
-		Expect(p.Feed("hi")).ToNot(BeEmpty())
-		Expect(p.Close()).To(BeEmpty()) // nothing held back
-		// The parser is finished after Close: further input is dropped.
-		Expect(p.Feed("more")).To(BeEmpty())
-		Expect(p.Close()).To(BeEmpty())
-	})
-
-	It("generates index-based tool call ids (call_<index>)", func() {
-		// Mirrors the index-based id convention of pkg/grpc/rich_test.go and
-		// keeps ids deterministic for the split-invariance property below.
-		deltas := parseGemma4Fragments(false, []string{
-			`<|tool_call>call:a{}<tool_call|><|tool_call>call:b{}<tool_call|>`,
-		})
-		_, _, tools := flattenGemma4Deltas(deltas)
-		Expect(tools).To(HaveLen(2))
-		Expect(tools[0].id).To(Equal("call_0"))
-		Expect(tools[1].id).To(Equal("call_1"))
-	})
-
-	// Property: for a fixed full output, EVERY 2-split position must yield
-	// exactly the same flattened result as the unsplit parse. This kills
-	// fragment-boundary bugs (mid-marker, mid-delimiter, mid-payload splits).
-	DescribeTable("2-split fragment invariance",
-		func(startInThought bool, full string) {
-			refContent, refReasoning, refTools := flattenGemma4Deltas(
-				parseGemma4Fragments(startInThought, []string{full}))
-			for i := 0; i <= len(full); i++ {
-				content, reasoning, tools := flattenGemma4Deltas(
-					parseGemma4Fragments(startInThought, []string{full[:i], full[i:]}))
-				Expect(content).To(Equal(refContent), fmt.Sprintf("content diverged at split %d", i))
-				Expect(reasoning).To(Equal(refReasoning), fmt.Sprintf("reasoning diverged at split %d", i))
-				Expect(tools).To(Equal(refTools), fmt.Sprintf("tool calls diverged at split %d", i))
-			}
-		},
-		Entry("thought + content + two tool calls + turn end", false,
-			"<|channel>thought\nPondering the request...\n<channel|>Sure - calling tools now. "+
-				`<|tool_call>call:get_weather{location:<|"|>Paris, France<|"|>,unit:<|"|>celsius<|"|>,days:3,detailed:true}<tool_call|>`+
-				`<|tool_call>call:get_time{timezone:<|"|>Europe/Lisbon<|"|>,nested:{flag:false,vals:[1,2.5,<|"|>x<|"|>]}}<tool_call|>`+
-				"Done.<turn|>ignored tail"),
-		Entry("startInThought + tool call + trailing partial marker", true,
-			`Deep thought<channel|>final answer <|tool_call>call:noop{}<tool_call|> trailing <|tool`),
-		Entry("malformed payload fallback", false,
-			`pre <|tool_call>not a call<tool_call|> post`),
-	)
-})
-
-// Decoder-level ports of vLLM's TestParseGemma4Args / TestParseGemma4Array
-// (non-partial mode; the partial-withholding tests do not apply because this
-// parser only ever decodes COMPLETE payloads, see gemma4_parser.go).
-var _ = Describe("decodeGemma4Args", func() {
-	DescribeTable("decodes the gemma4 call syntax into JSON arguments",
-		func(in, wantJSON string) {
-			Expect(decodeGemma4Args(in, 0)).To(MatchJSON(wantJSON))
-		},
-		// vLLM: test_empty_string / test_whitespace_only
-		Entry("empty string", "", `{}`),
-		Entry("whitespace only", "   ", `{}`),
-		// vLLM: test_single_string_value
-		Entry("single string value", `location:<|"|>Paris<|"|>`, `{"location":"Paris"}`),
-		// vLLM: test_string_value_with_comma
-		Entry("string value with comma", `location:<|"|>Paris, France<|"|>`, `{"location":"Paris, France"}`),
-		// vLLM: test_multiple_string_values
-		Entry("multiple string values", `location:<|"|>San Francisco<|"|>,unit:<|"|>celsius<|"|>`, `{"location":"San Francisco","unit":"celsius"}`),
-		// vLLM: test_integer_value / test_float_value
-		Entry("integer value", "count:42", `{"count":42}`),
-		Entry("float value", "score:3.14", `{"score":3.14}`),
-		// vLLM: test_boolean_true / test_boolean_false
-		Entry("boolean true", "flag:true", `{"flag":true}`),
-		Entry("boolean false", "flag:false", `{"flag":false}`),
-		// vLLM: test_null_value (bare null must become JSON null, not "null")
-		Entry("null value", "param:null", `{"param":null}`),
-		// vLLM: test_mixed_types
-		Entry("mixed types", `name:<|"|>test<|"|>,count:42,active:true,score:3.14`,
-			`{"name":"test","count":42,"active":true,"score":3.14}`),
-		// vLLM: test_nested_object
-		Entry("nested object", `nested:{inner:<|"|>value<|"|>}`, `{"nested":{"inner":"value"}}`),
-		// vLLM: test_array_of_strings
-		Entry("array of strings", `items:[<|"|>a<|"|>,<|"|>b<|"|>]`, `{"items":["a","b"]}`),
-		// vLLM: test_unterminated_string (take everything after the delimiter)
-		Entry("unterminated string", `key:<|"|>unterminated`, `{"key":"unterminated"}`),
-		// vLLM: test_empty_value (key with no value after colon)
-		Entry("empty value", "key:", `{"key":""}`),
-		// vLLM: test_trailing_dot_float_partial_withheld, non-partial branch
-		// (trailing-dot floats parse normally outside streaming).
-		Entry("trailing dot float, complete payload", "left:108.,right:22.8", `{"left":108.0,"right":22.8}`),
-	)
-
-	It("terminates and yields valid JSON on malformed input", func() {
-		// vLLM: test_malformed_partial_array (the assertion there is only
-		// "returns a dict without hanging"; ours is "valid JSON object").
-		out := decodeGemma4Args(":[t:[]", 0)
-		var v map[string]any
-		Expect(json.Unmarshal([]byte(out), &v)).To(Succeed())
-	})
-
-	It("degrades nesting beyond the recursion cap to a string value", func() {
-		// 200 levels of a:{a:{...a:1...}}. Without the depth cap the mutual
-		// recursion would grow the stack with the model's output; a Go stack
-		// overflow is a fatal process kill, so levels past gemma4MaxArgsDepth
-		// must gracefully fall back to the raw inner text as a JSON string.
-		const depth = 200
-		body := strings.Repeat("a:{", depth-1) + "a:1" + strings.Repeat("}", depth-1)
-		out := decodeGemma4Args(body, 0)
-		var v map[string]any
-		Expect(json.Unmarshal([]byte(out), &v)).To(Succeed())
-		levels := 0
-		var cur any = v
-		for {
-			m, ok := cur.(map[string]any)
-			if !ok {
-				break
-			}
-			Expect(m).To(HaveKey("a"))
-			cur = m["a"]
-			levels++
-		}
-		Expect(levels).To(Equal(gemma4MaxArgsDepth + 1))
-		Expect(cur).To(BeAssignableToTypeOf(""))
-		Expect(cur).To(ContainSubstring("a:{"))
-	})
-})
-
-var _ = Describe("decodeGemma4Array", func() {
-	DescribeTable("decodes gemma4 array bodies into JSON arrays",
-		func(in, wantJSON string) {
-			Expect(decodeGemma4Array(in, 0)).To(MatchJSON(wantJSON))
-		},
-		// vLLM: test_string_array / test_empty_array / test_bare_values
-		Entry("string array", `<|"|>a<|"|>,<|"|>b<|"|>`, `["a","b"]`),
-		Entry("empty array", "", `[]`),
-		Entry("bare values", "42,true,3.14", `[42,true,3.14]`),
-		// vLLM: test_string_element_with_closing_bracket (a ']' inside a
-		// delimited string must not close the array)
-		Entry("string element with closing bracket", `[<|"|>a]b<|"|>,<|"|>c<|"|>],<|"|>tail<|"|>`, `[["a]b","c"],"tail"]`),
-		// vLLM: test_stray_closing_bracket (no-progress abort, keep prefix)
-		Entry("stray closing bracket", "42,]trailing", `[42]`),
-	)
-})
--- a/backend/go/dllm/gemma4_renderer.go
+++ b/backend/go/dllm/gemma4_renderer.go
--- a/backend/go/dllm/gemma4_renderer_test.go
+++ b/backend/go/dllm/gemma4_renderer_test.go
@@ -1,347 +0,0 @@
-package main
-
-// Renderer specs for RenderGemma4 against the canonical gemma4 chat template
-// (see the normative template comment in gemma4_renderer.go).
-//
-// Fixture provenance:
-//   - "single user message" and "enable_thinking" are the EXACT expected
-//     decodes from transformers tests/models/diffusion_gemma/
-//     test_modeling_diffusion_gemma.py (test_diffusion_gemma_chat_template
-//     and ..._with_thinking) with ONE difference: the transformers fixtures
-//     start with "<bos>" because apply_chat_template tokenizes the rendered
-//     text with add_bos. Our prompt goes through dllm_capi_generate, whose
-//     run_generate already tokenizes with prepend_bos = vocab.add_bos
-//     (dllm.cpp src/capi.cpp:230-231, true for gemma4), so the renderer must
-//     NOT emit a literal <bos> (it would double) and every expected string
-//     here drops that leading token.
-//   - All other expected strings were produced by rendering the verbatim
-//     GGUF template with jinja2 3.1.2 (bos_token="<bos>") and dropping the
-//     leading "<bos>" for the same reason.
-
-import (
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-)
-
-// Two-function tools array used by the tool fixtures (OpenAI wire shape, as
-// LocalAI passes it through PredictOptions.Tools).
-const testToolsJSON = `[{"type":"function","function":{"name":"get_weather","description":"Get the current weather in a location.","parameters":{"type":"object","properties":{"location":{"type":"string","description":"The city name."},"unit":{"type":"string","enum":["celsius","fahrenheit"]}},"required":["location"]}}},{"type":"function","function":{"name":"get_time","description":"Get the current time in a timezone.","parameters":{"type":"object","properties":{"timezone":{"type":"string","description":"IANA timezone name."}},"required":["timezone"]}}}]`
-
-// The <|tool>...<tool|> block the template renders for testToolsJSON inside
-// the system turn (jinja2-verified).
-const testToolsBlock = `<|tool>declaration:get_weather{description:<|"|>Get the current weather in a location.<|"|>,parameters:{properties:{location:{description:<|"|>The city name.<|"|>,type:<|"|>STRING<|"|>},unit:{enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>],type:<|"|>STRING<|"|>}},required:[<|"|>location<|"|>],type:<|"|>OBJECT<|"|>}}<tool|><|tool>declaration:get_time{description:<|"|>Get the current time in a timezone.<|"|>,parameters:{properties:{timezone:{description:<|"|>IANA timezone name.<|"|>,type:<|"|>STRING<|"|>}},required:[<|"|>timezone<|"|>],type:<|"|>OBJECT<|"|>}}<tool|>`
-
-// A single tool exercising the deep format_parameters branches: array items
-// (string-typed and nested-array), nullable, enum+nullable, nested object
-// properties/required, and a response declaration.
-const complexToolsJSON = `[{"type":"function","function":{"name":"complex_tool","description":"A complex tool.","parameters":{"type":"object","properties":{"tags":{"type":"array","description":"Tags.","items":{"type":"string"}},"matrix":{"type":"array","items":{"type":"array","items":{"type":"number"}}},"opts":{"type":"object","description":"Options.","properties":{"depth":{"type":"integer","nullable":true}},"required":["depth"]},"mode":{"type":"string","enum":["a","b"],"nullable":true}},"required":["tags","opts"]},"response":{"description":"The result.","type":"object"}}}]`
-
-// jinja2-verified render of complexToolsJSON. Notable template quirks pinned
-// here: nested array items go through format_argument with ESCAPED keys and
-// an un-uppercased type (<|"|>type<|"|>:<|"|>number<|"|>), while direct item
-// types are uppercased; properties dictsort case-insensitively.
-const complexToolsBlock = `<|tool>declaration:complex_tool{description:<|"|>A complex tool.<|"|>,parameters:{properties:{matrix:{items:{items:{<|"|>type<|"|>:<|"|>number<|"|>},type:<|"|>ARRAY<|"|>},type:<|"|>ARRAY<|"|>},mode:{enum:[<|"|>a<|"|>,<|"|>b<|"|>],nullable:true,type:<|"|>STRING<|"|>},opts:{description:<|"|>Options.<|"|>,properties:{depth:{nullable:true,type:<|"|>INTEGER<|"|>}},required:[<|"|>depth<|"|>],type:<|"|>OBJECT<|"|>},tags:{description:<|"|>Tags.<|"|>,items:{type:<|"|>STRING<|"|>},type:<|"|>ARRAY<|"|>}},required:[<|"|>tags<|"|>,<|"|>opts<|"|>],type:<|"|>OBJECT<|"|>},response:{description:<|"|>The result.<|"|>,type:<|"|>OBJECT<|"|>}}<tool|>`
-
-type renderGemma4Case struct {
-	msgs               []*pb.Message
-	toolsJSON          string
-	enableThinking     bool
-	noGenerationPrompt bool // inverted so the zero value is the common case
-	expected           string
-}
-
-var _ = Describe("RenderGemma4", func() {
-	DescribeTable("renders the canonical gemma4 prompt",
-		func(c renderGemma4Case) {
-			out, err := RenderGemma4(c.msgs, c.toolsJSON, c.enableThinking, !c.noGenerationPrompt)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(out).To(Equal(c.expected))
-			// The C-ABI generate prepends BOS itself: a literal <bos>
-			// anywhere in the rendered prompt would double-encode it.
-			Expect(out).ToNot(ContainSubstring("<bos>"))
-		},
-
-		// transformers fixture (test_diffusion_gemma_chat_template), sans <bos>:
-		// default thinking pre-opens an EMPTY thought channel in the
-		// generation prompt.
-		Entry("single user message, default (no thinking)", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "Write a long essay about Portugal."},
-			},
-			expected: "<|turn>user\nWrite a long essay about Portugal.<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
-		}),
-
-		// transformers fixture (test_diffusion_gemma_chat_template_with_thinking),
-		// sans <bos>: a system turn carrying <|think|> and NO auto-opened
-		// thought channel.
-		Entry("enable_thinking=true", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "Write a long essay about Portugal."},
-			},
-			enableThinking: true,
-			expected:       "<|turn>system\n<|think|>\n<turn|>\n<|turn>user\nWrite a long essay about Portugal.<turn|>\n<|turn>model\n",
-		}),
-
-		Entry("multi-turn user/assistant/user", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "Hello, who are you?"},
-				{Role: "assistant", Content: "I am Gemma, a helpful assistant."},
-				{Role: "user", Content: "Tell me a joke."},
-			},
-			expected: "<|turn>user\nHello, who are you?<turn|>\n<|turn>model\nI am Gemma, a helpful assistant.<turn|>\n<|turn>user\nTell me a joke.<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
-		}),
-
-		// tpl L178-L195: a leading system message is folded into the system
-		// turn (trimmed) and consumed from the loop.
-		Entry("system message folds into the system turn", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "system", Content: "You are a pirate."},
-				{Role: "user", Content: "Hello!"},
-			},
-			expected: "<|turn>system\nYou are a pirate.<turn|>\n<|turn>user\nHello!<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
-		}),
-
-		// tpl L182-L185: <|think|> goes at the very top of the SAME system
-		// turn, before the system prompt text.
-		Entry("system message with enable_thinking shares the turn", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "system", Content: "You are a pirate."},
-				{Role: "user", Content: "Hello!"},
-			},
-			enableThinking: true,
-			expected:       "<|turn>system\n<|think|>\nYou are a pirate.<turn|>\n<|turn>user\nHello!<turn|>\n<|turn>model\n",
-		}),
-
-		// tpl L196-L203: tool declarations render in the system turn, one
-		// <|tool>declaration:...<tool|> block per tool, no separators.
-		Entry("tools array (two functions)", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "What is the weather in Tokyo?"},
-			},
-			toolsJSON: testToolsJSON,
-			expected:  "<|turn>system\n" + testToolsBlock + "<turn|>\n<|turn>user\nWhat is the weather in Tokyo?<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
-		}),
-
-		// format_parameters deep branches (tpl L1-L85) + response declaration
-		// (tpl L106-L116).
-		Entry("complex tool schema (array items, nullable, nested object, response)", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "go"},
-			},
-			toolsJSON: complexToolsJSON,
-			expected:  "<|turn>system\n" + complexToolsBlock + "<turn|>\n<|turn>user\ngo<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
-		}),
-
-		// tpl L243-L313: assistant tool_calls render as
-		// <|tool_call>call:name{args}<tool_call|>; the following role=tool
-		// message renders inline as <|tool_response>response:name{value:..}
-		// <tool_response|>; the model turn stays OPEN (no <turn|>, no new
-		// generation prompt) so the model continues after the response.
-		Entry("assistant tool_calls + role=tool result", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "What is the weather in Tokyo?"},
-				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"call_1","type":"function","function":{"name":"get_weather","arguments":"{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"}}]`},
-				{Role: "tool", ToolCallId: "call_1", Content: "Sunny, 22 degrees celsius."},
-			},
-			toolsJSON: testToolsJSON,
-			expected:  "<|turn>system\n" + testToolsBlock + "<turn|>\n<|turn>user\nWhat is the weather in Tokyo?<turn|>\n<|turn>model\n" + `<|tool_call>call:get_weather{location:<|"|>Tokyo<|"|>,unit:<|"|>celsius<|"|>}<tool_call|><|tool_response>response:get_weather{value:<|"|>Sunny, 22 degrees celsius.<|"|>}<tool_response|>`,
-		}),
-
-		// tpl L348-L349: a tool_calls turn with no rendered responses ends
-		// on an OPEN <|tool_response> marker for the runtime to fill, and
-		// add_generation_prompt adds nothing (tpl L357).
-		Entry("assistant tool_calls without a result leaves <|tool_response> open", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "What is the weather in Tokyo?"},
-				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"call_1","type":"function","function":{"name":"get_weather","arguments":"{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"}}]`},
-			},
-			toolsJSON: testToolsJSON,
-			expected:  "<|turn>system\n" + testToolsBlock + "<turn|>\n<|turn>user\nWhat is the weather in Tokyo?<turn|>\n<|turn>model\n" + `<|tool_call>call:get_weather{location:<|"|>Tokyo<|"|>,unit:<|"|>celsius<|"|>}<tool_call|><|tool_response>`,
-		}),
-
-		// tpl L237-L241: reasoning_content renders as a thought channel only
-		// on a tool-calling turn after the last user message.
-		Entry("reasoning_content with tool_calls renders the thought channel", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "weather?"},
-				{Role: "assistant", Content: "", ReasoningContent: "I should call the tool", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"get_weather","arguments":"{\"location\":\"Tokyo\"}"}}]`},
-				{Role: "tool", ToolCallId: "c1", Content: "Sunny"},
-			},
-			expected: "<|turn>user\nweather?<turn|>\n<|turn>model\n<|channel>thought\nI should call the tool\n<channel|>" + `<|tool_call>call:get_weather{location:<|"|>Tokyo<|"|>}<tool_call|><|tool_response>response:get_weather{value:<|"|>Sunny<|"|>}<tool_response|>`,
-		}),
-
-		// tpl L220-L235: the assistant answer following its own tool round
-		// continues the SAME model turn (no second <|turn>model).
-		Entry("tool round then final assistant answer then user", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "weather?"},
-				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"get_weather","arguments":"{\"location\":\"Tokyo\"}"}}]`},
-				{Role: "tool", ToolCallId: "c1", Content: "Sunny"},
-				{Role: "assistant", Content: "It is sunny."},
-				{Role: "user", Content: "thanks"},
-			},
-			expected: "<|turn>user\nweather?<turn|>\n<|turn>model\n" + `<|tool_call>call:get_weather{location:<|"|>Tokyo<|"|>}<tool_call|><|tool_response>response:get_weather{value:<|"|>Sunny<|"|>}<tool_response|>` + "It is sunny.<turn|>\n<|turn>user\nthanks<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
-		}),
-
-		// format_argument (tpl L118-L147): numbers keep their JSON literal,
-		// booleans lower-case, nested maps have unquoted dictsorted keys,
-		// arrays bracketed; top-level args are dictsorted case-insensitively.
-		Entry("tool_call argument types (number/bool/nested/array)", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "go"},
-				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"f","arguments":"{\"count\":42,\"ratio\":3.5,\"flag\":true,\"off\":false,\"nested\":{\"x\":\"y\",\"n\":7},\"list\":[\"a\",1,true]}"}}]`},
-			},
-			expected: "<|turn>user\ngo<turn|>\n<|turn>model\n" + `<|tool_call>call:f{count:42,flag:true,list:[<|"|>a<|"|>,1,true],nested:{n:7,x:<|"|>y<|"|>},off:false,ratio:3.5}<tool_call|><|tool_response>`,
-		}),
-
-		// jinja dictsort is case-insensitive: alpha sorts before Beta.
-		Entry("tool_call argument dictsort is case-insensitive", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "go"},
-				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"f","arguments":"{\"Beta\":1,\"alpha\":2}"}}]`},
-			},
-			expected: "<|turn>user\ngo<turn|>\n<|turn>model\n<|tool_call>call:f{alpha:2,Beta:1}<tool_call|><|tool_response>",
-		}),
-
-		// jinja renders Python None as "None" (round-trips through vLLM's
-		// parser, which lowers "none" back to null).
-		Entry("tool_call null argument renders as None", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "go"},
-				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"f","arguments":"{\"maybe\":null}"}}]`},
-			},
-			expected: "<|turn>user\ngo<turn|>\n<|turn>model\n<|tool_call>call:f{maybe:None}<tool_call|><|tool_response>",
-		}),
-
-		Entry("tool_call empty arguments render empty braces", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "go"},
-				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"f","arguments":"{}"}}]`},
-			},
-			expected: "<|turn>user\ngo<turn|>\n<|turn>model\n<|tool_call>call:f{}<tool_call|><|tool_response>",
-		}),
-
-		// tpl L253-L254: a non-object arguments string renders verbatim.
-		Entry("tool_call non-object string arguments render verbatim", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "go"},
-				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"f","arguments":"just text"}}]`},
-			},
-			expected: "<|turn>user\ngo<turn|>\n<|turn>model\n<|tool_call>call:f{just text}<tool_call|><|tool_response>",
-		}),
-
-		// tpl L278-L285: unmatched tool_call_id falls back to the tool
-		// message's own name.
-		Entry("tool result name falls back when tool_call_id does not match", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "go"},
-				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"f","arguments":"{}"}}]`},
-				{Role: "tool", ToolCallId: "OTHER", Name: "named_tool", Content: "out"},
-			},
-			expected: "<|turn>user\ngo<turn|>\n<|turn>model\n" + `<|tool_call>call:f{}<tool_call|><|tool_response>response:named_tool{value:<|"|>out<|"|>}<tool_response|>`,
-		}),
-
-		// strip_thinking (tpl L148-L158): historical assistant content loses
-		// its <|channel>...<channel|> spans.
-		Entry("assistant content thinking channels are stripped", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "hi"},
-				{Role: "assistant", Content: "<|channel>thought\nsecret\n<channel|>visible answer"},
-				{Role: "user", Content: "more"},
-			},
-			expected: "<|turn>user\nhi<turn|>\n<|turn>model\nvisible answer<turn|>\n<|turn>user\nmore<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
-		}),
-
-		// tpl L220-L235: consecutive assistant messages suppress the second
-		// <|turn>model (continuation), but each still closes with <turn|>.
-		Entry("consecutive assistant messages continue the model turn", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "hi"},
-				{Role: "assistant", Content: "part one"},
-				{Role: "assistant", Content: "part two"},
-				{Role: "user", Content: "ok"},
-			},
-			expected: "<|turn>user\nhi<turn|>\n<|turn>model\npart one<turn|>\npart two<turn|>\n<|turn>user\nok<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
-		}),
-
-		Entry("add_generation_prompt=false renders no model turn", renderGemma4Case{
-			msgs: []*pb.Message{
-				{Role: "user", Content: "hi"},
-			},
-			noGenerationPrompt: true,
-			expected:           "<|turn>user\nhi<turn|>\n",
-		}),
-	)
-
-	Describe("error handling", func() {
-		It("fails loud on an unknown role", func() {
-			_, err := RenderGemma4([]*pb.Message{
-				{Role: "narrator", Content: "Meanwhile..."},
-			}, "", false, true)
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring(`unknown role "narrator"`))
-		})
-
-		It("fails on invalid tools JSON", func() {
-			_, err := RenderGemma4([]*pb.Message{
-				{Role: "user", Content: "hi"},
-			}, "{not json", false, true)
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring("tools JSON"))
-		})
-
-		It("fails on invalid tool_calls JSON", func() {
-			_, err := RenderGemma4([]*pb.Message{
-				{Role: "user", Content: "hi"},
-				{Role: "assistant", Content: "", ToolCalls: "{not json"},
-			}, "", false, true)
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring("tool_calls JSON"))
-		})
-
-		It("fails on an orphan tool message, naming its index", func() {
-			// A role:tool message with no preceding assistant tool_calls turn
-			// would be silently dropped by the jinja; we fail loud instead.
-			_, err := RenderGemma4([]*pb.Message{
-				{Role: "user", Content: "hi"},
-				{Role: "tool", Content: `{"temp": 20}`, ToolCallId: "call_1"},
-			}, "", false, true)
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring("orphan tool message 1"))
-		})
-
-		It("fails on trailing garbage after the tools JSON array", func() {
-			_, err := RenderGemma4([]*pb.Message{
-				{Role: "user", Content: "hi"},
-			}, "[] junk", false, true)
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring("tools JSON"))
-		})
-
-		It("fails when the tools JSON is not an array", func() {
-			_, err := RenderGemma4([]*pb.Message{
-				{Role: "user", Content: "hi"},
-			}, `{"type":"function"}`, false, true)
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring("tools JSON is not an array"))
-		})
-
-		It("fails when a tools array element is not an object", func() {
-			_, err := RenderGemma4([]*pb.Message{
-				{Role: "user", Content: "hi"},
-			}, `[42]`, false, true)
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring("tools[0] is not an object"))
-		})
-
-		It("rejects a nil message via the unknown-role check", func() {
-			// Pins current behavior: pb getters are nil-safe, so a nil message
-			// reads as role "" and trips the fail-loud unknown-role guard.
-			_, err := RenderGemma4([]*pb.Message{nil}, "", false, true)
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring(`unknown role "" in message 0`))
-		})
-	})
-})
--- a/backend/go/dllm/main.go
+++ b/backend/go/dllm/main.go
@@ -1,85 +0,0 @@
-package main
-
-// Started internally by LocalAI - one gRPC server per loaded model.
-//
-// Loads libdllm.so via purego and registers the 9-symbol flat C-ABI
-// declared in dllm.cpp's include/dllm_capi.h (ABI v1). The library name can
-// be overridden with DLLM_LIBRARY (mirrors the PARAKEET_LIBRARY /
-// WHISPER_LIBRARY convention in the sibling backends); the default looks
-// for the .so next to this binary (run.sh puts the package dir on
-// LD_LIBRARY_PATH).
-import (
-	"flag"
-	"fmt"
-	"os"
-
-	"github.com/ebitengine/purego"
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-)
-
-var (
-	addr = flag.String("addr", "localhost:50051", "the address to connect to")
-)
-
-type LibFuncs struct {
-	FuncPtr any
-	Name    string
-}
-
-// loadCAPI dlopens libName and binds the 9 dllm_capi_* entry points 1:1 to
-// dllm_capi.h, so an `nm libdllm.so | grep dllm_capi` is enough to spot
-// drift. Shared with the test suite (ensureLibLoaded), which drives the
-// bridge without the gRPC server.
-//
-// The C-ABI returns malloc'd char* buffers from tokenize_json/generate; we
-// register those as uintptr so we get the raw pointer back and can call
-// dllm_capi_free_string on it (purego's string return would copy and forget
-// the original pointer, leaking it on every call). last_error returns a
-// BORROWED pointer instead, so it is registered as a plain string: purego
-// copies it and nothing must be freed.
-func loadCAPI(libName string) error {
-	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
-	if err != nil {
-		return fmt.Errorf("dllm: dlopen %q: %w", libName, err)
-	}
-
-	libFuncs := []LibFuncs{
-		{&cppAbiVersion, "dllm_capi_abi_version"},
-		{&cppLoad, "dllm_capi_load"},
-		{&cppFree, "dllm_capi_free"},
-		{&cppLastError, "dllm_capi_last_error"},
-		{&cppFreeString, "dllm_capi_free_string"},
-		{&cppTokenizeJSON, "dllm_capi_tokenize_json"},
-		{&cppGenerate, "dllm_capi_generate"},
-		{&cppGenerateStream, "dllm_capi_generate_stream"},
-		{&cppCancel, "dllm_capi_cancel"},
-	}
-	for _, lf := range libFuncs {
-		purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
-	}
-	return nil
-}
-
-func main() {
-	libName := os.Getenv("DLLM_LIBRARY")
-	if libName == "" {
-		libName = "libdllm.so"
-	}
-
-	if err := loadCAPI(libName); err != nil {
-		panic(err)
-	}
-
-	// Hard-fail on an ABI mismatch: the flat-pointer bindings above would
-	// otherwise misbehave silently against a future libdllm.so.
-	if v := cAbiVersion(); v != dllmABIVersion {
-		panic(fmt.Errorf("dllm: libdllm.so ABI=%d, this backend speaks ABI=%d", v, dllmABIVersion))
-	}
-	fmt.Fprintf(os.Stderr, "[dllm] ABI=%d\n", cAbiVersion())
-
-	flag.Parse()
-
-	if err := grpc.StartServer(*addr, &Dllm{}); err != nil {
-		panic(err)
-	}
-}
--- a/backend/go/dllm/package.sh
+++ b/backend/go/dllm/package.sh
@@ -1,24 +0,0 @@
-#!/bin/bash
-#
-# T1 packaging stub: copy the binary, run.sh and libdllm.so into package/.
-# The full ldd walk (libc, libstdc++, libgomp, GPU runtimes, arch
-# detection) lands with the registration task, mirroring
-# backend/go/whisper/package.sh.
-
-set -e
-
-CURDIR=$(dirname "$(realpath "$0")")
-
-mkdir -p "$CURDIR/package/lib"
-
-cp -avf "$CURDIR/dllm-grpc" "$CURDIR/package/"
-cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
-
-# libdllm.so + any soname symlinks, should upstream ever add them.
-cp -avf "$CURDIR"/libdllm.so* "$CURDIR/package/lib/" 2>/dev/null || {
-	echo "ERROR: libdllm.so not found in $CURDIR, run 'make' first" >&2
-	exit 1
-}
-
-echo "T1 package layout (full ldd walk lands with registration):"
-ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/go/dllm/run.sh
+++ b/backend/go/dllm/run.sh
@@ -1,16 +0,0 @@
-#!/bin/bash
-set -e
-
-CURDIR=$(dirname "$(realpath "$0")")
-
-export LD_LIBRARY_PATH="$CURDIR/lib:$CURDIR:${LD_LIBRARY_PATH:-}"
-
-# If a self-contained ld.so was packaged, route through it so the
-# packaged libc / libstdc++ are used instead of the host's (matches the
-# whisper / parakeet-cpp backends' runtime layout).
-if [ -f "$CURDIR/lib/ld.so" ]; then
-	echo "Using lib/ld.so"
-	exec "$CURDIR/lib/ld.so" "$CURDIR/dllm-grpc" "$@"
-fi
-
-exec "$CURDIR/dllm-grpc" "$@"
--- a/backend/go/local-store/debug.go
+++ b/backend/go/local-store/debug.go
@@ -8,6 +8,6 @@ import (

 func assert(cond bool, msg string) {
 	if !cond {
-		xlog.Fatal(msg)
+		xlog.Fatal().Stack().Msg(msg)
 	}
 }
--- a/backend/go/local-store/store.go
+++ b/backend/go/local-store/store.go
@@ -1,22 +1,7 @@
 package main

-// LocalAI's in-process vector store, exposed as a gRPC backend. Keep
-// the implementation here — NOT in a pkg/ library imported by the main
-// LocalAI process. The whole point of the gRPC surface is that vector
-// storage is a backend like any other (local-store, qdrant, pinecone,
-// ...) and can be swapped without changing the routing/recognition
-// code that consumes it.
-//
-// Storage is a sorted parallel-slice (keys [][]float32, values
-// [][]byte). Set/Delete preserve the sort so Get can binary-search.
-// Find scans linearly and uses a heap to keep the top-K — fine for
-// the tens-to-thousands range. The "normalized fast path" (Find when
-// every stored key has unit magnitude AND the query is normalized)
-// skips the per-item magnitude calculation.
-//
-// Concurrency: base.SingleThread serialises gRPC calls so the
-// non-thread-safe slice/heap manipulation here is sound.
-
+// This is a wrapper to statisfy the GRPC service interface
+// It is meant to be used by the main executable that is the server for the specific backend type (falcon, gpt3, etc)
 import (
 	"container/heap"
 	"fmt"
@@ -25,29 +10,32 @@ import (

 	"github.com/mudler/LocalAI/pkg/grpc/base"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	"github.com/mudler/LocalAI/pkg/store"
+
+	"github.com/mudler/xlog"
 )

 type Store struct {
 	base.SingleThread

-	keys   [][]float32
+	// The sorted keys
+	keys [][]float32
+	// The sorted values
 	values [][]byte

-	// keysAreNormalized stays true until any non-unit-magnitude key
-	// is added; once false, the magnitude-aware fallback path is
-	// used by Find. Re-evaluated only at Set time, never again on
-	// its own — a deletion of the offending key does NOT flip it
-	// back to true (the bookkeeping cost would dominate the gain).
+	// If for every K it holds that ||k||^2 = 1, then we can use the normalized distance functions
+	// TODO: Should we normalize incoming keys if they are not instead?
 	keysAreNormalized bool
-
-	// keyLen is the dimension of every stored key. -1 means "no
-	// keys yet, dimension is open". Dimension mismatch on Set is
-	// rejected so cosine similarity (which requires equal-length
-	// vectors) doesn't silently mis-match.
+	// The first key decides the length of the keys
 	keyLen int
 }

+// TODO: Only used for sorting using Go's builtin implementation. The interfaces are columnar because
+// that's theoretically best for memory layout and cache locality, but this isn't optimized yet.
+type Pair struct {
+	Key   []float32
+	Value []byte
+}
+
 func NewStore() *Store {
 	return &Store{
 		keys:              make([][]float32, 0),
@@ -57,278 +45,334 @@ func NewStore() *Store {
 	}
 }

-// Load is a no-op — local-store has no on-disk artefact. opts.Model is
-// just a namespace identifier; isolation is already handled upstream
-// (ModelLoader spawns a fresh local-store process per (backend,
-// model) tuple, so each namespace is its own Store{} instance).
+func compareSlices(k1, k2 []float32) int {
+	assert(len(k1) == len(k2), fmt.Sprintf("compareSlices: len(k1) = %d, len(k2) = %d", len(k1), len(k2)))
+
+	return slices.Compare(k1, k2)
+}
+
+func hasKey(unsortedSlice [][]float32, target []float32) bool {
+	return slices.ContainsFunc(unsortedSlice, func(k []float32) bool {
+		return compareSlices(k, target) == 0
+	})
+}
+
+func findInSortedSlice(sortedSlice [][]float32, target []float32) (int, bool) {
+	return slices.BinarySearchFunc(sortedSlice, target, func(k, t []float32) int {
+		return compareSlices(k, t)
+	})
+}
+
+func isSortedPairs(kvs []Pair) bool {
+	for i := 1; i < len(kvs); i++ {
+		if compareSlices(kvs[i-1].Key, kvs[i].Key) > 0 {
+			return false
+		}
+	}
+
+	return true
+}
+
+func isSortedKeys(keys [][]float32) bool {
+	for i := 1; i < len(keys); i++ {
+		if compareSlices(keys[i-1], keys[i]) > 0 {
+			return false
+		}
+	}
+
+	return true
+}
+
+func sortIntoKeySlicese(keys []*pb.StoresKey) [][]float32 {
+	ks := make([][]float32, len(keys))
+
+	for i, k := range keys {
+		ks[i] = k.Floats
+	}
+
+	slices.SortFunc(ks, compareSlices)
+
+	assert(len(ks) == len(keys), fmt.Sprintf("len(ks) = %d, len(keys) = %d", len(ks), len(keys)))
+	assert(isSortedKeys(ks), "keys are not sorted")
+
+	return ks
+}
+
 func (s *Store) Load(opts *pb.ModelOptions) error {
+	// local-store is an in-memory vector store with no on-disk artefact to
+	// load — opts.Model is just a namespace identifier. The old `!= ""` guard
+	// rejected any non-empty model name with "not implemented", which broke
+	// callers that pass a namespace to isolate embedding spaces (face vs.
+	// voice biometrics both go through local-store but need distinct stores
+	// so ArcFace 512-D and ECAPA-TDNN 192-D don't collide). Namespace
+	// isolation is already handled upstream: ModelLoader spawns a fresh
+	// local-store process per (backend, model) tuple, so each namespace is
+	// its own Store{} instance. Nothing to do here beyond accepting the load.
 	_ = opts
 	return nil
 }

+// Sort the incoming kvs and merge them with the existing sorted kvs
 func (s *Store) StoresSet(opts *pb.StoresSetOptions) error {
-	keys := store.UnwrapKeys(opts.Keys)
-	values := store.UnwrapValues(opts.Values)
-	if len(keys) == 0 {
-		return fmt.Errorf("local-store: Set: no keys to add")
+	if len(opts.Keys) == 0 {
+		return fmt.Errorf("no keys to add")
 	}
-	if len(keys) != len(values) {
-		return fmt.Errorf("local-store: Set: len(keys) = %d, len(values) = %d", len(keys), len(values))
+
+	if len(opts.Keys) != len(opts.Values) {
+		return fmt.Errorf("len(keys) = %d, len(values) = %d", len(opts.Keys), len(opts.Values))
 	}

 	if s.keyLen == -1 {
-		s.keyLen = len(keys[0])
-	} else if len(keys[0]) != s.keyLen {
-		return fmt.Errorf("local-store: Set: key length %d does not match existing %d", len(keys[0]), s.keyLen)
+		s.keyLen = len(opts.Keys[0].Floats)
+	} else {
+		if len(opts.Keys[0].Floats) != s.keyLen {
+			return fmt.Errorf("Try to add key with length %d when existing length is %d", len(opts.Keys[0].Floats), s.keyLen)
+		}
 	}

-	kvs := make([]incomingPair, len(keys))
-	for i, k := range keys {
-		if len(k) != s.keyLen {
-			return fmt.Errorf("local-store: Set: key %d length %d does not match existing %d", i, len(k), s.keyLen)
-		}
-		if s.keysAreNormalized && !isNormalized(k) {
+	kvs := make([]Pair, len(opts.Keys))
+
+	for i, k := range opts.Keys {
+		if s.keysAreNormalized && !isNormalized(k.Floats) {
 			s.keysAreNormalized = false
+			var sample []float32
+			if len(s.keys) > 5 {
+				sample = k.Floats[:5]
+			} else {
+				sample = k.Floats
+			}
+			xlog.Debug("Key is not normalized", "sample", sample)
+		}
+
+		kvs[i] = Pair{
+			Key:   k.Floats,
+			Value: opts.Values[i].Bytes,
 		}
-		kvs[i] = incomingPair{key: k, value: values[i]}
 	}

-	slices.SortFunc(kvs, func(a, b incomingPair) int { return slices.Compare(a.key, b.key) })
+	slices.SortFunc(kvs, func(a, b Pair) int {
+		return compareSlices(a.Key, b.Key)
+	})
+
+	assert(len(kvs) == len(opts.Keys), fmt.Sprintf("len(kvs) = %d, len(opts.Keys) = %d", len(kvs), len(opts.Keys)))
+	assert(isSortedPairs(kvs), "keys are not sorted")
+
+	l := len(kvs) + len(s.keys)
+	merge_ks := make([][]float32, 0, l)
+	merge_vs := make([][]byte, 0, l)
+
+	i, j := 0, 0
+	for {
+		if i+j >= l {
+			break
+		}
+
+		if i >= len(kvs) {
+			merge_ks = append(merge_ks, s.keys[j])
+			merge_vs = append(merge_vs, s.values[j])
+			j++
+			continue
+		}
+
+		if j >= len(s.keys) {
+			merge_ks = append(merge_ks, kvs[i].Key)
+			merge_vs = append(merge_vs, kvs[i].Value)
+			i++
+			continue
+		}
+
+		c := compareSlices(kvs[i].Key, s.keys[j])
+		if c < 0 {
+			merge_ks = append(merge_ks, kvs[i].Key)
+			merge_vs = append(merge_vs, kvs[i].Value)
+			i++
+		} else if c > 0 {
+			merge_ks = append(merge_ks, s.keys[j])
+			merge_vs = append(merge_vs, s.values[j])
+			j++
+		} else {
+			merge_ks = append(merge_ks, kvs[i].Key)
+			merge_vs = append(merge_vs, kvs[i].Value)
+			i++
+			j++
+		}
+	}
+
+	assert(len(merge_ks) == l, fmt.Sprintf("len(merge_ks) = %d, l = %d", len(merge_ks), l))
+	assert(isSortedKeys(merge_ks), "merge keys are not sorted")
+
+	s.keys = merge_ks
+	s.values = merge_vs

-	merged := mergeSortedPairs(s.keys, s.values, kvs)
-	s.keys = merged.keys
-	s.values = merged.values
-	assert(slices.IsSortedFunc(s.keys, slices.Compare[[]float32]), "Set: s.keys not sorted post-merge")
-	assert(len(s.keys) == len(s.values), "Set: keys/values length skew")
 	return nil
 }

 func (s *Store) StoresDelete(opts *pb.StoresDeleteOptions) error {
-	keys := store.UnwrapKeys(opts.Keys)
-	if len(keys) == 0 {
-		return fmt.Errorf("local-store: Delete: no keys to delete")
+	if len(opts.Keys) == 0 {
+		return fmt.Errorf("no keys to delete")
 	}
-	if s.keyLen != -1 {
-		for i, k := range keys {
-			if len(k) != s.keyLen {
-				return fmt.Errorf("local-store: Delete: key %d length %d does not match existing %d", i, len(k), s.keyLen)
+
+	if len(opts.Keys) == 0 {
+		return fmt.Errorf("no keys to add")
+	}
+
+	if s.keyLen == -1 {
+		s.keyLen = len(opts.Keys[0].Floats)
+	} else {
+		if len(opts.Keys[0].Floats) != s.keyLen {
+			return fmt.Errorf("Trying to delete key with length %d when existing length is %d", len(opts.Keys[0].Floats), s.keyLen)
+		}
+	}
+
+	ks := sortIntoKeySlicese(opts.Keys)
+
+	l := len(s.keys) - len(ks)
+	merge_ks := make([][]float32, 0, l)
+	merge_vs := make([][]byte, 0, l)
+
+	tail_ks := s.keys
+	tail_vs := s.values
+	for _, k := range ks {
+		j, found := findInSortedSlice(tail_ks, k)
+
+		if found {
+			merge_ks = append(merge_ks, tail_ks[:j]...)
+			merge_vs = append(merge_vs, tail_vs[:j]...)
+			tail_ks = tail_ks[j+1:]
+			tail_vs = tail_vs[j+1:]
+		} else {
+			assert(!hasKey(s.keys, k), fmt.Sprintf("Key exists, but was not found: t=%d, %v", len(tail_ks), k))
+		}
+
+		xlog.Debug("Delete", "found", found, "tailLen", len(tail_ks), "j", j, "mergeKeysLen", len(merge_ks), "mergeValuesLen", len(merge_vs))
+	}
+
+	merge_ks = append(merge_ks, tail_ks...)
+	merge_vs = append(merge_vs, tail_vs...)
+
+	assert(len(merge_ks) <= len(s.keys), fmt.Sprintf("len(merge_ks) = %d, len(s.keys) = %d", len(merge_ks), len(s.keys)))
+
+	s.keys = merge_ks
+	s.values = merge_vs
+
+	assert(len(s.keys) >= l, fmt.Sprintf("len(s.keys) = %d, l = %d", len(s.keys), l))
+	assert(isSortedKeys(s.keys), "keys are not sorted")
+	assert(func() bool {
+		for _, k := range ks {
+			if _, found := findInSortedSlice(s.keys, k); found {
+				return false
 			}
 		}
-	}
-	sortedKeys := append([][]float32(nil), keys...)
-	slices.SortFunc(sortedKeys, slices.Compare[[]float32])
+		return true
+	}(), "Keys to delete still present")

-	mergedK := make([][]float32, 0, len(s.keys))
-	mergedV := make([][]byte, 0, len(s.keys))
-	tailK := s.keys
-	tailV := s.values
-	for _, k := range sortedKeys {
-		j, ok := slices.BinarySearchFunc(tailK, k, slices.Compare[[]float32])
-		if ok {
-			mergedK = append(mergedK, tailK[:j]...)
-			mergedV = append(mergedV, tailV[:j]...)
-			tailK = tailK[j+1:]
-			tailV = tailV[j+1:]
-		}
+	if len(s.keys) != l {
+		xlog.Debug("Delete: Some keys not found", "keysLen", len(s.keys), "expectedLen", l)
 	}
-	mergedK = append(mergedK, tailK...)
-	mergedV = append(mergedV, tailV...)
-	s.keys = mergedK
-	s.values = mergedV
-	assert(slices.IsSortedFunc(s.keys, slices.Compare[[]float32]), "Delete: s.keys not sorted post-merge")
-	assert(len(s.keys) == len(s.values), "Delete: keys/values length skew")
+
 	return nil
 }

-// StoresGet fetches values for the given keys. Missing keys are
-// omitted from the result rather than reported as an error — callers
-// compare returned-key length against requested-key length to detect
-// them. Returned slices are aligned.
 func (s *Store) StoresGet(opts *pb.StoresGetOptions) (pb.StoresGetResult, error) {
-	keys := store.UnwrapKeys(opts.Keys)
+	pbKeys := make([]*pb.StoresKey, 0, len(opts.Keys))
+	pbValues := make([]*pb.StoresValue, 0, len(opts.Keys))
+	ks := sortIntoKeySlicese(opts.Keys)
+
 	if len(s.keys) == 0 {
-		return pb.StoresGetResult{}, nil
-	}
-	if s.keyLen != -1 {
-		for i, k := range keys {
-			if len(k) != s.keyLen {
-				return pb.StoresGetResult{}, fmt.Errorf("local-store: Get: key %d length %d does not match existing %d", i, len(k), s.keyLen)
-			}
-		}
-	}
-	sortedKeys := append([][]float32(nil), keys...)
-	slices.SortFunc(sortedKeys, slices.Compare[[]float32])
-
-	var foundKeys [][]float32
-	var foundValues [][]byte
-	tailK := s.keys
-	tailV := s.values
-	for _, k := range sortedKeys {
-		j, ok := slices.BinarySearchFunc(tailK, k, slices.Compare[[]float32])
-		if !ok {
-			continue
-		}
-		foundKeys = append(foundKeys, tailK[j])
-		foundValues = append(foundValues, tailV[j])
-		tailK = tailK[j+1:]
-		tailV = tailV[j+1:]
-	}
-	return pb.StoresGetResult{
-		Keys:   store.WrapKeys(foundKeys),
-		Values: store.WrapValues(foundValues),
-	}, nil
-}
-
-// StoresFind returns the topK nearest stored entries by cosine
-// similarity, ordered most-similar first. An empty store returns
-// empty slices and no error.
-func (s *Store) StoresFind(opts *pb.StoresFindOptions) (pb.StoresFindResult, error) {
-	query := opts.Key.Floats
-	topK := int(opts.TopK)
-	if topK < 1 {
-		return pb.StoresFindResult{}, fmt.Errorf("local-store: Find: topK = %d, must be >= 1", topK)
-	}
-	if len(s.keys) == 0 {
-		return pb.StoresFindResult{}, nil
-	}
-	if len(query) != s.keyLen {
-		return pb.StoresFindResult{}, fmt.Errorf("local-store: Find: query length %d does not match existing %d", len(query), s.keyLen)
+		xlog.Debug("Get: No keys in store")
 	}

-	var keys [][]float32
-	var values [][]byte
-	var sims []float32
-	if s.keysAreNormalized && isNormalized(query) {
-		keys, values, sims = s.findNormalized(query, topK)
+	if s.keyLen == -1 {
+		s.keyLen = len(opts.Keys[0].Floats)
 	} else {
-		keys, values, sims = s.findFallback(query, topK)
+		if len(opts.Keys[0].Floats) != s.keyLen {
+			return pb.StoresGetResult{}, fmt.Errorf("Try to get a key with length %d when existing length is %d", len(opts.Keys[0].Floats), s.keyLen)
+		}
 	}
-	return pb.StoresFindResult{
-		Keys:         store.WrapKeys(keys),
-		Values:       store.WrapValues(values),
-		Similarities: sims,
+
+	tail_k := s.keys
+	tail_v := s.values
+	for i, k := range ks {
+		j, found := findInSortedSlice(tail_k, k)
+
+		if found {
+			pbKeys = append(pbKeys, &pb.StoresKey{
+				Floats: k,
+			})
+			pbValues = append(pbValues, &pb.StoresValue{
+				Bytes: tail_v[j],
+			})
+
+			tail_k = tail_k[j+1:]
+			tail_v = tail_v[j+1:]
+		} else {
+			assert(!hasKey(s.keys, k), fmt.Sprintf("Key exists, but was not found: i=%d, %v", i, k))
+		}
+	}
+
+	if len(pbKeys) != len(opts.Keys) {
+		xlog.Debug("Get: Some keys not found", "pbKeysLen", len(pbKeys), "optsKeysLen", len(opts.Keys), "storeKeysLen", len(s.keys))
+	}
+
+	return pb.StoresGetResult{
+		Keys:   pbKeys,
+		Values: pbValues,
 	}, nil
 }

-func (s *Store) findNormalized(query []float32, topK int) (keys [][]float32, values [][]byte, similarities []float32) {
-	assert(s.keysAreNormalized, "findNormalized: s.keysAreNormalized is false")
-	assert(isNormalized(query), "findNormalized: query is not unit-length")
-	pq := make(priorityQueue, 0, topK)
-	heap.Init(&pq)
-	for i, k := range s.keys {
-		var dot float32
-		for j := range k {
-			dot += query[j] * k[j]
-		}
-		assert(dot >= -1.01 && dot <= 1.01, fmt.Sprintf("findNormalized: dot %f out of [-1, 1] — keysAreNormalized invariant violated", dot))
-		heap.Push(&pq, &priorityItem{similarity: dot, key: k, value: s.values[i]})
-		if pq.Len() > topK {
-			heap.Pop(&pq)
-		}
-	}
-	return drainPQ(&pq)
-}
-
-func (s *Store) findFallback(query []float32, topK int) (keys [][]float32, values [][]byte, similarities []float32) {
-	var qmag float64
-	for _, v := range query {
-		qmag += float64(v) * float64(v)
-	}
-	qmag = math.Sqrt(qmag)
-	pq := make(priorityQueue, 0, topK)
-	heap.Init(&pq)
-	for i, k := range s.keys {
-		var dot, kmag float64
-		for j := range k {
-			dot += float64(query[j]) * float64(k[j])
-			kmag += float64(k[j]) * float64(k[j])
-		}
-		denom := qmag * math.Sqrt(kmag)
-		var sim float32
-		if denom > 0 {
-			sim = float32(dot / denom)
-		}
-		heap.Push(&pq, &priorityItem{similarity: sim, key: k, value: s.values[i]})
-		if pq.Len() > topK {
-			heap.Pop(&pq)
-		}
-	}
-	return drainPQ(&pq)
-}
-
 func isNormalized(k []float32) bool {
 	var sum float64
+
 	for _, v := range k {
-		sum += float64(v) * float64(v)
+		v64 := float64(v)
+		sum += v64 * v64
 	}
-	mag := math.Sqrt(sum)
-	return mag >= 0.99 && mag <= 1.01
+
+	s := math.Sqrt(sum)
+
+	return s >= 0.99 && s <= 1.01
 }

-type incomingPair struct {
-	key   []float32
-	value []byte
-}
+// TODO: This we could replace with handwritten SIMD code
+func normalizedCosineSimilarity(k1, k2 []float32) float32 {
+	assert(len(k1) == len(k2), fmt.Sprintf("normalizedCosineSimilarity: len(k1) = %d, len(k2) = %d", len(k1), len(k2)))

-type pairs struct {
-	keys   [][]float32
-	values [][]byte
-}
-
-// mergeSortedPairs merges (existing, incoming) into a fresh sorted
-// slice. Equal keys take the incoming value — Set is upsert.
-func mergeSortedPairs(existingK [][]float32, existingV [][]byte, incoming []incomingPair) pairs {
-	assert(slices.IsSortedFunc(existingK, slices.Compare[[]float32]), "mergeSortedPairs: existing not sorted")
-	assert(slices.IsSortedFunc(incoming, func(a, b incomingPair) int { return slices.Compare(a.key, b.key) }), "mergeSortedPairs: incoming not sorted")
-	l := len(existingK) + len(incoming)
-	mk := make([][]float32, 0, l)
-	mv := make([][]byte, 0, l)
-	i, j := 0, 0
-	for i < len(incoming) || j < len(existingK) {
-		switch {
-		case j >= len(existingK):
-			mk = append(mk, incoming[i].key)
-			mv = append(mv, incoming[i].value)
-			i++
-		case i >= len(incoming):
-			mk = append(mk, existingK[j])
-			mv = append(mv, existingV[j])
-			j++
-		default:
-			c := slices.Compare(incoming[i].key, existingK[j])
-			switch {
-			case c < 0:
-				mk = append(mk, incoming[i].key)
-				mv = append(mv, incoming[i].value)
-				i++
-			case c > 0:
-				mk = append(mk, existingK[j])
-				mv = append(mv, existingV[j])
-				j++
-			default:
-				mk = append(mk, incoming[i].key)
-				mv = append(mv, incoming[i].value)
-				i++
-				j++
-			}
-		}
+	var dot float32
+	for i := range len(k1) {
+		dot += k1[i] * k2[i]
 	}
-	return pairs{keys: mk, values: mv}
+
+	assert(dot >= -1.01 && dot <= 1.01, fmt.Sprintf("dot = %f", dot))
+
+	// 2.0 * (1.0 - dot) would be the Euclidean distance
+	return dot
 }

-type priorityItem struct {
-	similarity float32
-	key        []float32
-	value      []byte
+type PriorityItem struct {
+	Similarity float32
+	Key        []float32
+	Value      []byte
 }

-type priorityQueue []*priorityItem
+type PriorityQueue []*PriorityItem

-func (pq priorityQueue) Len() int           { return len(pq) }
-func (pq priorityQueue) Less(i, j int) bool { return pq[i].similarity < pq[j].similarity }
-func (pq priorityQueue) Swap(i, j int)      { pq[i], pq[j] = pq[j], pq[i] }
-func (pq *priorityQueue) Push(x any)        { *pq = append(*pq, x.(*priorityItem)) }
-func (pq *priorityQueue) Pop() any {
+func (pq PriorityQueue) Len() int { return len(pq) }
+
+func (pq PriorityQueue) Less(i, j int) bool {
+	// Inverted because the most similar should be at the top
+	return pq[i].Similarity < pq[j].Similarity
+}
+
+func (pq PriorityQueue) Swap(i, j int) {
+	pq[i], pq[j] = pq[j], pq[i]
+}
+
+func (pq *PriorityQueue) Push(x any) {
+	item := x.(*PriorityItem)
+	*pq = append(*pq, item)
+}
+
+func (pq *PriorityQueue) Pop() any {
 	old := *pq
 	n := len(old)
 	item := old[n-1]
@@ -336,16 +380,142 @@ func (pq *priorityQueue) Pop() any {
 	return item
 }

-func drainPQ(pq *priorityQueue) (keys [][]float32, values [][]byte, similarities []float32) {
-	n := pq.Len()
-	keys = make([][]float32, n)
-	values = make([][]byte, n)
-	similarities = make([]float32, n)
-	for i := n - 1; i >= 0; i-- {
-		item := heap.Pop(pq).(*priorityItem)
-		keys[i] = item.key
-		values[i] = item.value
-		similarities[i] = item.similarity
+func (s *Store) StoresFindNormalized(opts *pb.StoresFindOptions) (pb.StoresFindResult, error) {
+	tk := opts.Key.Floats
+	top_ks := make(PriorityQueue, 0, int(opts.TopK))
+	heap.Init(&top_ks)
+
+	for i, k := range s.keys {
+		sim := normalizedCosineSimilarity(tk, k)
+		heap.Push(&top_ks, &PriorityItem{
+			Similarity: sim,
+			Key:        k,
+			Value:      s.values[i],
+		})
+
+		if top_ks.Len() > int(opts.TopK) {
+			heap.Pop(&top_ks)
+		}
+	}
+
+	similarities := make([]float32, top_ks.Len())
+	pbKeys := make([]*pb.StoresKey, top_ks.Len())
+	pbValues := make([]*pb.StoresValue, top_ks.Len())
+
+	for i := top_ks.Len() - 1; i >= 0; i-- {
+		item := heap.Pop(&top_ks).(*PriorityItem)
+
+		similarities[i] = item.Similarity
+		pbKeys[i] = &pb.StoresKey{
+			Floats: item.Key,
+		}
+		pbValues[i] = &pb.StoresValue{
+			Bytes: item.Value,
+		}
+	}
+
+	return pb.StoresFindResult{
+		Keys:         pbKeys,
+		Values:       pbValues,
+		Similarities: similarities,
+	}, nil
+}
+
+func cosineSimilarity(k1, k2 []float32, mag1 float64) float32 {
+	assert(len(k1) == len(k2), fmt.Sprintf("cosineSimilarity: len(k1) = %d, len(k2) = %d", len(k1), len(k2)))
+
+	var dot, mag2 float64
+	for i := range len(k1) {
+		dot += float64(k1[i] * k2[i])
+		mag2 += float64(k2[i] * k2[i])
+	}
+
+	sim := float32(dot / (mag1 * math.Sqrt(mag2)))
+
+	assert(sim >= -1.01 && sim <= 1.01, fmt.Sprintf("sim = %f", sim))
+
+	return sim
+}
+
+func (s *Store) StoresFindFallback(opts *pb.StoresFindOptions) (pb.StoresFindResult, error) {
+	tk := opts.Key.Floats
+	top_ks := make(PriorityQueue, 0, int(opts.TopK))
+	heap.Init(&top_ks)
+
+	var mag1 float64
+	for _, v := range tk {
+		mag1 += float64(v * v)
+	}
+	mag1 = math.Sqrt(mag1)
+
+	for i, k := range s.keys {
+		dist := cosineSimilarity(tk, k, mag1)
+		heap.Push(&top_ks, &PriorityItem{
+			Similarity: dist,
+			Key:        k,
+			Value:      s.values[i],
+		})
+
+		if top_ks.Len() > int(opts.TopK) {
+			heap.Pop(&top_ks)
+		}
+	}
+
+	similarities := make([]float32, top_ks.Len())
+	pbKeys := make([]*pb.StoresKey, top_ks.Len())
+	pbValues := make([]*pb.StoresValue, top_ks.Len())
+
+	for i := top_ks.Len() - 1; i >= 0; i-- {
+		item := heap.Pop(&top_ks).(*PriorityItem)
+
+		similarities[i] = item.Similarity
+		pbKeys[i] = &pb.StoresKey{
+			Floats: item.Key,
+		}
+		pbValues[i] = &pb.StoresValue{
+			Bytes: item.Value,
+		}
+	}
+
+	return pb.StoresFindResult{
+		Keys:         pbKeys,
+		Values:       pbValues,
+		Similarities: similarities,
+	}, nil
+}
+
+func (s *Store) StoresFind(opts *pb.StoresFindOptions) (pb.StoresFindResult, error) {
+	tk := opts.Key.Floats
+
+	if len(tk) != s.keyLen {
+		return pb.StoresFindResult{}, fmt.Errorf("Try to find key with length %d when existing length is %d", len(tk), s.keyLen)
+	}
+
+	if opts.TopK < 1 {
+		return pb.StoresFindResult{}, fmt.Errorf("opts.TopK = %d, must be >= 1", opts.TopK)
+	}
+
+	if s.keyLen == -1 {
+		s.keyLen = len(opts.Key.Floats)
+	} else {
+		if len(opts.Key.Floats) != s.keyLen {
+			return pb.StoresFindResult{}, fmt.Errorf("Try to add key with length %d when existing length is %d", len(opts.Key.Floats), s.keyLen)
+		}
+	}
+
+	if s.keysAreNormalized && isNormalized(tk) {
+		return s.StoresFindNormalized(opts)
+	} else {
+		if s.keysAreNormalized {
+			var sample []float32
+			if len(s.keys) > 5 {
+				sample = tk[:5]
+			} else {
+				sample = tk
+			}
+			xlog.Debug("Trying to compare non-normalized key with normalized keys", "sample", sample)
+		}
+
+		return s.StoresFindFallback(opts)
 	}
-	return keys, values, similarities
 }
--- a/backend/go/local-store/store_suite_test.go
+++ b/backend/go/local-store/store_suite_test.go
@@ -1,13 +0,0 @@
-package main
-
-import (
-	"testing"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-func TestLocalStore(t *testing.T) {
-	RegisterFailHandler(Fail)
-	RunSpecs(t, "local-store test suite")
-}
--- a/backend/go/local-store/store_test.go
+++ b/backend/go/local-store/store_test.go
@@ -1,284 +0,0 @@
-package main
-
-// Regression suite for the local-store gRPC backend. Exercises the
-// Stores{Set,Get,Find,Delete} surface — the only public contract.
-// Callers (face/voice recognition, the routing KNN classifier) reach
-// this code via grpc.Backend, so testing at the wire-shaped boundary
-// matches the production import shape.
-
-import (
-	"math"
-	"math/rand/v2"
-	"testing"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-var _ = Describe("StoresSet", func() {
-	It("rejects empty input", func() {
-		Expect(NewStore().StoresSet(&pb.StoresSetOptions{})).NotTo(Succeed(), "Set with no keys should fail")
-	})
-
-	It("rejects key/value length mismatch", func() {
-		err := NewStore().StoresSet(&pb.StoresSetOptions{
-			Keys:   wrapKeys([][]float32{{1, 0, 0}}),
-			Values: wrapValues([][]byte{[]byte("a"), []byte("b")}),
-		})
-		Expect(err).To(HaveOccurred(), "len(keys) != len(values) should fail")
-	})
-
-	It("rejects dimension mismatch on later add", func() {
-		s := NewStore()
-		mustSet(s, [][]float32{{1, 0, 0}}, [][]byte{[]byte("3d")})
-		err := s.StoresSet(&pb.StoresSetOptions{
-			Keys:   wrapKeys([][]float32{{1, 0}}),
-			Values: wrapValues([][]byte{[]byte("2d")}),
-		})
-		Expect(err).To(HaveOccurred(), "dimension mismatch on later Set should fail")
-	})
-
-	It("rejects dimension mismatch within batch", func() {
-		err := NewStore().StoresSet(&pb.StoresSetOptions{
-			Keys:   wrapKeys([][]float32{{1, 0, 0}, {1, 0}}),
-			Values: wrapValues([][]byte{[]byte("3d"), []byte("2d")}),
-		})
-		Expect(err).To(HaveOccurred(), "mixed-dimension within one batch should fail")
-	})
-
-	It("merges sorted and updates existing key", func() {
-		s := NewStore()
-		mustSet(s, [][]float32{{0.3, 0, 0}, {0.1, 0, 0}}, [][]byte{[]byte("c"), []byte("a")})
-		mustSet(s, [][]float32{{0.2, 0, 0}, {0.1, 0, 0}}, [][]byte{[]byte("b"), []byte("a-updated")})
-		Expect(s.keys).To(HaveLen(3))
-		got := singleGet(s, []float32{0.1, 0, 0})
-		Expect(string(got)).To(Equal("a-updated"))
-	})
-})
-
-var _ = Describe("StoresGet", func() {
-	It("round-trips multi-key", func() {
-		s := NewStore()
-		mustSet(s,
-			[][]float32{{0.1, 0.2, 0.3}, {0.4, 0.5, 0.6}, {0.7, 0.8, 0.9}},
-			[][]byte{[]byte("a"), []byte("b"), []byte("c")},
-		)
-		res, err := s.StoresGet(&pb.StoresGetOptions{
-			Keys: wrapKeys([][]float32{{0.7, 0.8, 0.9}, {0.1, 0.2, 0.3}}),
-		})
-		Expect(err).NotTo(HaveOccurred())
-		Expect(res.Keys).To(HaveLen(2))
-	})
-
-	It("omits missing keys rather than erroring", func() {
-		s := NewStore()
-		mustSet(s, [][]float32{{0.1, 0, 0}}, [][]byte{[]byte("a")})
-		res, err := s.StoresGet(&pb.StoresGetOptions{
-			Keys: wrapKeys([][]float32{{0.1, 0, 0}, {0.9, 0, 0}}),
-		})
-		Expect(err).NotTo(HaveOccurred())
-		Expect(res.Keys).To(HaveLen(1))
-	})
-})
-
-var _ = Describe("StoresDelete", func() {
-	It("removes and preserves sort", func() {
-		s := NewStore()
-		mustSet(s,
-			[][]float32{{0.1, 0, 0}, {0.2, 0, 0}, {0.3, 0, 0}, {0.4, 0, 0}},
-			[][]byte{[]byte("a"), []byte("b"), []byte("c"), []byte("d")},
-		)
-		Expect(s.StoresDelete(&pb.StoresDeleteOptions{
-			Keys: wrapKeys([][]float32{{0.2, 0, 0}, {0.4, 0, 0}}),
-		})).To(Succeed())
-		Expect(s.keys).To(HaveLen(2))
-	})
-
-	It("tolerates missing keys", func() {
-		s := NewStore()
-		mustSet(s, [][]float32{{0.1, 0, 0}}, [][]byte{[]byte("a")})
-		Expect(s.StoresDelete(&pb.StoresDeleteOptions{
-			Keys: wrapKeys([][]float32{{0.9, 0, 0}}),
-		})).To(Succeed(), "delete of missing key should succeed")
-		Expect(s.keys).To(HaveLen(1))
-	})
-})
-
-var _ = Describe("StoresFind", func() {
-	It("returns normalized top-K", func() {
-		s := NewStore()
-		mustSet(s,
-			[][]float32{
-				normalizeVec([]float32{1, 0, 0}),
-				normalizeVec([]float32{0, 1, 0}),
-				normalizeVec([]float32{0, 0, 1}),
-			},
-			[][]byte{[]byte("x"), []byte("y"), []byte("z")},
-		)
-		res, err := s.StoresFind(&pb.StoresFindOptions{
-			Key:  &pb.StoresKey{Floats: normalizeVec([]float32{0.9, 0.1, 0})},
-			TopK: 2,
-		})
-		Expect(err).NotTo(HaveOccurred())
-		Expect(res.Keys).To(HaveLen(2))
-		Expect(res.Similarities[0]).To(BeNumerically(">=", res.Similarities[1]), "results not sorted desc by similarity")
-		Expect(string(res.Values[0].Bytes)).To(Equal("x"))
-	})
-
-	It("falls back for non-normalized keys", func() {
-		s := NewStore()
-		mustSet(s, [][]float32{{2, 0, 0}, {0, 3, 0}}, [][]byte{[]byte("x"), []byte("y")})
-		Expect(s.keysAreNormalized).To(BeFalse(), "store should report non-normalized after Set with magnitude > 1")
-		res, err := s.StoresFind(&pb.StoresFindOptions{
-			Key:  &pb.StoresKey{Floats: []float32{4, 0, 0}},
-			TopK: 1,
-		})
-		Expect(err).NotTo(HaveOccurred())
-		Expect(string(res.Values[0].Bytes)).To(Equal("x"))
-		Expect(res.Similarities[0]).To(BeNumerically(">=", float32(0.99)))
-		Expect(res.Similarities[0]).To(BeNumerically("<=", float32(1.01)))
-	})
-
-	It("rejects zero topK", func() {
-		s := NewStore()
-		mustSet(s, [][]float32{{1, 0, 0}}, [][]byte{[]byte("x")})
-		_, err := s.StoresFind(&pb.StoresFindOptions{
-			Key:  &pb.StoresKey{Floats: []float32{1, 0, 0}},
-			TopK: 0,
-		})
-		Expect(err).To(HaveOccurred(), "Find with topK=0 should fail")
-	})
-
-	It("rejects dimension mismatch", func() {
-		s := NewStore()
-		mustSet(s, [][]float32{{1, 0, 0}}, [][]byte{[]byte("x")})
-		_, err := s.StoresFind(&pb.StoresFindOptions{
-			Key:  &pb.StoresKey{Floats: []float32{1, 0}},
-			TopK: 1,
-		})
-		Expect(err).To(HaveOccurred(), "Find with mismatched dimension should fail")
-	})
-
-	It("returns empty result on empty store", func() {
-		res, err := NewStore().StoresFind(&pb.StoresFindOptions{
-			Key:  &pb.StoresKey{Floats: []float32{1, 0, 0}},
-			TopK: 5,
-		})
-		Expect(err).NotTo(HaveOccurred(), "Find on empty store should succeed")
-		Expect(res.Keys).To(BeEmpty())
-	})
-
-	It("handles topK larger than store", func() {
-		s := NewStore()
-		mustSet(s,
-			[][]float32{normalizeVec([]float32{1, 0, 0}), normalizeVec([]float32{0, 1, 0})},
-			[][]byte{[]byte("x"), []byte("y")},
-		)
-		res, err := s.StoresFind(&pb.StoresFindOptions{
-			Key:  &pb.StoresKey{Floats: normalizeVec([]float32{1, 0, 0})},
-			TopK: 10,
-		})
-		Expect(err).NotTo(HaveOccurred())
-		Expect(res.Keys).To(HaveLen(2))
-	})
-})
-
-var _ = Describe("StoresLoad", func() {
-	It("is a no-op", func() {
-		Expect(NewStore().Load(&pb.ModelOptions{Model: "any-namespace"})).To(Succeed())
-	})
-})
-
-func BenchmarkStoresFindNormalized(b *testing.B) {
-	const dim = 768
-	for _, n := range []int{8, 32, 128, 512} {
-		b.Run(fmtN(n), func(b *testing.B) {
-			s := buildStore(b, n, dim)
-			query := normalizeVec(randVec(dim, 42))
-			req := &pb.StoresFindOptions{Key: &pb.StoresKey{Floats: query}, TopK: 1}
-			b.ResetTimer()
-			for i := 0; i < b.N; i++ {
-				if _, err := s.StoresFind(req); err != nil {
-					b.Fatal(err)
-				}
-			}
-		})
-	}
-}
-
-// --- test helpers ---
-
-func mustSet(s *Store, keys [][]float32, values [][]byte) {
-	ExpectWithOffset(1, s.StoresSet(&pb.StoresSetOptions{Keys: wrapKeys(keys), Values: wrapValues(values)})).To(Succeed())
-}
-
-func singleGet(s *Store, key []float32) []byte {
-	res, err := s.StoresGet(&pb.StoresGetOptions{Keys: wrapKeys([][]float32{key})})
-	ExpectWithOffset(1, err).NotTo(HaveOccurred())
-	if len(res.Values) == 0 {
-		return nil
-	}
-	return res.Values[0].Bytes
-}
-
-func wrapKeys(in [][]float32) []*pb.StoresKey {
-	out := make([]*pb.StoresKey, len(in))
-	for i, k := range in {
-		out[i] = &pb.StoresKey{Floats: k}
-	}
-	return out
-}
-
-func wrapValues(in [][]byte) []*pb.StoresValue {
-	out := make([]*pb.StoresValue, len(in))
-	for i, v := range in {
-		out[i] = &pb.StoresValue{Bytes: v}
-	}
-	return out
-}
-
-func buildStore(tb testing.TB, n, dim int) *Store {
-	tb.Helper()
-	s := NewStore()
-	keys := make([][]float32, n)
-	values := make([][]byte, n)
-	for i := 0; i < n; i++ {
-		keys[i] = normalizeVec(randVec(dim, int64(i)+1))
-		values[i] = []byte{byte(i)}
-	}
-	if err := s.StoresSet(&pb.StoresSetOptions{Keys: wrapKeys(keys), Values: wrapValues(values)}); err != nil {
-		tb.Fatal(err)
-	}
-	return s
-}
-
-func randVec(dim int, seed int64) []float32 {
-	r := rand.New(rand.NewPCG(uint64(seed), 0xabcdef))
-	v := make([]float32, dim)
-	for i := range v {
-		v[i] = float32(r.NormFloat64())
-	}
-	return v
-}
-
-func normalizeVec(v []float32) []float32 {
-	var sum float64
-	for _, x := range v {
-		sum += float64(x) * float64(x)
-	}
-	mag := math.Sqrt(sum)
-	if mag == 0 {
-		return v
-	}
-	out := make([]float32, len(v))
-	for i, x := range v {
-		out[i] = float32(float64(x) / mag)
-	}
-	return out
-}
-
-func fmtN(n int) string {
-	return map[int]string{8: "n=8", 32: "n=32", 128: "n=128", 512: "n=512"}[n]
-}
--- a/backend/go/localvqe/Makefile
+++ b/backend/go/localvqe/Makefile
@@ -9,7 +9,7 @@ JOBS?=$(shell nproc --ignore=1)
 # LocalVQE upstream version pin. Bump to a specific commit when picking up
 # a new release; `main` works for development but is not reproducible.
 LOCALVQE_REPO?=https://github.com/localai-org/LocalVQE
-LOCALVQE_VERSION?=b0f0378a450e87c871b85689554801601ca56d98
+LOCALVQE_VERSION?=72bfb4c6

 # LocalVQE handles CPU feature selection internally (it ships the multiple
 # libggml-cpu-*.so variants and its loader picks the best one at runtime
@@ -27,8 +27,7 @@ endif

 # LocalVQE upstream supports CPU + Vulkan only. Other BUILD_TYPE values
 # fall through to the default CPU build — Vulkan is already as fast as the
-# specialised GPU paths would be on these small (1.3 M–4.8 M parameter)
-# models.
+# specialised GPU paths would be on this 1.3 M-parameter model.
 ifeq ($(BUILD_TYPE),vulkan)
 	CMAKE_ARGS+=-DGGML_VULKAN=ON -DLOCALVQE_VULKAN=ON
 else ifeq ($(OS),Darwin)
--- a/backend/go/localvqe/golocalvqe.go
+++ b/backend/go/localvqe/golocalvqe.go
@@ -3,6 +3,7 @@ package main
 import (
 	"encoding/binary"
 	"fmt"
+	"io"
 	"os"
 	"path/filepath"
 	"runtime"
@@ -10,7 +11,6 @@ import (
 	"strings"
 	"unsafe"

-	"github.com/go-audio/wav"
 	"github.com/mudler/LocalAI/pkg/grpc/base"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 	"github.com/mudler/xlog"
@@ -46,24 +46,24 @@ const (
 // through the options builder (CppOptionsNew + setters + CppNewWithOptions)
 // — the bare localvqe_new path doesn't expose backend / device selection.
 var (
-	CppOptionsNew          func() uintptr
-	CppOptionsFree         func(opts uintptr)
-	CppOptionsSetModelPath func(opts uintptr, modelPath string) int32
-	CppOptionsSetBackend   func(opts uintptr, backend string) int32
-	CppOptionsSetDevice    func(opts uintptr, device int32) int32
-	CppNewWithOptions      func(opts uintptr) uintptr
-	CppFree                func(ctx uintptr)
-	CppProcessF32          func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
-	CppProcessS16          func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
-	CppProcessFrameF32     func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
-	CppProcessFrameS16     func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
-	CppReset               func(ctx uintptr)
-	CppLastError           func(ctx uintptr) string
-	CppSampleRate          func(ctx uintptr) int32
-	CppHopLength           func(ctx uintptr) int32
-	CppFFTSize             func(ctx uintptr) int32
-	CppSetNoiseGate        func(ctx uintptr, enabled int32, thresholdDBFS float32) int32
-	CppGetNoiseGate        func(ctx uintptr, enabledOut, thresholdDBFSOut uintptr) int32
+	CppOptionsNew           func() uintptr
+	CppOptionsFree          func(opts uintptr)
+	CppOptionsSetModelPath  func(opts uintptr, modelPath string) int32
+	CppOptionsSetBackend    func(opts uintptr, backend string) int32
+	CppOptionsSetDevice     func(opts uintptr, device int32) int32
+	CppNewWithOptions       func(opts uintptr) uintptr
+	CppFree                 func(ctx uintptr)
+	CppProcessF32           func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
+	CppProcessS16           func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
+	CppProcessFrameF32      func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
+	CppProcessFrameS16      func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
+	CppReset                func(ctx uintptr)
+	CppLastError            func(ctx uintptr) string
+	CppSampleRate           func(ctx uintptr) int32
+	CppHopLength            func(ctx uintptr) int32
+	CppFFTSize              func(ctx uintptr) int32
+	CppSetNoiseGate         func(ctx uintptr, enabled int32, thresholdDBFS float32) int32
+	CppGetNoiseGate         func(ctx uintptr, enabledOut, thresholdDBFSOut uintptr) int32
 )

 // LocalVQE speaks gRPC against LocalVQE's flat C ABI. The streaming
@@ -490,14 +490,11 @@ func (v *LocalVQE) applyStreamConfig(cfg *pb.AudioTransformStreamConfig) error {

 // ---- WAV I/O ----------------------------------------------------------
 //
-// Reader/writer for the mono 16-bit PCM shape LocalVQE works with. Decoding
-// goes through the shared go-audio/wav decoder (as the whisper and parakeet
-// backends do) so RIFF chunk walking is handled robustly — an 18/40-byte
-// extensible `fmt ` chunk, or JUNK/bext/LIST metadata before or after `data`
-// (e.g. ffmpeg's trailing "Lavf" tag), is skipped rather than spliced into
-// the PCM stream as an audible click. The HTTP layer normalises arbitrary
-// input to WAV before we see it, but that WAV is ffmpeg output and is not
-// guaranteed to be the canonical 44-byte layout.
+// Minimal mono PCM WAV reader/writer. Only handles the subset LocalVQE
+// cares about (mono, 16-bit signed, no extensible chunks). For broader
+// audio support the HTTP layer's `audio.NormalizeAudioFile` already
+// converts arbitrary input to a canonical WAV before we see it; this
+// reader just decodes the canonical shape.

 func readMonoWAVf32(path string) ([]float32, int, error) {
 	f, err := os.Open(path)
@@ -505,26 +502,35 @@ func readMonoWAVf32(path string) ([]float32, int, error) {
 		return nil, 0, err
 	}
 	defer func() { _ = f.Close() }()
-
-	buf, err := wav.NewDecoder(f).FullPCMBuffer()
-	if err != nil {
-		return nil, 0, fmt.Errorf("decode WAV: %w", err)
+	header := make([]byte, 44)
+	if _, err := io.ReadFull(f, header); err != nil {
+		return nil, 0, err
 	}
-	if buf == nil || buf.Format == nil {
+	if string(header[0:4]) != "RIFF" || string(header[8:12]) != "WAVE" {
 		return nil, 0, fmt.Errorf("not a WAV file")
 	}
-	if buf.Format.NumChannels != 1 {
-		return nil, 0, fmt.Errorf("only mono WAV supported (got %d channels)", buf.Format.NumChannels)
+	channels := binary.LittleEndian.Uint16(header[22:24])
+	sampleRate := binary.LittleEndian.Uint32(header[24:28])
+	bitsPerSample := binary.LittleEndian.Uint16(header[34:36])
+
+	if channels != 1 {
+		return nil, 0, fmt.Errorf("only mono WAV supported (got %d channels)", channels)
 	}
-	if buf.SourceBitDepth != 16 {
-		return nil, 0, fmt.Errorf("only 16-bit PCM supported (got %d bits)", buf.SourceBitDepth)
+	if bitsPerSample != 16 {
+		return nil, 0, fmt.Errorf("only 16-bit PCM supported (got %d bits)", bitsPerSample)
 	}
-	if len(buf.Data) == 0 {
-		return nil, 0, fmt.Errorf("WAV has no audio data")
+
+	rest, err := io.ReadAll(f)
+	if err != nil {
+		return nil, 0, err
 	}
-	// AsFloat32Buffer normalises by 2^(bitDepth-1) == /32768 for 16-bit,
-	// matching the model's expected [-1, 1) input range.
-	return buf.AsFloat32Buffer().Data, buf.Format.SampleRate, nil
+	n := len(rest) / 2
+	out := make([]float32, n)
+	for i := 0; i < n; i++ {
+		s := int16(binary.LittleEndian.Uint16(rest[i*2 : i*2+2]))
+		out[i] = float32(s) / 32768.0
+	}
+	return out, int(sampleRate), nil
 }

 func writeMonoWAVf32(path string, samples []float32, sampleRate int) error {
@@ -540,13 +546,13 @@ func writeMonoWAVf32(path string, samples []float32, sampleRate int) error {
 	binary.LittleEndian.PutUint32(header[4:8], 36+dataLen)
 	copy(header[8:12], []byte("WAVE"))
 	copy(header[12:16], []byte("fmt "))
-	binary.LittleEndian.PutUint32(header[16:20], 16) // fmt chunk size
-	binary.LittleEndian.PutUint16(header[20:22], 1)  // PCM
-	binary.LittleEndian.PutUint16(header[22:24], 1)  // mono
+	binary.LittleEndian.PutUint32(header[16:20], 16)        // fmt chunk size
+	binary.LittleEndian.PutUint16(header[20:22], 1)         // PCM
+	binary.LittleEndian.PutUint16(header[22:24], 1)         // mono
 	binary.LittleEndian.PutUint32(header[24:28], uint32(sampleRate))
 	binary.LittleEndian.PutUint32(header[28:32], uint32(sampleRate*2)) // byte rate
-	binary.LittleEndian.PutUint16(header[32:34], 2)                    // block align
-	binary.LittleEndian.PutUint16(header[34:36], 16)                   // bits per sample
+	binary.LittleEndian.PutUint16(header[32:34], 2)         // block align
+	binary.LittleEndian.PutUint16(header[34:36], 16)        // bits per sample
 	copy(header[36:40], []byte("data"))
 	binary.LittleEndian.PutUint32(header[40:44], dataLen)
 	if _, err := f.Write(header); err != nil {
--- a/backend/go/localvqe/localvqe_test.go
+++ b/backend/go/localvqe/localvqe_test.go
@@ -1,9 +1,7 @@
 package main

 import (
-	"encoding/binary"
 	"os"
-	"path/filepath"
 	"testing"

 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
@@ -94,147 +92,6 @@ var _ = Describe("LocalVQE-cpp", func() {
 		})
 	})

-	Context("readMonoWAVf32 chunk parsing", func() {
-		// chunk builds a word-aligned RIFF sub-chunk (id + size + body + pad).
-		chunk := func(id string, body []byte) []byte {
-			out := append([]byte(id), 0, 0, 0, 0)
-			binary.LittleEndian.PutUint32(out[4:8], uint32(len(body)))
-			out = append(out, body...)
-			if len(body)&1 == 1 {
-				out = append(out, 0) // pad byte for odd-sized chunks
-			}
-			return out
-		}
-		// fmtBody returns a PCM `fmt ` chunk body. extra bytes simulate the
-		// 18/40-byte extensible form (cbSize + extension).
-		fmtBody := func(channels, bits uint16, rate uint32, extra int) []byte {
-			b := make([]byte, 16+extra)
-			binary.LittleEndian.PutUint16(b[0:2], 1) // PCM
-			binary.LittleEndian.PutUint16(b[2:4], channels)
-			binary.LittleEndian.PutUint32(b[4:8], rate)
-			binary.LittleEndian.PutUint32(b[8:12], rate*uint32(channels)*uint32(bits)/8)
-			binary.LittleEndian.PutUint16(b[12:14], channels*bits/8)
-			binary.LittleEndian.PutUint16(b[14:16], bits)
-			if extra >= 2 {
-				binary.LittleEndian.PutUint16(b[16:18], uint16(extra-2)) // cbSize
-			}
-			return b
-		}
-		// pcm encodes int16 samples little-endian.
-		pcm := func(samples ...int16) []byte {
-			b := make([]byte, len(samples)*2)
-			for i, s := range samples {
-				binary.LittleEndian.PutUint16(b[i*2:i*2+2], uint16(s))
-			}
-			return b
-		}
-		riff := func(chunks ...[]byte) []byte {
-			body := []byte("WAVE")
-			for _, c := range chunks {
-				body = append(body, c...)
-			}
-			out := append([]byte("RIFF"), 0, 0, 0, 0)
-			binary.LittleEndian.PutUint32(out[4:8], uint32(len(body)))
-			return append(out, body...)
-		}
-		writeWAV := func(b []byte) string {
-			p := filepath.Join(GinkgoT().TempDir(), "in.wav")
-			Expect(os.WriteFile(p, b, 0o600)).To(Succeed())
-			return p
-		}
-		// A canonical sample run with distinct values so any off-by-one /
-		// misalignment shows up as wrong numbers, not just wrong length.
-		samples := []int16{1000, -2000, 3000, -4000, 5000, -6000}
-		expectSamples := func(got []float32) {
-			Expect(got).To(HaveLen(len(samples)))
-			for i, s := range samples {
-				Expect(got[i]).To(BeNumerically("~", float32(s)/32768.0, 1e-6))
-			}
-		}
-
-		It("reads a canonical 44-byte WAV", func() {
-			p := writeWAV(riff(chunk("fmt ", fmtBody(1, 16, 16000, 0)), chunk("data", pcm(samples...))))
-			out, sr, err := readMonoWAVf32(p)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(sr).To(Equal(16000))
-			expectSamples(out)
-		})
-
-		It("ignores a LIST/JUNK chunk placed before data (no leading-impulse splice)", func() {
-			p := writeWAV(riff(
-				chunk("fmt ", fmtBody(1, 16, 16000, 0)),
-				chunk("JUNK", []byte("padding-bytes-here!")), // odd length → exercises pad
-				chunk("LIST", []byte("INFOISFTLavf60.0")),
-				chunk("data", pcm(samples...)),
-			))
-			out, sr, err := readMonoWAVf32(p)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(sr).To(Equal(16000))
-			expectSamples(out) // not corrupted by the preceding chunks
-		})
-
-		It("honours the data chunk size and drops a trailing metadata chunk", func() {
-			p := writeWAV(riff(
-				chunk("fmt ", fmtBody(1, 16, 16000, 0)),
-				chunk("data", pcm(samples...)),
-				chunk("LIST", []byte("INFOISFTLavf60.16.100")), // ffmpeg trailer tag
-			))
-			out, _, err := readMonoWAVf32(p)
-			Expect(err).ToNot(HaveOccurred())
-			expectSamples(out) // trailing LIST bytes not decoded as PCM
-		})
-
-		It("handles the 18-byte extensible fmt chunk", func() {
-			p := writeWAV(riff(chunk("fmt ", fmtBody(1, 16, 16000, 2)), chunk("data", pcm(samples...))))
-			out, sr, err := readMonoWAVf32(p)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(sr).To(Equal(16000))
-			expectSamples(out)
-		})
-
-		It("rejects non-mono input", func() {
-			p := writeWAV(riff(chunk("fmt ", fmtBody(2, 16, 16000, 0)), chunk("data", pcm(samples...))))
-			_, _, err := readMonoWAVf32(p)
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring("mono"))
-		})
-
-		It("rejects non-16-bit input", func() {
-			p := writeWAV(riff(chunk("fmt ", fmtBody(1, 8, 16000, 0)), chunk("data", pcm(samples...))))
-			_, _, err := readMonoWAVf32(p)
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring("16-bit"))
-		})
-
-		It("rejects a non-WAV file", func() {
-			p := writeWAV([]byte("not a riff file at all"))
-			_, _, err := readMonoWAVf32(p)
-			Expect(err).To(HaveOccurred())
-		})
-
-		It("errors when the data chunk is missing", func() {
-			// fmt but no data: the decoder must fail rather than return an
-			// empty (or garbage) sample slice. The exact message is the
-			// decoder's, so just assert it errors.
-			p := writeWAV(riff(chunk("fmt ", fmtBody(1, 16, 16000, 0))))
-			_, _, err := readMonoWAVf32(p)
-			Expect(err).To(HaveOccurred())
-		})
-
-		It("round-trips through writeMonoWAVf32", func() {
-			p := filepath.Join(GinkgoT().TempDir(), "rt.wav")
-			in := []float32{0.1, -0.2, 0.3, -0.4}
-			Expect(writeMonoWAVf32(p, in, 16000)).To(Succeed())
-			out, sr, err := readMonoWAVf32(p)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(sr).To(Equal(16000))
-			Expect(out).To(HaveLen(len(in)))
-			for i := range in {
-				Expect(out[i]).To(BeNumerically("~", in[i], 1e-4))
-			}
-		})
-	})
-
 	Context("model-gated integration (LOCALVQE_MODEL_PATH)", func() {
 		It("load + sample rate + hop + fft", func() {
 			path := modelPathOrSkip()
--- a/backend/go/parakeet-cpp/.gitignore
+++ b/backend/go/parakeet-cpp/.gitignore
@@ -1,11 +0,0 @@
-.cache/
-sources/
-build/
-package/
-parakeet-cpp-grpc
-# build artifacts staged in-tree by the Makefile (cp from sources/) or
-# symlinked for local dev; the real sources live in parakeet.cpp upstream.
-*.so
-*.so.*
-parakeet_capi.h
-compile_commands.json
--- a/backend/go/parakeet-cpp/Makefile
+++ b/backend/go/parakeet-cpp/Makefile
@@ -1,93 +0,0 @@
-# parakeet-cpp backend Makefile.
-#
-# Upstream pin lives below as PARAKEET_VERSION?=e270af73b94c9a5c37ec516230219ed4580e1db6
-# (.github/bump_deps.sh) can find and update it - matches the
-# whisper.cpp / ds4 / vibevoice-cpp convention.
-#
-# Local dev shortcut: if you already have an out-of-tree parakeet.cpp
-# build, you can symlink the .so + header into this directory and skip
-# the clone/cmake steps entirely, e.g.:
-#
-#   ln -sf /path/to/parakeet.cpp/build-shared/libparakeet.so .
-#   ln -sf /path/to/parakeet.cpp/include/parakeet_capi.h .
-#   go build -o parakeet-cpp-grpc .
-#
-# That's what the L0 smoke test uses. The default target below does the
-# proper clone-at-pin + cmake build so CI doesn't need a side-checkout.
-
-PARAKEET_VERSION?=e270af73b94c9a5c37ec516230219ed4580e1db6
-PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp
-
-GOCMD?=go
-GO_TAGS?=
-JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
-
-BUILD_TYPE?=
-NATIVE?=false
-
-# Build ggml statically into libparakeet.so (PIC) so the shared lib is
-# self-contained: dlopen needs no libggml*.so alongside it, only system libs
-# (libstdc++/libgomp/libc) that the runtime image already provides.
-CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DPARAKEET_SHARED=ON -DPARAKEET_BUILD_CLI=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
-
-ifeq ($(NATIVE),false)
-	CMAKE_ARGS+=-DGGML_NATIVE=OFF
-endif
-
-# parakeet.cpp gates its GGML backends behind PARAKEET_GGML_* options and does
-# set(GGML_CUDA ${PARAKEET_GGML_CUDA} CACHE BOOL "" FORCE), so a bare -DGGML_CUDA=ON
-# is overwritten back to OFF and the build silently falls back to CPU. Forward the
-# PARAKEET_GGML_* options instead. (openblas is not gated, so -DGGML_BLAS passes through.)
-ifeq ($(BUILD_TYPE),cublas)
-	CMAKE_ARGS+=-DPARAKEET_GGML_CUDA=ON
-else ifeq ($(BUILD_TYPE),openblas)
-	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
-else ifeq ($(BUILD_TYPE),hipblas)
-	CMAKE_ARGS+=-DPARAKEET_GGML_HIP=ON
-else ifeq ($(BUILD_TYPE),vulkan)
-	CMAKE_ARGS+=-DPARAKEET_GGML_VULKAN=ON
-endif
-
-.PHONY: parakeet-cpp-grpc package build clean purge test all
-
-all: parakeet-cpp-grpc
-
-# Clone the upstream parakeet.cpp source at the pinned commit. Directory
-# acts as the target so make only re-clones when missing. After a
-# PARAKEET_VERSION bump, run 'make purge && make' to refetch.
-sources/parakeet.cpp:
-	mkdir -p sources/parakeet.cpp
-	cd sources/parakeet.cpp && \
-	git init -q && \
-	git remote add origin $(PARAKEET_REPO) && \
-	git fetch --depth 1 origin $(PARAKEET_VERSION) && \
-	git checkout FETCH_HEAD && \
-	git submodule update --init --recursive --depth 1 --single-branch
-
-# Build the shared lib + header out-of-tree, then stage them next to the
-# Go sources so purego.Dlopen("libparakeet.so") and the cgo-less build
-# both pick them up.
-libparakeet.so: sources/parakeet.cpp
-	cmake -B sources/parakeet.cpp/build-shared -S sources/parakeet.cpp $(CMAKE_ARGS)
-	cmake --build sources/parakeet.cpp/build-shared --config Release -j$(JOBS)
-	cp -fv sources/parakeet.cpp/build-shared/libparakeet.so* ./ 2>/dev/null || true
-	cp -fv sources/parakeet.cpp/include/parakeet_capi.h ./
-
-parakeet-cpp-grpc: libparakeet.so main.go goparakeetcpp.go
-	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o parakeet-cpp-grpc .
-
-package: parakeet-cpp-grpc
-	bash package.sh
-
-build: package
-
-# Test target. Smoke test is gated on PARAKEET_BACKEND_TEST_MODEL +
-# PARAKEET_BACKEND_TEST_WAV; without them the spec auto-skips.
-test:
-	LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1
-
-clean: purge
-	rm -rf libparakeet.so* parakeet_capi.h package parakeet-cpp-grpc
-
-purge:
-	rm -rf sources/parakeet.cpp
--- a/backend/go/parakeet-cpp/batcher.go
+++ b/backend/go/parakeet-cpp/batcher.go
@@ -1,105 +0,0 @@
-package main
-
-import "time"
-
-// batchRequest is one in-flight unary transcription waiting to be batched.
-// In production pcm/decoder are set; tag is an opaque marker used by tests.
-type batchRequest struct {
-	pcm     []float32
-	decoder int32
-	// language is the per-request target locale ("" means the model default).
-	// parakeet.cpp's batched C-API takes ONE target_lang for the whole batch,
-	// so the dispatcher only coalesces requests that share a language.
-	language string
-	tag      string
-	reply    chan batchReply
-}
-
-// batchReply carries one per-item JSON object string (an element of the C-API's
-// JSON array) or an error back to the waiting handler goroutine.
-type batchReply struct {
-	json string
-	err  error
-}
-
-// batcher coalesces concurrent batchRequests into batched runBatch calls. A
-// single run() goroutine is the sole caller of runBatch, so runBatch (which in
-// production calls the thread-unsafe C engine) is never entered concurrently.
-type batcher struct {
-	submit   chan *batchRequest
-	maxSize  int
-	maxWait  time.Duration
-	runBatch func(reqs []*batchRequest) // must deliver a reply to every req
-}
-
-func newBatcher(maxSize int, maxWait time.Duration, runBatch func([]*batchRequest)) *batcher {
-	if maxSize < 1 {
-		maxSize = 1
-	}
-	return &batcher{
-		submit:   make(chan *batchRequest),
-		maxSize:  maxSize,
-		maxWait:  maxWait,
-		runBatch: runBatch,
-	}
-}
-
-// run is the dispatcher loop: accumulate submitted requests until either maxSize
-// is reached or maxWait elapses since the first queued request, then dispatch.
-// Exits when stop is closed (draining any partially-filled batch first).
-//
-// A batch carries ONE language (parakeet.cpp's batched C-API takes a single
-// target_lang), so a request whose language differs from the batch leader is
-// not coalesced: it is held in carry and becomes the leader of the next batch.
-// carry is therefore never dropped and its caller never deadlocks: every batch
-// (including a lone carry on stop) is dispatched, and runBatch replies to all.
-func (b *batcher) run(stop <-chan struct{}) {
-	var carry *batchRequest
-	for {
-		var first *batchRequest
-		if carry != nil {
-			// A mismatched request from the previous fill leads this batch.
-			first, carry = carry, nil
-		} else {
-			select {
-			case first = <-b.submit:
-			case <-stop:
-				return
-			}
-		}
-		batch := []*batchRequest{first}
-
-		// maxSize==1 disables batching: dispatch immediately (passthrough).
-		if b.maxSize == 1 {
-			b.runBatch(batch)
-			continue
-		}
-
-		timer := time.NewTimer(b.maxWait)
-	fill:
-		for len(batch) < b.maxSize {
-			select {
-			case r := <-b.submit:
-				if r.language != first.language {
-					// Different language: carry it to the next batch so this
-					// batch stays single-language, then dispatch what we have.
-					carry = r
-					break fill
-				}
-				batch = append(batch, r)
-			case <-timer.C:
-				break fill
-			case <-stop:
-				timer.Stop()
-				b.runBatch(batch)
-				// Don't strand a carried request's caller on shutdown.
-				if carry != nil {
-					b.runBatch([]*batchRequest{carry})
-				}
-				return
-			}
-		}
-		timer.Stop()
-		b.runBatch(batch)
-	}
-}
--- a/backend/go/parakeet-cpp/batcher_test.go
+++ b/backend/go/parakeet-cpp/batcher_test.go
@@ -1,164 +0,0 @@
-package main
-
-import (
-	"sync"
-	"time"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-var _ = Describe("batcher", func() {
-	echoReply := func(reqs []*batchRequest) {
-		for _, r := range reqs {
-			r.reply <- batchReply{json: r.tag}
-		}
-	}
-
-	It("coalesces concurrent submits into batches", func() {
-		var mu sync.Mutex
-		var sizes []int
-		run := func(reqs []*batchRequest) {
-			mu.Lock()
-			sizes = append(sizes, len(reqs))
-			mu.Unlock()
-			echoReply(reqs)
-		}
-		b := newBatcher(4, 50*time.Millisecond, run)
-		stop := make(chan struct{})
-		go b.run(stop)
-		defer close(stop)
-
-		const N = 4
-		var wg sync.WaitGroup
-		got := make([]string, N)
-		for i := 0; i < N; i++ {
-			wg.Add(1)
-			go func(i int) {
-				defer wg.Done()
-				rep := make(chan batchReply, 1)
-				b.submit <- &batchRequest{tag: string(rune('a' + i)), reply: rep}
-				got[i] = (<-rep).json
-			}(i)
-		}
-		wg.Wait()
-
-		mu.Lock()
-		defer mu.Unlock()
-		total, maxBatch := 0, 0
-		for _, s := range sizes {
-			total += s
-			if s > maxBatch {
-				maxBatch = s
-			}
-		}
-		Expect(total).To(Equal(N))
-		Expect(maxBatch).To(BeNumerically(">=", 2), "expected at least one batch to coalesce >1 request")
-	})
-
-	It("dispatches when max size is reached", func() {
-		dispatched := make(chan int, 8)
-		run := func(reqs []*batchRequest) {
-			dispatched <- len(reqs)
-			echoReply(reqs)
-		}
-		b := newBatcher(2, time.Hour, run) // huge window: only size can trigger
-		stop := make(chan struct{})
-		go b.run(stop)
-		defer close(stop)
-		for i := 0; i < 2; i++ {
-			rep := make(chan batchReply, 1)
-			b.submit <- &batchRequest{tag: "x", reply: rep}
-			go func(rep chan batchReply) { <-rep }(rep)
-		}
-		Eventually(dispatched, "2s").Should(Receive(Equal(2)))
-	})
-
-	It("dispatches when the wait window elapses", func() {
-		dispatched := make(chan int, 8)
-		run := func(reqs []*batchRequest) {
-			dispatched <- len(reqs)
-			echoReply(reqs)
-		}
-		b := newBatcher(8, 20*time.Millisecond, run) // size unreachable; window fires
-		stop := make(chan struct{})
-		go b.run(stop)
-		defer close(stop)
-		rep := make(chan batchReply, 1)
-		b.submit <- &batchRequest{tag: "x", reply: rep}
-		go func() { <-rep }()
-		Eventually(dispatched, "2s").Should(Receive(Equal(1)))
-	})
-
-	It("bypasses batching when max size is 1", func() {
-		dispatched := make(chan int, 8)
-		run := func(reqs []*batchRequest) {
-			dispatched <- len(reqs)
-			echoReply(reqs)
-		}
-		b := newBatcher(1, time.Hour, run) // size 1 => immediate dispatch
-		stop := make(chan struct{})
-		go b.run(stop)
-		defer close(stop)
-		rep := make(chan batchReply, 1)
-		b.submit <- &batchRequest{tag: "x", reply: rep}
-		go func() { <-rep }()
-		Eventually(dispatched, "2s").Should(Receive(Equal(1)))
-	})
-
-	It("never coalesces requests with different languages into one batch", func() {
-		// parakeet.cpp's batched C-API takes ONE target_lang per batch, so the
-		// dispatcher must keep every dispatched batch single-language. Submit a
-		// mix of languages and assert (a) no batch ever carries more than one
-		// distinct language and (b) every submitted request still gets a reply
-		// (the mismatched carry-over is never dropped).
-		var mu sync.Mutex
-		var langsPerBatch [][]string
-		run := func(reqs []*batchRequest) {
-			seen := map[string]struct{}{}
-			var distinct []string
-			for _, r := range reqs {
-				if _, ok := seen[r.language]; !ok {
-					seen[r.language] = struct{}{}
-					distinct = append(distinct, r.language)
-				}
-			}
-			mu.Lock()
-			langsPerBatch = append(langsPerBatch, distinct)
-			mu.Unlock()
-			echoReply(reqs)
-		}
-		// Large window + size so the fill loop stays open across submits and the
-		// language constraint (not the timer) is what splits the batches.
-		b := newBatcher(16, 200*time.Millisecond, run)
-		stop := make(chan struct{})
-		go b.run(stop)
-		defer close(stop)
-
-		langs := []string{"en", "en", "de", "de", "en", "fr", "fr"}
-		const N = 7
-		var wg sync.WaitGroup
-		got := make([]string, N)
-		for i := 0; i < N; i++ {
-			wg.Add(1)
-			go func(i int) {
-				defer wg.Done()
-				rep := make(chan batchReply, 1)
-				b.submit <- &batchRequest{tag: string(rune('a' + i)), language: langs[i], reply: rep}
-				got[i] = (<-rep).json
-			}(i)
-		}
-		wg.Wait()
-
-		mu.Lock()
-		defer mu.Unlock()
-		// Invariant: every dispatched batch is single-language.
-		for _, distinct := range langsPerBatch {
-			Expect(len(distinct)).To(Equal(1), "a batch coalesced more than one language: %v", distinct)
-		}
-		// Liveness: every request got a reply (carry-over never stranded).
-		for i := 0; i < N; i++ {
-			Expect(got[i]).To(Equal(string(rune('a' + i))))
-		}
-	})
-})
--- a/backend/go/parakeet-cpp/goparakeetcpp.go
+++ b/backend/go/parakeet-cpp/goparakeetcpp.go
@@ -1,826 +0,0 @@
-package main
-
-import (
-	"context"
-	"encoding/json"
-	"errors"
-	"fmt"
-	"os"
-	"path/filepath"
-	"strconv"
-	"strings"
-	"sync"
-	"time"
-	"unsafe"
-
-	"github.com/go-audio/wav"
-	"github.com/mudler/LocalAI/pkg/grpc/base"
-	"github.com/mudler/LocalAI/pkg/grpc/grpcerrors"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	"github.com/mudler/LocalAI/pkg/utils"
-	"github.com/mudler/xlog"
-	"google.golang.org/grpc/codes"
-	"google.golang.org/grpc/status"
-)
-
-// purego-bound entry points from libparakeet.so. Names match
-// parakeet_capi.h exactly so a `nm libparakeet.so | grep parakeet_capi`
-// is enough to spot drift.
-//
-// Functions that return char* are declared as uintptr so we can call
-// parakeet_capi_free_string on the same pointer after copying, the
-// C-API contract is "caller owns and must free the returned buffer".
-var (
-	CppAbiVersion         func() int32
-	CppLoad               func(ggufPath string) uintptr
-	CppFree               func(ctx uintptr)
-	CppTranscribePath     func(ctx uintptr, wavPath string, decoder int32) uintptr
-	CppTranscribePathJSON func(ctx uintptr, wavPath string, decoder int32) uintptr
-	CppFreeString         func(s uintptr)
-	CppLastError          func(ctx uintptr) string
-
-	// Batched JSON transcription: takes a concatenated float buffer of clips
-	// plus their per-clip sample counts (sum(nSamples)==len(samplesConcat))
-	// and returns a malloc'd char* JSON ARRAY of per-clip {"text","words",
-	// "tokens"} objects (uintptr, freed via CppFreeString). purego passes the
-	// Go slices as the base pointer of their backing array (kept alive for the
-	// call), matching the CppStreamFeed pcm []float32 binding pattern; the C
-	// side reads them as const float*/const int*.
-	CppTranscribePcmBatchJSON func(ctx uintptr, samplesConcat []float32, nSamples []int32, nClips int32, sampleRate int32, decoder int32) uintptr
-
-	// CppTranscribePcmBatchJSONLang is the multilingual variant of the batched
-	// JSON entry point: identical, plus a trailing target_lang. "" (the model
-	// default, "auto") is passed for non-prompt models, which ignore it; an
-	// unknown locale on a prompt model returns 0 and sets last_error. Present
-	// only in newer libparakeet.so; nil falls back to CppTranscribePcmBatchJSON.
-	CppTranscribePcmBatchJSONLang func(ctx uintptr, samplesConcat []float32, nSamples []int32, nClips int32, sampleRate int32, decoder int32, targetLang string) uintptr
-
-	// Cache-aware streaming (RNN-T) entry points. stream_begin returns 0 for
-	// non-streaming models. feed/finalize return a malloc'd char* (uintptr,
-	// freed via CppFreeString); feed writes 1 to *eouOut on an <EOU>/<EOB>.
-	CppStreamBegin    func(ctx uintptr) uintptr
-	CppStreamFeed     func(s uintptr, pcm []float32, nSamples int32, eouOut unsafe.Pointer) uintptr
-	CppStreamFinalize func(s uintptr) uintptr
-	CppStreamFree     func(s uintptr)
-
-	// CppStreamBeginLang is the multilingual variant of stream_begin: identical,
-	// plus a trailing target_lang ("" means the model default). Present only in
-	// newer libparakeet.so; nil falls back to CppStreamBegin.
-	CppStreamBeginLang func(ctx uintptr, targetLang string) uintptr
-
-	// Streaming JSON variants (ABI v4): feed/finalize returning a malloc'd char*
-	// JSON document {text,eou,frame_sec,words} (uintptr, freed via CppFreeString)
-	// so streaming segments can carry per-word timestamps. Present only in newer
-	// libparakeet.so; nil falls back to the text-only CppStreamFeed/Finalize path.
-	CppStreamFeedJSON     func(s uintptr, pcm []float32, nSamples int32) uintptr
-	CppStreamFinalizeJSON func(s uintptr) uintptr
-)
-
-// streamChunkSamples is how much 16 kHz mono PCM we hand to stream_feed per
-// call (1 s). The session buffers internally and decodes once a full
-// cache-aware encoder chunk is available, so this only bounds how often we
-// poll for newly-finalized text, not the model's actual chunk size.
-const streamChunkSamples = 16000
-
-// transcriptJSON mirrors the document returned by
-// parakeet_capi_transcribe_path_json (see parakeet_capi.h):
-//
-//	{"text":"...",
-//	 "words":[{"w":"...","start":0.480,"end":0.640,"conf":0.9100}, ...],
-//	 "tokens":[{"id":123,"t":0.480,"conf":0.9100}, ...]}
-//
-// "start"/"end"/"t" are seconds; "conf" is confidence in (0,1].
-type transcriptJSON struct {
-	Text     string            `json:"text"`
-	FrameSec float64           `json:"frame_sec"`
-	Words    []transcriptWord  `json:"words"`
-	Tokens   []transcriptToken `json:"tokens"`
-}
-
-// streamFeedJSON mirrors the document returned by
-// parakeet_capi_stream_feed_json / parakeet_capi_stream_finalize_json (ABI v4):
-//
-//	{"text":"...","eou":0,"frame_sec":0.080000,
-//	 "words":[{"w":"...","start":0.480,"end":0.640,"conf":0.9100}, ...]}
-//
-// "text" is the newly-finalized text since the last call; "eou" is 1 when an
-// <EOU>/<EOB> fired this feed; "words" are the words finalized this call with
-// absolute (stream-relative) start/end seconds.
-type streamFeedJSON struct {
-	Text     string           `json:"text"`
-	Eou      int              `json:"eou"`
-	FrameSec float64          `json:"frame_sec"`
-	Words    []transcriptWord `json:"words"`
-}
-
-type transcriptWord struct {
-	W     string  `json:"w"`
-	Start float64 `json:"start"`
-	End   float64 `json:"end"`
-	Conf  float64 `json:"conf"`
-}
-
-type transcriptToken struct {
-	ID   int32   `json:"id"`
-	T    float64 `json:"t"`
-	Conf float64 `json:"conf"`
-}
-
-// ParakeetCpp owns a single loaded parakeet_ctx. The C engine is a
-// thread-unsafe singleton (mirrors whisper.cpp / vibevoice.cpp). Rather than
-// serialize every call through base.SingleThread, we route unary
-// transcription through an in-process batcher (its sole dispatcher goroutine
-// is the only caller of the engine on that path) and guard the shared engine
-// with engineMu so a streaming session and a batched-unary dispatch never
-// touch it concurrently.
-type ParakeetCpp struct {
-	base.Base
-	ctxPtr   uintptr
-	engineMu sync.Mutex // sole guard of the one C engine (dispatcher + streaming)
-	bat      *batcher
-	batStop  chan struct{}
-	// segmentGapFrames is NeMo's segment_gap_threshold in ENCODER FRAMES (model
-	// YAML option, default 0=off). When >0 it adds NeMo's silence-gap split on
-	// top of the punctuation split; converted to seconds via the JSON frame_sec.
-	segmentGapFrames int
-}
-
-// Load is the LocalAI gRPC entry point for LoadModel: it calls
-// parakeet_capi_load with the GGUF path and stashes the resulting
-// opaque context pointer for AudioTranscription.
-func (p *ParakeetCpp) Load(opts *pb.ModelOptions) error {
-	if opts.ModelFile == "" {
-		return errors.New("parakeet-cpp: ModelFile is required")
-	}
-
-	ctx := CppLoad(opts.ModelFile)
-	if ctx == 0 {
-		// No ctx to ask for last_error (the C-API's last-error buffer
-		// lives on the ctx that was never returned). Surface the path
-		// so the operator at least knows which load failed.
-		return fmt.Errorf("parakeet-cpp: parakeet_capi_load failed for %q", opts.ModelFile)
-	}
-	p.ctxPtr = ctx
-
-	// Dynamic batching knobs (model YAML options:, key:value form). Batching is
-	// OFF by default (batch_max_size:1): each request runs on its own. On GPU,
-	// raising batch_max_size coalesces concurrent requests into one batched
-	// engine call and improves throughput under load; leave it at 1 on CPU and
-	// for low-concurrency setups, where batching only adds latency.
-	maxSize := optInt(opts, "batch_max_size", 1)
-	maxWaitMs := optInt(opts, "batch_max_wait_ms", 15)
-	if maxWaitMs < 0 {
-		maxWaitMs = 0
-	}
-
-	// NeMo's segment_gap_threshold (encoder frames, default 0=off). Off by
-	// default matches NeMo's default (punctuation-only segments); when set it
-	// additionally splits segments on inter-word silence (see transcriptResultFromDoc).
-	p.segmentGapFrames = optInt(opts, "segment_gap_threshold", 0)
-	if CppTranscribePcmBatchJSON != nil {
-		p.batStop = make(chan struct{})
-		p.bat = newBatcher(maxSize, time.Duration(maxWaitMs)*time.Millisecond, p.runBatch)
-		go p.bat.run(p.batStop) // dispatcher runs until Free closes batStop
-		if maxSize > 1 {
-			xlog.Info("parakeet-cpp: dynamic batching enabled",
-				"batch_max_size", maxSize, "batch_max_wait_ms", maxWaitMs)
-		} else {
-			xlog.Info("parakeet-cpp: dynamic batching off (batch_max_size=1); " +
-				"set batch_max_size>1 to coalesce concurrent requests on GPU")
-		}
-	} else {
-		xlog.Info("parakeet-cpp: batched C-API not present in libparakeet.so; " +
-			"batching disabled, using per-request transcription")
-	}
-	return nil
-}
-
-// optInt reads an integer model option (key:value form) from ModelOptions,
-// returning def when absent or unparseable. The options array carries the
-// model YAML's options: entries (see core/config; siblings such as
-// acestep-cpp parse the same key:value form via strings.Cut on ":").
-func optInt(opts *pb.ModelOptions, key string, def int) int {
-	for _, o := range opts.GetOptions() {
-		k, v, ok := strings.Cut(o, ":")
-		if ok && strings.TrimSpace(k) == key {
-			if n, err := strconv.Atoi(strings.TrimSpace(v)); err == nil {
-				return n
-			}
-		}
-	}
-	return def
-}
-
-// runBatch is the dispatcher's batch handler and the ONLY caller of the C
-// engine on the unary path. It concatenates the batch PCM, calls the batched
-// JSON C-API under engineMu, splits the JSON array, and replies to each request.
-func (p *ParakeetCpp) runBatch(reqs []*batchRequest) {
-	// Observability: the actual coalesced batch size per engine call. Debug-level
-	// so it stays silent in normal operation but lets operators confirm/tune batching.
-	xlog.Debug("parakeet-cpp: dispatching batch", "size", len(reqs))
-	nSamples := make([]int32, len(reqs))
-	total := 0
-	for i, r := range reqs {
-		nSamples[i] = int32(len(r.pcm))
-		total += len(r.pcm)
-	}
-	concat := make([]float32, 0, total)
-	for _, r := range reqs {
-		concat = append(concat, r.pcm...)
-	}
-	var dec int32
-	if len(reqs) > 0 {
-		dec = reqs[0].decoder
-	}
-	// All requests in a batch share one language (the batcher coalesces only
-	// same-language requests), so any element's language describes the batch.
-	lang := ""
-	if len(reqs) > 0 {
-		lang = reqs[0].language
-	}
-	p.engineMu.Lock()
-	var cstr uintptr
-	if CppTranscribePcmBatchJSONLang != nil {
-		cstr = CppTranscribePcmBatchJSONLang(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec, lang)
-	} else {
-		cstr = CppTranscribePcmBatchJSON(p.ctxPtr, concat, nSamples, int32(len(reqs)), 16000, dec)
-	}
-	p.engineMu.Unlock()
-	if cstr == 0 {
-		err := fmt.Errorf("parakeet-cpp: batch transcribe failed: %s", CppLastError(p.ctxPtr))
-		for _, r := range reqs {
-			r.reply <- batchReply{err: err}
-		}
-		return
-	}
-	raw := goStringFromCPtr(cstr)
-	CppFreeString(cstr)
-	var docs []json.RawMessage
-	if err := json.Unmarshal([]byte(raw), &docs); err != nil || len(docs) != len(reqs) {
-		e := fmt.Errorf("parakeet-cpp: batch json: got %d results for %d reqs (%v)", len(docs), len(reqs), err)
-		for _, r := range reqs {
-			r.reply <- batchReply{err: e}
-		}
-		return
-	}
-	for i, r := range reqs {
-		r.reply <- batchReply{json: string(docs[i])}
-	}
-}
-
-// AudioTranscription decodes the wav at opts.Dst to 16 kHz mono PCM and
-// submits it to the in-process batcher, which coalesces concurrent requests
-// into a single batched engine call (parakeet_capi_transcribe_pcm_batch_json)
-// with the default decoder (decoder=0, which selects the right head per
-// architecture: transducer for tdt/rnnt/hybrid, CTC for ctc) and shapes the
-// per-word timestamps into a LocalAI TranscriptResult.
-//
-// Parakeet emits word- and token-level timestamps but no native segment
-// boundaries, so we synthesise a single whole-clip segment spanning the first
-// word start to the last word end. Word-level timings are attached only when
-// the caller opts in via timestamp_granularities=["word"] (matching the
-// OpenAI API, whose default is segment-level); token ids always populate
-// Segment.Tokens.
-//
-// translate/diarize/prompt/temperature/threads are not applicable to parakeet
-// and are ignored; language is honored on the batched + streaming paths (see
-// opts.GetLanguage() below); streaming is handled by AudioTranscriptionStream
-// (L2).
-func (p *ParakeetCpp) AudioTranscription(ctx context.Context, opts *pb.TranscriptRequest) (pb.TranscriptResult, error) {
-	if p.ctxPtr == 0 {
-		return pb.TranscriptResult{}, grpcerrors.ModelNotLoaded("parakeet-cpp")
-	}
-	if opts.Dst == "" {
-		return pb.TranscriptResult{}, errors.New("parakeet-cpp: TranscriptRequest.dst (audio path) is required")
-	}
-
-	// Fallback when the batched C-API is unavailable: transcribe from a file
-	// path (original behavior, no batching). The C library's audio loader only
-	// understands 16 kHz mono WAV/PCM, so convert the input first - otherwise
-	// any non-WAV upload (MP3, etc.) fails with "failed to load audio". This
-	// mirrors what every other audio backend (whisper, crispasr) does via
-	// utils.AudioToWav before handing the file to the engine.
-	if p.bat == nil {
-		converted, cleanup, err := convertToWavMono16k(opts.Dst)
-		if err != nil {
-			return pb.TranscriptResult{}, err
-		}
-		defer cleanup()
-		cstr := CppTranscribePathJSON(p.ctxPtr, converted, 0)
-		if cstr == 0 {
-			return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: transcribe_path_json failed: %s", CppLastError(p.ctxPtr))
-		}
-		raw := goStringFromCPtr(cstr)
-		CppFreeString(cstr)
-		var doc transcriptJSON
-		if err := json.Unmarshal([]byte(raw), &doc); err != nil {
-			return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: decode transcript json: %w", err)
-		}
-		return transcriptResultFromDoc(doc, opts, p.segmentGapFrames), nil
-	}
-
-	// Batched path: decode to PCM, submit to the batcher, wait for this request's
-	// JSON element. The dispatcher is the sole engine caller on this path; both
-	// sends honour ctx cancellation.
-	pcm, _, err := decodeWavMono16k(opts.Dst)
-	if err != nil {
-		return pb.TranscriptResult{}, err
-	}
-	rep := make(chan batchReply, 1)
-	select {
-	case p.bat.submit <- &batchRequest{pcm: pcm, decoder: 0, language: opts.GetLanguage(), reply: rep}:
-	case <-ctx.Done():
-		return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
-	}
-	var res batchReply
-	select {
-	case res = <-rep:
-	case <-ctx.Done():
-		return pb.TranscriptResult{}, status.Error(codes.Canceled, "transcription cancelled")
-	}
-	if res.err != nil {
-		return pb.TranscriptResult{}, res.err
-	}
-	var doc transcriptJSON
-	if err := json.Unmarshal([]byte(res.json), &doc); err != nil {
-		return pb.TranscriptResult{}, fmt.Errorf("parakeet-cpp: decode transcript json: %w", err)
-	}
-	return transcriptResultFromDoc(doc, opts, p.segmentGapFrames), nil
-}
-
-// segmentSeparators is NeMo's default segment_seperators (sentence-ending
-// punctuation). Splitting on these matches NeMo's default segment timestamps.
-var segmentSeparators = []rune{'.', '?', '!'}
-
-// transcriptResultFromDoc maps a decoded transcriptJSON to a TranscriptResult,
-// grouping words into NeMo-faithful segments (see splitWordsIntoSegments). The
-// optional gapFrames (NeMo's segment_gap_threshold, in encoder FRAMES; 0=off)
-// additionally splits on inter-word silence; it is converted to a seconds gap
-// with the document's frame_sec. Per-segment word timings are attached only when
-// the caller requested word granularity; token ids populate each segment's
-// Tokens by time-window membership. Shared by the batched and direct paths.
-func transcriptResultFromDoc(doc transcriptJSON, opts *pb.TranscriptRequest, gapFrames int) pb.TranscriptResult {
-	text := strings.TrimSpace(doc.Text)
-
-	// Frame-unit gap threshold -> seconds (NeMo segment_gap_threshold). 0 = off.
-	gapSeconds := 0.0
-	if gapFrames > 0 {
-		if doc.FrameSec > 0 {
-			gapSeconds = float64(gapFrames) * doc.FrameSec
-		} else {
-			xlog.Warn("parakeet-cpp: segment_gap_threshold set but libparakeet.so " +
-				"did not report frame_sec; falling back to punctuation-only segments")
-		}
-	}
-
-	groups := splitWordsIntoSegments(doc.Words, segmentSeparators, gapSeconds)
-	if len(groups) == 0 {
-		// No words (edge case): single whole-clip text segment.
-		return pb.TranscriptResult{
-			Text:     text,
-			Segments: []*pb.TranscriptSegment{{Id: 0, Text: text}},
-		}
-	}
-
-	wantWords := wordsRequested(opts.TimestampGranularities)
-	segments := make([]*pb.TranscriptSegment, 0, len(groups))
-	for id, group := range groups {
-		parts := make([]string, len(group))
-		for i, gw := range group {
-			parts[i] = gw.W
-		}
-		seg := &pb.TranscriptSegment{
-			Id:     int32(id),
-			Start:  secondsToNanos(group[0].Start),
-			End:    secondsToNanos(group[len(group)-1].End),
-			Text:   strings.TrimSpace(strings.Join(parts, " ")),
-			Tokens: tokensInWindow(doc.Tokens, group[0].Start, group[len(group)-1].End),
-		}
-		if wantWords {
-			ws := make([]*pb.TranscriptWord, len(group))
-			for i, gw := range group {
-				ws[i] = &pb.TranscriptWord{Start: secondsToNanos(gw.Start), End: secondsToNanos(gw.End), Text: gw.W}
-			}
-			seg.Words = ws
-		}
-		segments = append(segments, seg)
-	}
-	return pb.TranscriptResult{Text: text, Segments: segments}
-}
-
-// splitWordsIntoSegments groups words into segments exactly as NeMo's
-// get_segment_offsets does (nemo/collections/asr/parts/utils/timestamp_utils.py).
-// Walking the words, it closes a segment when (1) the gap rule is enabled
-// (gapSeconds > 0) and the segment already has words and the gap from the
-// previous word's end to this word's start is >= gapSeconds - the current word
-// then STARTS a new segment - or, checked only when the gap rule did not apply
-// (NeMo's elif), (2) the word ends with (or is) a separator, which closes the
-// segment INCLUDING that word. Trailing words flush into a final segment.
-// gapSeconds <= 0 disables the gap rule, matching NeMo's default
-// segment_gap_threshold=None (punctuation-only segments).
-func splitWordsIntoSegments(words []transcriptWord, separators []rune, gapSeconds float64) [][]transcriptWord {
-	var segments [][]transcriptWord
-	var cur []transcriptWord
-	for i, word := range words {
-		gapActive := gapSeconds > 0 && len(cur) > 0
-		if gapActive && (word.Start-words[i-1].End) >= gapSeconds {
-			segments = append(segments, cur)
-			cur = []transcriptWord{word}
-			continue
-		}
-		if !gapActive && endsWithSeparator(word.W, separators) {
-			cur = append(cur, word)
-			segments = append(segments, cur)
-			cur = nil
-			continue
-		}
-		cur = append(cur, word)
-	}
-	if len(cur) > 0 {
-		segments = append(segments, cur)
-	}
-	return segments
-}
-
-// endsWithSeparator reports whether w's last rune is in separators (matching
-// NeMo's `word[-1] in delims or word in delims`).
-func endsWithSeparator(w string, separators []rune) bool {
-	r := []rune(strings.TrimSpace(w))
-	if len(r) == 0 {
-		return false
-	}
-	last := r[len(r)-1]
-	for _, s := range separators {
-		if last == s {
-			return true
-		}
-	}
-	return false
-}
-
-// tokensInWindow returns the ids of tokens whose timestamp t falls in
-// [start, end] (inclusive), assigning each token to the segment that spans its
-// time. The last segment's end is the last word end, so the final token is
-// included.
-func tokensInWindow(tokens []transcriptToken, start, end float64) []int32 {
-	var ids []int32
-	for _, t := range tokens {
-		if t.T >= start && t.T <= end {
-			ids = append(ids, t.ID)
-		}
-	}
-	return ids
-}
-
-// streamSegmenter accumulates streaming words into per-utterance segments. EOU
-// is the model's own utterance boundary; each closed segment takes its start/end
-// from its first/last accumulated word.
-type streamSegmenter struct {
-	segs   []*pb.TranscriptSegment
-	cur    []transcriptWord
-	nextID int32
-}
-
-func (s *streamSegmenter) add(doc streamFeedJSON) {
-	s.cur = append(s.cur, doc.Words...)
-	if doc.Eou != 0 {
-		s.flush()
-	}
-}
-
-func (s *streamSegmenter) flush() {
-	if len(s.cur) == 0 {
-		return
-	}
-	parts := make([]string, len(s.cur))
-	for i, w := range s.cur {
-		parts[i] = w.W
-	}
-	s.segs = append(s.segs, &pb.TranscriptSegment{
-		Id:    s.nextID,
-		Start: secondsToNanos(s.cur[0].Start),
-		End:   secondsToNanos(s.cur[len(s.cur)-1].End),
-		Text:  strings.TrimSpace(strings.Join(parts, " ")),
-	})
-	s.nextID++
-	s.cur = nil
-}
-
-func (s *streamSegmenter) segments() []*pb.TranscriptSegment { return s.segs }
-
-// wordsRequested reports whether the caller asked for word-level timestamps.
-// The OpenAI transcription API gates word timings behind
-// timestamp_granularities[] containing "word" and defaults to segment-level
-// otherwise; we follow that contract.
-func wordsRequested(granularities []string) bool {
-	for _, g := range granularities {
-		if strings.EqualFold(strings.TrimSpace(g), "word") {
-			return true
-		}
-	}
-	return false
-}
-
-// secondsToNanos converts the C-API's fractional-second timestamps into the
-// int64 nanoseconds LocalAI carries on TranscriptSegment/TranscriptWord, the
-// same nanosecond convention the whisper backend uses.
-func secondsToNanos(sec float64) int64 {
-	return int64(sec * 1e9)
-}
-
-// AudioTranscriptionStream drives the cache-aware streaming RNN-T over the
-// audio at opts.Dst: it decodes the file to 16 kHz mono PCM, feeds it in
-// chunks to parakeet_capi_stream_feed, and emits each newly-finalized text
-// run as a TranscriptStreamResponse delta. <EOU>/<EOB> events close the
-// current segment; a closing FinalResult carries the full transcript and the
-// per-utterance segments.
-//
-// stream_begin returns 0 for models that are not cache-aware streaming models
-// (only e.g. nvidia/parakeet_realtime_eou_120m-v1 qualifies). For those we fall
-// back to a single offline transcription emitted as one delta plus a closing
-// FinalResult, matching LocalAI's non-streaming streaming contract (and the
-// whisper backend), so the streaming endpoint works for every model.
-func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error {
-	defer close(results)
-
-	if p.ctxPtr == 0 {
-		return grpcerrors.ModelNotLoaded("parakeet-cpp")
-	}
-	if opts.Dst == "" {
-		return errors.New("parakeet-cpp: TranscriptRequest.dst (audio path) is required")
-	}
-	if err := ctx.Err(); err != nil {
-		return status.Error(codes.Canceled, "transcription cancelled")
-	}
-
-	var stream uintptr
-	if CppStreamBeginLang != nil {
-		stream = CppStreamBeginLang(p.ctxPtr, opts.GetLanguage())
-	} else {
-		stream = CppStreamBegin(p.ctxPtr)
-	}
-	if stream == 0 {
-		// Not a cache-aware streaming model: run a normal offline
-		// transcription and emit it as one delta + a closing final result.
-		res, err := p.AudioTranscription(ctx, opts)
-		if err != nil {
-			return err
-		}
-		if t := strings.TrimSpace(res.Text); t != "" {
-			results <- &pb.TranscriptStreamResponse{Delta: t}
-		}
-		results <- &pb.TranscriptStreamResponse{FinalResult: &res}
-		return nil
-	}
-	defer CppStreamFree(stream)
-	// The C engine is a single shared context: a streaming session and a batched
-	// unary dispatch must never touch it at once, so hold engineMu for the whole
-	// stream. This lock is intentionally taken AFTER the non-streaming fallback
-	// above returns: that fallback goes through AudioTranscription -> the batcher
-	// -> runBatch, which itself acquires engineMu, so locking here first would
-	// deadlock. Do not hoist this lock above the fallback.
-	p.engineMu.Lock()
-	defer p.engineMu.Unlock()
-
-	data, duration, err := decodeWavMono16k(opts.Dst)
-	if err != nil {
-		return err
-	}
-
-	// ABI v4: when the streaming JSON entry points are present, drive them so the
-	// per-utterance segments carry per-word start/end timestamps. Falls through to
-	// the text-only loop below against an older libparakeet.so. Runs under the
-	// engineMu already held above.
-	if CppStreamFeedJSON != nil {
-		return p.streamJSON(ctx, stream, data, duration, results)
-	}
-
-	var (
-		full     strings.Builder
-		segText  strings.Builder
-		segments []*pb.TranscriptSegment
-		segID    int32
-	)
-
-	flushSegment := func() {
-		t := strings.TrimSpace(segText.String())
-		segText.Reset()
-		if t == "" {
-			return
-		}
-		segments = append(segments, &pb.TranscriptSegment{Id: segID, Text: t})
-		segID++
-	}
-
-	// emitDelta consumes the malloc'd char* returned by feed/finalize: frees
-	// it, accumulates the text, and sends a delta when non-empty. A 0 return
-	// is an error (vs the "" empty-but-non-NULL no-new-text case).
-	emitDelta := func(ret uintptr) error {
-		if ret == 0 {
-			msg := CppLastError(p.ctxPtr)
-			if msg == "" {
-				msg = "unknown error"
-			}
-			return fmt.Errorf("parakeet-cpp: stream feed/finalize failed: %s", msg)
-		}
-		delta := goStringFromCPtr(ret)
-		CppFreeString(ret)
-		if delta == "" {
-			return nil
-		}
-		full.WriteString(delta)
-		segText.WriteString(delta)
-		results <- &pb.TranscriptStreamResponse{Delta: delta}
-		return nil
-	}
-
-	for off := 0; off < len(data); off += streamChunkSamples {
-		if err := ctx.Err(); err != nil {
-			return status.Error(codes.Canceled, "transcription cancelled")
-		}
-		end := min(off+streamChunkSamples, len(data))
-		chunk := data[off:end]
-
-		var eou int32
-		ret := CppStreamFeed(stream, chunk, int32(len(chunk)), unsafe.Pointer(&eou))
-		if err := emitDelta(ret); err != nil {
-			return err
-		}
-		if eou != 0 {
-			flushSegment()
-		}
-	}
-
-	// Flush the streaming tail (final encoder chunk).
-	if err := emitDelta(CppStreamFinalize(stream)); err != nil {
-		return err
-	}
-	flushSegment()
-
-	text := strings.TrimSpace(full.String())
-	if len(segments) == 0 && text != "" {
-		segments = append(segments, &pb.TranscriptSegment{Id: 0, Text: text})
-	}
-	results <- &pb.TranscriptStreamResponse{
-		FinalResult: &pb.TranscriptResult{
-			Text:     text,
-			Segments: segments,
-			Duration: duration,
-		},
-	}
-	return nil
-}
-
-// streamJSON drives the ABI v4 streaming JSON entry points: each feed/finalize
-// returns a {text,eou,frame_sec,words} document. The newly-finalized text is
-// emitted as a delta (unchanged streaming contract) while words are accumulated
-// into per-utterance segments (closed on EOU) so the closing FinalResult carries
-// timestamped segments. Runs under engineMu (already held by the caller).
-func (p *ParakeetCpp) streamJSON(ctx context.Context, stream uintptr, data []float32,
-	duration float32, results chan *pb.TranscriptStreamResponse) error {
-	var (
-		full strings.Builder
-		seg  streamSegmenter
-	)
-	// consume frees the malloc'd char* (a 0 return is an error), parses the JSON,
-	// emits the delta, and routes words through the segmenter.
-	consume := func(ret uintptr) error {
-		if ret == 0 {
-			msg := CppLastError(p.ctxPtr)
-			if msg == "" {
-				msg = "unknown error"
-			}
-			return fmt.Errorf("parakeet-cpp: stream feed/finalize failed: %s", msg)
-		}
-		raw := goStringFromCPtr(ret)
-		CppFreeString(ret)
-		var doc streamFeedJSON
-		if err := json.Unmarshal([]byte(raw), &doc); err != nil {
-			return fmt.Errorf("parakeet-cpp: decode stream json: %w", err)
-		}
-		if doc.Text != "" {
-			full.WriteString(doc.Text)
-			results <- &pb.TranscriptStreamResponse{Delta: doc.Text}
-		}
-		seg.add(doc)
-		return nil
-	}
-
-	for off := 0; off < len(data); off += streamChunkSamples {
-		if err := ctx.Err(); err != nil {
-			return status.Error(codes.Canceled, "transcription cancelled")
-		}
-		end := min(off+streamChunkSamples, len(data))
-		chunk := data[off:end]
-		if err := consume(CppStreamFeedJSON(stream, chunk, int32(len(chunk)))); err != nil {
-			return err
-		}
-	}
-	if err := consume(CppStreamFinalizeJSON(stream)); err != nil {
-		return err
-	}
-	seg.flush() // close any trailing utterance that never saw an EOU
-
-	text := strings.TrimSpace(full.String())
-	segments := seg.segments()
-	if len(segments) == 0 && text != "" {
-		segments = append(segments, &pb.TranscriptSegment{Id: 0, Text: text})
-	}
-	results <- &pb.TranscriptStreamResponse{
-		FinalResult: &pb.TranscriptResult{
-			Text:     text,
-			Segments: segments,
-			Duration: duration,
-		},
-	}
-	return nil
-}
-
-// decodeWavMono16k converts any input audio to 16 kHz mono PCM and returns the
-// float samples plus the clip duration in seconds. Mirrors the whisper
-// backend: utils.AudioToWav (ffmpeg) normalises rate/channels, go-audio
-// decodes the PCM.
-// convertToWavMono16k converts an arbitrary audio file to a 16 kHz mono WAV in
-// a fresh temp dir and returns the path together with a cleanup func the caller
-// must defer. WAV inputs already at 16 kHz/mono/16-bit are passed through by
-// utils.AudioToWav (hardlink/copy), everything else is transcoded via ffmpeg.
-// Used by the direct (non-batched) transcription path, which hands a file path
-// to the C library's WAV-only audio loader.
-func convertToWavMono16k(path string) (string, func(), error) {
-	dir, err := os.MkdirTemp("", "parakeet")
-	if err != nil {
-		return "", func() {}, err
-	}
-	cleanup := func() { _ = os.RemoveAll(dir) }
-
-	converted := filepath.Join(dir, "converted.wav")
-	if err := utils.AudioToWav(path, converted); err != nil {
-		cleanup()
-		return "", func() {}, err
-	}
-	return converted, cleanup, nil
-}
-
-func decodeWavMono16k(path string) ([]float32, float32, error) {
-	converted, cleanup, err := convertToWavMono16k(path)
-	if err != nil {
-		return nil, 0, err
-	}
-	defer cleanup()
-
-	fh, err := os.Open(converted)
-	if err != nil {
-		return nil, 0, err
-	}
-	defer func() { _ = fh.Close() }()
-
-	buf, err := wav.NewDecoder(fh).FullPCMBuffer()
-	if err != nil {
-		return nil, 0, err
-	}
-	data := buf.AsFloat32Buffer().Data
-	var duration float32
-	if buf.Format != nil && buf.Format.SampleRate > 0 {
-		duration = float32(len(data)) / float32(buf.Format.SampleRate)
-	}
-	return data, duration, nil
-}
-
-// Free releases the underlying parakeet_ctx. Called by LocalAI when the
-// model is unloaded.
-func (p *ParakeetCpp) Free() error {
-	// Stop the dispatcher before releasing the engine so no in-flight runBatch
-	// can touch a freed ctx (close leak / use-after-free on reload).
-	if p.batStop != nil {
-		close(p.batStop)
-		p.batStop = nil
-	}
-	if p.ctxPtr != 0 {
-		CppFree(p.ctxPtr)
-		p.ctxPtr = 0
-	}
-	return nil
-}
-
-// goStringFromCPtr copies a NUL-terminated C string into Go memory.
-// cptr is the raw pointer returned by purego from the C-API (a malloc'd
-// buffer the caller owns); callers must free it via CppFreeString after
-// the copy lands.
-//
-// The uintptr->unsafe.Pointer conversion below trips go vet's unsafeptr
-// check, which can't distinguish a C-owned heap pointer from Go-managed
-// memory. It is safe here: the pointer addresses a malloc'd C buffer the
-// Go GC neither tracks nor moves, and we dereference it immediately to
-// copy the bytes out, the same pattern (and the same tolerated warning)
-// as the whisper backend's unsafe.Slice over segsPtr.
-func goStringFromCPtr(cptr uintptr) string {
-	if cptr == 0 {
-		return ""
-	}
-	p := unsafe.Pointer(cptr) //nolint:govet // C-owned malloc'd buffer, not Go-GC memory (see doc above)
-	n := 0
-	for *(*byte)(unsafe.Add(p, n)) != 0 {
-		n++
-	}
-	return string(unsafe.Slice((*byte)(p), n))
-}
--- a/backend/go/parakeet-cpp/goparakeetcpp_test.go
+++ b/backend/go/parakeet-cpp/goparakeetcpp_test.go
@@ -1,247 +0,0 @@
-package main
-
-import (
-	"context"
-	"os"
-	"path/filepath"
-	"strings"
-	"sync"
-	"testing"
-
-	"github.com/ebitengine/purego"
-	"github.com/go-audio/audio"
-	"github.com/go-audio/wav"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-func TestParakeetCpp(t *testing.T) {
-	RegisterFailHandler(Fail)
-	RunSpecs(t, "parakeet-cpp Backend Suite")
-}
-
-var (
-	libLoadOnce sync.Once
-	libLoadErr  error
-)
-
-// ensureLibLoaded mirrors main.go's bootstrap so a Go test can drive
-// the C-API bridge without spinning up the gRPC server. Skips the
-// current spec when libparakeet.so isn't loadable from cwd
-// ($LD_LIBRARY_PATH or a symlink in ./).
-func ensureLibLoaded() {
-	libLoadOnce.Do(func() {
-		libName := os.Getenv("PARAKEET_LIBRARY")
-		if libName == "" {
-			libName = "libparakeet.so"
-		}
-		lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
-		if err != nil {
-			libLoadErr = err
-			return
-		}
-		purego.RegisterLibFunc(&CppAbiVersion, lib, "parakeet_capi_abi_version")
-		purego.RegisterLibFunc(&CppLoad, lib, "parakeet_capi_load")
-		purego.RegisterLibFunc(&CppFree, lib, "parakeet_capi_free")
-		purego.RegisterLibFunc(&CppTranscribePath, lib, "parakeet_capi_transcribe_path")
-		purego.RegisterLibFunc(&CppTranscribePathJSON, lib, "parakeet_capi_transcribe_path_json")
-		if sym, err := purego.Dlsym(lib, "parakeet_capi_transcribe_pcm_batch_json"); err == nil && sym != 0 {
-			purego.RegisterLibFunc(&CppTranscribePcmBatchJSON, lib, "parakeet_capi_transcribe_pcm_batch_json")
-		}
-		purego.RegisterLibFunc(&CppStreamBegin, lib, "parakeet_capi_stream_begin")
-		purego.RegisterLibFunc(&CppStreamFeed, lib, "parakeet_capi_stream_feed")
-		purego.RegisterLibFunc(&CppStreamFinalize, lib, "parakeet_capi_stream_finalize")
-		purego.RegisterLibFunc(&CppStreamFree, lib, "parakeet_capi_stream_free")
-		if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_feed_json"); err == nil && sym != 0 {
-			purego.RegisterLibFunc(&CppStreamFeedJSON, lib, "parakeet_capi_stream_feed_json")
-			purego.RegisterLibFunc(&CppStreamFinalizeJSON, lib, "parakeet_capi_stream_finalize_json")
-		}
-		purego.RegisterLibFunc(&CppFreeString, lib, "parakeet_capi_free_string")
-		purego.RegisterLibFunc(&CppLastError, lib, "parakeet_capi_last_error")
-	})
-	if libLoadErr != nil {
-		Skip("libparakeet.so not loadable: " + libLoadErr.Error())
-	}
-}
-
-// fixturesOrSkip returns the model + audio paths or skips the spec if
-// either env var is unset. The smoke test never runs in default CI; it
-// needs a real parakeet GGUF and a 16 kHz mono WAV on disk.
-func fixturesOrSkip() (string, string) {
-	modelPath := os.Getenv("PARAKEET_BACKEND_TEST_MODEL")
-	audioPath := os.Getenv("PARAKEET_BACKEND_TEST_WAV")
-	if modelPath == "" || audioPath == "" {
-		Skip("set PARAKEET_BACKEND_TEST_MODEL and PARAKEET_BACKEND_TEST_WAV to run this spec")
-	}
-	return modelPath, audioPath
-}
-
-// writeMono16kWav writes `samples` frames of 16 kHz mono 16-bit silence to
-// path. The result is already in AudioToWav's target format, so the conversion
-// helper copies it through without invoking ffmpeg.
-func writeMono16kWav(path string, samples int) {
-	GinkgoHelper()
-	f, err := os.Create(path)
-	Expect(err).ToNot(HaveOccurred())
-	enc := wav.NewEncoder(f, 16000, 16, 1, 1)
-	buf := &audio.IntBuffer{
-		Format:         &audio.Format{NumChannels: 1, SampleRate: 16000},
-		SourceBitDepth: 16,
-		Data:           make([]int, samples),
-	}
-	Expect(enc.Write(buf)).To(Succeed())
-	Expect(enc.Close()).To(Succeed())
-	Expect(f.Close()).To(Succeed())
-}
-
-var _ = Describe("ParakeetCpp", func() {
-	Context("AudioTranscription", func() {
-		It("transcribes a WAV via the parakeet C-API", func() {
-			modelPath, audioPath := fixturesOrSkip()
-			ensureLibLoaded()
-
-			p := &ParakeetCpp{}
-			Expect(p.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
-			defer func() { _ = p.Free() }()
-
-			res, err := p.AudioTranscription(context.Background(), &pb.TranscriptRequest{
-				Dst: audioPath,
-			})
-			Expect(err).ToNot(HaveOccurred())
-			Expect(strings.TrimSpace(res.Text)).ToNot(BeEmpty(),
-				"expected non-empty transcript for %s", audioPath)
-			// NeMo-faithful segmentation: one or more punctuation-delimited
-			// segments, each with text and a monotonically-advancing time span.
-			Expect(res.Segments).ToNot(BeEmpty(), "expected at least one segment")
-			var prevEnd int64
-			for i, seg := range res.Segments {
-				Expect(strings.TrimSpace(seg.Text)).ToNot(BeEmpty(),
-					"segment %d must have text", i)
-				Expect(seg.End).To(BeNumerically(">=", seg.Start),
-					"segment %d end must not precede its start", i)
-				Expect(seg.Start).To(BeNumerically(">=", prevEnd),
-					"segments must be in time order")
-				prevEnd = seg.End
-				// Default (no granularities) is segment-level: no per-word timings.
-				Expect(seg.Words).To(BeEmpty(),
-					"word timings are opt-in via timestamp_granularities")
-			}
-		})
-
-		It("emits word-level timestamps when granularity=word", func() {
-			modelPath, audioPath := fixturesOrSkip()
-			ensureLibLoaded()
-
-			p := &ParakeetCpp{}
-			Expect(p.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
-			defer func() { _ = p.Free() }()
-
-			res, err := p.AudioTranscription(context.Background(), &pb.TranscriptRequest{
-				Dst:                    audioPath,
-				TimestampGranularities: []string{"word"},
-			})
-			Expect(err).ToNot(HaveOccurred())
-			Expect(res.Segments).ToNot(BeEmpty())
-			// With word granularity every segment carries its own words, and each
-			// segment's span tracks its first/last word; word starts advance
-			// monotonically across the whole transcript.
-			totalWords := 0
-			var prevStart int64 = -1
-			for i, seg := range res.Segments {
-				Expect(seg.Words).ToNot(BeEmpty(),
-					"segment %d must carry per-word timestamps with granularity=word", i)
-				Expect(seg.Start).To(Equal(seg.Words[0].Start),
-					"segment %d start tracks its first word", i)
-				Expect(seg.End).To(Equal(seg.Words[len(seg.Words)-1].End),
-					"segment %d end tracks its last word", i)
-				for _, w := range seg.Words {
-					Expect(w.End).To(BeNumerically(">=", w.Start))
-					Expect(w.Start).To(BeNumerically(">=", prevStart))
-					prevStart = w.Start
-					totalWords++
-				}
-			}
-			Expect(totalWords).To(BeNumerically(">", 0))
-			Expect(res.Segments[0].Words[0].Start).To(BeNumerically(">=", int64(0)))
-		})
-	})
-
-	Context("convertToWavMono16k", func() {
-		// The non-batched transcription path hands a file path to the C
-		// library's WAV-only audio loader, so it must convert first.
-		// utils.AudioToWav passes an already-16kHz/mono/16-bit WAV through
-		// without ffmpeg, which lets us exercise the helper (and the
-		// regression: the direct path used to skip conversion entirely)
-		// without a model, the C library, or ffmpeg.
-		It("returns a decodable 16kHz mono WAV copy and cleans it up", func() {
-			dir := GinkgoT().TempDir()
-			src := filepath.Join(dir, "input.wav")
-			writeMono16kWav(src, 16000) // 1s of silence at 16 kHz
-
-			converted, cleanup, err := convertToWavMono16k(src)
-			Expect(err).ToNot(HaveOccurred())
-
-			// It must produce a fresh temp file, not return the original path.
-			Expect(converted).ToNot(Equal(src))
-			Expect(converted).To(BeAnExistingFile())
-
-			pcm, _, err := decodeWavMono16k(converted)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(pcm).To(HaveLen(16000), "round-trips the sample count")
-
-			cleanup()
-			Expect(converted).ToNot(BeAnExistingFile(), "cleanup removes the temp dir")
-		})
-
-		It("errors on a non-existent input rather than passing the path through", func() {
-			_, _, err := convertToWavMono16k(filepath.Join(GinkgoT().TempDir(), "missing.mp3"))
-			Expect(err).To(HaveOccurred())
-		})
-	})
-
-	Context("AudioTranscriptionStream", func() {
-		It("streams deltas and a closing FinalResult from a cache-aware model", func() {
-			// Streaming needs a cache-aware streaming model (e.g.
-			// realtime_eou); the offline test model would fail stream_begin.
-			modelPath := os.Getenv("PARAKEET_BACKEND_TEST_STREAM_MODEL")
-			audioPath := os.Getenv("PARAKEET_BACKEND_TEST_WAV")
-			if modelPath == "" || audioPath == "" {
-				Skip("set PARAKEET_BACKEND_TEST_STREAM_MODEL (cache-aware streaming model) and PARAKEET_BACKEND_TEST_WAV")
-			}
-			ensureLibLoaded()
-
-			p := &ParakeetCpp{}
-			Expect(p.Load(&pb.ModelOptions{ModelFile: modelPath})).To(Succeed())
-			defer func() { _ = p.Free() }()
-
-			results := make(chan *pb.TranscriptStreamResponse, 64)
-			errCh := make(chan error, 1)
-			go func() {
-				errCh <- p.AudioTranscriptionStream(context.Background(),
-					&pb.TranscriptRequest{Dst: audioPath}, results)
-			}()
-
-			var deltas []string
-			var final *pb.TranscriptResult
-			for r := range results {
-				if r.Delta != "" {
-					deltas = append(deltas, r.Delta)
-				}
-				if r.FinalResult != nil {
-					final = r.FinalResult
-				}
-			}
-			Expect(<-errCh).ToNot(HaveOccurred())
-
-			Expect(final).ToNot(BeNil(), "expected a closing FinalResult")
-			Expect(strings.TrimSpace(final.Text)).ToNot(BeEmpty(),
-				"expected a non-empty streamed transcript")
-			Expect(final.Segments).ToNot(BeEmpty(),
-				"FinalResult always carries at least one segment")
-			// The concatenated deltas reconstruct the final transcript.
-			Expect(strings.TrimSpace(strings.Join(deltas, ""))).To(Equal(strings.TrimSpace(final.Text)),
-				"deltas should reconstruct the final text")
-		})
-	})
-})
--- a/backend/go/parakeet-cpp/main.go
+++ b/backend/go/parakeet-cpp/main.go
@@ -1,94 +0,0 @@
-package main
-
-// Started internally by LocalAI - one gRPC server per loaded model.
-//
-// Loads libparakeet.so via purego and registers the flat C-API entry
-// points declared in parakeet_capi.h. The library name can be overridden
-// with PARAKEET_LIBRARY (mirrors the WHISPER_LIBRARY / VIBEVOICECPP_LIBRARY
-// convention in the sibling backends); the default looks for the .so next
-// to this binary.
-import (
-	"flag"
-	"fmt"
-	"os"
-
-	"github.com/ebitengine/purego"
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-)
-
-var (
-	addr = flag.String("addr", "localhost:50051", "the address to connect to")
-)
-
-type LibFuncs struct {
-	FuncPtr any
-	Name    string
-}
-
-func main() {
-	libName := os.Getenv("PARAKEET_LIBRARY")
-	if libName == "" {
-		libName = "libparakeet.so"
-	}
-
-	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
-	if err != nil {
-		panic(fmt.Errorf("parakeet-cpp: dlopen %q: %w", libName, err))
-	}
-
-	// Bound 1:1 to parakeet_capi.h. The C-API returns malloc'd char*
-	// buffers from transcribe_*; we register those as uintptr so we get
-	// the raw pointer back and can call parakeet_capi_free_string on it
-	// (purego's string return would copy and forget the original pointer,
-	// leaking it on every call).
-	libFuncs := []LibFuncs{
-		{&CppAbiVersion, "parakeet_capi_abi_version"},
-		{&CppLoad, "parakeet_capi_load"},
-		{&CppFree, "parakeet_capi_free"},
-		{&CppTranscribePath, "parakeet_capi_transcribe_path"},
-		{&CppTranscribePathJSON, "parakeet_capi_transcribe_path_json"},
-		{&CppStreamBegin, "parakeet_capi_stream_begin"},
-		{&CppStreamFeed, "parakeet_capi_stream_feed"},
-		{&CppStreamFinalize, "parakeet_capi_stream_finalize"},
-		{&CppStreamFree, "parakeet_capi_stream_free"},
-		{&CppFreeString, "parakeet_capi_free_string"},
-		{&CppLastError, "parakeet_capi_last_error"},
-	}
-	for _, lf := range libFuncs {
-		purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
-	}
-
-	// The batched-JSON entry point exists only in newer libparakeet.so (ABI >= 2).
-	// Probe with Dlsym and register only if present, so the backend still loads
-	// against an older library (it falls back to per-request transcription).
-	if sym, err := purego.Dlsym(lib, "parakeet_capi_transcribe_pcm_batch_json"); err == nil && sym != 0 {
-		purego.RegisterLibFunc(&CppTranscribePcmBatchJSON, lib, "parakeet_capi_transcribe_pcm_batch_json")
-	}
-
-	// Per-request language variants (multilingual nemotron). Same probe pattern:
-	// present only in libparakeet.so built with multilingual support, so the
-	// backend still loads against an older library and falls back to the
-	// non-lang batched + streaming entry points (model default / "auto").
-	if sym, err := purego.Dlsym(lib, "parakeet_capi_transcribe_pcm_batch_json_lang"); err == nil && sym != 0 {
-		purego.RegisterLibFunc(&CppTranscribePcmBatchJSONLang, lib, "parakeet_capi_transcribe_pcm_batch_json_lang")
-	}
-	if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_begin_lang"); err == nil && sym != 0 {
-		purego.RegisterLibFunc(&CppStreamBeginLang, lib, "parakeet_capi_stream_begin_lang")
-	}
-
-	// Streaming JSON entry points (ABI v4): surface per-word timestamps on the
-	// streaming path. Same probe pattern; absent in older libparakeet.so, where
-	// the backend falls back to the text-only streaming feed.
-	if sym, err := purego.Dlsym(lib, "parakeet_capi_stream_feed_json"); err == nil && sym != 0 {
-		purego.RegisterLibFunc(&CppStreamFeedJSON, lib, "parakeet_capi_stream_feed_json")
-		purego.RegisterLibFunc(&CppStreamFinalizeJSON, lib, "parakeet_capi_stream_finalize_json")
-	}
-
-	fmt.Fprintf(os.Stderr, "[parakeet-cpp] ABI=%d\n", CppAbiVersion())
-
-	flag.Parse()
-
-	if err := grpc.StartServer(*addr, &ParakeetCpp{}); err != nil {
-		panic(err)
-	}
-}
--- a/backend/go/parakeet-cpp/package.sh
+++ b/backend/go/parakeet-cpp/package.sh
@@ -1,23 +0,0 @@
-#!/bin/bash
-#
-# L0 packaging stub: copy the binary, run.sh and libparakeet.so* into
-# package/. The full ldd walk (libc, libstdc++, libgomp, GPU runtimes,
-# arch detection) lands in L3, mirroring backend/go/whisper/package.sh.
-
-set -e
-
-CURDIR=$(dirname "$(realpath "$0")")
-
-mkdir -p "$CURDIR/package/lib"
-
-cp -avf "$CURDIR/parakeet-cpp-grpc" "$CURDIR/package/"
-cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
-
-# libparakeet.so + any soname symlinks (libparakeet.so.X, libparakeet.so.X.Y).
-cp -avf "$CURDIR"/libparakeet.so* "$CURDIR/package/lib/" 2>/dev/null || {
-	echo "ERROR: libparakeet.so not found in $CURDIR, run 'make' first" >&2
-	exit 1
-}
-
-echo "L0 package layout (full ldd walk lands in L3):"
-ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/go/parakeet-cpp/run.sh
+++ b/backend/go/parakeet-cpp/run.sh
@@ -1,16 +0,0 @@
-#!/bin/bash
-set -e
-
-CURDIR=$(dirname "$(realpath "$0")")
-
-export LD_LIBRARY_PATH="$CURDIR/lib:$CURDIR:${LD_LIBRARY_PATH:-}"
-
-# If a self-contained ld.so was packaged, route through it so the
-# packaged libc / libstdc++ are used instead of the host's (matches the
-# whisper backend's runtime layout).
-if [ -f "$CURDIR/lib/ld.so" ]; then
-	echo "Using lib/ld.so"
-	exec "$CURDIR/lib/ld.so" "$CURDIR/parakeet-cpp-grpc" "$@"
-fi
-
-exec "$CURDIR/parakeet-cpp-grpc" "$@"
--- a/backend/go/parakeet-cpp/segments_test.go
+++ b/backend/go/parakeet-cpp/segments_test.go
@@ -1,127 +0,0 @@
-package main
-
-import (
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-func tw(text string, start, end float64) transcriptWord {
-	return transcriptWord{W: text, Start: start, End: end}
-}
-
-var _ = Describe("splitWordsIntoSegments (NeMo get_segment_offsets parity)", func() {
-	seps := []rune{'.', '?', '!'}
-
-	It("splits on sentence-ending punctuation, including the delimiter word", func() {
-		words := []transcriptWord{tw("hello", 0, 0.4), tw("world.", 0.4, 0.8), tw("bye", 1.0, 1.3)}
-		segs := splitWordsIntoSegments(words, seps, 0)
-		Expect(segs).To(HaveLen(2))
-		Expect(segs[0]).To(HaveLen(2))
-		Expect(segs[0][1].W).To(Equal("world."))
-		Expect(segs[1]).To(HaveLen(1))
-		Expect(segs[1][0].W).To(Equal("bye"))
-	})
-
-	It("keeps a single segment with no terminal punctuation and gap off", func() {
-		words := []transcriptWord{tw("a", 0, 0.2), tw("b", 0.2, 0.4), tw("c", 5.0, 5.2)}
-		segs := splitWordsIntoSegments(words, seps, 0)
-		Expect(segs).To(HaveLen(1))
-	})
-
-	It("splits on the gap rule when enabled, the gapped word starting the next segment", func() {
-		words := []transcriptWord{tw("a", 0, 0.2), tw("b", 0.2, 0.4), tw("c", 5.0, 5.2)}
-		segs := splitWordsIntoSegments(words, seps, 1.0) // c is 4.6s after b
-		Expect(segs).To(HaveLen(2))
-		Expect(segs[0]).To(HaveLen(2)) // a b
-		Expect(segs[1]).To(HaveLen(1)) // c
-		Expect(segs[1][0].W).To(Equal("c"))
-	})
-
-	It("checks the gap rule before punctuation (NeMo elif order)", func() {
-		// "b." would terminate, but c is far after it -> gap closes [a b.] at b.
-		words := []transcriptWord{tw("a", 0, 0.2), tw("b.", 0.2, 0.4), tw("c", 9.0, 9.2)}
-		segs := splitWordsIntoSegments(words, seps, 1.0)
-		Expect(segs).To(HaveLen(2))
-		Expect(segs[0]).To(HaveLen(2))
-		Expect(segs[1][0].W).To(Equal("c"))
-	})
-
-	It("still splits on punctuation when the gap rule is enabled but does not fire", func() {
-		words := []transcriptWord{tw("hi.", 0, 0.4), tw("bye", 0.4, 0.8)}
-		segs := splitWordsIntoSegments(words, seps, 5.0) // gap never reached
-		Expect(segs).To(HaveLen(2))
-		Expect(segs[0][0].W).To(Equal("hi."))
-	})
-
-	It("returns nothing for empty input", func() {
-		Expect(splitWordsIntoSegments(nil, seps, 0)).To(BeEmpty())
-	})
-})
-
-var _ = Describe("transcriptResultFromDoc (multi-segment)", func() {
-	doc := transcriptJSON{
-		Text:     "hello world. bye now",
-		FrameSec: 0.08,
-		Words: []transcriptWord{
-			{W: "hello", Start: 0.0, End: 0.4},
-			{W: "world.", Start: 0.4, End: 0.8},
-			{W: "bye", Start: 1.0, End: 1.3},
-			{W: "now", Start: 1.3, End: 1.6},
-		},
-		Tokens: []transcriptToken{{ID: 1, T: 0.1}, {ID: 2, T: 0.5}, {ID: 3, T: 1.1}, {ID: 4, T: 1.4}},
-	}
-
-	It("emits one segment per punctuation-delimited group with start/end", func() {
-		res := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
-		Expect(res.Segments).To(HaveLen(2))
-		Expect(res.Segments[0].Text).To(Equal("hello world."))
-		Expect(res.Segments[0].Start).To(Equal(int64(0)))
-		Expect(res.Segments[0].End).To(Equal(secondsToNanos(0.8)))
-		Expect(res.Segments[1].Text).To(Equal("bye now"))
-		Expect(res.Segments[1].Start).To(Equal(secondsToNanos(1.0)))
-		Expect(res.Segments[1].Id).To(Equal(int32(1)))
-	})
-
-	It("assigns tokens to the segment whose time window contains them", func() {
-		res := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
-		Expect(res.Segments[0].Tokens).To(Equal([]int32{1, 2}))
-		Expect(res.Segments[1].Tokens).To(Equal([]int32{3, 4}))
-	})
-
-	It("attaches per-segment words only when word granularity requested", func() {
-		plain := transcriptResultFromDoc(doc, &pb.TranscriptRequest{}, 0)
-		Expect(plain.Segments[0].Words).To(BeEmpty())
-		withWords := transcriptResultFromDoc(doc, &pb.TranscriptRequest{TimestampGranularities: []string{"word"}}, 0)
-		Expect(withWords.Segments[0].Words).To(HaveLen(2))
-	})
-
-	It("falls back to a single text segment when there are no words", func() {
-		res := transcriptResultFromDoc(transcriptJSON{Text: "hi"}, &pb.TranscriptRequest{}, 0)
-		Expect(res.Segments).To(HaveLen(1))
-		Expect(res.Segments[0].Text).To(Equal("hi"))
-	})
-})
-
-var _ = Describe("streaming segment assembly", func() {
-	It("closes a segment with start/end from its words on EOU", func() {
-		acc := &streamSegmenter{}
-		acc.add(streamFeedJSON{Text: "hello world", Eou: 1, Words: []transcriptWord{
-			{W: "hello", Start: 0.0, End: 0.4}, {W: "world", Start: 0.4, End: 0.9},
-		}})
-		segs := acc.segments()
-		Expect(segs).To(HaveLen(1))
-		Expect(segs[0].Text).To(Equal("hello world"))
-		Expect(segs[0].Start).To(Equal(int64(0)))
-		Expect(segs[0].End).To(Equal(secondsToNanos(0.9)))
-	})
-
-	It("buffers words across feeds until EOU", func() {
-		acc := &streamSegmenter{}
-		acc.add(streamFeedJSON{Text: "hi", Eou: 0, Words: []transcriptWord{{W: "hi", Start: 0, End: 0.3}}})
-		Expect(acc.segments()).To(BeEmpty())
-		acc.add(streamFeedJSON{Text: "there", Eou: 1, Words: []transcriptWord{{W: "there", Start: 0.3, End: 0.7}}})
-		Expect(acc.segments()).To(HaveLen(1))
-		Expect(acc.segments()[0].Text).To(Equal("hi there"))
-	})
-})
--- a/backend/go/qwen3-tts-cpp/Makefile
+++ b/backend/go/qwen3-tts-cpp/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # qwen3-tts.cpp version
 QWEN3TTS_REPO?=https://github.com/predict-woo/qwen3-tts.cpp
-QWEN3TTS_CPP_VERSION?=136e5d36c17083da0321fd96512dc7b263f94a44
+QWEN3TTS_CPP_VERSION?=7a762e2ad4bacc6fdda81d81bf10a09ffb546f29
 SO_TARGET?=libgoqwen3ttscpp.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/go/qwen3-tts-cpp/goqwen3ttscpp.go
+++ b/backend/go/qwen3-tts-cpp/goqwen3ttscpp.go
@@ -4,7 +4,6 @@ import (
 	"fmt"
 	"os"
 	"path/filepath"
-	"strings"

 	"github.com/mudler/LocalAI/pkg/grpc/base"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
@@ -22,43 +21,6 @@ type Qwen3TtsCpp struct {
 	threads int
 }

-// languageNameAliases maps common full language names to the canonical
-// two-letter code understood by the C++ language_to_id table.
-var languageNameAliases = map[string]string{
-	"english":    "en",
-	"russian":    "ru",
-	"chinese":    "zh",
-	"japanese":   "ja",
-	"korean":     "ko",
-	"german":     "de",
-	"french":     "fr",
-	"spanish":    "es",
-	"italian":    "it",
-	"portuguese": "pt",
-}
-
-// normalizeLanguage coerces a caller-supplied language into the canonical code
-// the model expects. It lowercases, trims, strips any region/locale suffix
-// (en-US, en_US, ja.JP -> en/ja), and resolves common full names (english -> en).
-// An empty input stays empty so the C++ side applies its English default; an
-// unrecognized value is returned normalized so C++ can log it and default.
-func normalizeLanguage(lang string) string {
-	lang = strings.ToLower(strings.TrimSpace(lang))
-	if lang == "" {
-		return ""
-	}
-
-	// Strip region/locale suffix: keep the segment before the first separator.
-	if i := strings.IndexAny(lang, "-_."); i >= 0 {
-		lang = lang[:i]
-	}
-
-	if code, ok := languageNameAliases[lang]; ok {
-		return code
-	}
-	return lang
-}
-
 func (q *Qwen3TtsCpp) Load(opts *pb.ModelOptions) error {
 	// ModelFile is the model directory path (containing GGUF files)
 	modelDir := opts.ModelFile
@@ -92,7 +54,7 @@ func (q *Qwen3TtsCpp) TTS(req *pb.TTSRequest) error {
 	dst := req.Dst
 	language := ""
 	if req.Language != nil {
-		language = normalizeLanguage(*req.Language)
+		language = *req.Language
 	}

 	// Synthesis parameters with sensible defaults
--- a/backend/go/qwen3-tts-cpp/language_test.go
+++ b/backend/go/qwen3-tts-cpp/language_test.go
@@ -1,53 +0,0 @@
-package main
-
-import (
-	"testing"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-func TestLanguageNormalization(t *testing.T) {
-	RegisterFailHandler(Fail)
-	RunSpecs(t, "qwen3-tts-cpp language normalization")
-}
-
-var _ = Describe("normalizeLanguage", func() {
-	DescribeTable("maps caller input to the canonical model language code",
-		func(input, expected string) {
-			Expect(normalizeLanguage(input)).To(Equal(expected))
-		},
-		// Canonical codes pass through unchanged
-		Entry("canonical en", "en", "en"),
-		Entry("canonical zh", "zh", "zh"),
-		Entry("canonical pt", "pt", "pt"),
-
-		// Case-insensitive
-		Entry("uppercase", "EN", "en"),
-		Entry("mixed case", "Ja", "ja"),
-
-		// Surrounding whitespace
-		Entry("trims whitespace", "  en  ", "en"),
-
-		// Region/locale stripping
-		Entry("BCP-47 region", "en-US", "en"),
-		Entry("underscore region", "en_US", "en"),
-		Entry("dotted locale", "ja.JP", "ja"),
-		Entry("region + case", "ZH-CN", "zh"),
-
-		// Full-name aliases
-		Entry("english name", "english", "en"),
-		Entry("chinese name cased", "Chinese", "zh"),
-		Entry("japanese name", "japanese", "ja"),
-		Entry("russian name", "russian", "ru"),
-		Entry("portuguese name", "portuguese", "pt"),
-
-		// Empty stays empty (C++ applies the English default)
-		Entry("empty", "", ""),
-		Entry("whitespace only", "   ", ""),
-
-		// Unknown values pass through normalized so C++ can log + default
-		Entry("unknown code", "klingon", "klingon"),
-		Entry("unknown with region", "xx-YY", "xx"),
-	)
-})
--- a/backend/go/rfdetr-cpp/.gitignore
+++ b/backend/go/rfdetr-cpp/.gitignore
@@ -1,7 +0,0 @@
-sources/
-build*/
-package/
-librfdetrcpp*.so
-rfdetr-cpp
-test-models/
-test-data/
--- a/backend/go/rfdetr-cpp/CMakeLists.txt
+++ b/backend/go/rfdetr-cpp/CMakeLists.txt
@@ -1,79 +0,0 @@
-cmake_minimum_required(VERSION 3.18)
-project(librfdetrcpp LANGUAGES C CXX)
-
-set(CMAKE_POSITION_INDEPENDENT_CODE ON)
-set(CMAKE_CXX_STANDARD 17)
-set(CMAKE_CXX_STANDARD_REQUIRED ON)
-
-# Static-link ggml + rfdetr so the resulting .so has no runtime dependency on
-# extra ggml/rfdetr shared libraries — only on libc/libstdc++/libgomp, which
-# the LocalAI package step bundles into the docker image.
-set(BUILD_SHARED_LIBS OFF CACHE BOOL "Build static libraries" FORCE)
-
-# rfdetr.cpp build switches: skip CLI/tests, keep static lib.
-set(RFDETR_BUILD_CLI OFF CACHE BOOL "Disable rfdetr CLI" FORCE)
-set(RFDETR_BUILD_TESTS OFF CACHE BOOL "Disable rfdetr tests" FORCE)
-set(RFDETR_SHARED OFF CACHE BOOL "Build rfdetr as static lib" FORCE)
-
-# rt-detr.cpp's top-level CMakeLists invokes
-# `bash ${CMAKE_SOURCE_DIR}/scripts/apply_ggml_patches.sh` to apply its
-# in-tree ggml patches before descending into the submodule. When we
-# `add_subdirectory` it from a parent project, `CMAKE_SOURCE_DIR` points
-# at *our* directory, not theirs, so the script path resolves wrong.
-#
-# Run the patches script ourselves up front (it's idempotent — re-running
-# is a no-op once patches are applied) so the rt-detr.cpp configure step
-# is essentially a no-op for the patch hook.
-set(RFDETR_CPP_SRC ${CMAKE_CURRENT_SOURCE_DIR}/sources/rt-detr.cpp)
-if(EXISTS ${RFDETR_CPP_SRC}/scripts/apply_ggml_patches.sh)
-    execute_process(
-        COMMAND bash ${RFDETR_CPP_SRC}/scripts/apply_ggml_patches.sh
-        RESULT_VARIABLE _rfdetr_patch_result
-        OUTPUT_VARIABLE _rfdetr_patch_output
-        ERROR_VARIABLE  _rfdetr_patch_error
-        OUTPUT_STRIP_TRAILING_WHITESPACE
-        ERROR_STRIP_TRAILING_WHITESPACE)
-    if(NOT _rfdetr_patch_result EQUAL 0)
-        message(FATAL_ERROR
-            "Failed to apply ggml patches (exit ${_rfdetr_patch_result}):\n"
-            "stdout:\n${_rfdetr_patch_output}\n"
-            "stderr:\n${_rfdetr_patch_error}")
-    endif()
-    message(STATUS "${_rfdetr_patch_output}")
-endif()
-
-# Stage a shim 'scripts/apply_ggml_patches.sh' under our source dir so that
-# rt-detr.cpp's CMakeLists — which calls
-#   bash ${CMAKE_SOURCE_DIR}/scripts/apply_ggml_patches.sh
-# — finds an idempotent no-op there. The real patches have already been
-# applied above; this just satisfies the path lookup.
-file(MAKE_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/scripts)
-file(WRITE ${CMAKE_CURRENT_SOURCE_DIR}/scripts/apply_ggml_patches.sh
-"#!/usr/bin/env bash
-# Shim - patches were already applied by the parent CMakeLists.
-exit 0
-")
-execute_process(COMMAND chmod +x ${CMAKE_CURRENT_SOURCE_DIR}/scripts/apply_ggml_patches.sh)
-
-add_subdirectory(./sources/rt-detr.cpp)
-
-# rfdetr.cpp's C-API symbols already live inside librfdetr (src/rfdetr_capi.cpp
-# is compiled into the lib). We re-export them via a MODULE library that
-# whole-archive-links rfdetr so the symbols are visible at dlopen time.
-add_library(rfdetrcpp MODULE
-    sources/rt-detr.cpp/src/rfdetr_capi.cpp)
-
-target_include_directories(rfdetrcpp PRIVATE
-    sources/rt-detr.cpp/include
-    sources/rt-detr.cpp/src
-    sources/rt-detr.cpp/third_party/stb
-)
-
-target_link_libraries(rfdetrcpp PRIVATE rfdetr ggml)
-
-if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
-    target_link_libraries(rfdetrcpp PRIVATE stdc++fs)
-endif()
-
-set_property(TARGET rfdetrcpp PROPERTY CXX_STANDARD 17)
-set_target_properties(rfdetrcpp PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
--- a/backend/go/rfdetr-cpp/Makefile
+++ b/backend/go/rfdetr-cpp/Makefile
@@ -1,135 +0,0 @@
-CMAKE_ARGS?=
-BUILD_TYPE?=
-NATIVE?=false
-
-GOCMD?=go
-GO_TAGS?=
-JOBS?=$(shell nproc --ignore=1)
-
-# rt-detr.cpp (GitHub redirects the historical mudler/rt-detr.cpp to the new
-# mudler/rf-detr.cpp slug). Pin to a specific commit if you need a stable
-# build; leaving this on `master` always picks up the latest C-API surface
-# (incl. the per-detection accessor functions used by gorfdetrcpp.go).
-RFDETR_REPO?=https://github.com/mudler/rf-detr.cpp.git
-RFDETR_VERSION?=65c0ffcc9a9bc9dae38252f63d0417c9845a6cf7
-
-ifeq ($(NATIVE),false)
-	CMAKE_ARGS+=-DGGML_NATIVE=OFF
-endif
-
-# Forward LocalAI's BUILD_TYPE to the matching ggml backend switch.
-ifeq ($(BUILD_TYPE),cublas)
-	CMAKE_ARGS+=-DGGML_CUDA=ON -DRFDETR_GGML_CUDA=ON
-else ifeq ($(BUILD_TYPE),openblas)
-	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
-else ifeq ($(BUILD_TYPE),clblas)
-	CMAKE_ARGS+=-DGGML_CLBLAST=ON
-else ifeq ($(BUILD_TYPE),hipblas)
-	ROCM_HOME ?= /opt/rocm
-	ROCM_PATH ?= /opt/rocm
-	export CXX=$(ROCM_HOME)/llvm/bin/clang++
-	export CC=$(ROCM_HOME)/llvm/bin/clang
-	AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
-	CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DRFDETR_GGML_HIPBLAS=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
-else ifeq ($(BUILD_TYPE),vulkan)
-	CMAKE_ARGS+=-DGGML_VULKAN=ON -DRFDETR_GGML_VULKAN=ON
-else ifeq ($(OS),Darwin)
-	ifneq ($(BUILD_TYPE),metal)
-		CMAKE_ARGS+=-DGGML_METAL=OFF
-	else
-		CMAKE_ARGS+=-DGGML_METAL=ON
-		CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
-		CMAKE_ARGS+=-DRFDETR_GGML_METAL=ON
-	endif
-endif
-
-ifeq ($(BUILD_TYPE),sycl_f16)
-	CMAKE_ARGS+=-DGGML_SYCL=ON \
-		-DCMAKE_C_COMPILER=icx \
-		-DCMAKE_CXX_COMPILER=icpx \
-		-DGGML_SYCL_F16=ON
-endif
-
-ifeq ($(BUILD_TYPE),sycl_f32)
-	CMAKE_ARGS+=-DGGML_SYCL=ON \
-		-DCMAKE_C_COMPILER=icx \
-		-DCMAKE_CXX_COMPILER=icpx
-endif
-
-sources/rt-detr.cpp:
-	mkdir -p sources && \
-	git clone --recursive $(RFDETR_REPO) sources/rt-detr.cpp && \
-	cd sources/rt-detr.cpp && \
-	git checkout $(RFDETR_VERSION) && \
-	git submodule update --init --recursive --depth 1 --single-branch
-
-# Detect OS
-UNAME_S := $(shell uname -s)
-
-# Only build CPU variants on Linux
-ifeq ($(UNAME_S),Linux)
-	VARIANT_TARGETS = librfdetrcpp-avx.so librfdetrcpp-avx2.so librfdetrcpp-avx512.so librfdetrcpp-fallback.so
-else
-	# On non-Linux (e.g., Darwin), build only fallback variant
-	VARIANT_TARGETS = librfdetrcpp-fallback.so
-endif
-
-rfdetr-cpp: main.go gorfdetrcpp.go $(VARIANT_TARGETS)
-	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o rfdetr-cpp ./
-
-package: rfdetr-cpp
-	bash package.sh
-
-build: package
-
-clean: purge
-	rm -rf librfdetrcpp*.so rfdetr-cpp package sources
-
-purge:
-	rm -rf build*
-
-# Build all variants (Linux only)
-ifeq ($(UNAME_S),Linux)
-librfdetrcpp-avx.so: sources/rt-detr.cpp
-	rm -rfv build-$@
-	$(info ${GREEN}I rfdetr-cpp build info:avx${RESET})
-	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) librfdetrcpp-custom
-	rm -rfv build-$@
-
-librfdetrcpp-avx2.so: sources/rt-detr.cpp
-	rm -rfv build-$@
-	$(info ${GREEN}I rfdetr-cpp build info:avx2${RESET})
-	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) librfdetrcpp-custom
-	rm -rfv build-$@
-
-librfdetrcpp-avx512.so: sources/rt-detr.cpp
-	rm -rfv build-$@
-	$(info ${GREEN}I rfdetr-cpp build info:avx512${RESET})
-	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) librfdetrcpp-custom
-	rm -rfv build-$@
-endif
-
-# Build fallback variant (all platforms)
-librfdetrcpp-fallback.so: sources/rt-detr.cpp
-	rm -rfv build-$@
-	$(info ${GREEN}I rfdetr-cpp build info:fallback${RESET})
-	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) librfdetrcpp-custom
-	rm -rfv build-$@
-
-librfdetrcpp-custom: CMakeLists.txt
-	mkdir -p build-$(SO_TARGET) && \
-	cd build-$(SO_TARGET) && \
-	cmake .. $(CMAKE_ARGS) && \
-	cmake --build . --config Release -j$(JOBS) && \
-	cd .. && \
-	mv build-$(SO_TARGET)/librfdetrcpp.so ./$(SO_TARGET)
-
-all: rfdetr-cpp package
-
-# `test` is invoked by the top-level Makefile's `test-extra` target. It builds
-# the backend binary + the fallback shared library (needed for dlopen at
-# runtime), then runs test.sh which downloads the test models + COCO image
-# and exercises the gRPC Load/Detect wire path via the Go smoke test in
-# main_test.go for both the detection and segmentation models.
-test: rfdetr-cpp librfdetrcpp-fallback.so
-	bash test.sh
--- a/backend/go/rfdetr-cpp/gorfdetrcpp.go
+++ b/backend/go/rfdetr-cpp/gorfdetrcpp.go
@@ -1,195 +0,0 @@
-package main
-
-// gorfdetrcpp.go - gRPC handlers (Load, Detect) for the rfdetr-cpp backend.
-//
-// Embeds base.SingleThread to default unimplemented RPCs to "not supported"
-// while we only implement object detection.
-
-import (
-	"encoding/base64"
-	"fmt"
-	"os"
-	"path/filepath"
-	"strconv"
-	"unsafe"
-
-	"github.com/mudler/LocalAI/pkg/grpc/base"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-)
-
-// Default upper bound on detections returned per image. RF-DETR's decoder
-// queries are limited to a few hundred; 300 is a safe ceiling.
-const defaultTopK = 300
-
-// rfdetr_handle_t is a uintptr-typed opaque handle (see include/rfdetr_capi.h).
-var (
-	// rfdetr_capi_load(const char* model_path, int n_threads, rfdetr_handle_t* out_handle) -> int
-	CapiLoad func(modelPath string, nThreads int32, outHandle *uintptr) int32
-	// rfdetr_capi_unload(rfdetr_handle_t handle) -> int
-	CapiUnload func(handle uintptr) int32
-	// rfdetr_capi_detect_path(handle, image_path, threshold, top_k, out_json) -> int
-	CapiDetectPath func(handle uintptr, imagePath string, threshold float32, topK uint32, outJSON *uintptr) int32
-	// rfdetr_capi_detect_buffer(handle, bytes, len, threshold, top_k, out_json) -> int
-	CapiDetectBuffer func(handle uintptr, bytes uintptr, length uintptr, threshold float32, topK uint32, outJSON *uintptr) int32
-	// rfdetr_capi_free_string(char* s)
-	CapiFreeString func(s uintptr)
-	// rfdetr_capi_get_n_detections(handle) -> int
-	CapiGetNDetections func(handle uintptr) int32
-	// rfdetr_capi_get_detection_class_id(handle, i) -> int
-	CapiGetDetectionClassID func(handle uintptr, i int32) int32
-	// rfdetr_capi_get_detection_box(handle, i, out_xyxy[4]) -> int (0 on success)
-	CapiGetDetectionBox func(handle uintptr, i int32, outXYXY uintptr) int32
-	// rfdetr_capi_get_detection_score(handle, i) -> float
-	CapiGetDetectionScore func(handle uintptr, i int32) float32
-	// rfdetr_capi_get_detection_class_name(handle, i, buf, buf_size) -> int (needed/written; two-call sizing)
-	CapiGetDetectionClassName func(handle uintptr, i int32, buf uintptr, bufSize int32) int32
-	// rfdetr_capi_get_detection_mask_png(handle, i, buf, buf_size) -> int (needed/written; 0 means no mask)
-	CapiGetDetectionMaskPNG func(handle uintptr, i int32, buf uintptr, bufSize int32) int32
-)
-
-type RFDetrCpp struct {
-	base.SingleThread
-	handle uintptr
-}
-
-// Load loads the GGUF model at opts.ModelFile (joined with opts.ModelPath if relative)
-// and stores the handle for later Detect calls.
-func (r *RFDetrCpp) Load(opts *pb.ModelOptions) error {
-	modelFile := opts.ModelFile
-	if modelFile == "" {
-		modelFile = opts.Model
-	}
-	if modelFile == "" {
-		return fmt.Errorf("rfdetr-cpp: ModelFile is empty")
-	}
-
-	var modelPath string
-	if filepath.IsAbs(modelFile) {
-		modelPath = modelFile
-	} else {
-		modelPath = filepath.Join(opts.ModelPath, modelFile)
-	}
-
-	if _, err := os.Stat(modelPath); err != nil {
-		return fmt.Errorf("rfdetr-cpp: model file not found: %s: %w", modelPath, err)
-	}
-
-	threads := opts.Threads
-	if threads <= 0 {
-		threads = 4
-	}
-
-	// Release previous model if any (re-Load).
-	if r.handle != 0 {
-		CapiUnload(r.handle)
-		r.handle = 0
-	}
-
-	var h uintptr
-	rc := CapiLoad(modelPath, threads, &h)
-	if rc != 0 || h == 0 {
-		return fmt.Errorf("rfdetr-cpp: rfdetr_capi_load failed with rc=%d for %s", rc, modelPath)
-	}
-	r.handle = h
-	return nil
-}
-
-// Detect runs object detection on the base64-encoded image in opts.Src at
-// opts.Threshold, returning one pb.Detection per result. Seg models also
-// populate Detection.Mask with PNG-encoded mask bytes.
-func (r *RFDetrCpp) Detect(opts *pb.DetectOptions) (pb.DetectResponse, error) {
-	if r.handle == 0 {
-		return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: model not loaded")
-	}
-
-	// Decode base64 image and write to temp file.
-	imgData, err := base64.StdEncoding.DecodeString(opts.Src)
-	if err != nil {
-		return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: failed to decode base64 image: %w", err)
-	}
-
-	tmpFile, err := os.CreateTemp("", "rfdetr-*.img")
-	if err != nil {
-		return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: failed to create temp file: %w", err)
-	}
-	defer func() { _ = os.Remove(tmpFile.Name()) }()
-
-	if _, err := tmpFile.Write(imgData); err != nil {
-		_ = tmpFile.Close()
-		return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: failed to write temp file: %w", err)
-	}
-	if err := tmpFile.Close(); err != nil {
-		return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: failed to close temp file: %w", err)
-	}
-
-	threshold := opts.Threshold
-	if threshold <= 0 {
-		threshold = 0.5
-	}
-
-	// JSON output from detect_path is unused: we read structured detections via
-	// the accessor functions. Still must free the returned string.
-	var jsonPtr uintptr
-	rc := CapiDetectPath(r.handle, tmpFile.Name(), threshold, uint32(defaultTopK), &jsonPtr)
-	if jsonPtr != 0 {
-		CapiFreeString(jsonPtr)
-	}
-	if rc != 0 {
-		return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: detect failed with rc=%d", rc)
-	}
-
-	n := CapiGetNDetections(r.handle)
-	if n < 0 {
-		return pb.DetectResponse{}, fmt.Errorf("rfdetr-cpp: invalid n_detections=%d", n)
-	}
-
-	detections := make([]*pb.Detection, 0, n)
-	for i := int32(0); i < n; i++ {
-		var bbox [4]float32 // x1, y1, x2, y2
-		if rc := CapiGetDetectionBox(r.handle, i, uintptr(unsafe.Pointer(&bbox[0]))); rc != 0 {
-			continue
-		}
-		cid := CapiGetDetectionClassID(r.handle, i)
-		score := CapiGetDetectionScore(r.handle, i)
-
-		// Two-call sizing for class_name.
-		var className string
-		nameSize := CapiGetDetectionClassName(r.handle, i, 0, 0)
-		if nameSize > 1 {
-			buf := make([]byte, nameSize)
-			written := CapiGetDetectionClassName(r.handle, i, uintptr(unsafe.Pointer(&buf[0])), nameSize)
-			// `written` is the same number (needed bytes including NUL); strip NUL.
-			if written > 0 && int(written) <= len(buf) {
-				className = string(buf[:written-1])
-			} else {
-				className = string(buf[:len(buf)-1])
-			}
-		}
-		if className == "" {
-			className = strconv.Itoa(int(cid))
-		}
-
-		// Two-call sizing for mask PNG (returns 0 when no mask).
-		var mask []byte
-		maskSize := CapiGetDetectionMaskPNG(r.handle, i, 0, 0)
-		if maskSize > 0 {
-			maskBuf := make([]byte, maskSize)
-			CapiGetDetectionMaskPNG(r.handle, i, uintptr(unsafe.Pointer(&maskBuf[0])), maskSize)
-			mask = maskBuf
-		}
-
-		detections = append(detections, &pb.Detection{
-			X:          bbox[0],
-			Y:          bbox[1],
-			Width:      bbox[2] - bbox[0],
-			Height:     bbox[3] - bbox[1],
-			Confidence: score,
-			ClassName:  className,
-			Mask:       mask,
-		})
-	}
-
-	return pb.DetectResponse{
-		Detections: detections,
-	}, nil
-}
--- a/backend/go/rfdetr-cpp/main.go
+++ b/backend/go/rfdetr-cpp/main.go
@@ -1,61 +0,0 @@
-package main
-
-// main.go - entry point for the rfdetr-cpp gRPC backend.
-//
-// Dlopens librfdetrcpp-<variant>.so via purego at the path in
-// RFDETR_LIBRARY (set by run.sh based on /proc/cpuinfo), registers the
-// rfdetr_capi_* C ABI symbols, then starts the gRPC server.
-
-import (
-	"flag"
-	"os"
-
-	"github.com/ebitengine/purego"
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-)
-
-var (
-	addr = flag.String("addr", "localhost:50051", "the address to connect to")
-)
-
-type LibFuncs struct {
-	FuncPtr any
-	Name    string
-}
-
-func main() {
-	// Get library name from environment variable, default to fallback
-	libName := os.Getenv("RFDETR_LIBRARY")
-	if libName == "" {
-		libName = "./librfdetrcpp-fallback.so"
-	}
-
-	rfdetrLib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
-	if err != nil {
-		panic(err)
-	}
-
-	libFuncs := []LibFuncs{
-		{&CapiLoad, "rfdetr_capi_load"},
-		{&CapiUnload, "rfdetr_capi_unload"},
-		{&CapiDetectPath, "rfdetr_capi_detect_path"},
-		{&CapiDetectBuffer, "rfdetr_capi_detect_buffer"},
-		{&CapiFreeString, "rfdetr_capi_free_string"},
-		{&CapiGetNDetections, "rfdetr_capi_get_n_detections"},
-		{&CapiGetDetectionClassID, "rfdetr_capi_get_detection_class_id"},
-		{&CapiGetDetectionBox, "rfdetr_capi_get_detection_box"},
-		{&CapiGetDetectionScore, "rfdetr_capi_get_detection_score"},
-		{&CapiGetDetectionClassName, "rfdetr_capi_get_detection_class_name"},
-		{&CapiGetDetectionMaskPNG, "rfdetr_capi_get_detection_mask_png"},
-	}
-
-	for _, lf := range libFuncs {
-		purego.RegisterLibFunc(lf.FuncPtr, rfdetrLib, lf.Name)
-	}
-
-	flag.Parse()
-
-	if err := grpc.StartServer(*addr, &RFDetrCpp{}); err != nil {
-		panic(err)
-	}
-}
--- a/backend/go/rfdetr-cpp/main_test.go
+++ b/backend/go/rfdetr-cpp/main_test.go
@@ -1,220 +0,0 @@
-package main
-
-// main_test.go - end-to-end smoke test for the rfdetr-cpp gRPC backend.
-//
-// Spawns the compiled rfdetr-cpp binary on a free local port, dials it via
-// gRPC, and exercises LoadModel + Detect against the test fixtures
-// downloaded by test.sh. Two scenarios:
-//
-//   1. detection — loads rfdetr-nano-q8_0.gguf and asserts at least one
-//      detection comes back with a non-empty class name and a bounding box
-//      of non-zero size.
-//   2. segmentation — loads rfdetr-seg-nano-q8_0.gguf and additionally
-//      asserts that at least one detection carries a PNG-encoded mask blob
-//      (verified by PNG magic bytes).
-//
-// Both specs Skip cleanly if their fixtures are missing so the test target
-// stays usable on a fresh checkout where models haven't been downloaded.
-
-import (
-	"context"
-	"encoding/base64"
-	"fmt"
-	"net"
-	"os"
-	"os/exec"
-	"path/filepath"
-	"testing"
-	"time"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-	"google.golang.org/grpc"
-	"google.golang.org/grpc/credentials/insecure"
-)
-
-func TestRFDetrCpp(t *testing.T) {
-	RegisterFailHandler(Fail)
-	RunSpecs(t, "rfdetr-cpp backend smoke suite")
-}
-
-// freePort grabs an ephemeral TCP port and immediately releases it so the
-// spawned backend can bind to it. There is a tiny TOCTOU window here but in
-// practice it's adequate for a smoke test on a quiet runner.
-func freePort() int {
-	l, err := net.Listen("tcp", "127.0.0.1:0")
-	Expect(err).ToNot(HaveOccurred(), "freePort listen")
-	port := l.Addr().(*net.TCPAddr).Port
-	Expect(l.Close()).To(Succeed())
-	return port
-}
-
-// startBackend spawns the rfdetr-cpp binary on the given port and waits
-// until it accepts TCP connections (up to 10s). The returned cleanup func
-// kills the process and reaps it.
-func startBackend(port int) func() {
-	binary, err := filepath.Abs("./rfdetr-cpp")
-	Expect(err).ToNot(HaveOccurred())
-	if _, err := os.Stat(binary); err != nil {
-		Skip(fmt.Sprintf("backend binary not built: %s (run `make rfdetr-cpp` first)", binary))
-	}
-
-	libPath, err := filepath.Abs("./librfdetrcpp-fallback.so")
-	Expect(err).ToNot(HaveOccurred())
-	if _, err := os.Stat(libPath); err != nil {
-		Skip(fmt.Sprintf("fallback library not built: %s (run `make librfdetrcpp-fallback.so` first)", libPath))
-	}
-
-	addr := fmt.Sprintf("127.0.0.1:%d", port)
-	cmd := exec.Command(binary, "--addr", addr)
-	cmd.Env = append(os.Environ(), "RFDETR_LIBRARY="+libPath)
-	cmd.Stdout = os.Stderr
-	cmd.Stderr = os.Stderr
-	Expect(cmd.Start()).To(Succeed())
-
-	cleanup := func() {
-		if cmd.Process != nil {
-			_ = cmd.Process.Kill()
-			_, _ = cmd.Process.Wait()
-		}
-	}
-
-	deadline := time.Now().Add(10 * time.Second)
-	for time.Now().Before(deadline) {
-		c, err := net.DialTimeout("tcp", addr, 200*time.Millisecond)
-		if err == nil {
-			_ = c.Close()
-			return cleanup
-		}
-		time.Sleep(200 * time.Millisecond)
-	}
-
-	cleanup()
-	Fail(fmt.Sprintf("backend did not become ready on %s within 10s", addr))
-	return func() {}
-}
-
-// loadTestImage reads the COCO test image downloaded by test.sh and returns
-// its base64-encoded content (the wire format accepted by the Detect RPC).
-func loadTestImage() string {
-	imgPath, err := filepath.Abs("test-data/test.jpg")
-	Expect(err).ToNot(HaveOccurred())
-	imgBytes, err := os.ReadFile(imgPath)
-	if err != nil {
-		Skip(fmt.Sprintf("test image not present: %s (run test.sh first)", imgPath))
-	}
-	return base64.StdEncoding.EncodeToString(imgBytes)
-}
-
-// dialBackend opens a gRPC client connection to the spawned backend.
-func dialBackend(port int) (pb.BackendClient, func()) {
-	addr := fmt.Sprintf("127.0.0.1:%d", port)
-	conn, err := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials()))
-	Expect(err).ToNot(HaveOccurred())
-	return pb.NewBackendClient(conn), func() { _ = conn.Close() }
-}
-
-// modelPathOrSkip resolves a model file under ./test-models/ and Skip()s
-// the current spec if it's missing.
-func modelPathOrSkip(name string) string {
-	modelDir, err := filepath.Abs("test-models")
-	Expect(err).ToNot(HaveOccurred())
-	modelPath := filepath.Join(modelDir, name)
-	if _, err := os.Stat(modelPath); err != nil {
-		Skip(fmt.Sprintf("model not present: %s (run test.sh first)", modelPath))
-	}
-	return modelPath
-}
-
-var _ = Describe("rfdetr-cpp backend", func() {
-	It("runs object detection against a known-good COCO image", func() {
-		modelPath := modelPathOrSkip("rfdetr-nano-q8_0.gguf")
-		imgB64 := loadTestImage()
-
-		port := freePort()
-		cleanup := startBackend(port)
-		defer cleanup()
-
-		client, closeConn := dialBackend(port)
-		defer closeConn()
-
-		ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
-		defer cancel()
-
-		loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
-			Model:     "rfdetr-nano-q8_0.gguf",
-			ModelFile: modelPath,
-			Threads:   2,
-		})
-		Expect(err).ToNot(HaveOccurred(), "LoadModel")
-		Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
-
-		detResp, err := client.Detect(ctx, &pb.DetectOptions{
-			Src:       imgB64,
-			Threshold: 0.5,
-		})
-		Expect(err).ToNot(HaveOccurred(), "Detect")
-		Expect(detResp.GetDetections()).ToNot(BeEmpty(), "no detections returned on a known-good COCO image")
-
-		_, _ = fmt.Fprintf(GinkgoWriter, "detection OK: %d detections\n", len(detResp.GetDetections()))
-		for i, d := range detResp.GetDetections() {
-			Expect(d.GetClassName()).ToNot(BeEmpty(), "detection %d has empty class_name", i)
-			Expect(d.GetConfidence()).To(BeNumerically(">=", float32(0.5)),
-				"detection %d below threshold", i)
-			Expect(d.GetWidth()).To(BeNumerically(">", float32(0)),
-				"detection %d has non-positive width", i)
-			Expect(d.GetHeight()).To(BeNumerically(">", float32(0)),
-				"detection %d has non-positive height", i)
-		}
-	})
-
-	It("runs segmentation and returns PNG-encoded masks", func() {
-		modelPath := modelPathOrSkip("rfdetr-seg-nano-q8_0.gguf")
-		imgB64 := loadTestImage()
-
-		port := freePort()
-		cleanup := startBackend(port)
-		defer cleanup()
-
-		client, closeConn := dialBackend(port)
-		defer closeConn()
-
-		ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
-		defer cancel()
-
-		loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
-			Model:     "rfdetr-seg-nano-q8_0.gguf",
-			ModelFile: modelPath,
-			Threads:   2,
-		})
-		Expect(err).ToNot(HaveOccurred(), "LoadModel")
-		Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
-
-		detResp, err := client.Detect(ctx, &pb.DetectOptions{
-			Src:       imgB64,
-			Threshold: 0.5,
-		})
-		Expect(err).ToNot(HaveOccurred(), "Detect")
-		Expect(detResp.GetDetections()).ToNot(BeEmpty(), "no detections returned from segmentation model")
-
-		haveMask := false
-		for i, d := range detResp.GetDetections() {
-			m := d.GetMask()
-			if len(m) == 0 {
-				continue
-			}
-			haveMask = true
-			// Verify PNG magic: 89 50 4E 47 ("\x89PNG").
-			Expect(len(m)).To(BeNumerically(">=", 4), "detection %d mask too short", i)
-			Expect([]byte{m[0], m[1], m[2], m[3]}).To(Equal([]byte{0x89, 'P', 'N', 'G'}),
-				"detection %d mask is not a PNG", i)
-		}
-		Expect(haveMask).To(BeTrue(),
-			"segmentation model returned %d detections but none carried a mask",
-			len(detResp.GetDetections()))
-
-		_, _ = fmt.Fprintf(GinkgoWriter, "segmentation OK: %d detections, at least one with PNG mask\n",
-			len(detResp.GetDetections()))
-	})
-})
--- a/backend/go/rfdetr-cpp/package.sh
+++ b/backend/go/rfdetr-cpp/package.sh
@@ -1,59 +0,0 @@
-#!/bin/bash
-
-# Script to copy the appropriate libraries based on architecture
-
-set -e
-
-CURDIR=$(dirname "$(realpath $0)")
-REPO_ROOT="${CURDIR}/../../.."
-
-# Create lib directory
-mkdir -p $CURDIR/package/lib
-
-cp -avf $CURDIR/librfdetrcpp-*.so $CURDIR/package/
-cp -avf $CURDIR/rfdetr-cpp $CURDIR/package/
-cp -fv $CURDIR/run.sh $CURDIR/package/
-
-# Detect architecture and copy appropriate libraries
-if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
-    # x86_64 architecture
-    echo "Detected x86_64 architecture, copying x86_64 libraries..."
-    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
-    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
-    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
-elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
-    # ARM64 architecture
-    echo "Detected ARM64 architecture, copying ARM64 libraries..."
-    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
-    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
-    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
-elif [ $(uname -s) = "Darwin" ]; then
-    echo "Detected Darwin"
-else
-    echo "Error: Could not detect architecture"
-    exit 1
-fi
-
-# Package GPU libraries based on BUILD_TYPE
-GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
-if [ -f "$GPU_LIB_SCRIPT" ]; then
-    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
-    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
-    package_gpu_libs
-fi
-
-echo "Packaging completed successfully"
-ls -liah $CURDIR/package/
-ls -liah $CURDIR/package/lib/
--- a/Show More
+++ b/Show More