feat(dllm): image input through the backend (multimodal C-ABI)

Routes PredictOptions.Images (raw base64, the core convention) through dllm.cpp's probed multimodal entry points as data: URIs; the gemma4 renderer appends one engine-side <image> marker per image after the last user message (llama.cpp attachment convention; the template's content-parts branch is unreachable through the flattened pb shape). The engine expands markers to boi + soft*n + eoi and splices the vision-tower embeddings. Older libdllm.so without the mm symbols fails with an actionable error (Dlsym probe). DLLM_VERSION pin bumped to the engine's vision-capable commit. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
feat(dllm): default gallery entry on Q4_K_M; add Q8_0 variant
2026-06-12 18:58:49 -04:00 · 2026-06-12 00:41:04 +00:00 · 2026-06-11 20:24:26 +00:00 · 2026-06-11 19:22:02 +00:00 · 2026-06-11 17:50:04 +00:00 · 2026-06-11 17:17:54 +00:00
172 changed files with 6482 additions and 5825 deletions
--- a/.agents/dllm-backend.md
+++ b/.agents/dllm-backend.md
@@ -0,0 +1,138 @@
+# Working on the dllm Backend
+
+`mudler/dllm.cpp` is a standalone C++/ggml engine for DiffusionGemma
+block-diffusion models. LocalAI wraps it with a **pure-Go** backend at
+`backend/go/dllm/` that dlopens `libdllm.so` via purego (ebitengine/purego) -
+NOT cgo, and NOT a C++ grpc-server fork. The Go side owns chat templating
+(gemma4 renderer) and output parsing (gemma4 streaming parser) and implements
+the rich gRPC interface (`PredictRich`/`PredictStreamRich`, ChatDelta replies).
+
+> NOTE: github.com/mudler/dllm.cpp is still **private** (publishing is
+> planned). Until then the Makefile's anonymous clone fails; use the local-dev
+> symlink shortcut documented at the top of `backend/go/dllm/Makefile`
+> (symlink an out-of-tree `build/libdllm.so` into the backend dir and skip the
+> clone), or a git credential helper with repo access.
+
+## Pin
+
+`backend/go/dllm/Makefile` pins `DLLM_VERSION?=<sha>` at the top
+(whisper / parakeet-cpp / ds4 convention). The bump-deps bot
+(`.github/workflows/bump_deps.yaml`) tracks `mudler/dllm.cpp` `main` and
+rewrites that variable. After a manual bump: `make -C backend/go/dllm purge &&
+make -C backend/go/dllm` (the clone is keyed on the directory existing, not
+the sha).
+
+## C-ABI and the serialization contract
+
+The binding covers the 9-symbol flat C-ABI from dllm.cpp's
+`include/dllm_capi.h` (ABI v1; `main.go` hard-fails on a version mismatch):
+`abi_version, load, free, last_error, free_string, tokenize_json, generate,
+generate_stream, cancel`. Contract points the Go wiring encodes (`capi.go`
+header comment has the full list):
+
+- **One ctx = one concurrent generate/tokenize.** A per-model worker
+  goroutine (`Dllm.jobs` in `dllm.go`) owns ALL C calls, making the
+  serialization structural instead of lock discipline.
+- **`dllm_capi_cancel` is the ONE exception**: it only flips an atomic and may
+  be called from any goroutine mid-generate, so `Dllm.Cancel` bypasses the
+  worker queue. The flag resets at the start of each generate, so a watchdog
+  racing a new generate must re-issue cancel.
+- **`last_error` is a borrowed pointer** and must only be read AFTER the
+  failing call returned (never while a generate is in flight on the same ctx).
+- **Free vs in-flight requests**: requests hold `genMu.RLock` for their full
+  duration; `Free` takes the write lock, so it only runs when nothing is in
+  flight, then drains and closes the worker. Post-Free requests get a clean
+  "model not loaded" error.
+- `tokenize_json`/`generate` return malloc'd `char*` (bound as `uintptr`,
+  copied, then `dllm_capi_free_string`d); opts/params JSON must be a FLAT
+  object of scalars (`buildOptsJSON` rejects anything else).
+
+## Wire shape
+
+| RPC | Implementation |
+|---|---|
+| LoadModel | `dllm_capi_load` (params: `n_gpu_layers`, `n_threads`, `ctx_len`); `Options[]` parsed into per-request gen opts (`eb_*`, `blocks`, `kv_cache`) by `parseModelGenOpts` |
+| PredictRich | render (if templated) → `dllm_capi_generate` → parse → ONE Reply with aggregated ChatDeltas + legacy `Message` bytes |
+| PredictStreamRich | `dllm_capi_generate_stream`; per committed diffusion block → UTF-8 holdback → parser.Feed → one Reply per non-empty delta batch (channel closed by the CALLER, per `pkg/grpc/interface.go`) |
+| Predict / PredictStream | Legacy paths, delegate to the rich pair (legacy stream INVERTS channel ownership: the impl closes) |
+| TokenizeString | `dllm_capi_tokenize_json` (C side prepends BOS per `vocab.add_bos`) |
+| Cancel | `dllm_capi_cancel`, exposed as the `grpc.Cancellable` capability (`pkg/grpc/interface.go`): the gRPC server arms it via `context.AfterFunc` on the Predict/PredictStream context, so client disconnects/timeouts abort the in-flight generate - llama.cpp `IsCancelled()` parity for Go backends |
+
+`n_threads` and `ctx_len` are accepted-but-ignored by the engine at the
+current pin (the context bound comes from GGUF `n_ctx_train`); they are sent
+for forward compatibility.
+
+## Renderer / parser (the templated chat path)
+
+With `use_tokenizer_template` + raw Messages, the backend owns templating and
+parsing (the ds4 precedent, but in Go):
+
+- `gemma4_renderer.go` - `RenderGemma4(msgs, toolsJSON, enableThinking,
+  addGenerationPrompt)`. The file embeds the FULL `tokenizer.chat_template`
+  jinja (17466 bytes, md5 `8c34cf93c7a7815b3fdb300a009c4c17`) extracted
+  verbatim from `diffusiongemma-26B-A4B-it-BF16.gguf` via gguf-py - e.g.
+  `python scripts/dump_gguf.py model.gguf | grep -A400 chat_template` in the
+  dllm.cpp checkout - as a numbered comment block; every Go rule cites its
+  "tpl L<n>" line. Re-verify the md5 before blaming the renderer for a
+  mismatch with a new GGUF. **BOS exception**: the template emits
+  `{{- bos_token -}}` but the renderer deliberately does NOT - dllm.cpp's
+  `run_generate` tokenizes with `prepend_bos = vocab.add_bos` (true for
+  gemma4), so a literal `<bos>` would double it.
+- `gemma4_parser.go` - streaming state machine turning raw model text
+  (fragments can split anywhere, including mid-marker) into ChatDeltas:
+  thought channels → `reasoning_content`, `<|tool_call>call:name{...}` →
+  ToolCallDelta, `<turn|>` → done. Marker grammar cross-checked against vLLM
+  PR #45163's gemma4 tool/reasoning parsers. Malformed payloads are re-emitted
+  raw as content, never dropped.
+- Thinking is **opt-in** for this family (`Metadata["enable_thinking"]`,
+  default OFF - the inverse of ds4): the template gates every thinking branch
+  on `enable_thinking`, and the no-thinking render pre-closes an empty thought
+  channel, so the parser always starts in content state.
+- **UTF-8 boundary holdback** (`splitValidUTF8` in `dllm.go`): per-block
+  detokenization can split a multi-byte character across block boundaries, and
+  grpc-go refuses to marshal invalid UTF-8 in proto3 strings. An incomplete
+  trailing sequence (at most 3 bytes) is carried into the next block; genuinely
+  undecodable bytes become U+FFFD.
+
+Without `use_tokenizer_template`, the prompt passes through verbatim and the
+output is NOT gemma4-parsed (plain content, like any non-autoparsing backend).
+
+## Tests
+
+| Layer | Gate | What |
+|---|---|---|
+| `backend/go/dllm/*_test.go` (renderer/parser/wiring) | none - run in plain `go test ./backend/go/dllm/...` | Ginkgo specs over a fake `generator` seam; canonical renderer fixtures from transformers' `test_modeling_diffusion_gemma.py`, parser tables from the vLLM gemma4 parsers |
+| `backend/go/dllm/dllm_test.go` C-ABI smoke | `DLLM_TEST_LIBRARY` + `DLLM_TEST_TINY_MODEL` (dllm.cpp's `tests/fixtures/tiny_with_vocab.gguf`); Skips when unset | Drives the real `libdllm.so`: ABI check, load, tokenize `[2,18]`, deterministic generate, cancel (incl. mid-stream `Dllm.Cancel` aborting a deliberately slow `eb_max_steps:256` run in ~10ms) |
+| `tests/e2e-backends/dllm_test.go` | `BACKEND_TEST_DLLM=1` + `BACKEND_BINARY` (packaged run.sh) + `BACKEND_TEST_MODEL_FILE` (tiny fixture) | Templated chat round trip (Messages + UseTokenizerTemplate) over the real gRPC binary, non-streaming + streaming; plus client-context cancellation mid-stream (proves the `Cancellable` server plumbing end to end) |
+| Real-model e2e | `BACKEND_TEST_DLLM_REAL_MODEL_FILE` (26B BF16, ~50 GB) + `BACKEND_TEST_DLLM_REAL_GPU_LAYERS` | CUDA-13-class hardware only |
+
+Tool-call e2e is deliberately absent from the tiny-model spec: the fixture has
+random weights and cannot be coaxed into emitting tool markup; the unit tables
+carry that coverage.
+
+## Build matrix
+
+`cpu-dllm` (amd64 + arm64), `cuda13-dllm` (amd64), and
+`cuda13-nvidia-l4t-arm64-dllm` (arm64 CUDA: Jetson / DGX Spark GB10), via
+`.github/backend-matrix.yml`. No darwin/Metal. CUDA builds forward
+`-DDLLM_CUDA=ON` (dllm.cpp gates ggml's CUDA behind its own flag - a bare
+`-DGGML_CUDA=ON` is overridden by the cache FORCE). `libdllm.so` is
+self-contained (ggml statically absorbed, PIC), so `package.sh` only ships
+the binary, `run.sh` and that one .so (the parakeet-cpp-style stub layout;
+no ldd walk yet).
+
+## Known limitations
+
+- **Cancel granularity**: the C-ABI cancel flag is per-ctx and resets on
+  every generate entry, so a Cancel racing a NEW generate can be lost, and
+  with requests queued on the worker it aborts whichever generate is
+  currently running (acceptable: the server de-registers the hook on normal
+  completion, one process serves one model).
+- **Throughput**: ~0.15 tok/s on the 26B at default settings (GB10) - every
+  denoise step recomputes the full prompt+canvas. The upstream prefix-KV
+  cache (dllm.cpp P3) is the fix; `kv_cache:on` errors until it lands
+  (`auto`/`off` are accepted no-ops).
+- **Repo privacy**: see the note at the top - CI clone of dllm.cpp needs the
+  repo published (or credentials) before the backend images can build.
+- Engine spec/validation references: dllm.cpp `docs/validation.md` and
+  LocalAI `docs/superpowers/specs/2026-06-10-dllm-cpp-design.md`.
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -703,19 +703,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: 'cublas'
-    cuda-major-version: "12"
-    cuda-minor-version: "8"
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-nvidia-cuda-12-locate-anything-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "locate-anything-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "12"
    cuda-minor-version: "8"
@@ -1556,19 +1543,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  - build-type: 'cublas'
-    cuda-major-version: "13"
-    cuda-minor-version: "0"
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-nvidia-cuda-13-locate-anything-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "locate-anything-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1595,19 +1569,6 @@ include:
    backend: "rfdetr-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
-  - build-type: 'cublas'
-    cuda-major-version: "13"
-    cuda-minor-version: "0"
-    platforms: 'linux/arm64'
-    skip-drivers: 'false'
-    tag-latest: 'auto'
-    tag-suffix: '-nvidia-l4t-cuda-13-arm64-locate-anything-cpp'
-    base-image: "ubuntu:24.04"
-    ubuntu-version: '2404'
-    runs-on: 'ubuntu-24.04-arm'
-    backend: "locate-anything-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1647,6 +1608,19 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-13-dllm'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "dllm"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -1686,6 +1660,19 @@ include:
    backend: "parakeet-cpp"
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/arm64'
+    skip-drivers: 'false'
+    tag-latest: 'auto'
+    tag-suffix: '-nvidia-l4t-cuda-13-arm64-dllm'
+    base-image: "ubuntu:24.04"
+    ubuntu-version: '2404'
+    runs-on: 'ubuntu-24.04-arm'
+    backend: "dllm"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -2845,74 +2832,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
-  # locate-anything-cpp
-  - build-type: ''
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-cpu-locate-anything-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "locate-anything-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'sycl_f32'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-intel-sycl-f32-locate-anything-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-    skip-drivers: 'false'
-    backend: "locate-anything-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'sycl_f16'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-intel-sycl-f16-locate-anything-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
-    skip-drivers: 'false'
-    backend: "locate-anything-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'vulkan'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/amd64'
-    platform-tag: 'amd64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-vulkan-locate-anything-cpp'
-    runs-on: 'ubuntu-latest'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "locate-anything-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
-  - build-type: 'vulkan'
-    cuda-major-version: ""
-    cuda-minor-version: ""
-    platforms: 'linux/arm64'
-    platform-tag: 'arm64'
-    tag-latest: 'auto'
-    tag-suffix: '-gpu-vulkan-locate-anything-cpp'
-    runs-on: 'ubuntu-24.04-arm'
-    base-image: "ubuntu:24.04"
-    skip-drivers: 'false'
-    backend: "locate-anything-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2404'
  - build-type: 'sycl_f32'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -3006,19 +2925,6 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2204'
-  - build-type: 'cublas'
-    cuda-major-version: "12"
-    cuda-minor-version: "0"
-    platforms: 'linux/arm64'
-    skip-drivers: 'false'
-    tag-latest: 'auto'
-    tag-suffix: '-nvidia-l4t-arm64-locate-anything-cpp'
-    base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
-    runs-on: 'ubuntu-24.04-arm'
-    backend: "locate-anything-cpp"
-    dockerfile: "./backend/Dockerfile.golang"
-    context: "./"
-    ubuntu-version: '2204'
  # whisper
  - build-type: ''
    cuda-major-version: ""
@@ -3265,6 +3171,35 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  # dllm
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    platform-tag: 'amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-dllm'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "dllm"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/arm64'
+    platform-tag: 'arm64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-dllm'
+    runs-on: 'ubuntu-24.04-arm'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "dllm"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: 'sycl_f32'
    cuda-major-version: ""
    cuda-minor-version: ""
@@ -4461,10 +4396,6 @@ includeDarwin:
    tag-suffix: "-metal-darwin-arm64-silero-vad"
    build-type: "metal"
    lang: "go"
-  - backend: "sherpa-onnx"
-    tag-suffix: "-metal-darwin-arm64-sherpa-onnx"
-    build-type: "metal"
-    lang: "go"
  - backend: "local-store"
    tag-suffix: "-metal-darwin-arm64-local-store"
    build-type: "metal"
@@ -4472,6 +4403,3 @@ includeDarwin:
  - backend: "llama-cpp-quantization"
    tag-suffix: "-metal-darwin-arm64-llama-cpp-quantization"
    build-type: "mps"
-  - backend: "speaker-recognition"
-    tag-suffix: "-metal-darwin-arm64-speaker-recognition"
-    build-type: "mps"
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -38,6 +38,10 @@ jobs:
            variable: "PARAKEET_VERSION"
            branch: "master"
            file: "backend/go/parakeet-cpp/Makefile"
+          - repository: "mudler/dllm.cpp"
+            variable: "DLLM_VERSION"
+            branch: "main"
+            file: "backend/go/dllm/Makefile"
          - repository: "leejet/stable-diffusion.cpp"
            variable: "STABLEDIFFUSION_GGML_VERSION"
            branch: "master"
@@ -62,10 +66,6 @@ jobs:
            variable: "RFDETR_VERSION"
            branch: "main"
            file: "backend/go/rfdetr-cpp/Makefile"
-          - repository: "mudler/locate-anything.cpp"
-            variable: "LOCATEANYTHING_VERSION"
-            branch: "master"
-            file: "backend/go/locate-anything-cpp/Makefile"
          - repository: "predict-woo/qwen3-tts.cpp"
            variable: "QWEN3TTS_CPP_VERSION"
            branch: "main"
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -38,7 +38,6 @@ jobs:
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
      qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
      rfdetr-cpp: ${{ steps.detect.outputs.rfdetr-cpp }}
-      locate-anything-cpp: ${{ steps.detect.outputs.locate-anything-cpp }}
      vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
      localvqe: ${{ steps.detect.outputs.localvqe }}
      voxtral: ${{ steps.detect.outputs.voxtral }}
@@ -564,7 +563,7 @@ jobs:
      - name: Run e2e-backends smoke
        env:
          BACKEND_IMAGE: quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
-          BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias,tokenize
+          BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias
        run: |
          make test-extra-backend
  # Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
@@ -902,45 +901,6 @@ jobs:
      - name: Test rfdetr-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp test
-  # Per-backend e2e for locate-anything-cpp: builds the .so + Go binary and
-  # runs `make -C backend/go/locate-anything-cpp test`. test.sh fetches the
-  # locate-anything-q8_0 GGUF (~6.3 GB, NVIDIA LocateAnything-3B) from the
-  # published mudler/locate-anything.cpp-gguf HF repo + a COCO image, then the
-  # Go wire test loads the model and runs an open-vocabulary Detect, asserting
-  # at least one labeled box. Heavier than the other Go backends (it is a 3B),
-  # so it is gated to changes under backend/go/locate-anything-cpp/.
-  tests-locate-anything-cpp:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.locate-anything-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential cmake curl libopenblas-dev
-      - name: Setup Go
-        uses: actions/setup-go@v5
-      - name: Display Go version
-        run: go version
-      - name: Proto Dependencies
-        run: |
-          # Install protoc
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Build locate-anything-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp
-      - name: Test locate-anything-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp test
  # Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
  # runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
  # the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -26,6 +26,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
 | [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
 | [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
+| [.agents/dllm-backend.md](.agents/dllm-backend.md) | Working on the dllm backend (DiffusionGemma block-diffusion) - purego C-ABI binding, per-ctx serialization contract, gemma4 renderer/parser, gated test layers |
 | [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
 | [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
 | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
--- a/1
+++ b/1
@@ -108,7 +108,6 @@ RUN <<EOT bash
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
--- a/8
+++ b/8
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/dllm backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -566,7 +566,6 @@ prepare-test-extra: protogen-python
 	$(MAKE) -C backend/python/speaker-recognition
 	$(MAKE) -C backend/rust/kokoros kokoros-grpc
 	$(MAKE) -C backend/go/rfdetr-cpp
-	$(MAKE) -C backend/go/locate-anything-cpp

 test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/transformers test
@@ -594,7 +593,6 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/speaker-recognition test
 	$(MAKE) -C backend/rust/kokoros test
 	$(MAKE) -C backend/go/rfdetr-cpp test
-	$(MAKE) -C backend/go/locate-anything-cpp test

 ##
 ## End-to-end gRPC tests that exercise a built backend container image.
@@ -1173,6 +1171,9 @@ BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|tr
 BACKEND_WHISPER = whisper|golang|.|false|true
 BACKEND_CRISPASR = crispasr|golang|.|false|true
 BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
+# dllm is mudler/dllm.cpp, the DiffusionGemma block-diffusion engine,
+# wrapped by the purego backend at backend/go/dllm.
+BACKEND_DLLM = dllm|golang|.|false|true
 BACKEND_VOXTRAL = voxtral|golang|.|false|true
 BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
 BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
@@ -1262,6 +1263,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
 $(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
+$(eval $(call generate-docker-build-target,$(BACKEND_DLLM)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
 $(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
--- a/backend/Dockerfile.golang
+++ b/backend/Dockerfile.golang
@@ -206,16 +206,6 @@ RUN if [ "${BACKEND}" = "opus" ]; then \
    apt-get clean && rm -rf /var/lib/apt/lists/*; \
 fi

-# CrispASR's piper TTS backend dlopens libespeak-ng at runtime to phonemize
-# non-English text (the MIT-clean path; English uses a built-in G2P). Install
-# the espeak-ng runtime + its libpcaudio/libsonic deps + voice data so
-# package.sh can bundle them into the FROM scratch image.
-RUN if [ "${BACKEND}" = "crispasr" ]; then \
-    apt-get update && apt-get install -y --no-install-recommends \
-        espeak-ng-data libespeak-ng1 libpcaudio0 libsonic0 && \
-    apt-get clean && rm -rf /var/lib/apt/lists/*; \
-fi
-
 COPY . /LocalAI

 RUN git config --global --add safe.directory /LocalAI
--- a/backend/Dockerfile.python
+++ b/backend/Dockerfile.python
@@ -126,7 +126,6 @@ RUN <<EOT bash
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -1,10 +1,10 @@
 # ds4 backend Makefile.
 #
-# Upstream pin lives below as DS4_VERSION?=d881f2a05e8ff6bec001315a36b794b4aa310173
+# Upstream pin lives below as DS4_VERSION?=8384adf0f9fa0f3bb342dd925372de778b95b263
 # (.github/bump_deps.sh) can find and update it - matches the
 # llama-cpp / ik-llama-cpp / turboquant convention.

-DS4_VERSION?=d881f2a05e8ff6bec001315a36b794b4aa310173
+DS4_VERSION?=8384adf0f9fa0f3bb342dd925372de778b95b263
 DS4_REPO?=https://github.com/antirez/ds4

 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=4c6595503fe45d5a39f88d194e270f64c7424677
+LLAMA_VERSION?=039e20a2db9e87b2477c76cc04905f3e1acad77f
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -3486,7 +3486,7 @@ public:
        if (body.count("prompt") != 0) {
            const bool add_special = json_value(body, "add_special", false);

-            llama_tokens tokens = tokenize_mixed(ctx_server.impl->vocab, body.at("prompt"), add_special, true);
+            llama_tokens tokens = tokenize_mixed(ctx_server.impl->vocab, body.at("content"), add_special, true);


            for (const auto& token : tokens) {
--- a/backend/go/crispasr/Makefile
+++ b/backend/go/crispasr/Makefile
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)

 # CrispASR version (release tag)
 CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
-CRISPASR_VERSION?=d745bda4386ae0f9d1d2f23fff8ec95d76428221
+CRISPASR_VERSION?=c29f6653a516a3001d923944dad8892072cc7334
 SO_TARGET?=libgocrispasr.so

 CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
--- a/backend/go/crispasr/gocrispasr.go
+++ b/backend/go/crispasr/gocrispasr.go
@@ -11,7 +11,6 @@ import (

 	"github.com/go-audio/audio"
 	"github.com/go-audio/wav"
-	gguf "github.com/gpustack/gguf-parser-go"
 	"github.com/mudler/LocalAI/pkg/grpc/base"
 	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 	"github.com/mudler/LocalAI/pkg/utils"
@@ -38,39 +37,6 @@ var (

 type CrispASR struct {
 	base.SingleThread
-	// sampleRate is the output rate (Hz) of the loaded TTS engine's PCM, used to
-	// write a correct WAV header. Most CrispASR TTS backends emit 24 kHz, but
-	// piper returns its model's native rate (16 kHz for x_low/low voices,
-	// 22.05 kHz for medium/high), so it is read from the GGUF metadata at Load.
-	sampleRate int
-}
-
-// defaultTTSSampleRate is the output rate assumed for CrispASR TTS engines that
-// don't advertise one in GGUF metadata (vibevoice/orpheus/chatterbox/qwen3-tts
-// all emit 24 kHz). piper is the exception and carries piper.sample_rate.
-const defaultTTSSampleRate = 24000
-
-// piperSampleRate reads the piper.sample_rate metadata key from a GGUF model.
-// CrispASR's piper backend returns PCM at the model's native rate without
-// resampling, so the WAV header must match it. Returns ok=false for non-piper
-// models (key absent) or an unreadable file, letting the caller fall back to
-// defaultTTSSampleRate.
-func piperSampleRate(modelPath string) (int, bool) {
-	// Only scalar architecture keys are read, so skip the large array metadata
-	// (phoneme map) and mmap the header - same rationale as pkg/vram's reader.
-	f, err := gguf.ParseGGUFFile(modelPath, gguf.UseMMap(), gguf.SkipLargeMetadata())
-	if err != nil {
-		return 0, false
-	}
-	kv, ok := f.Header.MetadataKV.Get("piper.sample_rate")
-	if !ok || kv.ValueType != gguf.GGUFMetadataValueTypeUint32 {
-		return 0, false
-	}
-	rate := int(kv.ValueUint32())
-	if rate <= 0 {
-		return 0, false
-	}
-	return rate, true
 }

 // splitOption splits a "prefix:value" model option into its key and value,
@@ -137,14 +103,6 @@ func (w *CrispASR) Load(opts *pb.ModelOptions) error {
 		return fmt.Errorf("Failed to load CrispASR transcription model")
 	}

-	// Determine the TTS output sample rate for the WAV header. piper voices
-	// carry their native rate in GGUF metadata and CrispASR does not resample;
-	// every other engine emits the 24 kHz default.
-	w.sampleRate = defaultTTSSampleRate
-	if rate, ok := piperSampleRate(opts.ModelFile); ok {
-		w.sampleRate = rate
-	}
-
 	// Load the companion file (codec/tokenizer/s3gen) after the session is open.
 	// rc==0 means success or "not applicable" for the active backend; only a
 	// negative code is fatal.
@@ -432,7 +390,7 @@ func (w *CrispASR) synthesize(text string) ([]float32, error) {
 	}
 	defer CppTTSFree(ptr)
 	src := unsafe.Slice((*float32)(unsafe.Pointer(ptr)), int(n)) //nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free.
-	out := make([]float32, int(n))                               // copy out of C memory before free
+	out := make([]float32, int(n)) // copy out of C memory before free
 	copy(out, src)
 	return out, nil
 }
@@ -459,7 +417,7 @@ func (w *CrispASR) TTS(req *pb.TTSRequest) error {
 	if err != nil {
 		return err
 	}
-	return writeWAV(req.Dst, pcm, w.sampleRate)
+	return writeWAV24k(req.Dst, pcm)
 }

 // TTSStream is the streaming counterpart to TTS. CrispASR has no progressive
@@ -489,7 +447,7 @@ func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
 	}
 	defer func() { _ = os.Remove(dst) }()

-	if err := writeWAV(dst, pcm, w.sampleRate); err != nil {
+	if err := writeWAV24k(dst, pcm); err != nil {
 		return err
 	}

@@ -501,14 +459,14 @@ func (w *CrispASR) TTSStream(req *pb.TTSRequest, results chan []byte) error {
 	return nil
 }

-// writeWAV writes pcm as a sampleRate Hz, mono, 16-bit PCM WAV at dst.
-func writeWAV(dst string, pcm []float32, sampleRate int) error {
+// writeWAV24k writes pcm as a 24000 Hz, mono, 16-bit PCM WAV at dst.
+func writeWAV24k(dst string, pcm []float32) error {
 	f, err := os.Create(dst)
 	if err != nil {
 		return fmt.Errorf("crispasr: create %q: %w", dst, err)
 	}

-	enc := wav.NewEncoder(f, sampleRate, 16, 1, 1)
+	enc := wav.NewEncoder(f, 24000, 16, 1, 1)
 	ints := make([]int, len(pcm))
 	for i, s := range pcm {
 		if s > 1 {
@@ -519,7 +477,7 @@ func writeWAV(dst string, pcm []float32, sampleRate int) error {
 		ints[i] = int(s * 32767)
 	}
 	buf := &audio.IntBuffer{
-		Format:         &audio.Format{NumChannels: 1, SampleRate: sampleRate},
+		Format:         &audio.Format{NumChannels: 1, SampleRate: 24000},
 		Data:           ints,
 		SourceBitDepth: 16,
 	}
--- a/backend/go/crispasr/gocrispasr_samplerate_test.go
+++ b/backend/go/crispasr/gocrispasr_samplerate_test.go
@@ -1,164 +0,0 @@
-package main
-
-import (
-	"bytes"
-	"encoding/binary"
-	"os"
-	"path/filepath"
-
-	"github.com/go-audio/wav"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-// GGUF metadata value type tags (subset) from the GGUF spec.
-const (
-	ggufTypeUint32 uint32 = 4
-	ggufTypeString uint32 = 8
-)
-
-type ggufKV struct {
-	key   string
-	vtype uint32
-	val   any
-}
-
-// writeMinimalGGUF emits a valid, tensor-less GGUF file carrying only the given
-// metadata key-values. Enough for the header-only parse path piperSampleRate
-// uses; avoids pulling a real multi-MB voice into the test.
-func writeMinimalGGUF(path string, kvs []ggufKV) error {
-	var b bytes.Buffer
-	b.WriteString("GGUF")                                // magic
-	_ = binary.Write(&b, binary.LittleEndian, uint32(3)) // version
-	_ = binary.Write(&b, binary.LittleEndian, uint64(0)) // tensor count
-	_ = binary.Write(&b, binary.LittleEndian, uint64(len(kvs)))
-	for _, kv := range kvs {
-		_ = binary.Write(&b, binary.LittleEndian, uint64(len(kv.key)))
-		b.WriteString(kv.key)
-		_ = binary.Write(&b, binary.LittleEndian, kv.vtype)
-		switch v := kv.val.(type) {
-		case uint32:
-			_ = binary.Write(&b, binary.LittleEndian, v)
-		case string:
-			_ = binary.Write(&b, binary.LittleEndian, uint64(len(v)))
-			b.WriteString(v)
-		}
-	}
-	return os.WriteFile(path, b.Bytes(), 0o644)
-}
-
-// wavSampleRate decodes the WAV header at path and returns its sample rate.
-func wavSampleRate(path string) (int, error) {
-	f, err := os.Open(path)
-	if err != nil {
-		return 0, err
-	}
-	defer func() { _ = f.Close() }()
-	dec := wav.NewDecoder(f)
-	dec.ReadInfo()
-	return int(dec.SampleRate), nil
-}
-
-var _ = Describe("piper sample rate", func() {
-	Context("piperSampleRate", func() {
-		It("reads piper.sample_rate from a piper GGUF (medium = 22050)", func() {
-			p := filepath.Join(GinkgoT().TempDir(), "voice.gguf")
-			Expect(writeMinimalGGUF(p, []ggufKV{
-				{key: "general.architecture", vtype: ggufTypeString, val: "piper"},
-				{key: "piper.sample_rate", vtype: ggufTypeUint32, val: uint32(22050)},
-			})).To(Succeed())
-
-			rate, ok := piperSampleRate(p)
-			Expect(ok).To(BeTrue(), "piper.sample_rate should be found")
-			Expect(rate).To(Equal(22050))
-		})
-
-		It("reads the low-quality rate (16000)", func() {
-			p := filepath.Join(GinkgoT().TempDir(), "voice.gguf")
-			Expect(writeMinimalGGUF(p, []ggufKV{
-				{key: "piper.sample_rate", vtype: ggufTypeUint32, val: uint32(16000)},
-			})).To(Succeed())
-
-			rate, ok := piperSampleRate(p)
-			Expect(ok).To(BeTrue())
-			Expect(rate).To(Equal(16000))
-		})
-
-		It("returns ok=false for a non-piper GGUF (no piper.sample_rate key)", func() {
-			p := filepath.Join(GinkgoT().TempDir(), "other.gguf")
-			Expect(writeMinimalGGUF(p, []ggufKV{
-				{key: "general.architecture", vtype: ggufTypeString, val: "vibevoice"},
-			})).To(Succeed())
-
-			_, ok := piperSampleRate(p)
-			Expect(ok).To(BeFalse())
-		})
-
-		It("returns ok=false for an unreadable/non-GGUF file", func() {
-			p := filepath.Join(GinkgoT().TempDir(), "garbage.gguf")
-			Expect(os.WriteFile(p, []byte("not a gguf"), 0o644)).To(Succeed())
-
-			_, ok := piperSampleRate(p)
-			Expect(ok).To(BeFalse())
-		})
-	})
-
-	// End-to-end through the built .so. Gated on CRISPASR_PIPER_MODEL_PATH (a
-	// real piper voice GGUF) like the other model-backed specs; never runs in
-	// default CI. Proves CrispASR's piper backend output rate flows into the
-	// WAV header instead of the hardcoded 24 kHz default.
-	Context("piper TTS end-to-end", func() {
-		It("writes the WAV at the model's native piper.sample_rate", func() {
-			model := os.Getenv("CRISPASR_PIPER_MODEL_PATH")
-			if model == "" {
-				Skip("set CRISPASR_PIPER_MODEL_PATH to run the piper e2e spec")
-			}
-			ensureLibLoaded()
-
-			expected, ok := piperSampleRate(model)
-			Expect(ok).To(BeTrue(), "model should carry piper.sample_rate metadata")
-
-			w := &CrispASR{}
-			Expect(w.Load(&pb.ModelOptions{
-				ModelFile: model,
-				Options:   []string{"backend:piper"},
-				Threads:   4,
-			})).To(Succeed())
-
-			dst := filepath.Join(GinkgoT().TempDir(), "piper.wav")
-			Expect(w.TTS(&pb.TTSRequest{Text: "Hello from CrispASR piper.", Dst: dst})).To(Succeed())
-
-			info, err := os.Stat(dst)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(info.Size()).To(BeNumerically(">", 1024), "expected a non-trivial WAV")
-
-			rate, err := wavSampleRate(dst)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(rate).To(Equal(expected),
-				"WAV header rate must equal the model's native piper.sample_rate, not the 24k default")
-		})
-	})
-
-	Context("writeWAV", func() {
-		It("writes the WAV header at the given sample rate (22050 for piper, not the 24k default)", func() {
-			dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
-			pcm := make([]float32, 220) // 10 ms of silence is enough for a header
-			Expect(writeWAV(dst, pcm, 22050)).To(Succeed())
-
-			rate, err := wavSampleRate(dst)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(rate).To(Equal(22050))
-		})
-
-		It("writes a 16000 Hz header for low-quality piper voices", func() {
-			dst := filepath.Join(GinkgoT().TempDir(), "out.wav")
-			pcm := make([]float32, 160)
-			Expect(writeWAV(dst, pcm, 16000)).To(Succeed())
-
-			rate, err := wavSampleRate(dst)
-			Expect(err).ToNot(HaveOccurred())
-			Expect(rate).To(Equal(16000))
-		})
-	})
-})
--- a/backend/go/crispasr/package.sh
+++ b/backend/go/crispasr/package.sh
@@ -51,32 +51,6 @@ else
    exit 1
 fi

-# Bundle espeak-ng (+ its libpcaudio/libsonic runtime deps) and its voice data so
-# the piper TTS backend can phonemize non-English text. CrispASR dlopens
-# libespeak-ng.so.1 at runtime (the MIT-clean path); the dlopen succeeds loading
-# libespeak-ng but FAILS if libpcaudio/libsonic are absent, so all three .so are
-# required. run.sh points CRISPASR_ESPEAK_DATA_PATH at the bundled data dir.
-# Best-effort: only copied when present, so a local dev build without espeak-ng
-# installed still packages the rest (English voices keep working).
-ESPEAK_LIBDIR=""
-for d in /usr/lib/x86_64-linux-gnu /usr/lib/aarch64-linux-gnu; do
-    if [ -f "$d/libespeak-ng.so.1" ]; then
-        ESPEAK_LIBDIR="$d"
-        break
-    fi
-done
-if [ -n "$ESPEAK_LIBDIR" ]; then
-    echo "Bundling espeak-ng from $ESPEAK_LIBDIR ..."
-    cp -arfLv "$ESPEAK_LIBDIR/libespeak-ng.so.1" $CURDIR/package/lib/
-    cp -arfLv "$ESPEAK_LIBDIR/libpcaudio.so.0" $CURDIR/package/lib/
-    cp -arfLv "$ESPEAK_LIBDIR/libsonic.so.0" $CURDIR/package/lib/
-    if [ -d "$ESPEAK_LIBDIR/espeak-ng-data" ]; then
-        cp -arfLv "$ESPEAK_LIBDIR/espeak-ng-data" $CURDIR/package/
-    fi
-else
-    echo "espeak-ng not found; non-English piper voices will not phonemize"
-fi
-
 # Package GPU libraries based on BUILD_TYPE
 # The GPU library packaging script will detect BUILD_TYPE and copy appropriate GPU libraries
 GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
--- a/backend/go/crispasr/run.sh
+++ b/backend/go/crispasr/run.sh
@@ -41,11 +41,6 @@ fi
 export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 export CRISPASR_LIBRARY=$LIBRARY

-# Point piper's espeak-ng phonemizer at the bundled voice data. The variable
-# names the directory CONTAINING espeak-ng-data (package.sh drops it next to
-# this script). Harmless when espeak-ng wasn't bundled.
-export CRISPASR_ESPEAK_DATA_PATH=$CURDIR
-
 # If there is a lib/ld.so, use it
 if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
--- a/backend/go/dllm/.gitignore
+++ b/backend/go/dllm/.gitignore
@@ -0,0 +1,10 @@
+.cache/
+sources/
+build/
+package/
+dllm-grpc
+# build artifacts staged in-tree by the Makefile (cp from sources/) or
+# symlinked for local dev; the real sources live in dllm.cpp upstream.
+*.so
+*.so.*
+compile_commands.json
--- a/backend/go/dllm/Makefile
+++ b/backend/go/dllm/Makefile
@@ -0,0 +1,97 @@
+# dllm backend Makefile.
+#
+# Upstream pin lives below as DLLM_VERSION?=<sha> so .github/bump_deps.sh
+# can find and update it - matches the whisper.cpp / parakeet-cpp / ds4
+# convention.
+#
+# Local dev shortcut: if you already have an out-of-tree dllm.cpp build,
+# you can symlink the .so into this directory and skip the clone/cmake
+# steps entirely, e.g.:
+#
+#   ln -sf /path/to/dllm.cpp/build/libdllm.so .
+#   go build -o dllm-grpc .
+#
+# That's what the gated C-ABI binding smoke uses (DLLM_TEST_LIBRARY). The
+# default target below does the proper clone-at-pin + cmake build so CI
+# doesn't need a side-checkout.
+#
+# NOTE: github.com/mudler/dllm.cpp is still private (publishing is planned);
+# until then the anonymous clone below fails. Use the symlink shortcut above
+# with a local checkout, or a git credential helper with access to the repo.
+
+# The pin below is the first commit carrying the multimodal C-ABI entry
+# points (dllm_capi_generate_mm / dllm_capi_generate_stream_mm) the
+# image-input path probes for; older libs still load, but image requests
+# then fail with "library predates the multimodal entry points".
+DLLM_VERSION?=e6dcf44cddd65845e3a0814a1c2282a5d90ee98a
+DLLM_REPO?=https://github.com/mudler/dllm.cpp
+
+GOCMD?=go
+GO_TAGS?=
+JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
+
+BUILD_TYPE?=
+NATIVE?=false
+
+# libdllm.so is self-contained: dllm.cpp's CMakeLists statically absorbs ggml
+# (BUILD_SHARED_LIBS=OFF + PIC) into the shared lib, so dlopen needs no
+# libggml*.so alongside it, only system libs (libstdc++/libgomp/libc) the
+# runtime image already provides. Tests/CLI are upstream-only concerns.
+CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DDLLM_BUILD_TESTS=OFF
+
+ifeq ($(NATIVE),false)
+	CMAKE_ARGS+=-DGGML_NATIVE=OFF
+endif
+
+# Same arch set the sibling ggml backends (acestep/vibevoice/qwen3-tts) bake
+# for their cublas images; override for a native build.
+CUDA_ARCHITECTURES?=75-virtual;80-virtual;86-real;89-real
+
+# dllm.cpp gates CUDA behind DLLM_CUDA (set(GGML_CUDA ... CACHE FORCE)), so
+# forward that instead of a bare -DGGML_CUDA=ON.
+ifeq ($(BUILD_TYPE),cublas)
+	CMAKE_ARGS+=-DDLLM_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="$(CUDA_ARCHITECTURES)"
+endif
+
+.PHONY: dllm-grpc package build clean purge test all
+
+all: dllm-grpc
+
+# Clone the upstream dllm.cpp source at the pinned commit (ggml comes in as
+# a submodule). Directory acts as the target so make only re-clones when
+# missing. After a DLLM_VERSION bump, run 'make purge && make' to refetch.
+sources/dllm.cpp:
+	mkdir -p sources/dllm.cpp
+	cd sources/dllm.cpp && \
+	git init -q && \
+	git remote add origin $(DLLM_REPO) && \
+	git fetch --depth 1 origin $(DLLM_VERSION) && \
+	git checkout FETCH_HEAD && \
+	git submodule update --init --recursive --depth 1 --single-branch
+
+# Build the shared lib out-of-tree, then stage it next to the Go sources so
+# purego.Dlopen("libdllm.so") and the packaging step both pick it up.
+libdllm.so: sources/dllm.cpp
+	cmake -B sources/dllm.cpp/build -S sources/dllm.cpp $(CMAKE_ARGS)
+	cmake --build sources/dllm.cpp/build --config Release -j$(JOBS)
+	cp -fv sources/dllm.cpp/build/libdllm.so ./
+
+dllm-grpc: libdllm.so main.go capi.go
+	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o dllm-grpc .
+
+package: dllm-grpc
+	bash package.sh
+
+build: package
+
+# Test target. The C-ABI binding smoke is gated on DLLM_TEST_LIBRARY +
+# DLLM_TEST_TINY_MODEL; without them the gated specs auto-skip and only the
+# pure-Go helper specs run.
+test:
+	LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1
+
+clean: purge
+	rm -rf libdllm.so* package dllm-grpc
+
+purge:
+	rm -rf sources/dllm.cpp
--- a/backend/go/dllm/capi.go
+++ b/backend/go/dllm/capi.go
@@ -0,0 +1,326 @@
+package main
+
+// Typed Go wrappers over dllm.cpp's flat C-ABI (include/dllm_capi.h, ABI v1).
+//
+// Contract highlights the wrappers encode (see the header + src/capi.cpp):
+//   - tokenize_json/generate return malloc'd char* the CALLER owns: bound as
+//     uintptr, copied with goStringFromCPtr, released via dllm_capi_free_string.
+//   - last_error returns a BORROWED pointer (valid until the next call on the
+//     same ctx): bound as a plain string (purego copies), never freed, and only
+//     read AFTER the failing call has returned - reading it while a generate is
+//     in flight on the same ctx violates the per-ctx serialization contract.
+//   - All entry points except dllm_capi_cancel must be externally serialized
+//     per ctx (one ctx = one concurrent generate/tokenize). Cancel only flips
+//     an atomic and may be called from any goroutine mid-generate.
+//   - No C++ exception crosses the boundary; failures land in last_error.
+
+import (
+	"encoding/json"
+	"errors"
+	"fmt"
+	"sync"
+	"sync/atomic"
+	"unsafe"
+
+	"github.com/ebitengine/purego"
+)
+
+// dllmABIVersion is the DLLM_CAPI_ABI_VERSION this binding was written
+// against; main.go refuses to start against a libdllm.so reporting another.
+const dllmABIVersion = 1
+
+// purego-bound entry points from libdllm.so. Names match dllm_capi.h
+// exactly; loadCAPI (main.go) fills these in at boot.
+var (
+	cppAbiVersion func() int32
+	cppLoad       func(ggufPath, paramsJSON string) uintptr
+	cppFree       func(ctx uintptr)
+	cppLastError  func(ctx uintptr) string // borrowed pointer: purego copies, do NOT free
+	cppFreeString func(s uintptr)
+	// malloc'd char* returns, hence uintptr (see loadCAPI's doc comment).
+	cppTokenizeJSON func(ctx uintptr, text string) uintptr
+	cppGenerate     func(ctx uintptr, prompt, optsJSON string) uintptr
+	// on_block/on_step are C function pointers produced by purego.NewCallback;
+	// userData carries the streamCallStates registry key.
+	cppGenerateStream func(ctx uintptr, prompt, optsJSON string, onBlock, onStep, userData uintptr) int32
+	cppCancel         func(ctx uintptr)
+)
+
+// Optional multimodal entry points (dllm_capi.h's P4 surface). The ABI
+// version stays 1: presence is detected by PROBING the symbols with Dlsym at
+// boot (loadCAPI, mirroring the parakeet-cpp optional-symbol pattern). nil
+// means the loaded libdllm.so predates the mm surface; the wrappers below
+// then fail with errMMUnsupported instead of crashing on a nil call.
+var (
+	cppGenerateMM       func(ctx uintptr, prompt, imagesJSON, optsJSON string) uintptr
+	cppGenerateStreamMM func(ctx uintptr, prompt, imagesJSON, optsJSON string, onBlock, onStep, userData uintptr) int32
+)
+
+// mmImageMarker is the literal placeholder dllm_capi_generate_mm expands to
+// <boi> + soft-token placeholders + <eoi> (dllm_capi.h placeholder contract;
+// capi.cpp MM_MARKER). The prompt must carry exactly one marker per
+// images_json entry, in image order.
+const mmImageMarker = "<image>"
+
+// errMMUnsupported is returned for image-bearing requests against an old
+// text-only libdllm.so (the Dlsym probe found no mm symbols).
+var errMMUnsupported = errors.New(
+	"dllm: image input requires libdllm.so with the multimodal entry points (dllm_capi_generate_mm), but the loaded library predates them - rebuild/upgrade the dllm backend to use images")
+
+// cMMSupported reports whether the loaded libdllm.so carries the multimodal
+// generate pair. Both symbols ship together (same dllm.cpp commit), but the
+// guard requires both anyway so a half-present surface can never dispatch.
+func cMMSupported() bool {
+	return cppGenerateMM != nil && cppGenerateStreamMM != nil
+}
+
+// cAbiVersion returns the library's DLLM_CAPI_ABI_VERSION.
+func cAbiVersion() int32 {
+	return cppAbiVersion()
+}
+
+// cLoad opens the GGUF at path with the flat params JSON (e.g.
+// {"n_gpu_layers":99}). Returns 0 on failure; per the header contract there
+// is no ctx to carry the reason, the C side logs it to stderr (and
+// cLastError(0) only yields the static NULL-ctx message).
+func cLoad(path, paramsJSON string) uintptr {
+	return cppLoad(path, paramsJSON)
+}
+
+// cFree releases a ctx; safe on 0 (delete nullptr).
+func cFree(h uintptr) {
+	cppFree(h)
+}
+
+// cLastError returns the ctx's last error message (or the static NULL-ctx
+// message for h==0). The C pointer is borrowed and only valid until the next
+// call on the same ctx; purego's string return copies it immediately, so the
+// returned Go string is safe to keep. Must not be called while another call
+// on the same ctx is in flight.
+func cLastError(h uintptr) string {
+	return cppLastError(h)
+}
+
+// lastErrorOr is cLastError with a fallback for the empty-message case, so
+// wrapped errors never end in ": ".
+func lastErrorOr(h uintptr, fallback string) string {
+	if msg := cLastError(h); msg != "" {
+		return msg
+	}
+	return fallback
+}
+
+// cTokenizeJSON tokenizes text (the C side prepends bos per vocab.add_bos)
+// and returns the token ids as a JSON array string, e.g. "[2,18]".
+func cTokenizeJSON(h uintptr, text string) (string, error) {
+	ret := cppTokenizeJSON(h, text)
+	if ret == 0 {
+		return "", fmt.Errorf("dllm: tokenize failed: %s", lastErrorOr(h, "unknown error"))
+	}
+	out := goStringFromCPtr(ret)
+	cppFreeString(ret)
+	return out, nil
+}
+
+// cGenerate runs a blocking generation and returns the detokenized text.
+// optsJSON must be a FLAT JSON object of scalars (use buildOptsJSON); the C
+// parser rejects nested objects/arrays. NULL return -> last_error (read only
+// after the call returned, per the serialization contract); a cancelled call
+// surfaces as the "cancelled" message.
+func cGenerate(h uintptr, prompt, optsJSON string) (string, error) {
+	ret := cppGenerate(h, prompt, optsJSON)
+	if ret == 0 {
+		return "", fmt.Errorf("dllm: generate failed: %s", lastErrorOr(h, "unknown error"))
+	}
+	out := goStringFromCPtr(ret)
+	cppFreeString(ret)
+	return out, nil
+}
+
+// cGenerateMM is cGenerate's multimodal counterpart. imagesJSON is the flat
+// JSON array of image entries (data: base64 URIs here; the C side also takes
+// file paths) and the prompt must carry one mmImageMarker per entry - the
+// engine enforces the 1:1 match and reports mismatches through last_error.
+func cGenerateMM(h uintptr, prompt, imagesJSON, optsJSON string) (string, error) {
+	if !cMMSupported() {
+		return "", errMMUnsupported
+	}
+	ret := cppGenerateMM(h, prompt, imagesJSON, optsJSON)
+	if ret == 0 {
+		return "", fmt.Errorf("dllm: generate_mm failed: %s", lastErrorOr(h, "unknown error"))
+	}
+	out := goStringFromCPtr(ret)
+	cppFreeString(ret)
+	return out, nil
+}
+
+// streamCallState carries the Go callbacks for one in-flight
+// cGenerateStream call; the registry key travels through C as user_data.
+// The map shape mirrors the whisper backend's streamCallStates: only one
+// entry per ctx is ever live (the C-ABI is serialized per ctx), but keying
+// by call survives multiple models/processes sharing the package.
+type streamCallState struct {
+	onBlock func(text string)
+	onStep  func(step, total int, preview string)
+}
+
+var (
+	streamCallStates sync.Map // uint64 -> *streamCallState
+	streamCallSeq    atomic.Uint64
+
+	// purego.NewCallback allocates a finite, never-released callback slot, so
+	// the two trampolines are created exactly once and reused across calls.
+	streamCbOnce sync.Once
+	blockCbPtr   uintptr
+	stepCbPtr    uintptr
+)
+
+// onBlockTrampoline is the Go side of dllm_block_cb. It runs on the C
+// calling thread, mid-generate: keep it tiny and non-blocking (callers that
+// bridge to goroutines must hand off via buffered channels). The text
+// pointer is only valid for the duration of the invocation, so it is copied
+// to a Go string immediately.
+func onBlockTrampoline(text uintptr, userData uintptr) {
+	v, ok := streamCallStates.Load(uint64(userData))
+	if !ok {
+		return // call already torn down
+	}
+	state := v.(*streamCallState)
+	if state.onBlock != nil {
+		state.onBlock(goStringFromCPtr(text))
+	}
+}
+
+// onStepTrampoline is the Go side of dllm_step_cb; same threading and
+// lifetime caveats as onBlockTrampoline.
+func onStepTrampoline(step int32, totalSteps int32, canvasPreview uintptr, userData uintptr) {
+	v, ok := streamCallStates.Load(uint64(userData))
+	if !ok {
+		return
+	}
+	state := v.(*streamCallState)
+	if state.onStep != nil {
+		state.onStep(int(step), int(totalSteps), goStringFromCPtr(canvasPreview))
+	}
+}
+
+// withStreamCallbacks registers onBlock/onStep in the trampoline registry
+// for the duration of one streaming C call and invokes call with the C
+// function pointers (NULL for absent callbacks, so the C side skips the
+// per-block / per-step detokenize work entirely) plus the registry key to
+// pass as user_data. Shared by the text and multimodal stream wrappers.
+func withStreamCallbacks(onBlock func(text string), onStep func(step, total int, preview string), call func(blockPtr, stepPtr, userData uintptr) int32) int32 {
+	streamCbOnce.Do(func() {
+		blockCbPtr = purego.NewCallback(onBlockTrampoline)
+		stepCbPtr = purego.NewCallback(onStepTrampoline)
+	})
+
+	id := streamCallSeq.Add(1)
+	streamCallStates.Store(id, &streamCallState{onBlock: onBlock, onStep: onStep})
+	defer streamCallStates.Delete(id)
+
+	var blockPtr, stepPtr uintptr
+	if onBlock != nil {
+		blockPtr = blockCbPtr
+	}
+	if onStep != nil {
+		stepPtr = stepCbPtr
+	}
+	return call(blockPtr, stepPtr, uintptr(id))
+}
+
+// cGenerateStream runs a generation with per-committed-block (onBlock) and
+// per-denoising-step (onStep) callbacks; either may be nil. The callbacks
+// run on the C thread (see the trampoline docs). Returns an error carrying
+// last_error on failure; cancellation surfaces as the "cancelled" message.
+func cGenerateStream(h uintptr, prompt, optsJSON string, onBlock func(text string), onStep func(step, total int, preview string)) error {
+	rc := withStreamCallbacks(onBlock, onStep, func(blockPtr, stepPtr, userData uintptr) int32 {
+		return cppGenerateStream(h, prompt, optsJSON, blockPtr, stepPtr, userData)
+	})
+	if rc != 0 {
+		return fmt.Errorf("dllm: generate_stream failed: %s", lastErrorOr(h, "unknown error"))
+	}
+	return nil
+}
+
+// cGenerateStreamMM is cGenerateStream's multimodal counterpart; see
+// cGenerateMM for the imagesJSON/marker contract.
+func cGenerateStreamMM(h uintptr, prompt, imagesJSON, optsJSON string, onBlock func(text string), onStep func(step, total int, preview string)) error {
+	if !cMMSupported() {
+		return errMMUnsupported
+	}
+	rc := withStreamCallbacks(onBlock, onStep, func(blockPtr, stepPtr, userData uintptr) int32 {
+		return cppGenerateStreamMM(h, prompt, imagesJSON, optsJSON, blockPtr, stepPtr, userData)
+	})
+	if rc != 0 {
+		return fmt.Errorf("dllm: generate_stream_mm failed: %s", lastErrorOr(h, "unknown error"))
+	}
+	return nil
+}
+
+// cCancel requests cancellation of the in-flight generate on h. This is the
+// ONE entry point safe to call from any goroutine while a generate runs (it
+// only flips an atomic). Note the cancel-reset race from the header: each
+// generate resets the flag on entry, so a watchdog should re-issue cancel if
+// the call has not returned.
+func cCancel(h uintptr) {
+	cppCancel(h)
+}
+
+// buildOptsJSON renders generation options as the flat JSON object the
+// C-ABI expects (known keys: n_predict, blocks, seed, eb_*, kv_cache). The
+// C-side scanner only understands scalar number/string values and rejects
+// nested objects/arrays loudly; bools are rejected here too because the
+// scanner has no concept of them. Fail loud rather than let an option be
+// silently misread.
+//
+// CAVEAT: json.Marshal HTML-escapes <, > and & inside string values (e.g.
+// "<" becomes the six-byte \u003c sequence). None of the known string-valued keys
+// (kv_cache: auto|on|off) can contain those bytes today; if one ever does,
+// switch to an Encoder with SetEscapeHTML(false) like gemma4JSONString.
+func buildOptsJSON(opts map[string]any) (string, error) {
+	if len(opts) == 0 {
+		return "{}", nil
+	}
+	for k, v := range opts {
+		switch v.(type) {
+		case string,
+			int, int8, int16, int32, int64,
+			uint, uint8, uint16, uint32, uint64,
+			float32, float64,
+			json.Number:
+			// scalar: fine
+		default:
+			return "", fmt.Errorf("dllm: opts key %q has non-scalar value %T (the C-ABI only accepts flat number/string scalars)", k, v)
+		}
+	}
+	b, err := json.Marshal(opts)
+	if err != nil {
+		return "", fmt.Errorf("dllm: marshal opts: %w", err)
+	}
+	return string(b), nil
+}
+
+// goStringFromCPtr copies a NUL-terminated C string into Go memory. cptr is
+// the raw pointer returned by purego from the C-ABI (a malloc'd buffer the
+// caller owns, or a callback argument only valid during the invocation);
+// owning callers must free it via cppFreeString after the copy lands.
+//
+// A direct unsafe.Pointer(cptr) conversion trips go vet's unsafeptr check,
+// which can't distinguish a C-owned heap pointer from Go-managed memory (the
+// parakeet-cpp and whisper backends tolerate that warning). Reinterpreting
+// through &cptr below is equivalent at runtime and keeps plain `go vet`
+// clean. It is safe either way: the pointer addresses C memory the Go GC
+// neither tracks nor moves, and we dereference it immediately to copy the
+// bytes out.
+func goStringFromCPtr(cptr uintptr) string {
+	if cptr == 0 {
+		return ""
+	}
+	p := *(*unsafe.Pointer)(unsafe.Pointer(&cptr)) // C-owned buffer, not Go-GC memory (see doc above)
+	n := 0
+	for *(*byte)(unsafe.Add(p, n)) != 0 {
+		n++
+	}
+	return string(unsafe.Slice((*byte)(p), n))
+}
--- a/backend/go/dllm/dllm.go
+++ b/backend/go/dllm/dllm.go
@@ -0,0 +1,622 @@
+package main
+
+// LocalAI gRPC backend for dllm.cpp (DiffusionGemma block-diffusion models).
+//
+// Wiring overview:
+//   - Load opens the GGUF via dllm_capi_load and starts the per-model worker
+//     goroutine that serializes every C call (see submit).
+//   - PredictRich / PredictStreamRich implement grpc.AIModelRich: when the
+//     request carries raw messages (use_tokenizer_template), the backend owns
+//     templating (RenderGemma4) and output parsing (Gemma4Parser) and replies
+//     with ChatDeltas, like the llama.cpp autoparser and the ds4 backend.
+//   - The legacy Predict / PredictStream methods delegate to the rich pair
+//     (cloud-proxy precedent); the gRPC server prefers the rich path anyway.
+
+import (
+	"encoding/json"
+	"errors"
+	"fmt"
+	"strconv"
+	"strings"
+	"sync"
+	"unicode/utf8"
+
+	grpc "github.com/mudler/LocalAI/pkg/grpc"
+	"github.com/mudler/LocalAI/pkg/grpc/base"
+	"github.com/mudler/LocalAI/pkg/grpc/grpcerrors"
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+	"github.com/mudler/xlog"
+)
+
+// The gRPC server cancels in-flight generations on client disconnect only
+// for backends advertising the Cancellable capability; keep Dllm pinned to
+// it so a signature drift fails the build, not the disconnect path.
+var _ grpc.Cancellable = (*Dllm)(nil)
+
+// generator is the seam between the backend wiring and the dllm.cpp C-ABI:
+// the real implementation (capiGenerator) wraps the cGenerate/cTokenizeJSON
+// family, while tests substitute a fake to exercise prompt construction,
+// parsing and serialization without libdllm.so.
+type generator interface {
+	generate(prompt, optsJSON string) (string, error)
+	// generateStream invokes onBlock once per committed diffusion block, on
+	// the thread running the C call, before returning.
+	generateStream(prompt, optsJSON string, onBlock func(text string)) error
+	// generateMM / generateStreamMM are the multimodal counterparts:
+	// imagesJSON is a flat JSON array of data: base64 URIs and the prompt
+	// carries one mmImageMarker per entry (dllm_capi.h placeholder
+	// contract). Against an old text-only libdllm.so they fail with
+	// errMMUnsupported.
+	generateMM(prompt, imagesJSON, optsJSON string) (string, error)
+	generateStreamMM(prompt, imagesJSON, optsJSON string, onBlock func(text string)) error
+	tokenizeJSON(text string) (string, error)
+	// cancel is the ONE entry point safe to call concurrently with an
+	// in-flight generate on the same ctx (dllm_capi.h: it only flips an
+	// atomic; everything else must be externally serialized per ctx).
+	cancel()
+	free()
+}
+
+// capiGenerator is the production generator over one dllm_ctx handle.
+type capiGenerator struct {
+	h uintptr
+}
+
+func (g *capiGenerator) generate(prompt, optsJSON string) (string, error) {
+	return cGenerate(g.h, prompt, optsJSON)
+}
+
+func (g *capiGenerator) generateStream(prompt, optsJSON string, onBlock func(text string)) error {
+	// on_step (per-denoise-step canvas preview, dllm.cpp's --visual) is
+	// passed as nil for now: a future progress hook for the React UI can
+	// plumb it through without touching the C binding.
+	return cGenerateStream(g.h, prompt, optsJSON, onBlock, nil)
+}
+
+func (g *capiGenerator) generateMM(prompt, imagesJSON, optsJSON string) (string, error) {
+	return cGenerateMM(g.h, prompt, imagesJSON, optsJSON)
+}
+
+func (g *capiGenerator) generateStreamMM(prompt, imagesJSON, optsJSON string, onBlock func(text string)) error {
+	// on_step is nil for the same reason as generateStream.
+	return cGenerateStreamMM(g.h, prompt, imagesJSON, optsJSON, onBlock, nil)
+}
+
+func (g *capiGenerator) tokenizeJSON(text string) (string, error) {
+	return cTokenizeJSON(g.h, text)
+}
+
+func (g *capiGenerator) cancel() {
+	cCancel(g.h)
+}
+
+func (g *capiGenerator) free() {
+	cFree(g.h)
+}
+
+// Dllm is the gRPC backend instance: one per loaded model (LocalAI starts
+// one backend process per model).
+type Dllm struct {
+	base.Base
+
+	gen generator
+	// genOpts holds the model-level generation overrides parsed from
+	// ModelOptions.Options at Load (eb_*, blocks, kv_cache). The C-ABI takes
+	// them per-generate, not per-load, so they are merged into every
+	// request's opts JSON (requestOptsJSON).
+	genOpts map[string]any
+
+	// jobs is the per-model worker queue. dllm_capi.h requires every entry
+	// point EXCEPT dllm_capi_cancel to be externally serialized per ctx (one
+	// ctx = one concurrent generate/tokenize; last_error is unsafe to read
+	// while a call is in flight). A single goroutine owning all C calls makes
+	// that contract structural instead of relying on lock discipline.
+	jobs     chan func()
+	workerWG sync.WaitGroup
+
+	// genMu guards gen against Free racing in-flight requests: requests hold
+	// the read lock for their full duration (they stay concurrent with each
+	// other - the worker still serializes the C calls), Free takes the write
+	// lock so it can only run when no request is in flight.
+	genMu sync.RWMutex
+}
+
+func (d *Dllm) startWorker() {
+	d.jobs = make(chan func())
+	d.workerWG.Add(1)
+	go func() {
+		defer d.workerWG.Done()
+		for job := range d.jobs {
+			job()
+		}
+	}()
+}
+
+// submit runs job on the worker goroutine and waits for it to finish.
+// Concurrent gRPC requests therefore queue up and execute one at a time
+// against the single dllm_ctx.
+func (d *Dllm) submit(job func()) {
+	done := make(chan struct{})
+	d.jobs <- func() {
+		defer close(done)
+		job()
+	}
+	<-done
+}
+
+// Load opens the GGUF and prepares the worker. Load-time engine parameters
+// travel as the flat params JSON of dllm_capi_load; generation overrides
+// from Options are stored for per-request opts JSON instead (the C-ABI has
+// no per-load sampler state).
+func (d *Dllm) Load(opts *pb.ModelOptions) error {
+	if d.gen != nil {
+		return errors.New("dllm: model already loaded")
+	}
+
+	params := map[string]any{
+		"n_gpu_layers": opts.GetNGPULayers(),
+	}
+	if opts.GetThreads() > 0 {
+		params["n_threads"] = opts.GetThreads()
+	}
+	if opts.GetContextSize() > 0 {
+		params["ctx_len"] = opts.GetContextSize()
+	}
+	paramsJSON, err := buildOptsJSON(params)
+	if err != nil {
+		return err
+	}
+
+	d.genOpts = parseModelGenOpts(opts.GetOptions())
+
+	h := cLoad(opts.GetModelFile(), paramsJSON)
+	if h == 0 {
+		// No ctx exists on load failure, so last_error(NULL) only carries the
+		// static NULL-ctx message; the real reason is on the backend's stderr.
+		return fmt.Errorf("dllm: load %q failed: %s (see backend log for details)",
+			opts.GetModelFile(), lastErrorOr(0, "unknown error"))
+	}
+	d.gen = &capiGenerator{h: h}
+	d.startWorker()
+	xlog.Info("dllm: model loaded", "model", opts.GetModelFile(), "params", paramsJSON, "gen_opts", d.genOpts)
+	return nil
+}
+
+// Free releases the dllm ctx and stops the worker. Safe when never loaded.
+//
+// The write lock is essential: the gRPC server (pkg/grpc/server.go, see the
+// model-unload path around line 764) calls Free with no locking of its own,
+// and base.Base provides none either. Without it a request racing Free would
+// panic sending on the closed jobs channel - or worse, generate on a freed C
+// ctx. Holding genMu until gen is nil also turns post-Free requests into a
+// clean "model not loaded" error instead of a crash.
+func (d *Dllm) Free() error {
+	d.genMu.Lock()
+	defer d.genMu.Unlock()
+	if d.gen == nil {
+		return nil
+	}
+	d.submit(d.gen.free)
+	close(d.jobs)
+	d.workerWG.Wait()
+	d.gen = nil
+	return nil
+}
+
+// Cancel requests cancellation of the in-flight generate (the
+// grpc.Cancellable capability). The gRPC server arms it via
+// context.AfterFunc on the request/stream context, so a client
+// disconnect or timeout aborts the generation server-side - the same
+// semantics the llama.cpp C++ backend gets from polling IsCancelled().
+// It deliberately bypasses the worker queue: dllm_capi_cancel is the one
+// call the C-ABI allows from any goroutine mid-generate (it only flips
+// an atomic).
+//
+// Note dllm_capi.h's cancel-reset race: each generate resets the flag on
+// entry, so a Cancel racing a NEW generate on the same ctx can be lost
+// (and, with requests queued on the worker, it aborts whichever generate
+// is currently running). The single-flag granularity is acceptable here
+// because the server de-registers the hook on normal completion and one
+// backend process serves one model.
+func (d *Dllm) Cancel() {
+	// RLock so a server-side AfterFunc firing in the window between a
+	// request finishing and a model unload cannot touch a freed C ctx
+	// (Free holds the write lock while tearing gen down). cancel() is the
+	// one C call that is safe concurrently with an in-flight generate, so
+	// taking a read lock here cannot deadlock against request holders.
+	d.genMu.RLock()
+	defer d.genMu.RUnlock()
+	if d.gen != nil {
+		d.gen.cancel()
+	}
+}
+
+// dllmGenOptKeys are the ModelOptions.Options keys this backend forwards to
+// the engine. Options is a shared free-form bag (other layers put their own
+// entries there), so unknown keys are skipped with a warning, not an error.
+var dllmGenOptKeys = map[string]bool{
+	"blocks":   true,
+	"kv_cache": true, // "auto"|"on"|"off"; honored by the engine from P3
+}
+
+// parseModelGenOpts parses "key:value" Options entries into the flat scalar
+// map merged into every generate's opts JSON. eb_* (Entropy-Bound sampler
+// knobs) and the keys in dllmGenOptKeys are recognized; values are typed by
+// first successful parse (int, then float, else string) to match the C
+// scanner's number/string scalars.
+func parseModelGenOpts(options []string) map[string]any {
+	out := map[string]any{}
+	for _, o := range options {
+		key, val, found := strings.Cut(o, ":")
+		if !found {
+			xlog.Warn("dllm: ignoring malformed option (want key:value)", "option", o)
+			continue
+		}
+		if !strings.HasPrefix(key, "eb_") && !dllmGenOptKeys[key] {
+			xlog.Debug("dllm: ignoring unrecognized option", "key", key)
+			continue
+		}
+		out[key] = parseScalarOpt(val)
+	}
+	return out
+}
+
+func parseScalarOpt(v string) any {
+	if iv, err := strconv.ParseInt(v, 10, 64); err == nil {
+		return iv
+	}
+	if fv, err := strconv.ParseFloat(v, 64); err == nil {
+		return fv
+	}
+	return v
+}
+
+// metadataEnableThinking reads the enable_thinking gate. Unlike ds4 (default
+// ON, matching ds4-server), dllm defaults OFF: DiffusionGemma's chat
+// template guards every thinking branch with `enable_thinking is defined and
+// enable_thinking`, i.e. thinking is opt-in for this model family, and the
+// no-thinking render pre-closes an empty thought channel that the OFF
+// default must produce.
+func metadataEnableThinking(opts *pb.PredictOptions) bool {
+	v := opts.GetMetadata()["enable_thinking"]
+	return v == "true" || v == "1"
+}
+
+// buildPrompt resolves the prompt for a request. With use_tokenizer_template
+// and raw messages the backend owns templating (RenderGemma4, including the
+// mmImageMarker injection for opts.Images) and the output is in the known
+// gemma4 format, so parse=true. Without it the caller templated the prompt
+// themselves (LocalAI's Go templates + PEG fallback, or a bare completion):
+// the prompt passes through verbatim - for image requests it must already
+// carry one literal mmImageMarker per image (the engine enforces the 1:1
+// match) - and the output is NOT gemma4-parsed - it is emitted as plain
+// content and the Go side's extraction applies, as for any non-autoparsing
+// backend.
+func buildPrompt(opts *pb.PredictOptions) (prompt string, parse bool, err error) {
+	if opts.GetUseTokenizerTemplate() && len(opts.GetMessages()) > 0 {
+		prompt, err = RenderGemma4(opts.GetMessages(), opts.GetTools(), len(opts.GetImages()), metadataEnableThinking(opts), true)
+		return prompt, true, err
+	}
+	return opts.GetPrompt(), false, nil
+}
+
+// imagesJSON renders opts.Images as the flat JSON array of data: URIs the mm
+// C-ABI expects, or "" when the request carries no images. The entries arrive
+// as RAW base64 payloads: LocalAI's OpenAI layer decodes every image_url /
+// image content part (URL download or data: URI) to plain base64 via
+// utils.GetContentURIAsBase64 (core/http/middleware/request.go) and core
+// flattens them into PredictOptions.Images (core/backend/llm.go). The
+// hardcoded image/jpeg mime mirrors the llama.cpp backend's re-wrapping
+// convention (grpc-server.cpp, "data:image/jpeg;base64," + images(i)); the
+// engine ignores the declared mime and sniffs the real format from the
+// decoded bytes (stb_image), so PNG/BMP payloads work through it too.
+func imagesJSON(images []string) (string, error) {
+	if len(images) == 0 {
+		return "", nil
+	}
+	uris := make([]string, len(images))
+	for i, img := range images {
+		// dllm_capi.h: array entries are read VERBATIM up to the closing
+		// quote, with NO escape handling. json.Marshal would escape these
+		// bytes and the C side would misparse the entry, so fail loud (they
+		// can never appear in genuine base64 anyway).
+		if strings.ContainsAny(img, "\"\\") {
+			return "", fmt.Errorf("dllm: image %d is not base64 (contains a quote or backslash; PredictOptions.Images entries must be raw base64 payloads)", i)
+		}
+		uris[i] = "data:image/jpeg;base64," + img
+	}
+	b, err := json.Marshal(uris)
+	if err != nil {
+		return "", fmt.Errorf("dllm: marshal images: %w", err)
+	}
+	return string(b), nil
+}
+
+// requestOptsJSON merges the model-level overrides with the request's
+// sampling fields into the flat opts JSON for one generate call.
+func (d *Dllm) requestOptsJSON(opts *pb.PredictOptions) (string, error) {
+	m := make(map[string]any, len(d.genOpts)+2)
+	for k, v := range d.genOpts {
+		m[k] = v
+	}
+	if n := opts.GetTokens(); n > 0 {
+		// The engine rounds n_predict UP to a whole number of diffusion
+		// blocks (the canvas is denoised block-wise), so the completion may
+		// run slightly past the requested budget. Tokens==0 omits the key so
+		// the C-ABI default of 256 applies (hardcoded in capi.cpp's
+		// parse_gen_opts, independent of canvas_length).
+		m["n_predict"] = n
+	}
+	if s := opts.GetSeed(); s > 0 {
+		// The engine seeds mt19937 with explicit non-negative seeds. Seed<=0
+		// is omitted: proto3 cannot distinguish 0 from unset, and negative
+		// values conventionally mean "random" across LocalAI backends.
+		m["seed"] = s
+	}
+	return buildOptsJSON(m)
+}
+
+// prepareRequest is the shared prologue of the rich methods: resolve the
+// prompt (and whether the output gets gemma4-parsed) and build the per-call
+// opts JSON plus the images JSON ("" for text-only requests, which routes
+// the call through the text generate entry points).
+func (d *Dllm) prepareRequest(opts *pb.PredictOptions) (prompt string, parse bool, optsJSON, imgJSON string, err error) {
+	// Fail loud on media the engine has no path for, instead of silently
+	// generating from a prompt that ignores them.
+	if len(opts.GetVideos()) > 0 || len(opts.GetAudios()) > 0 {
+		return "", false, "", "", errors.New("dllm: video/audio input is not supported (images only)")
+	}
+	prompt, parse, err = buildPrompt(opts)
+	if err != nil {
+		return "", false, "", "", err
+	}
+	optsJSON, err = d.requestOptsJSON(opts)
+	if err != nil {
+		return "", false, "", "", err
+	}
+	imgJSON, err = imagesJSON(opts.GetImages())
+	if err != nil {
+		return "", false, "", "", err
+	}
+	return prompt, parse, optsJSON, imgJSON, nil
+}
+
+// sanitizeUTF8 makes s safe for a proto3 string field. Block-boundary
+// detokenization and byte-fallback tokens can produce invalid UTF-8, and
+// grpc-go refuses to marshal it ("string field contains invalid UTF-8"), so
+// every string destined for a Reply/ChatDelta must pass through here (or
+// through splitValidUTF8, which calls it). Lone malformed bytes are genuinely
+// undecodable: replace with U+FFFD rather than crash the stream.
+func sanitizeUTF8(s string) string {
+	if utf8.ValidString(s) {
+		return s
+	}
+	return strings.ToValidUTF8(s, "<22>")
+}
+
+// utf8SeqLen returns the declared sequence length of a UTF-8 leading byte
+// (1 for bytes that can never lead a multi-byte sequence, so they are never
+// held back and fall through to sanitizeUTF8's replacement).
+func utf8SeqLen(b byte) int {
+	switch {
+	case b&0xE0 == 0xC0:
+		return 2
+	case b&0xF0 == 0xE0:
+		return 3
+	case b&0xF8 == 0xF0:
+		return 4
+	default:
+		return 1
+	}
+}
+
+// splitValidUTF8 prepends the previous block's carry to the new block and
+// splits the result into text safe to emit now and a trailing INCOMPLETE
+// UTF-8 sequence (at most utf8.UTFMax-1 bytes) to carry into the next block:
+// the per-block detokenize can split a multi-byte character across block
+// boundaries (llama.cpp's grpc-server holds back the same way). Only a
+// suffix that can still become a valid rune is withheld; bytes that are
+// already undecodable are replaced immediately so the carry stays bounded.
+func splitValidUTF8(carry, block string) (emit, newCarry string) {
+	s := carry + block
+	cut := len(s)
+	for i := len(s) - 1; i >= 0 && len(s)-i < utf8.UTFMax; i-- {
+		b := s[i]
+		if b < utf8.RuneSelf {
+			break // ASCII: everything before the tail scan is complete
+		}
+		if !utf8.RuneStart(b) {
+			continue // continuation byte: keep looking for its leading byte
+		}
+		// Leading byte: hold the sequence back iff it declares more bytes
+		// than the stream has produced so far (it may complete next block).
+		if utf8SeqLen(b) > len(s)-i {
+			cut = i
+		}
+		break
+	}
+	return sanitizeUTF8(s[:cut]), s[cut:]
+}
+
+// PredictRich is the non-streaming inference path (grpc.AIModelRich).
+// Returns one Reply whose Message is the aggregated assistant content and
+// whose ChatDeltas carry the parsed content/reasoning/tool-call events.
+func (d *Dllm) PredictRich(opts *pb.PredictOptions) (*pb.Reply, error) {
+	d.genMu.RLock()
+	defer d.genMu.RUnlock()
+	if d.gen == nil {
+		return nil, grpcerrors.ModelNotLoaded("dllm")
+	}
+	prompt, parse, optsJSON, imgJSON, err := d.prepareRequest(opts)
+	if err != nil {
+		return nil, err
+	}
+
+	var out string
+	var genErr error
+	d.submit(func() {
+		if imgJSON != "" {
+			out, genErr = d.gen.generateMM(prompt, imgJSON, optsJSON)
+		} else {
+			out, genErr = d.gen.generate(prompt, optsJSON)
+		}
+	})
+	if genErr != nil {
+		return nil, genErr
+	}
+	// Byte-fallback tokens can detokenize to invalid UTF-8; proto3 strings
+	// must be valid or grpc-go fails the whole reply at marshal time.
+	out = sanitizeUTF8(out)
+
+	if !parse {
+		// Raw-prompt mode: plain content, no gemma4 parsing (see buildPrompt).
+		return &pb.Reply{Message: []byte(out), ChatDeltas: []*pb.ChatDelta{{Content: out}}}, nil
+	}
+
+	// The prompt renders with add_generation_prompt; both thinking modes
+	// leave the model starting in content state (see the Gemma4Parser header
+	// comment), hence NewGemma4Parser(false).
+	parser := NewGemma4Parser(false)
+	if reply := replyFromDeltas(append(parser.Feed(out), parser.Close()...)); reply != nil {
+		return reply, nil
+	}
+	// Everything was markers (or out was empty): an empty but non-nil Reply.
+	return &pb.Reply{}, nil
+}
+
+// PredictStreamRich is the streaming counterpart (grpc.AIModelRich): one
+// Reply per committed diffusion block that produced deltas. Per the
+// interface contract the channel is only sent into here - the gRPC server
+// closes it after this returns (opposite to legacy PredictStream).
+func (d *Dllm) PredictStreamRich(opts *pb.PredictOptions, results chan<- *pb.Reply) error {
+	d.genMu.RLock()
+	defer d.genMu.RUnlock()
+	if d.gen == nil {
+		return grpcerrors.ModelNotLoaded("dllm")
+	}
+	prompt, parse, optsJSON, imgJSON, err := d.prepareRequest(opts)
+	if err != nil {
+		return err
+	}
+
+	var parser *Gemma4Parser
+	if parse {
+		parser = NewGemma4Parser(false)
+	}
+	// emit runs inside onBlock, i.e. on the thread driving the C generate.
+	// Sending on results can block on a slow consumer, but the server-side
+	// pump (pkg/grpc/server.go PredictStream) drains continuously and drops
+	// undeliverable sends, so this backpressure is brief and bounded - and
+	// pausing the diffusion loop under it is the desired behavior anyway.
+	emit := func(text string) {
+		if !parse {
+			if text != "" {
+				results <- &pb.Reply{Message: []byte(text), ChatDeltas: []*pb.ChatDelta{{Content: text}}}
+			}
+			return
+		}
+		deltas := parser.Feed(text)
+		if reply := replyFromDeltas(deltas); reply != nil {
+			results <- reply
+		}
+	}
+	// onBlock guards emit (and through it the parser) against invalid UTF-8:
+	// a multi-byte character split across block boundaries is held back until
+	// it completes (see splitValidUTF8), so proto3 marshaling never fails.
+	var carry string
+	onBlock := func(block string) {
+		var text string
+		text, carry = splitValidUTF8(carry, block)
+		emit(text)
+	}
+
+	var genErr error
+	d.submit(func() {
+		if imgJSON != "" {
+			genErr = d.gen.generateStreamMM(prompt, imgJSON, optsJSON, onBlock)
+		} else {
+			genErr = d.gen.generateStream(prompt, optsJSON, onBlock)
+		}
+	})
+	if genErr != nil {
+		return genErr
+	}
+	if carry != "" {
+		// The stream ended mid-sequence: the held-back bytes can no longer
+		// complete, so flush them through the U+FFFD last resort.
+		emit(sanitizeUTF8(carry))
+	}
+	if parse {
+		if reply := replyFromDeltas(parser.Close()); reply != nil {
+			results <- reply
+		}
+	}
+	return nil
+}
+
+// replyFromDeltas wraps one batch of parsed deltas into a streaming Reply,
+// or nil when the batch is empty (markers consumed, nothing emitted yet).
+// Message mirrors the batch's content text so legacy chan-string consumers
+// see exactly the displayed tokens.
+func replyFromDeltas(deltas []*pb.ChatDelta) *pb.Reply {
+	if len(deltas) == 0 {
+		return nil
+	}
+	var content strings.Builder
+	for _, delta := range deltas {
+		content.WriteString(delta.GetContent())
+	}
+	return &pb.Reply{Message: []byte(content.String()), ChatDeltas: deltas}
+}
+
+// Predict is the legacy (string, error) signature; the gRPC server prefers
+// PredictRich, this exists for non-rich callers (cloud-proxy precedent).
+func (d *Dllm) Predict(opts *pb.PredictOptions) (string, error) {
+	reply, err := d.PredictRich(opts)
+	if err != nil {
+		return "", err
+	}
+	return string(reply.GetMessage()), nil
+}
+
+// PredictStream is the legacy chan-string path: rich replies reduced to
+// their content text. Note the inverted channel ownership - the LEGACY
+// contract requires the impl to close the channel.
+func (d *Dllm) PredictStream(opts *pb.PredictOptions, results chan string) error {
+	defer close(results)
+	richCh := make(chan *pb.Reply)
+	errCh := make(chan error, 1)
+	go func() {
+		errCh <- d.PredictStreamRich(opts, richCh)
+		close(richCh)
+	}()
+	for reply := range richCh {
+		if msg := reply.GetMessage(); len(msg) > 0 {
+			results <- string(msg)
+		}
+	}
+	return <-errCh
+}
+
+// TokenizeString tokenizes opts.Prompt via dllm_capi_tokenize_json (the C
+// side prepends bos per the vocab) and decodes the returned id array.
+func (d *Dllm) TokenizeString(opts *pb.PredictOptions) (pb.TokenizationResponse, error) {
+	d.genMu.RLock()
+	defer d.genMu.RUnlock()
+	if d.gen == nil {
+		return pb.TokenizationResponse{}, grpcerrors.ModelNotLoaded("dllm")
+	}
+	var out string
+	var tokErr error
+	d.submit(func() {
+		out, tokErr = d.gen.tokenizeJSON(opts.GetPrompt())
+	})
+	if tokErr != nil {
+		return pb.TokenizationResponse{}, tokErr
+	}
+	var tokens []int32
+	if err := json.Unmarshal([]byte(out), &tokens); err != nil {
+		return pb.TokenizationResponse{}, fmt.Errorf("dllm: decode tokenize result %q: %w", out, err)
+	}
+	return pb.TokenizationResponse{Length: int32(len(tokens)), Tokens: tokens}, nil
+}
--- a/backend/go/dllm/dllm_test.go
+++ b/backend/go/dllm/dllm_test.go
--- a/backend/go/dllm/gemma4_parser.go
+++ b/backend/go/dllm/gemma4_parser.go
@@ -0,0 +1,562 @@
+// Gemma4 (DiffusionGemma) streaming output parser: raw model text, fed in
+// arbitrary fragments (per committed diffusion block; a fragment can split
+// anywhere, including mid-marker and mid-payload), is turned into
+// pb.ChatDelta events (content / reasoning_content / tool_calls).
+//
+// Normative sources:
+//   - The chat template embedded at the top of gemma4_renderer.go ("tpl L<n>"
+//     citations below refer to its numbered lines). The OUTPUT format mirrors
+//     what the template renders for assistant history: thought channels
+//     (<|channel>thought\n ... <channel|>, tpl L240), tool calls
+//     (<|tool_call>call:name{...}<tool_call|>, tpl L246-L257) and turn ends
+//     (<turn|>, tpl L351).
+//   - vLLM PR #45163: vllm/tool_parsers/gemma4_tool_parser.py (marker
+//     handling, the call:name{...} argument grammar and its decoder, ported
+//     below) and vllm/reasoning/gemma4_reasoning_parser.py (channel markers,
+//     the "thought\n" role label, is_reasoning_end semantics).
+//
+// Initial state (derived from the generation prompt, tpl L356-L362, see
+// RenderGemma4):
+//   - enable_thinking=false: the prompt ends with "<|turn>model\n" +
+//     "<|channel>thought\n<channel|>" - an EMPTY thought channel, pre-opened
+//     AND pre-closed by the template. The model's output therefore starts in
+//     plain content. Use NewGemma4Parser(false).
+//   - enable_thinking=true: the prompt ends at "<|turn>model\n" and the model
+//     opens and closes its own thought channel in the OUTPUT
+//     ("<|channel>thought\n...reasoning...<channel|>final answer", per the
+//     vLLM Gemma4ReasoningParser docstring). The parser still starts in
+//     content state - the channel markers in the output drive the switch.
+//     Use NewGemma4Parser(false) here too.
+//   - NewGemma4Parser(true) is for callers that pre-open the thought channel
+//     in the prompt themselves (appending "<|channel>thought\n" after the
+//     generation prompt to force thinking): the output then begins mid-thought
+//     and everything is reasoning until the first <channel|>.
+//
+// State diagram (markers are consumed, never emitted):
+//
+//	             <|channel>                  \n (channel name dropped: the
+//	[content] --------------> [chan-header] ----> [thought]   "thought\n" role
+//	   ^ |  <channel|> (stray close: swallowed,                label, stripped
+//	   +-+  strip_thinking semantics, tpl L148-L158)           like vLLM does)
+//	   ^                  <channel|>
+//	   +----------------------------------------- [thought]
+//	   ^                  <tool_call|>                 | <|tool_call> (implicit
+//	   +-------------- [tool-call] <-------------------+  reasoning end, vLLM
+//	   |  <|tool_call>     ^                               is_reasoning_end)
+//	   +-------------------+
+//	[content]/[thought] --- <turn|> ---> [done]  (everything after is dropped)
+//
+// Buffering rules:
+//   - content/thought states hold back at most len(longest marker)-1 bytes:
+//     the longest tail that is still a proper prefix of a watched marker.
+//     Content is otherwise emitted immediately (no unbounded buffering).
+//   - the tool-call state buffers the whole payload until <tool_call|>. This
+//     is unbounded in principle but bounded in practice by the model's
+//     diffusion canvas, and is required because the call:name{...} payload
+//     only becomes decodable (and trustworthy) once complete - the same
+//     reason vLLM's parser accumulates before parsing.
+//   - Close() flushes whatever is still held: partial markers come out as
+//     content/reasoning (per the state that held them); an unterminated
+//     channel header or tool-call payload is re-emitted RAW (including its
+//     opening marker) as content - malformed output is never silently
+//     dropped (mirrors vLLM extract_tool_calls returning the raw text as
+//     content when its regex does not match).
+//
+// Streaming granularity DIVERGENCE from vLLM: vLLM re-parses the partial
+// payload on every token and streams argument-JSON diffs (its `partial=True`
+// decoder mode plus withholding logic exist only for that). Our fragments are
+// whole committed diffusion blocks, so each completed tool call is emitted
+// once, as a single ToolCallDelta carrying index + id + name + the full
+// arguments JSON - exactly the shape backend/python/vllm/backend.py emits
+// per call and pkg/functions.ToolCallsFromChatDeltas re-accumulates.
+package main
+
+import (
+	"encoding/json"
+	"regexp"
+	"strconv"
+	"strings"
+
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+)
+
+// gemma4CallRE is vLLM's tool_call_regex
+// (`<\|tool_call>call:([\w\-\.]+)\{(.*?)\}<tool_call\|>`, DOTALL) anchored to
+// a single already-extracted payload: name charset [\w\-.], braces mandatory.
+var gemma4CallRE = regexp.MustCompile(`(?s)^call:([\w\-.]+)\{(.*)\}$`)
+
+type g4State int
+
+const (
+	g4Content g4State = iota
+	g4ChanHeader
+	g4Thought
+	g4ToolCall
+	g4Done
+)
+
+// Markers watched per emitting state. A stray <tool_call|> outside a tool
+// call is deliberately NOT watched: it passes through verbatim, consistent
+// with the malformed-payload fallback re-emitting it as content.
+var (
+	gemma4ContentMarkers = []string{gemma4ChannelOpen, gemma4ChannelClose, gemma4ToolCallOpen, gemma4TurnEnd}
+	gemma4ThoughtMarkers = []string{gemma4ChannelClose, gemma4ToolCallOpen, gemma4TurnEnd}
+)
+
+type Gemma4Parser struct {
+	state g4State
+	// held is the per-state carry-over between Feed calls: a partial marker
+	// (content/thought), a partial channel header (chan-header) or the
+	// payload accumulated so far (tool-call).
+	held    string
+	toolIdx int
+}
+
+// NewGemma4Parser returns a parser positioned per the initial-state rules in
+// the header comment: startInThought=true only when the caller pre-opened a
+// thought channel in the prompt.
+func NewGemma4Parser(startInThought bool) *Gemma4Parser {
+	state := g4Content
+	if startInThought {
+		state = g4Thought
+	}
+	return &Gemma4Parser{state: state}
+}
+
+// Feed consumes the next output fragment and returns the deltas it completes.
+func (p *Gemma4Parser) Feed(text string) []*pb.ChatDelta {
+	if text == "" || p.state == g4Done {
+		return nil
+	}
+	pending := p.held + text
+	p.held = ""
+	var em g4Emitter
+	for pending != "" {
+		switch p.state {
+		case g4Content, g4Thought:
+			markers := gemma4ContentMarkers
+			if p.state == g4Thought {
+				markers = gemma4ThoughtMarkers
+			}
+			idx, marker := findEarliestGemma4Marker(pending, markers)
+			if idx == -1 {
+				hold := gemma4MarkerHoldback(pending, markers)
+				p.emitText(&em, pending[:len(pending)-hold])
+				p.held = pending[len(pending)-hold:]
+				pending = ""
+				continue
+			}
+			p.emitText(&em, pending[:idx])
+			pending = pending[idx+len(marker):]
+			switch marker {
+			case gemma4ChannelOpen:
+				p.state = g4ChanHeader
+			case gemma4ChannelClose:
+				// In thought: channel ends. In content: stray close,
+				// swallowed (strip_thinking keeps both sides, tpl L148-L158).
+				p.state = g4Content
+			case gemma4ToolCallOpen:
+				p.state = g4ToolCall
+			case gemma4TurnEnd:
+				p.state = g4Done
+			}
+		case g4ChanHeader:
+			// The channel header is "<name>\n"; the template only ever writes
+			// "thought" (tpl L240/L360) and the label is structural, so it is
+			// dropped, not emitted (vLLM strips the same "thought\n" prefix).
+			nl := strings.IndexByte(pending, '\n')
+			if nl == -1 {
+				p.held = pending
+				pending = ""
+				continue
+			}
+			pending = pending[nl+1:]
+			p.state = g4Thought
+		case g4ToolCall:
+			end := strings.Index(pending, gemma4ToolCallClose)
+			if end == -1 {
+				p.held = pending
+				pending = ""
+				continue
+			}
+			p.emitToolCall(&em, pending[:end])
+			pending = pending[end+len(gemma4ToolCallClose):]
+			p.state = g4Content
+		case g4Done:
+			pending = ""
+		}
+	}
+	return em.deltas
+}
+
+// Close flushes held-back partials. Incomplete structures (open channel
+// header, unterminated tool payload) are re-emitted raw as content rather
+// than dropped. The parser is finished afterwards.
+func (p *Gemma4Parser) Close() []*pb.ChatDelta {
+	var em g4Emitter
+	switch p.state {
+	case g4Content:
+		em.content(p.held)
+	case g4Thought:
+		em.reasoning(p.held)
+	case g4ChanHeader:
+		em.content(gemma4ChannelOpen + p.held)
+	case g4ToolCall:
+		em.content(gemma4ToolCallOpen + p.held)
+	case g4Done:
+	}
+	p.held = ""
+	p.state = g4Done
+	return em.deltas
+}
+
+func (p *Gemma4Parser) emitText(em *g4Emitter, s string) {
+	if p.state == g4Thought {
+		em.reasoning(s)
+		return
+	}
+	em.content(s)
+}
+
+// emitToolCall decodes one complete <|tool_call>...<tool_call|> payload. On a
+// payload that does not match call:name{...} the raw text (markers included)
+// is emitted as content, mirroring vLLM's extract_tool_calls fallback.
+func (p *Gemma4Parser) emitToolCall(em *g4Emitter, payload string) {
+	m := gemma4CallRE.FindStringSubmatch(payload)
+	if m == nil {
+		em.content(gemma4ToolCallOpen + payload + gemma4ToolCallClose)
+		return
+	}
+	// Index-based ids: deterministic (the split-invariance property relies
+	// on it) and matching the call_<n> convention of pkg/grpc/rich_test.go;
+	// core only needs ids to be non-empty and unique within the response.
+	em.tool(p.toolIdx, "call_"+strconv.Itoa(p.toolIdx), m[1], decodeGemma4Args(m[2], 0))
+	p.toolIdx++
+}
+
+// g4Emitter collects ChatDeltas; empty text events are dropped.
+type g4Emitter struct {
+	deltas []*pb.ChatDelta
+}
+
+func (e *g4Emitter) content(s string) {
+	if s != "" {
+		e.deltas = append(e.deltas, &pb.ChatDelta{Content: s})
+	}
+}
+
+func (e *g4Emitter) reasoning(s string) {
+	if s != "" {
+		e.deltas = append(e.deltas, &pb.ChatDelta{ReasoningContent: s})
+	}
+}
+
+func (e *g4Emitter) tool(index int, id, name, argsJSON string) {
+	e.deltas = append(e.deltas, &pb.ChatDelta{ToolCalls: []*pb.ToolCallDelta{{
+		Index:     int32(index),
+		Id:        id,
+		Name:      name,
+		Arguments: argsJSON,
+	}}})
+}
+
+// findEarliestGemma4Marker returns the position and value of the first
+// complete marker occurrence, or (-1, "").
+func findEarliestGemma4Marker(s string, markers []string) (int, string) {
+	best, bestMarker := -1, ""
+	for _, m := range markers {
+		if idx := strings.Index(s, m); idx >= 0 && (best == -1 || idx < best) {
+			best, bestMarker = idx, m
+		}
+	}
+	return best, bestMarker
+}
+
+// gemma4MarkerHoldback returns the length of the longest suffix of s that is
+// a proper prefix of a watched marker - the only bytes that may still grow
+// into a marker and therefore must not be emitted yet (bounded by the
+// longest marker, so content is never buffered unboundedly).
+func gemma4MarkerHoldback(s string, markers []string) int {
+	maxHold := 0
+	for _, m := range markers {
+		if len(m)-1 > maxHold {
+			maxHold = len(m) - 1
+		}
+	}
+	if len(s) < maxHold {
+		maxHold = len(s)
+	}
+	for k := maxHold; k >= 1; k-- {
+		tail := s[len(s)-k:]
+		for _, m := range markers {
+			if strings.HasPrefix(m, tail) {
+				return k
+			}
+		}
+	}
+	return 0
+}
+
+// ---------------------------------------------------------------------------
+// call:name{...} argument decoder
+//
+// Port of vLLM's _parse_gemma4_args / _parse_gemma4_array /
+// _parse_gemma4_value (gemma4_tool_parser.py) in non-partial mode only: this
+// parser decodes exclusively COMPLETE payloads (incomplete ones fall back to
+// raw content at Close), so vLLM's partial-withholding machinery
+// (trailing-dot floats, withheld bare tails) is intentionally not ported.
+//
+// Grammar (inverse of the renderer's formatGemma4Argument, tpl L118-L147):
+//
+//	args    := pair (',' pair)*
+//	pair    := key ':' value          (keys unquoted, up to the first ':')
+//	value   := string | object | array | bare
+//	string  := '<|"|>' ... '<|"|>'    (no escapes; unterminated -> rest)
+//	object  := '{' args '}'           (delimited strings skipped when
+//	array   := '[' value,* ']'         counting braces/brackets)
+//	bare    := true | false | null/none/nil | number | bare-string
+//
+// Output is a JSON object/array string with keys in payload order (Python
+// dict insertion order), built with HTML escaping off so payload text
+// survives byte-for-byte.
+// ---------------------------------------------------------------------------
+
+func isGemma4Space(c byte) bool { return c == ' ' || c == '\n' || c == '\t' }
+
+// gemma4MaxArgsDepth caps the mutual recursion between decodeGemma4Args and
+// decodeGemma4Array. Defense against model-generated deep nesting: a Go stack
+// overflow is a fatal process kill, not a recoverable error, so past the cap
+// a nested body gracefully degrades to a JSON string of its raw text.
+const gemma4MaxArgsDepth = 100
+
+// decodeGemma4Args decodes one args body (the text between the outer braces
+// of call:name{...}) into a JSON object string. depth is the current nesting
+// level (0 at the payload root); see gemma4MaxArgsDepth.
+func decodeGemma4Args(s string, depth int) string {
+	if depth > gemma4MaxArgsDepth {
+		return gemma4JSONString(s)
+	}
+	var b strings.Builder
+	b.WriteString("{")
+	first := true
+	pair := func(key, val string) {
+		if !first {
+			b.WriteString(",")
+		}
+		first = false
+		b.WriteString(gemma4JSONString(key))
+		b.WriteString(":")
+		b.WriteString(val)
+	}
+	i, n := 0, len(s)
+	for i < n {
+		for i < n && (isGemma4Space(s[i]) || s[i] == ',') {
+			i++
+		}
+		if i >= n {
+			break
+		}
+		keyStart := i
+		for i < n && s[i] != ':' {
+			i++
+		}
+		if i >= n {
+			break // no ':' -> trailing junk, dropped (vLLM does the same)
+		}
+		key := strings.TrimSpace(s[keyStart:i])
+		i++ // skip ':'
+		for i < n && isGemma4Space(s[i]) {
+			i++
+		}
+		if i >= n {
+			pair(key, `""`) // "key:" with nothing after -> empty string
+			break
+		}
+		switch {
+		case strings.HasPrefix(s[i:], gemma4StringDelim):
+			i += len(gemma4StringDelim)
+			if end := strings.Index(s[i:], gemma4StringDelim); end == -1 {
+				pair(key, gemma4JSONString(s[i:])) // unterminated -> take rest
+				i = n
+			} else {
+				pair(key, gemma4JSONString(s[i:i+end]))
+				i += end + len(gemma4StringDelim)
+			}
+		case s[i] == '{':
+			inner, next := scanGemma4Balanced(s, i, '{', '}')
+			pair(key, decodeGemma4Args(inner, depth+1))
+			i = next
+		case s[i] == '[':
+			inner, next := scanGemma4Balanced(s, i, '[', ']')
+			pair(key, decodeGemma4Array(inner, depth+1))
+			i = next
+		default:
+			valStart := i
+			for i < n && s[i] != ',' && s[i] != '}' && s[i] != ']' {
+				i++
+			}
+			if i == valStart {
+				// No progress (value starts on a stray '}'/']'): abort on
+				// malformed input rather than loop, like vLLM.
+				i = n
+				continue
+			}
+			pair(key, decodeGemma4Bare(s[valStart:i]))
+		}
+	}
+	b.WriteString("}")
+	return b.String()
+}
+
+// decodeGemma4Array decodes one array body (the text between '[' and ']')
+// into a JSON array string. depth is the current nesting level; see
+// gemma4MaxArgsDepth.
+func decodeGemma4Array(s string, depth int) string {
+	if depth > gemma4MaxArgsDepth {
+		return gemma4JSONString(s)
+	}
+	var b strings.Builder
+	b.WriteString("[")
+	first := true
+	item := func(val string) {
+		if !first {
+			b.WriteString(",")
+		}
+		first = false
+		b.WriteString(val)
+	}
+	i, n := 0, len(s)
+	for i < n {
+		for i < n && (isGemma4Space(s[i]) || s[i] == ',') {
+			i++
+		}
+		if i >= n {
+			break
+		}
+		switch {
+		case strings.HasPrefix(s[i:], gemma4StringDelim):
+			i += len(gemma4StringDelim)
+			if end := strings.Index(s[i:], gemma4StringDelim); end == -1 {
+				item(gemma4JSONString(s[i:]))
+				i = n
+			} else {
+				item(gemma4JSONString(s[i : i+end]))
+				i += end + len(gemma4StringDelim)
+			}
+		case s[i] == '{':
+			inner, next := scanGemma4Balanced(s, i, '{', '}')
+			item(decodeGemma4Args(inner, depth+1))
+			i = next
+		case s[i] == '[':
+			inner, next := scanGemma4Balanced(s, i, '[', ']')
+			item(decodeGemma4Array(inner, depth+1))
+			i = next
+		default:
+			valStart := i
+			for i < n && s[i] != ',' && s[i] != ']' {
+				i++
+			}
+			if i == valStart {
+				i = n // no progress: abort on malformed input, like vLLM
+				continue
+			}
+			item(decodeGemma4Bare(s[valStart:i]))
+		}
+	}
+	b.WriteString("]")
+	return b.String()
+}
+
+// scanGemma4Balanced scans a brace/bracket-balanced span starting at the
+// opener s[start], skipping over <|"|>-delimited strings so structural
+// characters inside them do not count (vLLM's depth scan). Returns the inner
+// text and the index just past the closer; an unterminated span yields the
+// rest of the string (the inner decoder still extracts what is there - this
+// path is only reachable from genuinely malformed complete payloads).
+func scanGemma4Balanced(s string, start int, open, close byte) (string, int) {
+	depth := 1
+	i := start + 1
+	innerStart := i
+	n := len(s)
+	for i < n && depth > 0 {
+		if strings.HasPrefix(s[i:], gemma4StringDelim) {
+			i += len(gemma4StringDelim)
+			if nd := strings.Index(s[i:], gemma4StringDelim); nd == -1 {
+				i = n
+			} else {
+				i += nd + len(gemma4StringDelim)
+			}
+			continue
+		}
+		switch s[i] {
+		case open:
+			depth++
+		case close:
+			depth--
+		}
+		i++
+	}
+	if depth > 0 {
+		return s[innerStart:], n
+	}
+	return s[innerStart : i-1], i
+}
+
+// decodeGemma4Bare maps an undelimited value to its JSON form: booleans,
+// null aliases (null/none/nil, case-insensitive - the renderer writes
+// Python None as "None", tpl L144-L145 via format_argument's else branch),
+// numbers (vLLM's rule: a '.' tries float, otherwise int; anything that
+// fails parses as a bare string).
+func decodeGemma4Bare(raw string) string {
+	v := strings.TrimSpace(raw)
+	if v == "" {
+		return `""`
+	}
+	if v == "true" || v == "false" {
+		return v
+	}
+	switch strings.ToLower(v) {
+	case "null", "none", "nil":
+		return "null"
+	}
+	if strings.Contains(v, ".") {
+		if f, err := strconv.ParseFloat(v, 64); err == nil {
+			return formatGemma4Float(f)
+		}
+	} else if iv, err := strconv.ParseInt(v, 10, 64); err == nil {
+		return strconv.FormatInt(iv, 10)
+	}
+	return gemma4JSONString(v)
+}
+
+// formatGemma4Float renders like Python's json.dumps(float): integral floats
+// keep a ".0" suffix ("108." decodes to 108.0, not 108), so the arguments
+// JSON matches what vLLM would have produced for the same payload.
+func formatGemma4Float(f float64) string {
+	s := strconv.FormatFloat(f, 'g', -1, 64)
+	if !strings.ContainsAny(s, ".eE") {
+		s += ".0"
+	}
+	return s
+}
+
+// gemma4JSONString encodes a JSON string WITHOUT HTML escaping (json.Marshal
+// would escape the angle brackets in "<div>" to \u003c / \u003e sequences;
+// payload text should survive
+// byte-for-byte, like Python's json.dumps(ensure_ascii=False)).
+func gemma4JSONString(s string) string {
+	var sb strings.Builder
+	enc := json.NewEncoder(&sb)
+	enc.SetEscapeHTML(false)
+	if err := enc.Encode(s); err != nil {
+		// Unreachable for plain strings; fall back to default escaping
+		// rather than emitting invalid JSON.
+		b, mErr := json.Marshal(s)
+		if mErr != nil {
+			return `""`
+		}
+		return string(b)
+	}
+	// Encode appends a trailing newline.
+	return strings.TrimSuffix(sb.String(), "\n")
+}
--- a/backend/go/dllm/gemma4_parser_test.go
+++ b/backend/go/dllm/gemma4_parser_test.go
@@ -0,0 +1,592 @@
+package main
+
+// Parser specs for Gemma4Parser (model output text -> pb.ChatDelta events).
+//
+// Fixture provenance:
+//   - Entries marked "vLLM: <name>" are direct ports of the named test from
+//     vLLM PR #45163, tests/tool_parsers/test_gemma4_tool_parser.py (the
+//     authoritative test-suite for the gemma4 tool-call wire format). The
+//     streaming tests' chunk lists are reused verbatim as Feed fragments.
+//   - Decoder entries port the TestParseGemma4Args / TestParseGemma4Array
+//     classes from the same file (non-partial mode only; this parser never
+//     decodes partial payloads, see the divergence note in gemma4_parser.go).
+//   - Channel/turn-marker expectations come from the chat template embedded
+//     in gemma4_renderer.go (tpl L356-L362 generation prompt, L148-L158
+//     strip_thinking) and vLLM's Gemma4ReasoningParser
+//     (vllm/reasoning/gemma4_reasoning_parser.py).
+
+import (
+	"encoding/json"
+	"fmt"
+	"strings"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+)
+
+// flatGemma4Tool is one accumulated tool call, mirroring how LocalAI core
+// folds ToolCallDelta streams (pkg/functions/chat_deltas.go
+// ToolCallsFromChatDeltas: name/id latch on first non-empty, arguments
+// concatenate per index). Tests flatten through the same rules so they
+// assert exactly what core will reconstruct.
+type flatGemma4Tool struct {
+	id   string
+	name string
+	args string
+}
+
+func flattenGemma4Deltas(deltas []*pb.ChatDelta) (string, string, []flatGemma4Tool) {
+	var content, reasoning strings.Builder
+	byIndex := map[int32]*flatGemma4Tool{}
+	maxIdx := int32(-1)
+	for _, d := range deltas {
+		content.WriteString(d.GetContent())
+		reasoning.WriteString(d.GetReasoningContent())
+		for _, tc := range d.GetToolCalls() {
+			acc, ok := byIndex[tc.GetIndex()]
+			if !ok {
+				acc = &flatGemma4Tool{}
+				byIndex[tc.GetIndex()] = acc
+			}
+			if tc.GetName() != "" {
+				acc.name = tc.GetName()
+			}
+			if tc.GetId() != "" {
+				acc.id = tc.GetId()
+			}
+			acc.args += tc.GetArguments()
+			if tc.GetIndex() > maxIdx {
+				maxIdx = tc.GetIndex()
+			}
+		}
+	}
+	var tools []flatGemma4Tool
+	for i := int32(0); i <= maxIdx; i++ {
+		if acc, ok := byIndex[i]; ok {
+			tools = append(tools, *acc)
+		}
+	}
+	return content.String(), reasoning.String(), tools
+}
+
+type wantGemma4Tool struct {
+	name     string
+	argsJSON string // compared with MatchJSON (key order irrelevant)
+}
+
+type parseGemma4Case struct {
+	startInThought bool
+	fragments      []string
+	wantContent    string
+	wantReasoning  string
+	wantTools      []wantGemma4Tool
+}
+
+func parseGemma4Fragments(startInThought bool, fragments []string) []*pb.ChatDelta {
+	p := NewGemma4Parser(startInThought)
+	var all []*pb.ChatDelta
+	for _, f := range fragments {
+		all = append(all, p.Feed(f)...)
+	}
+	return append(all, p.Close()...)
+}
+
+var _ = Describe("Gemma4Parser", func() {
+	DescribeTable("parses streamed gemma4 output into ChatDeltas",
+		func(c parseGemma4Case) {
+			content, reasoning, tools := flattenGemma4Deltas(parseGemma4Fragments(c.startInThought, c.fragments))
+			Expect(content).To(Equal(c.wantContent))
+			Expect(reasoning).To(Equal(c.wantReasoning))
+			Expect(tools).To(HaveLen(len(c.wantTools)))
+			seenIDs := map[string]bool{}
+			for i, want := range c.wantTools {
+				Expect(tools[i].name).To(Equal(want.name), "tool %d name", i)
+				Expect(tools[i].args).To(MatchJSON(want.argsJSON), "tool %d arguments", i)
+				Expect(tools[i].id).ToNot(BeEmpty(), "tool %d id", i)
+				Expect(seenIDs).ToNot(HaveKey(tools[i].id), "tool %d id must be unique", i)
+				seenIDs[tools[i].id] = true
+			}
+		},
+
+		// --- (1) pure content -------------------------------------------------
+		// vLLM: test_no_tool_calls
+		Entry("pure content, single fragment", parseGemma4Case{
+			fragments:   []string{"Hello, how can I help you today?"},
+			wantContent: "Hello, how can I help you today?",
+		}),
+
+		// --- (2) thought -> final transition ----------------------------------
+		// enable_thinking render: prompt ends at <|turn>model\n and the model
+		// opens/closes its own thought channel in the OUTPUT (vLLM
+		// Gemma4ReasoningParser docstring; tpl L356-L362). The "thought\n"
+		// role label after <|channel> is structural and must be stripped
+		// (vLLM _THOUGHT_PREFIX handling).
+		Entry("thought channel then final content", parseGemma4Case{
+			fragments:     []string{"<|channel>thought\nLet me think about this.\n<channel|>The answer is 42."},
+			wantReasoning: "Let me think about this.\n",
+			wantContent:   "The answer is 42.",
+		}),
+
+		// --- (3) startInThought both ways -------------------------------------
+		Entry("startInThought=true routes initial text to reasoning until <channel|>", parseGemma4Case{
+			startInThought: true,
+			fragments:      []string{"I am thinking hard.<channel|>Done."},
+			wantReasoning:  "I am thinking hard.",
+			wantContent:    "Done.",
+		}),
+		// A stray <channel|> with no open channel is swallowed, matching the
+		// template's strip_thinking (tpl L148-L158: the marker is dropped,
+		// text on both sides is kept).
+		Entry("startInThought=false keeps the same text as content, stray <channel|> swallowed", parseGemma4Case{
+			startInThought: false,
+			fragments:      []string{"I am thinking hard.<channel|>Done."},
+			wantContent:    "I am thinking hard.Done.",
+		}),
+
+		// --- (4) one tool call, full payload type zoo --------------------------
+		Entry("single tool call: strings, numbers, bools, null, nested object and array", parseGemma4Case{
+			fragments: []string{`<|tool_call>call:complex_function{text:<|"|>with, comma and {braces}<|"|>,count:42,score:3.14,yes:true,no:false,nothing:null,obj:{inner:<|"|>v<|"|>,k:1},arr:[<|"|>a<|"|>,2,true]}<tool_call|>`},
+			wantTools: []wantGemma4Tool{{
+				name:     "complex_function",
+				argsJSON: `{"text":"with, comma and {braces}","count":42,"score":3.14,"yes":true,"no":false,"nothing":null,"obj":{"inner":"v","k":1},"arr":["a",2,true]}`,
+			}},
+		}),
+
+		// --- (5) payload split across 3 fragments ------------------------------
+		Entry("tool-call payload split across three fragments", parseGemma4Case{
+			fragments: []string{
+				"<|tool_call>call:get_weather{loc",
+				`ation:<|"|>Paris, Fra`,
+				`nce<|"|>}<tool_call|>`,
+			},
+			wantTools: []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"Paris, France"}`}},
+		}),
+
+		// --- (6) marker split across fragments ----------------------------------
+		Entry("tool-call open marker split across fragments", parseGemma4Case{
+			fragments: []string{
+				"<|tool_ca",
+				`ll>call:get_weather{location:<|"|>London<|"|>}<tool_call|>`,
+			},
+			wantTools: []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"London"}`}},
+		}),
+		Entry("channel open marker split across fragments", parseGemma4Case{
+			fragments: []string{
+				"<|chan",
+				"nel>thought\ndeep thought<channel|>final",
+			},
+			wantReasoning: "deep thought",
+			wantContent:   "final",
+		}),
+
+		// --- (7) trailing partial marker held, flushed by Close -----------------
+		Entry("trailing partial marker is held back and flushed by Close", parseGemma4Case{
+			fragments:   []string{"Hello <|tool"},
+			wantContent: "Hello <|tool",
+		}),
+
+		// --- (8) malformed/incomplete payload -> content fallback ---------------
+		// vLLM: test_incomplete_tool_call (no end marker: the whole text stays
+		// content, never silently dropped).
+		Entry("incomplete tool payload at Close is emitted as raw content", parseGemma4Case{
+			fragments:   []string{`<|tool_call>call:get_weather{location:<|"|>London`},
+			wantContent: `<|tool_call>call:get_weather{location:<|"|>London`,
+		}),
+		Entry("malformed complete payload is emitted as raw content, parsing continues", parseGemma4Case{
+			fragments:   []string{"<|tool_call>oops no call syntax<tool_call|> done"},
+			wantContent: "<|tool_call>oops no call syntax<tool_call|> done",
+		}),
+
+		// --- (9) <turn|> ends the turn -------------------------------------------
+		Entry("text after <turn|> is ignored, including later fragments", parseGemma4Case{
+			fragments: []string{
+				"before<turn|>after",
+				`more <|tool_call>call:f{}<tool_call|>`,
+			},
+			wantContent: "before",
+		}),
+		Entry("<turn|> inside a thought channel ends the turn", parseGemma4Case{
+			startInThought: true,
+			fragments:      []string{"thinking<turn|>ignored"},
+			wantReasoning:  "thinking",
+		}),
+
+		// --- (10) ported vLLM non-streaming cases ---------------------------------
+		// vLLM: test_single_tool_call
+		Entry("vLLM: test_single_tool_call", parseGemma4Case{
+			fragments: []string{`<|tool_call>call:get_weather{location:<|"|>London<|"|>}<tool_call|>`},
+			wantTools: []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"London"}`}},
+		}),
+		// vLLM: test_multiple_arguments
+		Entry("vLLM: test_multiple_arguments", parseGemma4Case{
+			fragments: []string{`<|tool_call>call:get_weather{location:<|"|>San Francisco<|"|>,unit:<|"|>celsius<|"|>}<tool_call|>`},
+			wantTools: []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"San Francisco","unit":"celsius"}`}},
+		}),
+		// vLLM: test_text_before_tool_call. DIVERGENCE: vLLM's non-streaming
+		// extractor trims the content ("...you."); a streaming parser cannot
+		// retroactively trim already-emitted text, so the trailing space is
+		// kept (vLLM's own streaming path keeps it too, see
+		// test_streaming_text_before_tool_call which only checks a prefix).
+		Entry("vLLM: test_text_before_tool_call (streaming semantics: no trim)", parseGemma4Case{
+			fragments:   []string{`Let me check the weather for you. <|tool_call>call:get_weather{location:<|"|>Paris<|"|>}<tool_call|>`},
+			wantContent: "Let me check the weather for you. ",
+			wantTools:   []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"Paris"}`}},
+		}),
+		// vLLM: test_multiple_tool_calls (also covers case 11: multi-tool sequence)
+		Entry("vLLM: test_multiple_tool_calls", parseGemma4Case{
+			fragments: []string{`<|tool_call>call:get_weather{location:<|"|>London<|"|>}<tool_call|><|tool_call>call:get_time{location:<|"|>London<|"|>}<tool_call|>`},
+			wantTools: []wantGemma4Tool{
+				{name: "get_weather", argsJSON: `{"location":"London"}`},
+				{name: "get_time", argsJSON: `{"location":"London"}`},
+			},
+		}),
+		// vLLM: test_nested_arguments
+		Entry("vLLM: test_nested_arguments", parseGemma4Case{
+			fragments: []string{`<|tool_call>call:complex_function{nested:{inner:<|"|>value<|"|>},list:[<|"|>a<|"|>,<|"|>b<|"|>]}<tool_call|>`},
+			wantTools: []wantGemma4Tool{{name: "complex_function", argsJSON: `{"nested":{"inner":"value"},"list":["a","b"]}`}},
+		}),
+		// vLLM: test_tool_call_with_number_and_boolean
+		Entry("vLLM: test_tool_call_with_number_and_boolean", parseGemma4Case{
+			fragments: []string{`<|tool_call>call:set_status{is_active:true,count:42,score:3.14}<tool_call|>`},
+			wantTools: []wantGemma4Tool{{name: "set_status", argsJSON: `{"is_active":true,"count":42,"score":3.14}`}},
+		}),
+		// vLLM: test_hyphenated_function_name
+		Entry("vLLM: test_hyphenated_function_name", parseGemma4Case{
+			fragments: []string{`<|tool_call>call:get-weather{location:<|"|>London<|"|>}<tool_call|>`},
+			wantTools: []wantGemma4Tool{{name: "get-weather", argsJSON: `{"location":"London"}`}},
+		}),
+		// vLLM: test_dotted_function_name
+		Entry("vLLM: test_dotted_function_name", parseGemma4Case{
+			fragments: []string{`<|tool_call>call:weather.get{location:<|"|>London<|"|>}<tool_call|>`},
+			wantTools: []wantGemma4Tool{{name: "weather.get", argsJSON: `{"location":"London"}`}},
+		}),
+		// vLLM: test_no_arguments
+		Entry("vLLM: test_no_arguments", parseGemma4Case{
+			fragments: []string{"<|tool_call>call:get_status{}<tool_call|>"},
+			wantTools: []wantGemma4Tool{{name: "get_status", argsJSON: `{}`}},
+		}),
+
+		// --- ported vLLM streaming cases (chunk lists reused as fragments) --------
+		// vLLM: test_basic_streaming_single_tool
+		Entry("vLLM: test_basic_streaming_single_tool", parseGemma4Case{
+			fragments: []string{
+				"<|tool_call>",
+				"call:get_weather{",
+				`location:<|"|>Paris`,
+				", France",
+				`<|"|>}`,
+				"<tool_call|>",
+			},
+			wantTools: []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"Paris, France"}`}},
+		}),
+		// vLLM: test_streaming_multi_arg
+		Entry("vLLM: test_streaming_multi_arg", parseGemma4Case{
+			fragments: []string{
+				"<|tool_call>",
+				"call:get_weather{",
+				`location:<|"|>Tokyo<|"|>,`,
+				`unit:<|"|>celsius<|"|>}`,
+				"<tool_call|>",
+			},
+			wantTools: []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"Tokyo","unit":"celsius"}`}},
+		}),
+		// vLLM: test_streaming_text_before_tool_call
+		Entry("vLLM: test_streaming_text_before_tool_call", parseGemma4Case{
+			fragments: []string{
+				"Let me check ",
+				"the weather. ",
+				"<|tool_call>",
+				"call:get_weather{",
+				`location:<|"|>London<|"|>}`,
+				"<tool_call|>",
+			},
+			wantContent: "Let me check the weather. ",
+			wantTools:   []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"London"}`}},
+		}),
+		// vLLM: test_streaming_numeric_args
+		Entry("vLLM: test_streaming_numeric_args", parseGemma4Case{
+			fragments: []string{
+				"<|tool_call>",
+				"call:set_config{",
+				"count:42,",
+				"active:true}",
+				"<tool_call|>",
+			},
+			wantTools: []wantGemma4Tool{{name: "set_config", argsJSON: `{"count":42,"active":true}`}},
+		}),
+		// vLLM: test_streaming_boolean_split_across_chunks
+		Entry("vLLM: test_streaming_boolean_split_across_chunks", parseGemma4Case{
+			fragments: []string{
+				"<|tool_call>",
+				"call:search{input:{all:tru",
+				"e}}",
+				"<tool_call|>",
+			},
+			wantTools: []wantGemma4Tool{{name: "search", argsJSON: `{"input":{"all":true}}`}},
+		}),
+		// vLLM: test_streaming_false_split_across_chunks
+		Entry("vLLM: test_streaming_false_split_across_chunks", parseGemma4Case{
+			fragments: []string{
+				"<|tool_call>",
+				"call:set{flag:fals",
+				"e}",
+				"<tool_call|>",
+			},
+			wantTools: []wantGemma4Tool{{name: "set", argsJSON: `{"flag":false}`}},
+		}),
+		// vLLM: test_streaming_number_split_across_chunks
+		Entry("vLLM: test_streaming_number_split_across_chunks", parseGemma4Case{
+			fragments: []string{
+				"<|tool_call>",
+				"call:set{count:4",
+				"2}",
+				"<tool_call|>",
+			},
+			wantTools: []wantGemma4Tool{{name: "set", argsJSON: `{"count":42}`}},
+		}),
+		// vLLM: test_streaming_empty_args
+		Entry("vLLM: test_streaming_empty_args", parseGemma4Case{
+			fragments: []string{
+				"<|tool_call>",
+				"call:get_status{}",
+				"<tool_call|>",
+			},
+			wantTools: []wantGemma4Tool{{name: "get_status", argsJSON: `{}`}},
+		}),
+		// vLLM: test_streaming_split_delimiter_no_invalid_json (string
+		// delimiter <|"|> split across fragments must not leak fragments).
+		Entry("vLLM: test_streaming_split_delimiter_no_invalid_json", parseGemma4Case{
+			fragments: []string{
+				"<|tool_call>",
+				"call:todowrite{",
+				`content:<|"|>Buy milk<|`,
+				`"|>}`,
+				"<tool_call|>",
+			},
+			wantTools: []wantGemma4Tool{{name: "todowrite", argsJSON: `{"content":"Buy milk"}`}},
+		}),
+		// vLLM: test_streaming_does_not_duplicate_plain_text_after_tool_call
+		Entry("vLLM: test_streaming_does_not_duplicate_plain_text_after_tool_call", parseGemma4Case{
+			fragments: []string{
+				"<|tool_call>",
+				"call:get_weather{",
+				`location:<|"|>Paris<|"|>}`,
+				"<tool_call|><",
+				"div>",
+			},
+			wantContent: "<div>",
+			wantTools:   []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"Paris"}`}},
+		}),
+		// vLLM: test_streaming_html_argument_does_not_duplicate_tag_prefixes
+		Entry("vLLM: test_streaming_html_argument_does_not_duplicate_tag_prefixes", parseGemma4Case{
+			fragments: []string{
+				"<|tool_call>",
+				"call:write_file{",
+				`path:<|"|>index.html<|"|>,`,
+				`content:<|"|><!DOCTYPE html>` + "\n<",
+				`html lang="zh-CN">` + "\n<",
+				"head>\n    <",
+				`meta charset="UTF-8">` + "\n    <",
+				`meta name="viewport" content="width=device-width">` + "\n",
+				`<|"|>}`,
+				"<tool_call|>",
+			},
+			wantTools: []wantGemma4Tool{{
+				name:     "write_file",
+				argsJSON: `{"path":"index.html","content":"<!DOCTYPE html>\n<html lang=\"zh-CN\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width\">\n"}`,
+			}},
+		}),
+		// vLLM: test_streaming_single_chunk_complete_tool_call
+		Entry("vLLM: test_streaming_single_chunk_complete_tool_call", parseGemma4Case{
+			fragments: []string{`<|tool_call>call:name_a_color{color_hex:<|"|>00ff11<|"|>}<tool_call|>`},
+			wantTools: []wantGemma4Tool{{name: "name_a_color", argsJSON: `{"color_hex":"00ff11"}`}},
+		}),
+		// vLLM: test_streaming_multi_chunk_batched_tool_calls (two complete
+		// calls in ONE fragment; both must come out with distinct indices)
+		Entry("vLLM: test_streaming_multi_chunk_batched_tool_calls", parseGemma4Case{
+			fragments: []string{
+				`<|tool_call>call:get_weather{location:<|"|>London<|"|>}<tool_call|>` +
+					`<|tool_call>call:get_time{timezone:<|"|>GMT<|"|>}<tool_call|>`,
+			},
+			wantTools: []wantGemma4Tool{
+				{name: "get_weather", argsJSON: `{"location":"London"}`},
+				{name: "get_time", argsJSON: `{"timezone":"GMT"}`},
+			},
+		}),
+		// vLLM: test_streaming_trailing_bare_bool_not_duplicated
+		Entry("vLLM: test_streaming_trailing_bare_bool_not_duplicated", parseGemma4Case{
+			fragments: []string{
+				"<|tool_call>",
+				"call:Edit{",
+				`file_path:<|"|>src/env.py<|"|>,`,
+				`old_string:<|"|>old_val<|"|>,`,
+				`new_string:<|"|>new_val<|"|>,`,
+				"replace_all:",
+				"false}",
+				"<tool_call|>",
+			},
+			wantTools: []wantGemma4Tool{{
+				name:     "Edit",
+				argsJSON: `{"file_path":"src/env.py","old_string":"old_val","new_string":"new_val","replace_all":false}`,
+			}},
+		}),
+
+		// --- implicit reasoning end on <|tool_call> (vLLM is_reasoning_end:
+		// a tool_call token means reasoning is over) -----------------------------
+		Entry("tool call inside an open thought channel ends the reasoning", parseGemma4Case{
+			startInThought: true,
+			fragments:      []string{`need the weather<|tool_call>call:get_weather{location:<|"|>Rome<|"|>}<tool_call|>`},
+			wantReasoning:  "need the weather",
+			wantTools:      []wantGemma4Tool{{name: "get_weather", argsJSON: `{"location":"Rome"}`}},
+		}),
+
+		// --- (12) empty fragments are no-ops --------------------------------------
+		Entry("empty fragments are no-ops", parseGemma4Case{
+			fragments:   []string{"", "Hello", "", "", " world", ""},
+			wantContent: "Hello world",
+		}),
+	)
+
+	It("returns no deltas for an empty fragment and after Close", func() {
+		p := NewGemma4Parser(false)
+		Expect(p.Feed("")).To(BeEmpty())
+		Expect(p.Feed("hi")).ToNot(BeEmpty())
+		Expect(p.Close()).To(BeEmpty()) // nothing held back
+		// The parser is finished after Close: further input is dropped.
+		Expect(p.Feed("more")).To(BeEmpty())
+		Expect(p.Close()).To(BeEmpty())
+	})
+
+	It("generates index-based tool call ids (call_<index>)", func() {
+		// Mirrors the index-based id convention of pkg/grpc/rich_test.go and
+		// keeps ids deterministic for the split-invariance property below.
+		deltas := parseGemma4Fragments(false, []string{
+			`<|tool_call>call:a{}<tool_call|><|tool_call>call:b{}<tool_call|>`,
+		})
+		_, _, tools := flattenGemma4Deltas(deltas)
+		Expect(tools).To(HaveLen(2))
+		Expect(tools[0].id).To(Equal("call_0"))
+		Expect(tools[1].id).To(Equal("call_1"))
+	})
+
+	// Property: for a fixed full output, EVERY 2-split position must yield
+	// exactly the same flattened result as the unsplit parse. This kills
+	// fragment-boundary bugs (mid-marker, mid-delimiter, mid-payload splits).
+	DescribeTable("2-split fragment invariance",
+		func(startInThought bool, full string) {
+			refContent, refReasoning, refTools := flattenGemma4Deltas(
+				parseGemma4Fragments(startInThought, []string{full}))
+			for i := 0; i <= len(full); i++ {
+				content, reasoning, tools := flattenGemma4Deltas(
+					parseGemma4Fragments(startInThought, []string{full[:i], full[i:]}))
+				Expect(content).To(Equal(refContent), fmt.Sprintf("content diverged at split %d", i))
+				Expect(reasoning).To(Equal(refReasoning), fmt.Sprintf("reasoning diverged at split %d", i))
+				Expect(tools).To(Equal(refTools), fmt.Sprintf("tool calls diverged at split %d", i))
+			}
+		},
+		Entry("thought + content + two tool calls + turn end", false,
+			"<|channel>thought\nPondering the request...\n<channel|>Sure - calling tools now. "+
+				`<|tool_call>call:get_weather{location:<|"|>Paris, France<|"|>,unit:<|"|>celsius<|"|>,days:3,detailed:true}<tool_call|>`+
+				`<|tool_call>call:get_time{timezone:<|"|>Europe/Lisbon<|"|>,nested:{flag:false,vals:[1,2.5,<|"|>x<|"|>]}}<tool_call|>`+
+				"Done.<turn|>ignored tail"),
+		Entry("startInThought + tool call + trailing partial marker", true,
+			`Deep thought<channel|>final answer <|tool_call>call:noop{}<tool_call|> trailing <|tool`),
+		Entry("malformed payload fallback", false,
+			`pre <|tool_call>not a call<tool_call|> post`),
+	)
+})
+
+// Decoder-level ports of vLLM's TestParseGemma4Args / TestParseGemma4Array
+// (non-partial mode; the partial-withholding tests do not apply because this
+// parser only ever decodes COMPLETE payloads, see gemma4_parser.go).
+var _ = Describe("decodeGemma4Args", func() {
+	DescribeTable("decodes the gemma4 call syntax into JSON arguments",
+		func(in, wantJSON string) {
+			Expect(decodeGemma4Args(in, 0)).To(MatchJSON(wantJSON))
+		},
+		// vLLM: test_empty_string / test_whitespace_only
+		Entry("empty string", "", `{}`),
+		Entry("whitespace only", "   ", `{}`),
+		// vLLM: test_single_string_value
+		Entry("single string value", `location:<|"|>Paris<|"|>`, `{"location":"Paris"}`),
+		// vLLM: test_string_value_with_comma
+		Entry("string value with comma", `location:<|"|>Paris, France<|"|>`, `{"location":"Paris, France"}`),
+		// vLLM: test_multiple_string_values
+		Entry("multiple string values", `location:<|"|>San Francisco<|"|>,unit:<|"|>celsius<|"|>`, `{"location":"San Francisco","unit":"celsius"}`),
+		// vLLM: test_integer_value / test_float_value
+		Entry("integer value", "count:42", `{"count":42}`),
+		Entry("float value", "score:3.14", `{"score":3.14}`),
+		// vLLM: test_boolean_true / test_boolean_false
+		Entry("boolean true", "flag:true", `{"flag":true}`),
+		Entry("boolean false", "flag:false", `{"flag":false}`),
+		// vLLM: test_null_value (bare null must become JSON null, not "null")
+		Entry("null value", "param:null", `{"param":null}`),
+		// vLLM: test_mixed_types
+		Entry("mixed types", `name:<|"|>test<|"|>,count:42,active:true,score:3.14`,
+			`{"name":"test","count":42,"active":true,"score":3.14}`),
+		// vLLM: test_nested_object
+		Entry("nested object", `nested:{inner:<|"|>value<|"|>}`, `{"nested":{"inner":"value"}}`),
+		// vLLM: test_array_of_strings
+		Entry("array of strings", `items:[<|"|>a<|"|>,<|"|>b<|"|>]`, `{"items":["a","b"]}`),
+		// vLLM: test_unterminated_string (take everything after the delimiter)
+		Entry("unterminated string", `key:<|"|>unterminated`, `{"key":"unterminated"}`),
+		// vLLM: test_empty_value (key with no value after colon)
+		Entry("empty value", "key:", `{"key":""}`),
+		// vLLM: test_trailing_dot_float_partial_withheld, non-partial branch
+		// (trailing-dot floats parse normally outside streaming).
+		Entry("trailing dot float, complete payload", "left:108.,right:22.8", `{"left":108.0,"right":22.8}`),
+	)
+
+	It("terminates and yields valid JSON on malformed input", func() {
+		// vLLM: test_malformed_partial_array (the assertion there is only
+		// "returns a dict without hanging"; ours is "valid JSON object").
+		out := decodeGemma4Args(":[t:[]", 0)
+		var v map[string]any
+		Expect(json.Unmarshal([]byte(out), &v)).To(Succeed())
+	})
+
+	It("degrades nesting beyond the recursion cap to a string value", func() {
+		// 200 levels of a:{a:{...a:1...}}. Without the depth cap the mutual
+		// recursion would grow the stack with the model's output; a Go stack
+		// overflow is a fatal process kill, so levels past gemma4MaxArgsDepth
+		// must gracefully fall back to the raw inner text as a JSON string.
+		const depth = 200
+		body := strings.Repeat("a:{", depth-1) + "a:1" + strings.Repeat("}", depth-1)
+		out := decodeGemma4Args(body, 0)
+		var v map[string]any
+		Expect(json.Unmarshal([]byte(out), &v)).To(Succeed())
+		levels := 0
+		var cur any = v
+		for {
+			m, ok := cur.(map[string]any)
+			if !ok {
+				break
+			}
+			Expect(m).To(HaveKey("a"))
+			cur = m["a"]
+			levels++
+		}
+		Expect(levels).To(Equal(gemma4MaxArgsDepth + 1))
+		Expect(cur).To(BeAssignableToTypeOf(""))
+		Expect(cur).To(ContainSubstring("a:{"))
+	})
+})
+
+var _ = Describe("decodeGemma4Array", func() {
+	DescribeTable("decodes gemma4 array bodies into JSON arrays",
+		func(in, wantJSON string) {
+			Expect(decodeGemma4Array(in, 0)).To(MatchJSON(wantJSON))
+		},
+		// vLLM: test_string_array / test_empty_array / test_bare_values
+		Entry("string array", `<|"|>a<|"|>,<|"|>b<|"|>`, `["a","b"]`),
+		Entry("empty array", "", `[]`),
+		Entry("bare values", "42,true,3.14", `[42,true,3.14]`),
+		// vLLM: test_string_element_with_closing_bracket (a ']' inside a
+		// delimited string must not close the array)
+		Entry("string element with closing bracket", `[<|"|>a]b<|"|>,<|"|>c<|"|>],<|"|>tail<|"|>`, `[["a]b","c"],"tail"]`),
+		// vLLM: test_stray_closing_bracket (no-progress abort, keep prefix)
+		Entry("stray closing bracket", "42,]trailing", `[42]`),
+	)
+})
--- a/backend/go/dllm/gemma4_renderer.go
+++ b/backend/go/dllm/gemma4_renderer.go
--- a/backend/go/dllm/gemma4_renderer_test.go
+++ b/backend/go/dllm/gemma4_renderer_test.go
@@ -0,0 +1,406 @@
+package main
+
+// Renderer specs for RenderGemma4 against the canonical gemma4 chat template
+// (see the normative template comment in gemma4_renderer.go).
+//
+// Fixture provenance:
+//   - "single user message" and "enable_thinking" are the EXACT expected
+//     decodes from transformers tests/models/diffusion_gemma/
+//     test_modeling_diffusion_gemma.py (test_diffusion_gemma_chat_template
+//     and ..._with_thinking) with ONE difference: the transformers fixtures
+//     start with "<bos>" because apply_chat_template tokenizes the rendered
+//     text with add_bos. Our prompt goes through dllm_capi_generate, whose
+//     run_generate already tokenizes with prepend_bos = vocab.add_bos
+//     (dllm.cpp src/capi.cpp:230-231, true for gemma4), so the renderer must
+//     NOT emit a literal <bos> (it would double) and every expected string
+//     here drops that leading token.
+//   - All other expected strings were produced by rendering the verbatim
+//     GGUF template with jinja2 3.1.2 (bos_token="<bos>") and dropping the
+//     leading "<bos>" for the same reason.
+
+import (
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+)
+
+// Two-function tools array used by the tool fixtures (OpenAI wire shape, as
+// LocalAI passes it through PredictOptions.Tools).
+const testToolsJSON = `[{"type":"function","function":{"name":"get_weather","description":"Get the current weather in a location.","parameters":{"type":"object","properties":{"location":{"type":"string","description":"The city name."},"unit":{"type":"string","enum":["celsius","fahrenheit"]}},"required":["location"]}}},{"type":"function","function":{"name":"get_time","description":"Get the current time in a timezone.","parameters":{"type":"object","properties":{"timezone":{"type":"string","description":"IANA timezone name."}},"required":["timezone"]}}}]`
+
+// The <|tool>...<tool|> block the template renders for testToolsJSON inside
+// the system turn (jinja2-verified).
+const testToolsBlock = `<|tool>declaration:get_weather{description:<|"|>Get the current weather in a location.<|"|>,parameters:{properties:{location:{description:<|"|>The city name.<|"|>,type:<|"|>STRING<|"|>},unit:{enum:[<|"|>celsius<|"|>,<|"|>fahrenheit<|"|>],type:<|"|>STRING<|"|>}},required:[<|"|>location<|"|>],type:<|"|>OBJECT<|"|>}}<tool|><|tool>declaration:get_time{description:<|"|>Get the current time in a timezone.<|"|>,parameters:{properties:{timezone:{description:<|"|>IANA timezone name.<|"|>,type:<|"|>STRING<|"|>}},required:[<|"|>timezone<|"|>],type:<|"|>OBJECT<|"|>}}<tool|>`
+
+// A single tool exercising the deep format_parameters branches: array items
+// (string-typed and nested-array), nullable, enum+nullable, nested object
+// properties/required, and a response declaration.
+const complexToolsJSON = `[{"type":"function","function":{"name":"complex_tool","description":"A complex tool.","parameters":{"type":"object","properties":{"tags":{"type":"array","description":"Tags.","items":{"type":"string"}},"matrix":{"type":"array","items":{"type":"array","items":{"type":"number"}}},"opts":{"type":"object","description":"Options.","properties":{"depth":{"type":"integer","nullable":true}},"required":["depth"]},"mode":{"type":"string","enum":["a","b"],"nullable":true}},"required":["tags","opts"]},"response":{"description":"The result.","type":"object"}}}]`
+
+// jinja2-verified render of complexToolsJSON. Notable template quirks pinned
+// here: nested array items go through format_argument with ESCAPED keys and
+// an un-uppercased type (<|"|>type<|"|>:<|"|>number<|"|>), while direct item
+// types are uppercased; properties dictsort case-insensitively.
+const complexToolsBlock = `<|tool>declaration:complex_tool{description:<|"|>A complex tool.<|"|>,parameters:{properties:{matrix:{items:{items:{<|"|>type<|"|>:<|"|>number<|"|>},type:<|"|>ARRAY<|"|>},type:<|"|>ARRAY<|"|>},mode:{enum:[<|"|>a<|"|>,<|"|>b<|"|>],nullable:true,type:<|"|>STRING<|"|>},opts:{description:<|"|>Options.<|"|>,properties:{depth:{nullable:true,type:<|"|>INTEGER<|"|>}},required:[<|"|>depth<|"|>],type:<|"|>OBJECT<|"|>},tags:{description:<|"|>Tags.<|"|>,items:{type:<|"|>STRING<|"|>},type:<|"|>ARRAY<|"|>}},required:[<|"|>tags<|"|>,<|"|>opts<|"|>],type:<|"|>OBJECT<|"|>},response:{description:<|"|>The result.<|"|>,type:<|"|>OBJECT<|"|>}}<tool|>`
+
+type renderGemma4Case struct {
+	msgs      []*pb.Message
+	toolsJSON string
+	// nImages mirrors len(PredictOptions.Images): the OpenAI layer strips
+	// image content parts out of the messages, so the renderer re-injects
+	// one engine marker per image on the last user message (see the IMAGE
+	// NOTE on RenderGemma4).
+	nImages            int
+	enableThinking     bool
+	noGenerationPrompt bool // inverted so the zero value is the common case
+	expected           string
+}
+
+var _ = Describe("RenderGemma4", func() {
+	DescribeTable("renders the canonical gemma4 prompt",
+		func(c renderGemma4Case) {
+			out, err := RenderGemma4(c.msgs, c.toolsJSON, c.nImages, c.enableThinking, !c.noGenerationPrompt)
+			Expect(err).ToNot(HaveOccurred())
+			Expect(out).To(Equal(c.expected))
+			// The C-ABI generate prepends BOS itself: a literal <bos>
+			// anywhere in the rendered prompt would double-encode it.
+			Expect(out).ToNot(ContainSubstring("<bos>"))
+		},
+
+		// transformers fixture (test_diffusion_gemma_chat_template), sans <bos>:
+		// default thinking pre-opens an EMPTY thought channel in the
+		// generation prompt.
+		Entry("single user message, default (no thinking)", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "Write a long essay about Portugal."},
+			},
+			expected: "<|turn>user\nWrite a long essay about Portugal.<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
+		}),
+
+		// transformers fixture (test_diffusion_gemma_chat_template_with_thinking),
+		// sans <bos>: a system turn carrying <|think|> and NO auto-opened
+		// thought channel.
+		Entry("enable_thinking=true", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "Write a long essay about Portugal."},
+			},
+			enableThinking: true,
+			expected:       "<|turn>system\n<|think|>\n<turn|>\n<|turn>user\nWrite a long essay about Portugal.<turn|>\n<|turn>model\n",
+		}),
+
+		Entry("multi-turn user/assistant/user", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "Hello, who are you?"},
+				{Role: "assistant", Content: "I am Gemma, a helpful assistant."},
+				{Role: "user", Content: "Tell me a joke."},
+			},
+			expected: "<|turn>user\nHello, who are you?<turn|>\n<|turn>model\nI am Gemma, a helpful assistant.<turn|>\n<|turn>user\nTell me a joke.<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
+		}),
+
+		// tpl L178-L195: a leading system message is folded into the system
+		// turn (trimmed) and consumed from the loop.
+		Entry("system message folds into the system turn", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "system", Content: "You are a pirate."},
+				{Role: "user", Content: "Hello!"},
+			},
+			expected: "<|turn>system\nYou are a pirate.<turn|>\n<|turn>user\nHello!<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
+		}),
+
+		// tpl L182-L185: <|think|> goes at the very top of the SAME system
+		// turn, before the system prompt text.
+		Entry("system message with enable_thinking shares the turn", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "system", Content: "You are a pirate."},
+				{Role: "user", Content: "Hello!"},
+			},
+			enableThinking: true,
+			expected:       "<|turn>system\n<|think|>\nYou are a pirate.<turn|>\n<|turn>user\nHello!<turn|>\n<|turn>model\n",
+		}),
+
+		// tpl L196-L203: tool declarations render in the system turn, one
+		// <|tool>declaration:...<tool|> block per tool, no separators.
+		Entry("tools array (two functions)", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "What is the weather in Tokyo?"},
+			},
+			toolsJSON: testToolsJSON,
+			expected:  "<|turn>system\n" + testToolsBlock + "<turn|>\n<|turn>user\nWhat is the weather in Tokyo?<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
+		}),
+
+		// format_parameters deep branches (tpl L1-L85) + response declaration
+		// (tpl L106-L116).
+		Entry("complex tool schema (array items, nullable, nested object, response)", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "go"},
+			},
+			toolsJSON: complexToolsJSON,
+			expected:  "<|turn>system\n" + complexToolsBlock + "<turn|>\n<|turn>user\ngo<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
+		}),
+
+		// tpl L243-L313: assistant tool_calls render as
+		// <|tool_call>call:name{args}<tool_call|>; the following role=tool
+		// message renders inline as <|tool_response>response:name{value:..}
+		// <tool_response|>; the model turn stays OPEN (no <turn|>, no new
+		// generation prompt) so the model continues after the response.
+		Entry("assistant tool_calls + role=tool result", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "What is the weather in Tokyo?"},
+				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"call_1","type":"function","function":{"name":"get_weather","arguments":"{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"}}]`},
+				{Role: "tool", ToolCallId: "call_1", Content: "Sunny, 22 degrees celsius."},
+			},
+			toolsJSON: testToolsJSON,
+			expected:  "<|turn>system\n" + testToolsBlock + "<turn|>\n<|turn>user\nWhat is the weather in Tokyo?<turn|>\n<|turn>model\n" + `<|tool_call>call:get_weather{location:<|"|>Tokyo<|"|>,unit:<|"|>celsius<|"|>}<tool_call|><|tool_response>response:get_weather{value:<|"|>Sunny, 22 degrees celsius.<|"|>}<tool_response|>`,
+		}),
+
+		// tpl L348-L349: a tool_calls turn with no rendered responses ends
+		// on an OPEN <|tool_response> marker for the runtime to fill, and
+		// add_generation_prompt adds nothing (tpl L357).
+		Entry("assistant tool_calls without a result leaves <|tool_response> open", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "What is the weather in Tokyo?"},
+				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"call_1","type":"function","function":{"name":"get_weather","arguments":"{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"}}]`},
+			},
+			toolsJSON: testToolsJSON,
+			expected:  "<|turn>system\n" + testToolsBlock + "<turn|>\n<|turn>user\nWhat is the weather in Tokyo?<turn|>\n<|turn>model\n" + `<|tool_call>call:get_weather{location:<|"|>Tokyo<|"|>,unit:<|"|>celsius<|"|>}<tool_call|><|tool_response>`,
+		}),
+
+		// tpl L237-L241: reasoning_content renders as a thought channel only
+		// on a tool-calling turn after the last user message.
+		Entry("reasoning_content with tool_calls renders the thought channel", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "weather?"},
+				{Role: "assistant", Content: "", ReasoningContent: "I should call the tool", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"get_weather","arguments":"{\"location\":\"Tokyo\"}"}}]`},
+				{Role: "tool", ToolCallId: "c1", Content: "Sunny"},
+			},
+			expected: "<|turn>user\nweather?<turn|>\n<|turn>model\n<|channel>thought\nI should call the tool\n<channel|>" + `<|tool_call>call:get_weather{location:<|"|>Tokyo<|"|>}<tool_call|><|tool_response>response:get_weather{value:<|"|>Sunny<|"|>}<tool_response|>`,
+		}),
+
+		// tpl L220-L235: the assistant answer following its own tool round
+		// continues the SAME model turn (no second <|turn>model).
+		Entry("tool round then final assistant answer then user", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "weather?"},
+				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"get_weather","arguments":"{\"location\":\"Tokyo\"}"}}]`},
+				{Role: "tool", ToolCallId: "c1", Content: "Sunny"},
+				{Role: "assistant", Content: "It is sunny."},
+				{Role: "user", Content: "thanks"},
+			},
+			expected: "<|turn>user\nweather?<turn|>\n<|turn>model\n" + `<|tool_call>call:get_weather{location:<|"|>Tokyo<|"|>}<tool_call|><|tool_response>response:get_weather{value:<|"|>Sunny<|"|>}<tool_response|>` + "It is sunny.<turn|>\n<|turn>user\nthanks<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
+		}),
+
+		// format_argument (tpl L118-L147): numbers keep their JSON literal,
+		// booleans lower-case, nested maps have unquoted dictsorted keys,
+		// arrays bracketed; top-level args are dictsorted case-insensitively.
+		Entry("tool_call argument types (number/bool/nested/array)", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "go"},
+				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"f","arguments":"{\"count\":42,\"ratio\":3.5,\"flag\":true,\"off\":false,\"nested\":{\"x\":\"y\",\"n\":7},\"list\":[\"a\",1,true]}"}}]`},
+			},
+			expected: "<|turn>user\ngo<turn|>\n<|turn>model\n" + `<|tool_call>call:f{count:42,flag:true,list:[<|"|>a<|"|>,1,true],nested:{n:7,x:<|"|>y<|"|>},off:false,ratio:3.5}<tool_call|><|tool_response>`,
+		}),
+
+		// jinja dictsort is case-insensitive: alpha sorts before Beta.
+		Entry("tool_call argument dictsort is case-insensitive", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "go"},
+				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"f","arguments":"{\"Beta\":1,\"alpha\":2}"}}]`},
+			},
+			expected: "<|turn>user\ngo<turn|>\n<|turn>model\n<|tool_call>call:f{alpha:2,Beta:1}<tool_call|><|tool_response>",
+		}),
+
+		// jinja renders Python None as "None" (round-trips through vLLM's
+		// parser, which lowers "none" back to null).
+		Entry("tool_call null argument renders as None", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "go"},
+				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"f","arguments":"{\"maybe\":null}"}}]`},
+			},
+			expected: "<|turn>user\ngo<turn|>\n<|turn>model\n<|tool_call>call:f{maybe:None}<tool_call|><|tool_response>",
+		}),
+
+		Entry("tool_call empty arguments render empty braces", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "go"},
+				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"f","arguments":"{}"}}]`},
+			},
+			expected: "<|turn>user\ngo<turn|>\n<|turn>model\n<|tool_call>call:f{}<tool_call|><|tool_response>",
+		}),
+
+		// tpl L253-L254: a non-object arguments string renders verbatim.
+		Entry("tool_call non-object string arguments render verbatim", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "go"},
+				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"f","arguments":"just text"}}]`},
+			},
+			expected: "<|turn>user\ngo<turn|>\n<|turn>model\n<|tool_call>call:f{just text}<tool_call|><|tool_response>",
+		}),
+
+		// tpl L278-L285: unmatched tool_call_id falls back to the tool
+		// message's own name.
+		Entry("tool result name falls back when tool_call_id does not match", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "go"},
+				{Role: "assistant", Content: "", ToolCalls: `[{"index":0,"id":"c1","type":"function","function":{"name":"f","arguments":"{}"}}]`},
+				{Role: "tool", ToolCallId: "OTHER", Name: "named_tool", Content: "out"},
+			},
+			expected: "<|turn>user\ngo<turn|>\n<|turn>model\n" + `<|tool_call>call:f{}<tool_call|><|tool_response>response:named_tool{value:<|"|>out<|"|>}<tool_response|>`,
+		}),
+
+		// strip_thinking (tpl L148-L158): historical assistant content loses
+		// its <|channel>...<channel|> spans.
+		Entry("assistant content thinking channels are stripped", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "hi"},
+				{Role: "assistant", Content: "<|channel>thought\nsecret\n<channel|>visible answer"},
+				{Role: "user", Content: "more"},
+			},
+			expected: "<|turn>user\nhi<turn|>\n<|turn>model\nvisible answer<turn|>\n<|turn>user\nmore<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
+		}),
+
+		// tpl L220-L235: consecutive assistant messages suppress the second
+		// <|turn>model (continuation), but each still closes with <turn|>.
+		Entry("consecutive assistant messages continue the model turn", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "hi"},
+				{Role: "assistant", Content: "part one"},
+				{Role: "assistant", Content: "part two"},
+				{Role: "user", Content: "ok"},
+			},
+			expected: "<|turn>user\nhi<turn|>\n<|turn>model\npart one<turn|>\npart two<turn|>\n<|turn>user\nok<turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
+		}),
+
+		Entry("add_generation_prompt=false renders no model turn", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "hi"},
+			},
+			noGenerationPrompt: true,
+			expected:           "<|turn>user\nhi<turn|>\n",
+		}),
+
+		// One engine marker per image, appended directly after the user
+		// text with no separator (tpl L323-L341 emits parts back-to-back;
+		// "<image>" is dllm_capi.h's splice marker, not the template's
+		// <|image|> text token - see the IMAGE NOTE on RenderGemma4).
+		Entry("one image appends one engine marker to the user message", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "What is in this picture?"},
+			},
+			nImages:  1,
+			expected: "<|turn>user\nWhat is in this picture?<image><turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
+		}),
+
+		Entry("multiple images append markers in image order", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "Compare these."},
+			},
+			nImages:  3,
+			expected: "<|turn>user\nCompare these.<image><image><image><turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
+		}),
+
+		// Flattened delivery loses per-message attribution, so all images
+		// attach to the LAST user message (llama.cpp grpc-server convention).
+		Entry("images attach to the last user message in multi-turn", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: "hi"},
+				{Role: "assistant", Content: "hello"},
+				{Role: "user", Content: "and this?"},
+			},
+			nImages:  1,
+			expected: "<|turn>user\nhi<turn|>\n<|turn>model\nhello<turn|>\n<|turn>user\nand this?<image><turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
+		}),
+
+		// tpl L346: the markers count as captured_content, so an image-only
+		// user message still has content and closes its turn normally.
+		Entry("image with empty user text still closes the turn", renderGemma4Case{
+			msgs: []*pb.Message{
+				{Role: "user", Content: ""},
+			},
+			nImages:  1,
+			expected: "<|turn>user\n<image><turn|>\n<|turn>model\n<|channel>thought\n<channel|>",
+		}),
+	)
+
+	Describe("error handling", func() {
+		It("fails loud on an unknown role", func() {
+			_, err := RenderGemma4([]*pb.Message{
+				{Role: "narrator", Content: "Meanwhile..."},
+			}, "", 0, false, true)
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring(`unknown role "narrator"`))
+		})
+
+		It("fails on invalid tools JSON", func() {
+			_, err := RenderGemma4([]*pb.Message{
+				{Role: "user", Content: "hi"},
+			}, "{not json", 0, false, true)
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring("tools JSON"))
+		})
+
+		It("fails on invalid tool_calls JSON", func() {
+			_, err := RenderGemma4([]*pb.Message{
+				{Role: "user", Content: "hi"},
+				{Role: "assistant", Content: "", ToolCalls: "{not json"},
+			}, "", 0, false, true)
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring("tool_calls JSON"))
+		})
+
+		It("fails on an orphan tool message, naming its index", func() {
+			// A role:tool message with no preceding assistant tool_calls turn
+			// would be silently dropped by the jinja; we fail loud instead.
+			_, err := RenderGemma4([]*pb.Message{
+				{Role: "user", Content: "hi"},
+				{Role: "tool", Content: `{"temp": 20}`, ToolCallId: "call_1"},
+			}, "", 0, false, true)
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring("orphan tool message 1"))
+		})
+
+		It("fails on trailing garbage after the tools JSON array", func() {
+			_, err := RenderGemma4([]*pb.Message{
+				{Role: "user", Content: "hi"},
+			}, "[] junk", 0, false, true)
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring("tools JSON"))
+		})
+
+		It("fails when the tools JSON is not an array", func() {
+			_, err := RenderGemma4([]*pb.Message{
+				{Role: "user", Content: "hi"},
+			}, `{"type":"function"}`, 0, false, true)
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring("tools JSON is not an array"))
+		})
+
+		It("fails when a tools array element is not an object", func() {
+			_, err := RenderGemma4([]*pb.Message{
+				{Role: "user", Content: "hi"},
+			}, `[42]`, 0, false, true)
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring("tools[0] is not an object"))
+		})
+
+		It("rejects a nil message via the unknown-role check", func() {
+			// Pins current behavior: pb getters are nil-safe, so a nil message
+			// reads as role "" and trips the fail-loud unknown-role guard.
+			_, err := RenderGemma4([]*pb.Message{nil}, "", 0, false, true)
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring(`unknown role "" in message 0`))
+		})
+
+		It("fails loud on images with no user message to attach them to", func() {
+			// The engine would reject the markerless prompt anyway
+			// (marker/image count mismatch); the renderer surfaces the bad
+			// request with a usable message instead.
+			_, err := RenderGemma4([]*pb.Message{
+				{Role: "system", Content: "sys"},
+				{Role: "assistant", Content: "hi"},
+			}, "", 1, false, true)
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring("no user message"))
+		})
+	})
+})
--- a/backend/go/dllm/main.go
+++ b/backend/go/dllm/main.go
@@ -0,0 +1,98 @@
+package main
+
+// Started internally by LocalAI - one gRPC server per loaded model.
+//
+// Loads libdllm.so via purego and registers the flat C-ABI declared in
+// dllm.cpp's include/dllm_capi.h (ABI v1): 9 mandatory symbols plus the
+// Dlsym-probed optional multimodal pair. The library name can
+// be overridden with DLLM_LIBRARY (mirrors the PARAKEET_LIBRARY /
+// WHISPER_LIBRARY convention in the sibling backends); the default looks
+// for the .so next to this binary (run.sh puts the package dir on
+// LD_LIBRARY_PATH).
+import (
+	"flag"
+	"fmt"
+	"os"
+
+	"github.com/ebitengine/purego"
+	grpc "github.com/mudler/LocalAI/pkg/grpc"
+)
+
+var (
+	addr = flag.String("addr", "localhost:50051", "the address to connect to")
+)
+
+type LibFuncs struct {
+	FuncPtr any
+	Name    string
+}
+
+// loadCAPI dlopens libName and binds the 9 dllm_capi_* entry points 1:1 to
+// dllm_capi.h, so an `nm libdllm.so | grep dllm_capi` is enough to spot
+// drift. Shared with the test suite (ensureLibLoaded), which drives the
+// bridge without the gRPC server.
+//
+// The C-ABI returns malloc'd char* buffers from tokenize_json/generate; we
+// register those as uintptr so we get the raw pointer back and can call
+// dllm_capi_free_string on it (purego's string return would copy and forget
+// the original pointer, leaking it on every call). last_error returns a
+// BORROWED pointer instead, so it is registered as a plain string: purego
+// copies it and nothing must be freed.
+func loadCAPI(libName string) error {
+	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
+	if err != nil {
+		return fmt.Errorf("dllm: dlopen %q: %w", libName, err)
+	}
+
+	libFuncs := []LibFuncs{
+		{&cppAbiVersion, "dllm_capi_abi_version"},
+		{&cppLoad, "dllm_capi_load"},
+		{&cppFree, "dllm_capi_free"},
+		{&cppLastError, "dllm_capi_last_error"},
+		{&cppFreeString, "dllm_capi_free_string"},
+		{&cppTokenizeJSON, "dllm_capi_tokenize_json"},
+		{&cppGenerate, "dllm_capi_generate"},
+		{&cppGenerateStream, "dllm_capi_generate_stream"},
+		{&cppCancel, "dllm_capi_cancel"},
+	}
+	for _, lf := range libFuncs {
+		purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
+	}
+
+	// Multimodal entry points (dllm_capi.h's P4 surface). Additive: the ABI
+	// version stays 1 and consumers detect the surface by probing the symbols
+	// (the parakeet-cpp optional-symbol pattern), so the backend still loads
+	// against an older text-only libdllm.so - image requests then fail with
+	// errMMUnsupported instead of a boot failure.
+	if sym, err := purego.Dlsym(lib, "dllm_capi_generate_mm"); err == nil && sym != 0 {
+		purego.RegisterLibFunc(&cppGenerateMM, lib, "dllm_capi_generate_mm")
+	}
+	if sym, err := purego.Dlsym(lib, "dllm_capi_generate_stream_mm"); err == nil && sym != 0 {
+		purego.RegisterLibFunc(&cppGenerateStreamMM, lib, "dllm_capi_generate_stream_mm")
+	}
+	return nil
+}
+
+func main() {
+	libName := os.Getenv("DLLM_LIBRARY")
+	if libName == "" {
+		libName = "libdllm.so"
+	}
+
+	if err := loadCAPI(libName); err != nil {
+		panic(err)
+	}
+
+	// Hard-fail on an ABI mismatch: the flat-pointer bindings above would
+	// otherwise misbehave silently against a future libdllm.so.
+	if v := cAbiVersion(); v != dllmABIVersion {
+		panic(fmt.Errorf("dllm: libdllm.so ABI=%d, this backend speaks ABI=%d", v, dllmABIVersion))
+	}
+	fmt.Fprintf(os.Stderr, "[dllm] ABI=%d multimodal=%t\n", cAbiVersion(), cMMSupported())
+
+	flag.Parse()
+
+	if err := grpc.StartServer(*addr, &Dllm{}); err != nil {
+		panic(err)
+	}
+}
--- a/backend/go/dllm/package.sh
+++ b/backend/go/dllm/package.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+#
+# T1 packaging stub: copy the binary, run.sh and libdllm.so into package/.
+# The full ldd walk (libc, libstdc++, libgomp, GPU runtimes, arch
+# detection) lands with the registration task, mirroring
+# backend/go/whisper/package.sh.
+
+set -e
+
+CURDIR=$(dirname "$(realpath "$0")")
+
+mkdir -p "$CURDIR/package/lib"
+
+cp -avf "$CURDIR/dllm-grpc" "$CURDIR/package/"
+cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
+
+# libdllm.so + any soname symlinks, should upstream ever add them.
+cp -avf "$CURDIR"/libdllm.so* "$CURDIR/package/lib/" 2>/dev/null || {
+	echo "ERROR: libdllm.so not found in $CURDIR, run 'make' first" >&2
+	exit 1
+}
+
+echo "T1 package layout (full ldd walk lands with registration):"
+ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/go/dllm/run.sh
+++ b/backend/go/dllm/run.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+set -e
+
+CURDIR=$(dirname "$(realpath "$0")")
+
+export LD_LIBRARY_PATH="$CURDIR/lib:$CURDIR:${LD_LIBRARY_PATH:-}"
+
+# If a self-contained ld.so was packaged, route through it so the
+# packaged libc / libstdc++ are used instead of the host's (matches the
+# whisper / parakeet-cpp backends' runtime layout).
+if [ -f "$CURDIR/lib/ld.so" ]; then
+	echo "Using lib/ld.so"
+	exec "$CURDIR/lib/ld.so" "$CURDIR/dllm-grpc" "$@"
+fi
+
+exec "$CURDIR/dllm-grpc" "$@"
--- a/backend/go/locate-anything-cpp/.gitignore
+++ b/backend/go/locate-anything-cpp/.gitignore
@@ -1,7 +0,0 @@
-sources/
-build*/
-package/
-liblocateanythingcpp*.so
-locate-anything-cpp
-test-models/
-test-data/
--- a/backend/go/locate-anything-cpp/CMakeLists.txt
+++ b/backend/go/locate-anything-cpp/CMakeLists.txt
@@ -1,57 +0,0 @@
-cmake_minimum_required(VERSION 3.18)
-project(liblocateanythingcpp LANGUAGES C CXX)
-
-set(CMAKE_POSITION_INDEPENDENT_CODE ON)
-set(CMAKE_CXX_STANDARD 17)
-set(CMAKE_CXX_STANDARD_REQUIRED ON)
-
-# Static-link ggml + locate_anything so the resulting .so has no runtime
-# dependency on extra ggml/locate_anything shared libraries — only on
-# libc/libstdc++/libgomp, which the LocalAI package step bundles into the
-# docker image.
-set(BUILD_SHARED_LIBS OFF CACHE BOOL "Build static libraries" FORCE)
-
-# locate-anything.cpp build switches: skip CLI/tests, keep static lib.
-set(LA_BUILD_CLI OFF CACHE BOOL "Disable locate-anything CLI" FORCE)
-set(LA_BUILD_TESTS OFF CACHE BOOL "Disable locate-anything tests" FORCE)
-set(LA_SHARED OFF CACHE BOOL "Build locate_anything as static lib" FORCE)
-
-# Unlike rt-detr.cpp, locate-anything.cpp ships no in-tree ggml patches, so
-# there is no apply_ggml_patches.sh hook to shim here.
-add_subdirectory(./sources/locate-anything.cpp)
-
-# locate-anything.cpp's top-level CMakeLists points its own target's include
-# dirs at ${CMAKE_SOURCE_DIR}/{include,src,third_party,...}. CMAKE_SOURCE_DIR
-# is the *top-level* source dir of the whole CMake tree, so when we pull it in
-# via add_subdirectory it resolves to OUR directory, not theirs, and the
-# locate_anything target fails to find its own headers (la_capi.h, stb_image.h,
-# la_gguf_keys.h). Re-add the correct, subdir-relative include paths to the
-# already-defined target so it compiles regardless of where it's nested.
-set(LA_SRC ${CMAKE_CURRENT_SOURCE_DIR}/sources/locate-anything.cpp)
-target_include_directories(locate_anything PRIVATE
-    ${LA_SRC}/include
-    ${LA_SRC}/src
-    ${LA_SRC}/third_party
-    ${LA_SRC}/third_party/stb)
-
-# locate-anything.cpp's C-API symbols already live inside liblocate_anything
-# (src/la_capi.cpp is compiled into the lib). We re-export them via a MODULE
-# library that links locate_anything so the symbols are visible at dlopen time.
-add_library(locateanythingcpp MODULE
-    sources/locate-anything.cpp/src/la_capi.cpp)
-
-target_include_directories(locateanythingcpp PRIVATE
-    sources/locate-anything.cpp/include
-    sources/locate-anything.cpp/src
-    sources/locate-anything.cpp/third_party
-    sources/locate-anything.cpp/third_party/stb
-)
-
-target_link_libraries(locateanythingcpp PRIVATE locate_anything ggml)
-
-if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
-    target_link_libraries(locateanythingcpp PRIVATE stdc++fs)
-endif()
-
-set_property(TARGET locateanythingcpp PROPERTY CXX_STANDARD 17)
-set_target_properties(locateanythingcpp PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
--- a/backend/go/locate-anything-cpp/Makefile
+++ b/backend/go/locate-anything-cpp/Makefile
@@ -1,134 +0,0 @@
-CMAKE_ARGS?=
-BUILD_TYPE?=
-NATIVE?=false
-
-GOCMD?=go
-GO_TAGS?=
-JOBS?=$(shell nproc --ignore=1)
-
-# locate-anything.cpp. Pin to a specific commit for a stable build; leaving
-# this on `master` always picks up the latest C-API surface (incl. the
-# per-detection accessor functions used by golocateanythingcpp.go).
-LOCATEANYTHING_REPO?=https://github.com/mudler/locate-anything.cpp.git
-LOCATEANYTHING_VERSION?=60e450945476d5e97e0754a8c0e71a9ea81690e0
-
-ifeq ($(NATIVE),false)
-	CMAKE_ARGS+=-DGGML_NATIVE=OFF
-endif
-
-# Forward LocalAI's BUILD_TYPE to the matching ggml backend switch.
-ifeq ($(BUILD_TYPE),cublas)
-	CMAKE_ARGS+=-DGGML_CUDA=ON -DLA_GGML_CUDA=ON
-else ifeq ($(BUILD_TYPE),openblas)
-	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
-else ifeq ($(BUILD_TYPE),clblas)
-	CMAKE_ARGS+=-DGGML_CLBLAST=ON
-else ifeq ($(BUILD_TYPE),hipblas)
-	ROCM_HOME ?= /opt/rocm
-	ROCM_PATH ?= /opt/rocm
-	export CXX=$(ROCM_HOME)/llvm/bin/clang++
-	export CC=$(ROCM_HOME)/llvm/bin/clang
-	AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
-	CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
-else ifeq ($(BUILD_TYPE),vulkan)
-	CMAKE_ARGS+=-DGGML_VULKAN=ON -DLA_GGML_VULKAN=ON
-else ifeq ($(OS),Darwin)
-	ifneq ($(BUILD_TYPE),metal)
-		CMAKE_ARGS+=-DGGML_METAL=OFF
-	else
-		CMAKE_ARGS+=-DGGML_METAL=ON
-		CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
-		CMAKE_ARGS+=-DLA_GGML_METAL=ON
-	endif
-endif
-
-ifeq ($(BUILD_TYPE),sycl_f16)
-	CMAKE_ARGS+=-DGGML_SYCL=ON \
-		-DCMAKE_C_COMPILER=icx \
-		-DCMAKE_CXX_COMPILER=icpx \
-		-DGGML_SYCL_F16=ON
-endif
-
-ifeq ($(BUILD_TYPE),sycl_f32)
-	CMAKE_ARGS+=-DGGML_SYCL=ON \
-		-DCMAKE_C_COMPILER=icx \
-		-DCMAKE_CXX_COMPILER=icpx
-endif
-
-sources/locate-anything.cpp:
-	mkdir -p sources && \
-	git clone --recursive $(LOCATEANYTHING_REPO) sources/locate-anything.cpp && \
-	cd sources/locate-anything.cpp && \
-	git checkout $(LOCATEANYTHING_VERSION) && \
-	git submodule update --init --recursive --depth 1 --single-branch
-
-# Detect OS
-UNAME_S := $(shell uname -s)
-
-# Only build CPU variants on Linux
-ifeq ($(UNAME_S),Linux)
-	VARIANT_TARGETS = liblocateanythingcpp-avx.so liblocateanythingcpp-avx2.so liblocateanythingcpp-avx512.so liblocateanythingcpp-fallback.so
-else
-	# On non-Linux (e.g., Darwin), build only fallback variant
-	VARIANT_TARGETS = liblocateanythingcpp-fallback.so
-endif
-
-locate-anything-cpp: main.go golocateanythingcpp.go $(VARIANT_TARGETS)
-	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o locate-anything-cpp ./
-
-package: locate-anything-cpp
-	bash package.sh
-
-build: package
-
-clean: purge
-	rm -rf liblocateanythingcpp*.so locate-anything-cpp package sources
-
-purge:
-	rm -rf build*
-
-# Build all variants (Linux only)
-ifeq ($(UNAME_S),Linux)
-liblocateanythingcpp-avx.so: sources/locate-anything.cpp
-	rm -rfv build-$@
-	$(info ${GREEN}I locate-anything-cpp build info:avx${RESET})
-	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) liblocateanythingcpp-custom
-	rm -rfv build-$@
-
-liblocateanythingcpp-avx2.so: sources/locate-anything.cpp
-	rm -rfv build-$@
-	$(info ${GREEN}I locate-anything-cpp build info:avx2${RESET})
-	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) liblocateanythingcpp-custom
-	rm -rfv build-$@
-
-liblocateanythingcpp-avx512.so: sources/locate-anything.cpp
-	rm -rfv build-$@
-	$(info ${GREEN}I locate-anything-cpp build info:avx512${RESET})
-	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) liblocateanythingcpp-custom
-	rm -rfv build-$@
-endif
-
-# Build fallback variant (all platforms)
-liblocateanythingcpp-fallback.so: sources/locate-anything.cpp
-	rm -rfv build-$@
-	$(info ${GREEN}I locate-anything-cpp build info:fallback${RESET})
-	SO_TARGET=$@ CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) liblocateanythingcpp-custom
-	rm -rfv build-$@
-
-liblocateanythingcpp-custom: CMakeLists.txt
-	mkdir -p build-$(SO_TARGET) && \
-	cd build-$(SO_TARGET) && \
-	cmake .. $(CMAKE_ARGS) && \
-	cmake --build . --config Release -j$(JOBS) && \
-	cd .. && \
-	mv build-$(SO_TARGET)/liblocateanythingcpp.so ./$(SO_TARGET)
-
-all: locate-anything-cpp package
-
-# `test` is invoked by the top-level Makefile's `test-extra` target. It builds
-# the backend binary + the fallback shared library (needed for dlopen at
-# runtime), then runs test.sh which downloads the q8_0 GGUF + COCO image and
-# exercises the gRPC Load/Detect wire path via the Go smoke test in
-# main_test.go.
-test: locate-anything-cpp liblocateanythingcpp-fallback.so
-	bash test.sh
--- a/backend/go/locate-anything-cpp/golocateanythingcpp.go
+++ b/backend/go/locate-anything-cpp/golocateanythingcpp.go
@@ -1,174 +0,0 @@
-package main
-
-// golocateanythingcpp.go - gRPC handlers (Load, Detect) for the
-// locate-anything-cpp backend.
-//
-// Embeds base.SingleThread to default unimplemented RPCs to "not supported"
-// while we only implement open-vocabulary object detection (Detect).
-
-import (
-	"encoding/base64"
-	"fmt"
-	"os"
-	"path/filepath"
-	"unsafe"
-
-	"github.com/mudler/LocalAI/pkg/grpc/base"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-)
-
-// la_ctx* is an opaque handle. la_capi_load returns it directly (0 == failure),
-// unlike rfdetr's out-parameter convention.
-var (
-	// la_capi_load(const char* gguf_path, int n_threads) -> la_ctx* (0 = fail)
-	CapiLoad func(gguf string, nThreads int32) uintptr
-	// la_capi_free(la_ctx* ctx)
-	CapiFree func(handle uintptr)
-	// la_capi_locate_path(ctx, image_path, prompt, mode) -> char* json (0 = err)
-	CapiLocatePath func(handle uintptr, imagePath string, prompt string, mode int32) uintptr
-	// la_capi_locate_buffer(ctx, bytes, len, prompt, mode) -> char* json (0 = err)
-	CapiLocateBuffer func(handle uintptr, bytes uintptr, length uintptr, prompt string, mode int32) uintptr
-	// la_capi_get_n_detections(ctx) -> int
-	CapiGetNDetections func(handle uintptr) int32
-	// la_capi_get_detection_box(ctx, i, out_xyxy[4]) -> int (0 on success)
-	CapiGetDetectionBox func(handle uintptr, i int32, outXYXY uintptr) int32
-	// la_capi_get_detection_label(ctx, i, buf, buf_size) -> int (required size incl NUL; two-call sizing)
-	CapiGetDetectionLabel func(handle uintptr, i int32, buf uintptr, bufSize int32) int32
-	// la_capi_free_string(char* s)
-	CapiFreeString func(s uintptr)
-	// la_capi_last_error(ctx) -> const char* (owned by ctx, "" if none / null ctx).
-	// purego marshals the returned C string into a Go string (a copy), so we
-	// never free it and avoid raw pointer arithmetic.
-	CapiLastError func(handle uintptr) string
-)
-
-type LocateAnythingCpp struct {
-	base.SingleThread
-	handle uintptr
-}
-
-// Load loads the GGUF model at opts.ModelFile (joined with opts.ModelPath if
-// relative) and stores the la_ctx handle for later Detect calls.
-func (r *LocateAnythingCpp) Load(opts *pb.ModelOptions) error {
-	modelFile := opts.ModelFile
-	if modelFile == "" {
-		modelFile = opts.Model
-	}
-	if modelFile == "" {
-		return fmt.Errorf("locate-anything-cpp: ModelFile is empty")
-	}
-
-	var modelPath string
-	if filepath.IsAbs(modelFile) {
-		modelPath = modelFile
-	} else {
-		modelPath = filepath.Join(opts.ModelPath, modelFile)
-	}
-
-	if _, err := os.Stat(modelPath); err != nil {
-		return fmt.Errorf("locate-anything-cpp: model file not found: %s: %w", modelPath, err)
-	}
-
-	threads := opts.Threads
-	if threads <= 0 {
-		threads = 4
-	}
-
-	// Release previous model if any (re-Load).
-	if r.handle != 0 {
-		CapiFree(r.handle)
-		r.handle = 0
-	}
-
-	h := CapiLoad(modelPath, threads)
-	if h == 0 {
-		// la_capi_last_error needs a ctx; on a failed load we have none (it
-		// returns "" for a null ctx), so the text is best-effort. Surface it
-		// when present.
-		if msg := CapiLastError(0); msg != "" {
-			return fmt.Errorf("locate-anything-cpp: la_capi_load failed for %s: %s", modelPath, msg)
-		}
-		return fmt.Errorf("locate-anything-cpp: la_capi_load failed for %s", modelPath)
-	}
-	r.handle = h
-	return nil
-}
-
-// Detect runs open-vocabulary detection on the base64-encoded image in opts.Src
-// using the required text prompt in opts.Prompt, returning one pb.Detection per
-// located object with its predicted label as ClassName.
-func (r *LocateAnythingCpp) Detect(opts *pb.DetectOptions) (pb.DetectResponse, error) {
-	if r.handle == 0 {
-		return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: model not loaded")
-	}
-
-	// Open-vocabulary detection is prompt-driven; without a prompt there is
-	// nothing to locate.
-	prompt := opts.Prompt
-	if prompt == "" {
-		return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: a text prompt is required (open-vocabulary detection)")
-	}
-
-	// Decode base64 image and write to temp file.
-	imgData, err := base64.StdEncoding.DecodeString(opts.Src)
-	if err != nil {
-		return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: failed to decode base64 image: %w", err)
-	}
-
-	tmpFile, err := os.CreateTemp("", "locate-anything-*.img")
-	if err != nil {
-		return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: failed to create temp file: %w", err)
-	}
-	defer func() { _ = os.Remove(tmpFile.Name()) }()
-
-	if _, err := tmpFile.Write(imgData); err != nil {
-		_ = tmpFile.Close()
-		return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: failed to write temp file: %w", err)
-	}
-	if err := tmpFile.Close(); err != nil {
-		return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: failed to close temp file: %w", err)
-	}
-
-	// mode 0 = hybrid (Parallel Box Decoding). The JSON return value is unused:
-	// structured detections are read via the accessor functions. Still must
-	// free the returned string.
-	jsonPtr := CapiLocatePath(r.handle, tmpFile.Name(), prompt, 0)
-	if jsonPtr != 0 {
-		CapiFreeString(jsonPtr)
-	}
-
-	n := CapiGetNDetections(r.handle)
-	if n < 0 {
-		return pb.DetectResponse{}, fmt.Errorf("locate-anything-cpp: invalid n_detections=%d", n)
-	}
-
-	detections := make([]*pb.Detection, 0, n)
-	for i := int32(0); i < n; i++ {
-		var xyxy [4]float32 // x1, y1, x2, y2
-		if CapiGetDetectionBox(r.handle, i, uintptr(unsafe.Pointer(&xyxy[0]))) != 0 {
-			continue
-		}
-
-		// Two-call sizing for the label string.
-		label := ""
-		need := CapiGetDetectionLabel(r.handle, i, 0, 0)
-		if need > 0 {
-			buf := make([]byte, need)
-			CapiGetDetectionLabel(r.handle, i, uintptr(unsafe.Pointer(&buf[0])), need)
-			label = string(buf[:need-1])
-		}
-
-		detections = append(detections, &pb.Detection{
-			X:          xyxy[0],
-			Y:          xyxy[1],
-			Width:      xyxy[2] - xyxy[0],
-			Height:     xyxy[3] - xyxy[1],
-			Confidence: 1.0,
-			ClassName:  label,
-		})
-	}
-
-	return pb.DetectResponse{
-		Detections: detections,
-	}, nil
-}
--- a/backend/go/locate-anything-cpp/main.go
+++ b/backend/go/locate-anything-cpp/main.go
@@ -1,59 +0,0 @@
-package main
-
-// main.go - entry point for the locate-anything-cpp gRPC backend.
-//
-// Dlopens liblocateanythingcpp-<variant>.so via purego at the path in
-// LOCATEANYTHING_LIBRARY (set by run.sh based on /proc/cpuinfo), registers
-// the la_capi_* C ABI symbols, then starts the gRPC server.
-
-import (
-	"flag"
-	"os"
-
-	"github.com/ebitengine/purego"
-	grpc "github.com/mudler/LocalAI/pkg/grpc"
-)
-
-var (
-	addr = flag.String("addr", "localhost:50051", "the address to connect to")
-)
-
-type LibFuncs struct {
-	FuncPtr any
-	Name    string
-}
-
-func main() {
-	// Get library name from environment variable, default to fallback
-	libName := os.Getenv("LOCATEANYTHING_LIBRARY")
-	if libName == "" {
-		libName = "./liblocateanythingcpp-fallback.so"
-	}
-
-	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
-	if err != nil {
-		panic(err)
-	}
-
-	libFuncs := []LibFuncs{
-		{&CapiLoad, "la_capi_load"},
-		{&CapiFree, "la_capi_free"},
-		{&CapiLocatePath, "la_capi_locate_path"},
-		{&CapiLocateBuffer, "la_capi_locate_buffer"},
-		{&CapiGetNDetections, "la_capi_get_n_detections"},
-		{&CapiGetDetectionBox, "la_capi_get_detection_box"},
-		{&CapiGetDetectionLabel, "la_capi_get_detection_label"},
-		{&CapiFreeString, "la_capi_free_string"},
-		{&CapiLastError, "la_capi_last_error"},
-	}
-
-	for _, lf := range libFuncs {
-		purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
-	}
-
-	flag.Parse()
-
-	if err := grpc.StartServer(*addr, &LocateAnythingCpp{}); err != nil {
-		panic(err)
-	}
-}
--- a/backend/go/locate-anything-cpp/main_test.go
+++ b/backend/go/locate-anything-cpp/main_test.go
@@ -1,176 +0,0 @@
-package main
-
-// main_test.go - end-to-end smoke test for the locate-anything-cpp gRPC backend.
-//
-// Spawns the compiled locate-anything-cpp binary on a free local port, dials it
-// via gRPC, and exercises LoadModel + Detect against the test fixtures
-// downloaded by test.sh: the q8_0 GGUF of nvidia/LocateAnything-3B and a real
-// COCO image with people + cars. Asserts that open-vocabulary detection driven
-// by a text prompt returns at least one detection, each carrying a non-empty
-// class name and a bounding box of non-zero size.
-//
-// The spec Skip()s cleanly if its fixtures (the ~6.3 GB model, the test image,
-// the built binary, or the fallback .so) are missing, so the test target stays
-// usable on a fresh checkout / on CI runners where the large model hasn't been
-// downloaded.
-
-import (
-	"context"
-	"encoding/base64"
-	"fmt"
-	"net"
-	"os"
-	"os/exec"
-	"path/filepath"
-	"testing"
-	"time"
-
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-	"google.golang.org/grpc"
-	"google.golang.org/grpc/credentials/insecure"
-)
-
-func TestDetect(t *testing.T) {
-	RegisterFailHandler(Fail)
-	RunSpecs(t, "locate-anything-cpp backend smoke suite")
-}
-
-// freePort grabs an ephemeral TCP port and immediately releases it so the
-// spawned backend can bind to it. There is a tiny TOCTOU window here but in
-// practice it's adequate for a smoke test on a quiet runner.
-func freePort() int {
-	l, err := net.Listen("tcp", "127.0.0.1:0")
-	Expect(err).ToNot(HaveOccurred(), "freePort listen")
-	port := l.Addr().(*net.TCPAddr).Port
-	Expect(l.Close()).To(Succeed())
-	return port
-}
-
-// startBackend spawns the locate-anything-cpp binary on the given port and
-// waits until it accepts TCP connections (up to 10s). It mirrors how main.go
-// resolves the purego library: the LOCATEANYTHING_LIBRARY env var points the
-// dlopen at the freshly built fallback .so, and the la_capi_* symbols are
-// registered there. The returned cleanup func kills the process and reaps it.
-func startBackend(port int) func() {
-	binary, err := filepath.Abs("./locate-anything-cpp")
-	Expect(err).ToNot(HaveOccurred())
-	if _, err := os.Stat(binary); err != nil {
-		Skip(fmt.Sprintf("backend binary not built: %s (run `make locate-anything-cpp` first)", binary))
-	}
-
-	libPath, err := filepath.Abs("./liblocateanythingcpp-fallback.so")
-	Expect(err).ToNot(HaveOccurred())
-	if _, err := os.Stat(libPath); err != nil {
-		Skip(fmt.Sprintf("fallback library not built: %s (run `make liblocateanythingcpp-fallback.so` first)", libPath))
-	}
-
-	addr := fmt.Sprintf("127.0.0.1:%d", port)
-	cmd := exec.Command(binary, "--addr", addr)
-	cmd.Env = append(os.Environ(), "LOCATEANYTHING_LIBRARY="+libPath)
-	cmd.Stdout = os.Stderr
-	cmd.Stderr = os.Stderr
-	Expect(cmd.Start()).To(Succeed())
-
-	cleanup := func() {
-		if cmd.Process != nil {
-			_ = cmd.Process.Kill()
-			_, _ = cmd.Process.Wait()
-		}
-	}
-
-	deadline := time.Now().Add(10 * time.Second)
-	for time.Now().Before(deadline) {
-		c, err := net.DialTimeout("tcp", addr, 200*time.Millisecond)
-		if err == nil {
-			_ = c.Close()
-			return cleanup
-		}
-		time.Sleep(200 * time.Millisecond)
-	}
-
-	cleanup()
-	Fail(fmt.Sprintf("backend did not become ready on %s within 10s", addr))
-	return func() {}
-}
-
-// loadTestImage reads the COCO test image downloaded by test.sh and returns its
-// base64-encoded content (the wire format accepted by the Detect RPC).
-func loadTestImage() string {
-	imgPath, err := filepath.Abs("test-data/test.jpg")
-	Expect(err).ToNot(HaveOccurred())
-	imgBytes, err := os.ReadFile(imgPath)
-	if err != nil {
-		Skip(fmt.Sprintf("test image not present: %s (run test.sh first)", imgPath))
-	}
-	return base64.StdEncoding.EncodeToString(imgBytes)
-}
-
-// dialBackend opens a gRPC client connection to the spawned backend.
-func dialBackend(port int) (pb.BackendClient, func()) {
-	addr := fmt.Sprintf("127.0.0.1:%d", port)
-	conn, err := grpc.NewClient(addr, grpc.WithTransportCredentials(insecure.NewCredentials()))
-	Expect(err).ToNot(HaveOccurred())
-	return pb.NewBackendClient(conn), func() { _ = conn.Close() }
-}
-
-// modelPathOrSkip resolves the model file under ./test-models/ and Skip()s the
-// current spec if it's missing (the ~6.3 GB GGUF is not present on a fresh
-// checkout / on CI runners without the download).
-func modelPathOrSkip(name string) string {
-	modelDir, err := filepath.Abs("test-models")
-	Expect(err).ToNot(HaveOccurred())
-	modelPath := filepath.Join(modelDir, name)
-	if _, err := os.Stat(modelPath); err != nil {
-		Skip(fmt.Sprintf("model not present: %s (run test.sh first)", modelPath))
-	}
-	return modelPath
-}
-
-var _ = Describe("locate-anything-cpp backend", func() {
-	It("runs open-vocabulary detection against a known-good COCO image", func() {
-		modelPath := modelPathOrSkip("locate-anything-q8_0.gguf")
-		imgB64 := loadTestImage()
-
-		port := freePort()
-		cleanup := startBackend(port)
-		defer cleanup()
-
-		client, closeConn := dialBackend(port)
-		defer closeConn()
-
-		// The q8_0 model is ~6.3 GB and hybrid Parallel Box Decoding on CPU is
-		// not cheap, so give LoadModel + Detect a generous deadline.
-		ctx, cancel := context.WithTimeout(context.Background(), 20*time.Minute)
-		defer cancel()
-
-		loadResp, err := client.LoadModel(ctx, &pb.ModelOptions{
-			Model:     "locate-anything-q8_0.gguf",
-			ModelFile: modelPath,
-			Threads:   4,
-		})
-		Expect(err).ToNot(HaveOccurred(), "LoadModel")
-		Expect(loadResp.GetSuccess()).To(BeTrue(), "LoadModel reported failure: %s", loadResp.GetMessage())
-
-		// Open-vocabulary detection is prompt-driven; the prompt names the
-		// classes to locate (people + cars), separated by the </c> control token.
-		detResp, err := client.Detect(ctx, &pb.DetectOptions{
-			Src:    imgB64,
-			Prompt: "Locate all the instances that matches the following description: person</c>car.",
-		})
-		Expect(err).ToNot(HaveOccurred(), "Detect")
-		Expect(detResp.GetDetections()).ToNot(BeEmpty(), "no detections returned on a known-good COCO image")
-
-		_, _ = fmt.Fprintf(GinkgoWriter, "detection OK: %d detections\n", len(detResp.GetDetections()))
-		for i, d := range detResp.GetDetections() {
-			Expect(d.GetClassName()).ToNot(BeEmpty(), "detection %d has empty class_name", i)
-			Expect(d.GetWidth()).To(BeNumerically(">", float32(0)),
-				"detection %d has non-positive width", i)
-			Expect(d.GetHeight()).To(BeNumerically(">", float32(0)),
-				"detection %d has non-positive height", i)
-			_, _ = fmt.Fprintf(GinkgoWriter, "  [%d] %s box=(%.1f,%.1f,%.1fx%.1f)\n",
-				i, d.GetClassName(), d.GetX(), d.GetY(), d.GetWidth(), d.GetHeight())
-		}
-	})
-})
--- a/backend/go/locate-anything-cpp/package.sh
+++ b/backend/go/locate-anything-cpp/package.sh
@@ -1,59 +0,0 @@
-#!/bin/bash
-
-# Script to copy the appropriate libraries based on architecture
-
-set -e
-
-CURDIR=$(dirname "$(realpath $0)")
-REPO_ROOT="${CURDIR}/../../.."
-
-# Create lib directory
-mkdir -p $CURDIR/package/lib
-
-cp -avf $CURDIR/liblocateanythingcpp-*.so $CURDIR/package/
-cp -avf $CURDIR/locate-anything-cpp $CURDIR/package/
-cp -fv $CURDIR/run.sh $CURDIR/package/
-
-# Detect architecture and copy appropriate libraries
-if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
-    # x86_64 architecture
-    echo "Detected x86_64 architecture, copying x86_64 libraries..."
-    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
-    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
-    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
-    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
-    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
-elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
-    # ARM64 architecture
-    echo "Detected ARM64 architecture, copying ARM64 libraries..."
-    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
-    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
-    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
-    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
-    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
-elif [ $(uname -s) = "Darwin" ]; then
-    echo "Detected Darwin"
-else
-    echo "Error: Could not detect architecture"
-    exit 1
-fi
-
-# Package GPU libraries based on BUILD_TYPE
-GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
-if [ -f "$GPU_LIB_SCRIPT" ]; then
-    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
-    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
-    package_gpu_libs
-fi
-
-echo "Packaging completed successfully"
-ls -liah $CURDIR/package/
-ls -liah $CURDIR/package/lib/
--- a/backend/go/locate-anything-cpp/run.sh
+++ b/backend/go/locate-anything-cpp/run.sh
@@ -1,52 +0,0 @@
-#!/bin/bash
-set -ex
-
-# Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath $0)")
-
-cd /
-
-echo "CPU info:"
-if [ "$(uname)" != "Darwin" ]; then
-	grep -e "model\sname" /proc/cpuinfo | head -1
-	grep -e "flags" /proc/cpuinfo | head -1
-fi
-
-LIBRARY="$CURDIR/liblocateanythingcpp-fallback.so"
-
-if [ "$(uname)" != "Darwin" ]; then
-	if grep -q -e "\savx\s" /proc/cpuinfo ; then
-		echo "CPU:    AVX    found OK"
-		if [ -e $CURDIR/liblocateanythingcpp-avx.so ]; then
-			LIBRARY="$CURDIR/liblocateanythingcpp-avx.so"
-		fi
-	fi
-
-	if grep -q -e "\savx2\s" /proc/cpuinfo ; then
-		echo "CPU:    AVX2   found OK"
-		if [ -e $CURDIR/liblocateanythingcpp-avx2.so ]; then
-			LIBRARY="$CURDIR/liblocateanythingcpp-avx2.so"
-		fi
-	fi
-
-	# Check avx 512
-	if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
-		echo "CPU:    AVX512F found OK"
-		if [ -e $CURDIR/liblocateanythingcpp-avx512.so ]; then
-			LIBRARY="$CURDIR/liblocateanythingcpp-avx512.so"
-		fi
-	fi
-fi
-
-export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
-export LOCATEANYTHING_LIBRARY=$LIBRARY
-
-# If there is a lib/ld.so, use it
-if [ -f $CURDIR/lib/ld.so ]; then
-	echo "Using lib/ld.so"
-	echo "Using library: $LIBRARY"
-	exec $CURDIR/lib/ld.so $CURDIR/locate-anything-cpp "$@"
-fi
-
-echo "Using library: $LIBRARY"
-exec $CURDIR/locate-anything-cpp "$@"
--- a/backend/go/locate-anything-cpp/test.sh
+++ b/backend/go/locate-anything-cpp/test.sh
@@ -1,47 +0,0 @@
-#!/bin/bash
-set -e
-
-CURDIR=$(dirname "$(realpath $0)")
-
-echo "Running locate-anything-cpp backend tests..."
-
-# Test model from the mudler/locate-anything.cpp-gguf HuggingFace repo. This is
-# the q8_0 quantization of nvidia/LocateAnything-3B (~6.3 GB), so the download
-# is the slow step. It is resumed with `curl -C -` and skipped entirely if the
-# file is already present.
-LOCATEANYTHING_MODEL_DIR="${LOCATEANYTHING_MODEL_DIR:-$CURDIR/test-models}"
-
-LOCATEANYTHING_MODEL_FILE="${LOCATEANYTHING_MODEL_FILE:-locate-anything-q8_0.gguf}"
-LOCATEANYTHING_MODEL_URL="${LOCATEANYTHING_MODEL_URL:-https://huggingface.co/mudler/locate-anything.cpp-gguf/resolve/main/locate-anything-q8_0.gguf}"
-
-mkdir -p "$LOCATEANYTHING_MODEL_DIR"
-
-if [ ! -f "$LOCATEANYTHING_MODEL_DIR/$LOCATEANYTHING_MODEL_FILE" ]; then
-    echo "Downloading locate-anything q8_0 model (~6.3 GB, this is slow)..."
-    # -C - resumes a partial download so an interrupted run doesn't restart from 0.
-    curl -L -C - -o "$LOCATEANYTHING_MODEL_DIR/$LOCATEANYTHING_MODEL_FILE" "$LOCATEANYTHING_MODEL_URL" --progress-bar
-fi
-
-# Use a real COCO test image (people + cars) from the upstream rf-detr.cpp repo
-# (~46 KB). Open-vocabulary detection needs real content to locate, so a
-# synthetic image would trivially yield zero detections.
-TEST_IMAGE_DIR="$CURDIR/test-data"
-TEST_IMAGE_FILE="$TEST_IMAGE_DIR/test.jpg"
-TEST_IMAGE_URL="${TEST_IMAGE_URL:-https://raw.githubusercontent.com/mudler/rf-detr.cpp/main/tests/fixtures/ci/test_image.jpg}"
-
-mkdir -p "$TEST_IMAGE_DIR"
-if [ ! -f "$TEST_IMAGE_FILE" ]; then
-    echo "Downloading COCO test image..."
-    curl -L -o "$TEST_IMAGE_FILE" "$TEST_IMAGE_URL" --progress-bar
-fi
-
-echo "locate-anything-cpp test setup complete."
-echo "  model:      $LOCATEANYTHING_MODEL_DIR/$LOCATEANYTHING_MODEL_FILE"
-echo "  test image: $TEST_IMAGE_FILE"
-
-# Run the Go smoke test: spawns the backend binary on a free port, calls
-# LoadModel + Detect via gRPC against the downloaded GGUF + COCO image.
-echo ""
-echo "Running Go smoke test..."
-cd "$CURDIR"
-go test -v -timeout 30m ./...
--- a/backend/go/parakeet-cpp/Makefile
+++ b/backend/go/parakeet-cpp/Makefile
@@ -1,6 +1,6 @@
 # parakeet-cpp backend Makefile.
 #
-# Upstream pin lives below as PARAKEET_VERSION?=b8012f11e5269126eddb7f4fd02f891a2ccc29b0
+# Upstream pin lives below as PARAKEET_VERSION?=e270af73b94c9a5c37ec516230219ed4580e1db6
 # (.github/bump_deps.sh) can find and update it - matches the
 # whisper.cpp / ds4 / vibevoice-cpp convention.
 #
@@ -15,7 +15,7 @@
 # That's what the L0 smoke test uses. The default target below does the
 # proper clone-at-pin + cmake build so CI doesn't need a side-checkout.

-PARAKEET_VERSION?=b8012f11e5269126eddb7f4fd02f891a2ccc29b0
+PARAKEET_VERSION?=e270af73b94c9a5c37ec516230219ed4580e1db6
 PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp

 GOCMD?=go
@@ -39,10 +39,7 @@ endif
 # is overwritten back to OFF and the build silently falls back to CPU. Forward the
 # PARAKEET_GGML_* options instead. (openblas is not gated, so -DGGML_BLAS passes through.)
 ifeq ($(BUILD_TYPE),cublas)
-	# GGML_CUDA_GRAPHS is OFF by ggml default; enabling it gives a small free
-	# speedup (~1% measured on GB10, never negative) by capturing/replaying the
-	# CUDA graph. Not gated by parakeet.cpp, so it passes straight through to ggml.
-	CMAKE_ARGS+=-DPARAKEET_GGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON
+	CMAKE_ARGS+=-DPARAKEET_GGML_CUDA=ON
 else ifeq ($(BUILD_TYPE),openblas)
 	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
 else ifeq ($(BUILD_TYPE),hipblas)
--- a/backend/go/parakeet-cpp/goparakeetcpp.go
+++ b/backend/go/parakeet-cpp/goparakeetcpp.go
@@ -98,21 +98,17 @@ type transcriptJSON struct {
 }

 // streamFeedJSON mirrors the document returned by
-// parakeet_capi_stream_feed_json / parakeet_capi_stream_finalize_json (ABI v5):
+// parakeet_capi_stream_feed_json / parakeet_capi_stream_finalize_json (ABI v4):
 //
-//	{"text":"...","eou":0,"eob":0,"frame_sec":0.080000,
+//	{"text":"...","eou":0,"frame_sec":0.080000,
 //	 "words":[{"w":"...","start":0.480,"end":0.640,"conf":0.9100}, ...]}
 //
 // "text" is the newly-finalized text since the last call; "eou" is 1 when an
-// <EOU> (end of utterance) fired this feed and "eob" is 1 when an <EOB>
-// (backchannel) fired. ABI v4 conflated the two into "eou"; v5 split them, so
-// we read both and treat either as an utterance boundary for segmentation.
-// "words" are the words finalized this call with absolute (stream-relative)
-// start/end seconds.
+// <EOU>/<EOB> fired this feed; "words" are the words finalized this call with
+// absolute (stream-relative) start/end seconds.
 type streamFeedJSON struct {
 	Text     string           `json:"text"`
 	Eou      int              `json:"eou"`
-	Eob      int              `json:"eob"`
 	FrameSec float64          `json:"frame_sec"`
 	Words    []transcriptWord `json:"words"`
 }
@@ -487,10 +483,7 @@ type streamSegmenter struct {

 func (s *streamSegmenter) add(doc streamFeedJSON) {
 	s.cur = append(s.cur, doc.Words...)
-	// Close the segment on either turn signal: <EOU> (end of utterance) or
-	// <EOB> (backchannel). ABI v4 reported both via "eou"; v5 split them, so we
-	// OR them here to keep the v4 segmentation boundaries.
-	if doc.Eou != 0 || doc.Eob != 0 {
+	if doc.Eou != 0 {
 		s.flush()
 	}
 }
@@ -678,12 +671,11 @@ func (p *ParakeetCpp) AudioTranscriptionStream(ctx context.Context, opts *pb.Tra
 	return nil
 }

-// streamJSON drives the streaming JSON entry points (present since ABI v4): each
-// feed/finalize returns a {text,eou,eob,frame_sec,words} document. The
-// newly-finalized text is emitted as a delta (unchanged streaming contract)
-// while words are accumulated into per-utterance segments (closed on <EOU> or
-// <EOB>) so the closing FinalResult carries timestamped segments. Runs under
-// engineMu (already held by the caller).
+// streamJSON drives the ABI v4 streaming JSON entry points: each feed/finalize
+// returns a {text,eou,frame_sec,words} document. The newly-finalized text is
+// emitted as a delta (unchanged streaming contract) while words are accumulated
+// into per-utterance segments (closed on EOU) so the closing FinalResult carries
+// timestamped segments. Runs under engineMu (already held by the caller).
 func (p *ParakeetCpp) streamJSON(ctx context.Context, stream uintptr, data []float32,
 	duration float32, results chan *pb.TranscriptStreamResponse) error {
 	var (
--- a/backend/go/parakeet-cpp/segments_test.go
+++ b/backend/go/parakeet-cpp/segments_test.go
@@ -124,17 +124,4 @@ var _ = Describe("streaming segment assembly", func() {
 		Expect(acc.segments()).To(HaveLen(1))
 		Expect(acc.segments()[0].Text).To(Equal("hi there"))
 	})
-
-	// ABI v5 split <EOB> (backchannel) out of the "eou" flag into its own "eob"
-	// field; a backchannel must still close the segment as it did in v4.
-	It("closes a segment on EOB (backchannel) too", func() {
-		acc := &streamSegmenter{}
-		acc.add(streamFeedJSON{Text: "uh huh", Eou: 0, Eob: 1, Words: []transcriptWord{
-			{W: "uh", Start: 0.0, End: 0.2}, {W: "huh", Start: 0.2, End: 0.5},
-		}})
-		segs := acc.segments()
-		Expect(segs).To(HaveLen(1))
-		Expect(segs[0].Text).To(Equal("uh huh"))
-		Expect(segs[0].End).To(Equal(secondsToNanos(0.5)))
-	})
 })
--- a/backend/go/vibevoice-cpp/CMakeLists.txt
+++ b/backend/go/vibevoice-cpp/CMakeLists.txt
@@ -26,16 +26,8 @@ add_library(govibevoicecpp MODULE cpp/govibevoicecpp.cpp)
 # vv_capi_* symbols (purego dlopens them by name, nothing in our
 # translation unit references them). Force the static archive's
 # entire contents into the MODULE so dlsym finds vv_capi_load etc.
-#
-# Link the `vibevoice` TARGET (not a bare archive path) so CMake builds
-# libvibevoice.a first and tracks the dependency: the upstream project is added
-# with EXCLUDE_FROM_ALL, so without a target-level link there is no rule to
-# build it. Passing only $<TARGET_FILE:vibevoice> as a path on Apple left the
-# build with "No rule to make target 'vibevoice/libvibevoice.a'" (issue #10267).
-# force_load is then applied as a separate link option.
 if(APPLE)
-    target_link_libraries(govibevoicecpp PRIVATE vibevoice)
-    target_link_options(govibevoicecpp PRIVATE "-Wl,-force_load,$<TARGET_FILE:vibevoice>")
+    target_link_libraries(govibevoicecpp PRIVATE -Wl,-force_load $<TARGET_FILE:vibevoice>)
 elseif(MSVC)
    target_link_libraries(govibevoicecpp PRIVATE vibevoice)
    set_property(TARGET govibevoicecpp APPEND PROPERTY LINK_FLAGS "/WHOLEARCHIVE:vibevoice")
--- a/backend/go/vibevoice-cpp/Makefile
+++ b/backend/go/vibevoice-cpp/Makefile
@@ -94,30 +94,26 @@ purge:
 # Build all variants (Linux only)
 ifeq ($(UNAME_S),Linux)
 libgovibevoicecpp-avx.so: sources/vibevoice.cpp
-	$(MAKE) purge
 	$(info ${GREEN}I vibevoice-cpp build info:avx${RESET})
 	SO_TARGET=libgovibevoicecpp-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgovibevoicecpp-custom
-	rm -rfv build*
+	rm -rf build-libgovibevoicecpp-avx.so

 libgovibevoicecpp-avx2.so: sources/vibevoice.cpp
-	$(MAKE) purge
 	$(info ${GREEN}I vibevoice-cpp build info:avx2${RESET})
 	SO_TARGET=libgovibevoicecpp-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgovibevoicecpp-custom
-	rm -rfv build*
+	rm -rf build-libgovibevoicecpp-avx2.so

 libgovibevoicecpp-avx512.so: sources/vibevoice.cpp
-	$(MAKE) purge
 	$(info ${GREEN}I vibevoice-cpp build info:avx512${RESET})
 	SO_TARGET=libgovibevoicecpp-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgovibevoicecpp-custom
-	rm -rfv build*
+	rm -rf build-libgovibevoicecpp-avx512.so
 endif

 # Build fallback variant (all platforms)
 libgovibevoicecpp-fallback.so: sources/vibevoice.cpp
-	$(MAKE) purge
 	$(info ${GREEN}I vibevoice-cpp build info:fallback${RESET})
 	SO_TARGET=libgovibevoicecpp-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgovibevoicecpp-custom
-	rm -rfv build*
+	rm -rf build-libgovibevoicecpp-fallback.so

 libgovibevoicecpp-custom: CMakeLists.txt cpp/govibevoicecpp.cpp cpp/govibevoicecpp.h
 	mkdir -p build-$(SO_TARGET) && \
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -95,6 +95,29 @@
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ds4"
    metal: "metal-ds4"
    metal-darwin-arm64: "metal-ds4"
+- &dllm
+  name: "dllm"
+  alias: "dllm"
+  license: mit
+  description: |
+    mudler/dllm.cpp - DiffusionGemma block-diffusion LLM inference engine
+    (C++/ggml, GGUF weights). Decodes whole token canvases per diffusion
+    round instead of autoregressive sampling. Runs on CPU and NVIDIA CUDA 13
+    (including Jetson/GB10 L4T targets).
+  urls:
+    - https://github.com/mudler/dllm.cpp
+  tags:
+    - text-to-text
+    - LLM
+    - gguf
+    - diffusion
+    - CPU
+    - CUDA
+  capabilities:
+    default: "cpu-dllm"
+    nvidia: "cuda13-dllm"
+    nvidia-cuda-13: "cuda13-dllm"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-dllm"
 - &whispercpp
  name: "whisper"
  alias: "whisper"
@@ -337,35 +360,6 @@
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-rfdetr-cpp"
    intel: "intel-sycl-f32-rfdetr-cpp"
    vulkan: "vulkan-rfdetr-cpp"
- &locateanything
-  name: "locate-anything"
-  alias: "locate-anything"
-  license: apache-2.0
-  description: |
-    Open-vocabulary object detection and visual grounding (NVIDIA
-    LocateAnything-3B) in C/C++ using GGML. Loads pre-built GGUF weights
-    and, given an image and a free-form text prompt, returns bounding
-    boxes, class labels, and confidence scores for the referred objects.
-  urls:
-    - https://github.com/mudler/locate-anything.cpp
-    - https://huggingface.co/nvidia/LocateAnything-3B
-  tags:
-    - object-detection
-    - visual-grounding
-    - open-vocabulary
-    - locate-anything
-    - gpu
-    - cpu
-  capabilities:
-    default: "cpu-locate-anything-cpp"
-    nvidia: "cuda12-locate-anything-cpp"
-    nvidia-cuda-12: "cuda12-locate-anything-cpp"
-    nvidia-cuda-13: "cuda13-locate-anything-cpp"
-    nvidia-l4t: "nvidia-l4t-arm64-locate-anything-cpp"
-    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-locate-anything-cpp"
-    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-locate-anything-cpp"
-    intel: "intel-sycl-f32-locate-anything-cpp"
-    vulkan: "vulkan-locate-anything-cpp"
 - &vllm
  name: "vllm"
  license: apache-2.0
@@ -1254,7 +1248,6 @@
    default: "cpu-sherpa-onnx"
    nvidia: "cuda12-sherpa-onnx"
    nvidia-cuda-12: "cuda12-sherpa-onnx"
-    metal: "metal-sherpa-onnx"
 - !!merge <<: *neutts
  name: "neutts-development"
  capabilities:
@@ -1302,6 +1295,13 @@
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ds4-development"
    metal: "metal-ds4-development"
    metal-darwin-arm64: "metal-ds4-development"
+- !!merge <<: *dllm
+  name: "dllm-development"
+  capabilities:
+    default: "cpu-dllm-development"
+    nvidia: "cuda13-dllm-development"
+    nvidia-cuda-13: "cuda13-dllm-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-dllm-development"
 - !!merge <<: *stablediffusionggml
  name: "stablediffusion-ggml-development"
  capabilities:
@@ -1587,7 +1587,6 @@
    - localai/localai-backends:master-metal-darwin-arm64-kitten-tts
 - !!merge <<: *local-store
  name: "local-store-development"
-  alias: "local-store"
  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-local-store"
  mirrors:
    - localai/localai-backends:master-cpu-local-store
@@ -1598,7 +1597,6 @@
    - localai/localai-backends:latest-metal-darwin-arm64-local-store
 - !!merge <<: *local-store
  name: "metal-local-store-development"
-  alias: "local-store"
  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-local-store"
  mirrors:
    - localai/localai-backends:master-metal-darwin-arm64-local-store
@@ -1891,6 +1889,37 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-ds4"
  mirrors:
    - localai/localai-backends:master-metal-darwin-arm64-ds4
+## dllm
+- !!merge <<: *dllm
+  name: "cpu-dllm"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-dllm"
+  mirrors:
+    - localai/localai-backends:latest-cpu-dllm
+- !!merge <<: *dllm
+  name: "cpu-dllm-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-dllm"
+  mirrors:
+    - localai/localai-backends:master-cpu-dllm
+- !!merge <<: *dllm
+  name: "cuda13-dllm"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-dllm"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-dllm
+- !!merge <<: *dllm
+  name: "cuda13-dllm-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-dllm"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-dllm
+- !!merge <<: *dllm
+  name: "cuda13-nvidia-l4t-arm64-dllm"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-dllm"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-dllm
+- !!merge <<: *dllm
+  name: "cuda13-nvidia-l4t-arm64-dllm-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-dllm"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-dllm
 ## whisper
 - !!merge <<: *whispercpp
  name: "whisper-development"
@@ -4717,14 +4746,12 @@
    default: "cpu-speaker-recognition"
    nvidia: "cuda12-speaker-recognition"
    nvidia-cuda-12: "cuda12-speaker-recognition"
-    metal: "metal-speaker-recognition"
 - !!merge <<: *speakerrecognition
  name: "speaker-recognition-development"
  capabilities:
    default: "cpu-speaker-recognition-development"
    nvidia: "cuda12-speaker-recognition-development"
    nvidia-cuda-12: "cuda12-speaker-recognition-development"
-    metal: "metal-speaker-recognition-development"
 - !!merge <<: *speakerrecognition
  name: "cpu-speaker-recognition"
  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-speaker-recognition"
@@ -4745,16 +4772,6 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-speaker-recognition"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-speaker-recognition
- !!merge <<: *speakerrecognition
-  name: "metal-speaker-recognition"
-  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-speaker-recognition"
-  mirrors:
-    - localai/localai-backends:latest-metal-darwin-arm64-speaker-recognition
- !!merge <<: *speakerrecognition
-  name: "metal-speaker-recognition-development"
-  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-speaker-recognition"
-  mirrors:
-    - localai/localai-backends:master-metal-darwin-arm64-speaker-recognition
 ## sherpa-onnx
 - !!merge <<: *sherpa-onnx
  name: "sherpa-onnx-development"
@@ -4762,7 +4779,6 @@
    default: "cpu-sherpa-onnx-development"
    nvidia: "cuda12-sherpa-onnx-development"
    nvidia-cuda-12: "cuda12-sherpa-onnx-development"
-    metal: "metal-sherpa-onnx-development"
 - !!merge <<: *sherpa-onnx
  name: "cpu-sherpa-onnx"
  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-sherpa-onnx"
@@ -4783,13 +4799,3 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx
- !!merge <<: *sherpa-onnx
-  name: "metal-sherpa-onnx"
-  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-sherpa-onnx"
-  mirrors:
-    - localai/localai-backends:latest-metal-darwin-arm64-sherpa-onnx
- !!merge <<: *sherpa-onnx
-  name: "metal-sherpa-onnx-development"
-  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-sherpa-onnx"
-  mirrors:
-    - localai/localai-backends:master-metal-darwin-arm64-sherpa-onnx
--- a/backend/python/mlx/backend.py
+++ b/backend/python/mlx/backend.py
@@ -407,24 +407,6 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
        if not request.Prompt and request.UseTokenizerTemplate and request.Messages:
            messages = messages_to_dicts(request.Messages)

-            # The mlx-lm tokenizer only carries a text-LM chat template. A
-            # vision-language checkpoint (e.g. gemma-4 E4B) loaded here has no
-            # usable template, so apply_chat_template silently passes the raw
-            # text through and the model just echoes/loops (issue #10269).
-            # Warn loudly so the misroute is visible; such models belong on the
-            # mlx-vlm backend.
-            chat_template = getattr(self.tokenizer, "chat_template", None)
-            if not chat_template:
-                underlying = getattr(self.tokenizer, "_tokenizer", None)
-                chat_template = getattr(underlying, "chat_template", None)
-            if not chat_template:
-                print(
-                    "WARNING: this model has no chat template; output may be "
-                    "degenerate. Vision-language models (e.g. gemma-4 E4B) must "
-                    "use the 'mlx-vlm' backend instead of 'mlx'.",
-                    file=sys.stderr,
-                )
-
            kwargs = {"tokenize": False, "add_generation_prompt": True}
            if request.Tools:
                try:
--- a/backend/python/speaker-recognition/requirements-mps.txt
+++ b/backend/python/speaker-recognition/requirements-mps.txt
@@ -1,5 +0,0 @@
-torch
-torchaudio
-speechbrain
-transformers
-onnxruntime
--- a/backend/python/vllm/backend.py
+++ b/backend/python/vllm/backend.py
@@ -150,24 +150,9 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
                d["reasoning_content"] = msg.reasoning_content
            if msg.tool_calls:
                try:
-                    tool_calls = json.loads(msg.tool_calls)
+                    d["tool_calls"] = json.loads(msg.tool_calls)
                except json.JSONDecodeError:
                    pass
-                else:
-                    # OpenAI wire format carries function.arguments as a
-                    # JSON-encoded string, but chat templates (e.g. Qwen3)
-                    # iterate over it as a mapping. vLLM's own OpenAI server
-                    # parses arguments before applying the template, so do
-                    # the same here.
-                    if isinstance(tool_calls, list):
-                        for tc in tool_calls:
-                            func = tc.get("function") if isinstance(tc, dict) else None
-                            if isinstance(func, dict) and isinstance(func.get("arguments"), str):
-                                try:
-                                    func["arguments"] = json.loads(func["arguments"])
-                                except json.JSONDecodeError:
-                                    pass
-                    d["tool_calls"] = tool_calls
            result.append(d)
        return result

--- a/core/application/mitm.go
+++ b/core/application/mitm.go
@@ -11,29 +11,6 @@ import (
 	"github.com/mudler/xlog"
 )

-// startMITMIfConfigured brings up the cloudproxy MITM listener when an
-// address is configured, treating any startup failure as non-fatal.
-//
-// The listener is opt-in middleware whose address is persisted in runtime
-// settings (/api/settings → runtime_settings.json) and replayed on every
-// boot. A bad value — e.g. a host the process can't bind, like a LAN IP
-// inside a container — must NOT abort the whole server: doing so crash-loops
-// with no way out, because the Settings UI used to correct the address can't
-// load if startup never completes. So on failure we log loudly and carry on;
-// the admin fixes the address via /api/settings, which calls RestartMITM.
-func startMITMIfConfigured(app *Application, options *config.ApplicationConfig) {
-	if options.MITMListen == "" {
-		return
-	}
-	if err := startMITMProxy(app, options); err != nil {
-		xlog.Error("mitm: cloudproxy listener failed to start — continuing without it",
-			"listen", options.MITMListen,
-			"error", err,
-			"hint", "fix the address via Settings (e.g. \":8082\" to bind all interfaces) and the listener will restart",
-		)
-	}
-}
-
 func startMITMProxy(app *Application, options *config.ApplicationConfig) error {
 	app.mitmMutex.Lock()
 	defer app.mitmMutex.Unlock()
--- a/core/application/mitm_test.go
+++ b/core/application/mitm_test.go
@@ -1,58 +0,0 @@
-package application
-
-import (
-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/pkg/system"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-// minimal Application wired enough for startMITMProxy: an empty model
-// config loader (no host claims), CA written under a temp DataPath.
-func newMITMTestApp(dataPath string) (*Application, *config.ApplicationConfig) {
-	state, err := system.GetSystemState()
-	Expect(err).NotTo(HaveOccurred())
-	state.Model.ModelsPath = dataPath
-	opts := config.NewApplicationConfig(
-		config.WithSystemState(state),
-		config.WithDataPath(dataPath),
-	)
-	return newApplication(opts), opts
-}
-
-var _ = Describe("startMITMIfConfigured", func() {
-	It("does nothing when no listen address is configured", func() {
-		app, opts := newMITMTestApp(GinkgoT().TempDir())
-		opts.MITMListen = ""
-
-		Expect(func() { startMITMIfConfigured(app, opts) }).NotTo(Panic())
-		Expect(app.mitmServer.Load()).To(BeNil(), "no listener should be stored when disabled")
-	})
-
-	// Regression: a persisted-but-unbindable MITM address (e.g. a LAN host
-	// inside a container) must not abort startup. startMITMIfConfigured
-	// swallows the bind error so the rest of LocalAI still comes up and the
-	// admin can fix the address via the Settings UI.
-	It("logs and continues when the listen address cannot be bound", func() {
-		app, opts := newMITMTestApp(GinkgoT().TempDir())
-		// 192.0.2.1 is TEST-NET-1 (RFC 5737): guaranteed not assigned to any
-		// local interface, so bind fails deterministically without DNS.
-		opts.MITMListen = "192.0.2.1:8082"
-
-		Expect(func() { startMITMIfConfigured(app, opts) }).NotTo(Panic())
-		Expect(app.mitmServer.Load()).To(BeNil(), "failed listener must not be stored")
-	})
-
-	It("starts and stores the listener on a bindable address", func() {
-		app, opts := newMITMTestApp(GinkgoT().TempDir())
-		opts.MITMListen = "127.0.0.1:0" // OS-assigned free port
-
-		startMITMIfConfigured(app, opts)
-
-		srv := app.mitmServer.Load()
-		Expect(srv).NotTo(BeNil(), "listener should be stored on success")
-		DeferCleanup(srv.Stop)
-		Expect(srv.Addr()).NotTo(BeEmpty())
-	})
-})
--- a/core/application/router_factories.go
+++ b/core/application/router_factories.go
@@ -1,120 +1,63 @@
 package application

 import (
-	"context"
-	"fmt"
-
 	"github.com/mudler/LocalAI/core/backend"
 	"github.com/mudler/LocalAI/core/config"
 )

-// adapterConfig resolves a model name to its runtime ModelConfig, or nil when
-// unknown. LoadModelConfigFileByNameDefaultOptions never returns nil — for an
-// unknown name it returns a defaults-filled stub with an empty Name (the YAML
-// `name:` field is required by Validate), which is how we tell the two apart.
+// adapterConfig resolves a model name to its runtime ModelConfig, or
+// nil when the name is unknown. Shared by the router-facing factories
+// below and by ModelConfigLookup.
 func (a *Application) adapterConfig(modelName string) *config.ModelConfig {
 	cfg, err := a.backendLoader.LoadModelConfigFileByNameDefaultOptions(modelName, a.applicationConfig)
-	if err != nil || cfg == nil || cfg.Name == "" {
+	if err != nil || cfg == nil {
 		return nil
 	}
 	return cfg
 }

-// ModelConfigLookup is the lookup the router middleware's classifier validator
-// uses to confirm classifier_model declares FLAG_SCORE before binding it.
+// ModelConfigLookup is the lookup function the router middleware's
+// classifier validator uses to confirm classifier_model declares
+// FLAG_SCORE before binding it.
 func (a *Application) ModelConfigLookup() func(modelName string) *config.ModelConfig {
 	return a.adapterConfig
 }

-// The router-facing factories below (Scorer, Embedder, Reranker, TokenCounter)
-// bind a model NAME at construction and re-resolve the CONFIG on every call.
-// Capturing the config at construction would bake in whatever state
-// adapterConfig saw first — including a stub returned before the YAML reached
-// bcl.configs (e.g. /import-model or gallery install racing startup). The
-// classifier registry caches factories by router-config fingerprint, so a
-// once-stale capture stays stale until the router config is edited.
-
+// Scorer returns a backend.Scorer bound to the named model, or nil
+// when the model is unknown. Used as a method value (app.Scorer) by
+// router.ClassifierDeps — no factory-of-factory wrapper needed.
 func (a *Application) Scorer(modelName string) backend.Scorer {
-	if a.adapterConfig(modelName) == nil {
-		return nil
-	}
-	return &lazyScorer{app: a, modelName: modelName}
-}
-
-type lazyScorer struct {
-	app       *Application
-	modelName string
-}
-
-func (l *lazyScorer) Score(ctx context.Context, prompt string, candidates []string) ([]backend.CandidateScore, error) {
-	cfg := l.app.adapterConfig(l.modelName)
+	cfg := a.adapterConfig(modelName)
 	if cfg == nil {
-		return nil, fmt.Errorf("scorer: model %q no longer available", l.modelName)
-	}
-	return backend.NewScorer(l.app.modelLoader, *cfg, l.app.applicationConfig).Score(ctx, prompt, candidates)
-}
-
-// TokenCounter returns a func so the middleware's literal field type accepts
-// it as a method value without importing core/http/middleware from here.
-func (a *Application) TokenCounter(modelName string) func(string) (int, error) {
-	if a.adapterConfig(modelName) == nil {
 		return nil
 	}
-	return func(text string) (int, error) {
-		cfg := a.adapterConfig(modelName)
-		if cfg == nil {
-			return 0, fmt.Errorf("token counter: model %q no longer available", modelName)
-		}
-		resp, err := backend.ModelTokenize(text, a.modelLoader, *cfg, a.applicationConfig)
-		if err != nil {
-			return 0, err
-		}
-		return len(resp.Tokens), nil
-	}
+	return backend.NewScorer(a.modelLoader, *cfg, a.applicationConfig)
 }

+// Reranker returns a backend.Reranker bound to the named model, or
+// nil when unknown. The reranker model's `type:` (e.g. "colbert")
+// selects the scoring head inside the rerankers backend.
 func (a *Application) Reranker(modelName string) backend.Reranker {
-	if a.adapterConfig(modelName) == nil {
+	cfg := a.adapterConfig(modelName)
+	if cfg == nil {
 		return nil
 	}
-	return &lazyReranker{app: a, modelName: modelName}
-}
-
-type lazyReranker struct {
-	app       *Application
-	modelName string
-}
-
-func (l *lazyReranker) Rerank(ctx context.Context, query string, documents []string) ([]backend.RerankResult, error) {
-	cfg := l.app.adapterConfig(l.modelName)
-	if cfg == nil {
-		return nil, fmt.Errorf("reranker: model %q no longer available", l.modelName)
-	}
-	return backend.NewReranker(l.app.modelLoader, *cfg, l.app.applicationConfig).Rerank(ctx, query, documents)
+	return backend.NewReranker(a.modelLoader, *cfg, a.applicationConfig)
 }

+// Embedder returns a backend.Embedder bound to the named model, or
+// nil when unknown. Used by the router's L2 embedding cache.
 func (a *Application) Embedder(modelName string) backend.Embedder {
-	if a.adapterConfig(modelName) == nil {
+	cfg := a.adapterConfig(modelName)
+	if cfg == nil {
 		return nil
 	}
-	return &lazyEmbedder{app: a, modelName: modelName}
+	return backend.NewEmbedder(a.modelLoader, *cfg, a.applicationConfig)
 }

-type lazyEmbedder struct {
-	app       *Application
-	modelName string
-}
-
-func (l *lazyEmbedder) Embed(ctx context.Context, text string) ([]float32, error) {
-	cfg := l.app.adapterConfig(l.modelName)
-	if cfg == nil {
-		return nil, fmt.Errorf("embedder: model %q no longer available", l.modelName)
-	}
-	return backend.NewEmbedder(l.app.modelLoader, *cfg, l.app.applicationConfig).Embed(ctx, text)
-}
-
-// VectorStore takes a store name, not a model name — no adapterConfig, no
-// staleness to avoid.
+// VectorStore returns a backend.VectorStore for the named collection,
+// or nil when the name is empty. Each router model gets its own
+// backend process via the model loader's cache keyed by storeName.
 func (a *Application) VectorStore(storeName string) backend.VectorStore {
 	return backend.NewVectorStore(a.modelLoader, a.applicationConfig, storeName)
 }
--- a/core/application/router_factories_test.go
+++ b/core/application/router_factories_test.go
@@ -1,155 +0,0 @@
-package application
-
-import (
-	"context"
-	"os"
-	"path/filepath"
-
-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/pkg/model"
-	"github.com/mudler/LocalAI/pkg/system"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-// Regression: the router-facing factories used to capture
-// *config.ModelConfig at construction. A gallery install that raced
-// startup left a stub (Backend="") bound for the lifetime of the
-// classifier registry's cache entry, bypassing the user's `backend:`
-// config. These specs pin the lazy re-resolve.
-var _ = Describe("router_factories lazy config resolution", func() {
-	var (
-		tmpDir string
-		app    *Application
-	)
-
-	BeforeEach(func() {
-		var err error
-		tmpDir, err = os.MkdirTemp("", "router-factories-*")
-		Expect(err).NotTo(HaveOccurred())
-
-		appCfg := &config.ApplicationConfig{
-			Context:     context.Background(),
-			SystemState: &system.SystemState{Model: system.Model{ModelsPath: tmpDir}},
-		}
-		app = &Application{
-			backendLoader:     config.NewModelConfigLoader(tmpDir),
-			modelLoader:       model.NewModelLoader(appCfg.SystemState),
-			applicationConfig: appCfg,
-		}
-	})
-
-	AfterEach(func() {
-		_ = os.RemoveAll(tmpDir)
-	})
-
-	// writeCfg seeds both the on-disk YAML and the in-memory cache —
-	// removing only the cache would fall through to file-read.
-	writeCfg := func(name, backend string) {
-		yaml := "name: " + name + "\nbackend: " + backend + "\nparameters:\n  model: " + name + ".bin\n"
-		Expect(os.WriteFile(filepath.Join(tmpDir, name+".yaml"), []byte(yaml), 0644)).To(Succeed())
-		Expect(app.backendLoader.LoadModelConfigsFromPath(tmpDir)).To(Succeed())
-		cfg, ok := app.backendLoader.GetModelConfig(name)
-		Expect(ok).To(BeTrue(), "config must be loaded before the spec runs")
-		Expect(cfg.Backend).To(Equal(backend))
-	}
-
-	// removeCfg purges both the cache and the YAML so LoadModelConfigFileByName
-	// returns the empty-stub case and adapterConfig returns nil.
-	removeCfg := func(name string) {
-		app.backendLoader.RemoveModelConfig(name)
-		Expect(os.Remove(filepath.Join(tmpDir, name+".yaml"))).To(Succeed())
-	}
-
-	Context("Embedder", func() {
-		It("returns nil at construction for an unknown model", func() {
-			Expect(app.Embedder("missing")).To(BeNil())
-		})
-
-		It("re-resolves the model config on each Embed call", func() {
-			writeCfg("emb-test", "llama-cpp")
-			emb := app.Embedder("emb-test")
-			Expect(emb).NotTo(BeNil())
-
-			// The factory must hold the NAME, not a captured config —
-			// otherwise stale captures survive cache invalidation.
-			lazy, ok := emb.(*lazyEmbedder)
-			Expect(ok).To(BeTrue(), "Embedder must return *lazyEmbedder")
-			Expect(lazy.modelName).To(Equal("emb-test"))
-
-			// Mutate the cached config. A lazy implementation sees the
-			// update on the next adapterConfig call; a captured-at-
-			// construction implementation would still see "llama-cpp".
-			app.backendLoader.UpdateModelConfig("emb-test", func(c *config.ModelConfig) {
-				c.Backend = "rerankers"
-			})
-			Expect(lazy.app.adapterConfig("emb-test").Backend).To(Equal("rerankers"))
-
-			// Remove the config entirely → Embed must surface the disappearance.
-			removeCfg("emb-test")
-			_, err := emb.Embed(context.Background(), "anything")
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring("no longer available"))
-		})
-	})
-
-	Context("Scorer", func() {
-		It("returns nil at construction for an unknown model", func() {
-			Expect(app.Scorer("missing")).To(BeNil())
-		})
-
-		It("re-resolves the model config on each Score call", func() {
-			writeCfg("score-test", "llama-cpp")
-			sc := app.Scorer("score-test")
-			Expect(sc).NotTo(BeNil())
-
-			lazy, ok := sc.(*lazyScorer)
-			Expect(ok).To(BeTrue(), "Scorer must return *lazyScorer")
-			Expect(lazy.modelName).To(Equal("score-test"))
-
-			removeCfg("score-test")
-			_, err := sc.Score(context.Background(), "prompt", []string{"a"})
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring("no longer available"))
-		})
-	})
-
-	Context("Reranker", func() {
-		It("returns nil at construction for an unknown model", func() {
-			Expect(app.Reranker("missing")).To(BeNil())
-		})
-
-		It("re-resolves the model config on each Rerank call", func() {
-			writeCfg("rerank-test", "rerankers")
-			rr := app.Reranker("rerank-test")
-			Expect(rr).NotTo(BeNil())
-
-			lazy, ok := rr.(*lazyReranker)
-			Expect(ok).To(BeTrue(), "Reranker must return *lazyReranker")
-			Expect(lazy.modelName).To(Equal("rerank-test"))
-
-			removeCfg("rerank-test")
-			_, err := rr.Rerank(context.Background(), "q", []string{"d"})
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring("no longer available"))
-		})
-	})
-
-	Context("TokenCounter", func() {
-		It("returns nil at construction for an unknown model", func() {
-			Expect(app.TokenCounter("missing")).To(BeNil())
-		})
-
-		It("re-resolves the model config on each call", func() {
-			writeCfg("tok-test", "llama-cpp")
-			tc := app.TokenCounter("tok-test")
-			Expect(tc).NotTo(BeNil())
-
-			removeCfg("tok-test")
-			_, err := tc("anything")
-			Expect(err).To(HaveOccurred())
-			Expect(err.Error()).To(ContainSubstring("no longer available"))
-		})
-	})
-})
--- a/core/application/startup.go
+++ b/core/application/startup.go
@@ -462,7 +462,11 @@ func New(opts ...config.AppOption) (*Application, error) {
 	// traffic doesn't need a parallel config for MITM traffic.
 	// Runs after loadRuntimeSettingsFromFile so a listener configured
 	// via /api/settings is brought back up across restarts.
-	startMITMIfConfigured(application, options)
+	if options.MITMListen != "" {
+		if err := startMITMProxy(application, options); err != nil {
+			return nil, fmt.Errorf("mitm: startup: %w", err)
+		}
+	}

 	application.ModelLoader().SetBackendLoggingEnabled(options.EnableBackendLogging)

--- a/core/backend/embeddings.go
+++ b/core/backend/embeddings.go
@@ -100,13 +100,8 @@ func ModelEmbedding(ctx context.Context, s string, tokens []int, loader *model.M
 		trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems, appConfig.TracingMaxBodyBytes)

 		traceData := map[string]any{
-			"input_text": trace.TruncateString(s, 1000),
-		}
-		// Only present for token-mode callers (pre-tokenized override);
-		// emitting "0" alongside input_text would read as "consumed zero
-		// tokens", which is wrong.
-		if len(tokens) > 0 {
-			traceData["input_tokens_count"] = len(tokens)
+			"input_text":         trace.TruncateString(s, 1000),
+			"input_tokens_count": len(tokens),
 		}

 		startTime := time.Now()
--- a/core/backend/options.go
+++ b/core/backend/options.go
@@ -87,47 +87,11 @@ func getSeed(c config.ModelConfig) int32 {
 	return seed
 }

-// DefaultContextSize and DefaultBatchSize are the backend's fallbacks when a
-// model config leaves them unset. Exported so callers that must respect the
-// effective decode window — notably the router's prompt trimmer — resolve the
-// same numbers grpcModelOpts does instead of guessing.
-const (
-	DefaultContextSize = 4096
-	DefaultBatchSize   = 512
-)
-
-// EffectiveContextSize is the context window the backend will run with: the
-// configured value, or DefaultContextSize when unset.
-func EffectiveContextSize(c config.ModelConfig) int {
-	if c.ContextSize != nil {
-		return *c.ContextSize
-	}
-	return DefaultContextSize
-}
-
-// EffectiveBatchSize is the single-decode batch the backend will run with.
-// Score, embedding and rerank all process the whole input in one pass: score
-// decodes prompt+candidate (asserts n_tokens <= n_batch), and embedding/rerank
-// pool over the full sequence in one physical batch (n_ubatch). So the batch
-// is sized to the context — anything that fits the context fits one pass,
-// avoiding both the GGML_ASSERT crash and the "input is too large to process"
-// error. Explicit `batch:` always wins.
-func EffectiveBatchSize(c config.ModelConfig) int {
-	if c.Batch != 0 {
-		return c.Batch
-	}
-	singlePass := c.HasUsecases(config.FLAG_SCORE) ||
-		c.HasUsecases(config.FLAG_EMBEDDINGS) ||
-		c.HasUsecases(config.FLAG_RERANK)
-	if ctx := EffectiveContextSize(c); singlePass && ctx > DefaultBatchSize {
-		return ctx
-	}
-	return DefaultBatchSize
-}
-
 func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions {
-	ctxSize := EffectiveContextSize(c)
-	b := EffectiveBatchSize(c)
+	b := 512
+	if c.Batch != 0 {
+		b = c.Batch
+	}

 	flashAttention := "auto"

@@ -170,6 +134,11 @@ func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions {
 		}
 	}

+	ctxSize := 4096
+	if c.ContextSize != nil {
+		ctxSize = *c.ContextSize
+	}
+
 	mmlock := false
 	if c.MMlock != nil {
 		mmlock = *c.MMlock
--- a/core/backend/options_internal_test.go
+++ b/core/backend/options_internal_test.go
@@ -97,67 +97,3 @@ var _ = Describe("gRPCPredictOpts reasoning_effort metadata", func() {
 		Expect(opts.Metadata).ToNot(HaveKey("reasoning_effort"))
 	})
 })
-
-var _ = Describe("grpcModelOpts NBatch", func() {
-	scoreUsecase := config.FLAG_SCORE
-	threads := 1
-	ctx := 4096
-
-	It("defaults to 512 for an ordinary model", func() {
-		cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}}
-		opts := grpcModelOpts(cfg, "/tmp/models")
-		Expect(opts.NBatch).To(BeEquivalentTo(512))
-	})
-
-	It("sizes the batch to the context window for score models", func() {
-		// Score models decode the whole prompt+candidate in one
-		// llama_decode; n_batch must cover it or the backend aborts.
-		cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}, KnownUsecases: &scoreUsecase}
-		opts := grpcModelOpts(cfg, "/tmp/models")
-		Expect(opts.NBatch).To(BeEquivalentTo(4096))
-	})
-
-	It("keeps an explicit batch over the score default", func() {
-		cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}, KnownUsecases: &scoreUsecase}
-		cfg.Batch = 1024
-		opts := grpcModelOpts(cfg, "/tmp/models")
-		Expect(opts.NBatch).To(BeEquivalentTo(1024))
-	})
-
-	It("sizes the batch to the context window for embedding models", func() {
-		// Embedding/rerank pool over the whole sequence in one physical batch
-		// (n_ubatch); without this the input is capped at the 512 default and
-		// the backend returns "input is too large to process".
-		embeddings := true
-		cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}}
-		cfg.Embeddings = &embeddings
-		opts := grpcModelOpts(cfg, "/tmp/models")
-		Expect(opts.NBatch).To(BeEquivalentTo(4096))
-	})
-
-	It("sizes the batch to the context window for rerank models", func() {
-		reranking := true
-		cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}}
-		cfg.Reranking = &reranking
-		opts := grpcModelOpts(cfg, "/tmp/models")
-		Expect(opts.NBatch).To(BeEquivalentTo(4096))
-	})
-
-	It("does not raise the batch when a score model's context is below the default", func() {
-		small := 256
-		cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &small}, KnownUsecases: &scoreUsecase}
-		opts := grpcModelOpts(cfg, "/tmp/models")
-		Expect(opts.NBatch).To(BeEquivalentTo(512))
-	})
-
-	It("sizes the batch to the effective 4096 default for a score model with no explicit context_size", func() {
-		// The crash case: the backend defaults n_ctx to 4096, so n_batch must
-		// follow even when context_size is unset — otherwise n_batch stays 512
-		// against a 4096 window and the score decode hits the GGML_ASSERT.
-		cfg := config.ModelConfig{Threads: &threads, KnownUsecases: &scoreUsecase}
-		Expect(cfg.ContextSize).To(BeNil())
-		opts := grpcModelOpts(cfg, "/tmp/models")
-		Expect(opts.NBatch).To(BeEquivalentTo(4096))
-		Expect(opts.ContextSize).To(BeEquivalentTo(4096), "n_batch must match the effective n_ctx the backend receives")
-	})
-})
--- a/core/backend/stores.go
+++ b/core/backend/stores.go
@@ -3,10 +3,9 @@ package backend
 import (
 	"context"
 	"fmt"
-	"time"
+	"strings"

 	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/core/trace"

 	"github.com/mudler/LocalAI/pkg/grpc"
 	"github.com/mudler/LocalAI/pkg/model"
@@ -40,85 +39,34 @@ func (s *localVectorStore) backend(_ context.Context) (grpc.Backend, error) {
 	return StoreBackend(s.loader, s.appConfig, s.storeName, "")
 }

-func (s *localVectorStore) Search(ctx context.Context, vec []float32) (sim float64, payload []byte, ok bool, err error) {
-	start := time.Now()
-	outcome := "hit"
-	defer func() {
-		s.recordTrace(start, "search", len(vec), sim, outcome, err)
-	}()
-	be, berr := s.backend(ctx)
-	if berr != nil {
-		outcome = "backend_load_error"
-		return 0, nil, false, fmt.Errorf("vector store load: %w", berr)
+func (s *localVectorStore) Search(ctx context.Context, vec []float32) (float64, []byte, bool, error) {
+	be, err := s.backend(ctx)
+	if err != nil {
+		return 0, nil, false, fmt.Errorf("vector store load: %w", err)
 	}
-	_, values, similarities, ferr := store.Find(ctx, be, vec, 1)
-	if ferr != nil {
-		outcome = "find_error"
-		return 0, nil, false, fmt.Errorf("vector store find: %w", ferr)
+	_, values, similarities, err := store.Find(ctx, be, vec, 1)
+	if err != nil {
+		// local-store's Find returns "existing length is -1" before
+		// any keys are inserted. Surface that as a clean miss so the
+		// cache layer treats it as an empty store and proceeds to
+		// Insert rather than skipping.
+		if strings.Contains(err.Error(), "existing length is -1") {
+			return 0, nil, false, nil
+		}
+		return 0, nil, false, fmt.Errorf("vector store find: %w", err)
 	}
 	if len(values) == 0 || len(similarities) == 0 {
-		outcome = "miss"
 		return 0, nil, false, nil
 	}
 	return float64(similarities[0]), values[0], true, nil
 }

-func (s *localVectorStore) Insert(ctx context.Context, vec []float32, payload []byte) (err error) {
-	start := time.Now()
-	outcome := "ok"
-	defer func() {
-		s.recordTrace(start, "insert", len(vec), 0, outcome, err)
-	}()
-	be, berr := s.backend(ctx)
-	if berr != nil {
-		outcome = "backend_load_error"
-		return fmt.Errorf("vector store load: %w", berr)
-	}
-	if serr := store.SetSingle(ctx, be, vec, payload); serr != nil {
-		outcome = "insert_error"
-		return serr
-	}
-	return nil
-}
-
-// recordTrace surfaces vector-store calls in /api/backend-traces, including
-// the backend-load-failure path that otherwise vanishes into an xlog.Warn.
-// modelName uses the store namespace (e.g. "router-cache-smart-router") so
-// admins can tell which router's cache misbehaved; the backend is always
-// "local-store" and can't disambiguate.
-func (s *localVectorStore) recordTrace(start time.Time, op string, vecDim int, sim float64, outcome string, err error) {
-	if s.appConfig == nil || !s.appConfig.EnableTracing {
-		return
-	}
-	trace.InitBackendTracingIfEnabled(s.appConfig.TracingMaxItems, s.appConfig.TracingMaxBodyBytes)
-	errStr := ""
+func (s *localVectorStore) Insert(ctx context.Context, vec []float32, payload []byte) error {
+	be, err := s.backend(ctx)
 	if err != nil {
-		errStr = err.Error()
+		return fmt.Errorf("vector store load: %w", err)
 	}
-	summary := op + " " + outcome
-	if op == "search" && outcome == "hit" {
-		summary = fmt.Sprintf("search hit (sim=%.3f)", sim)
-	}
-	data := map[string]any{
-		"op":         op,
-		"outcome":    outcome,
-		"vector_dim": vecDim,
-	}
-	// Only include similarity for a real neighbor — miss/empty_store would
-	// otherwise render "similarity: 0" and read as a measured value.
-	if op == "search" && outcome == "hit" {
-		data["similarity"] = sim
-	}
-	trace.RecordBackendTrace(trace.BackendTrace{
-		Timestamp: start,
-		Duration:  time.Since(start),
-		Type:      trace.BackendTraceVectorStore,
-		ModelName: s.storeName,
-		Backend:   model.LocalStoreBackend,
-		Summary:   summary,
-		Error:     errStr,
-		Data:      data,
-	})
+	return store.SetSingle(ctx, be, vec, payload)
 }

 func StoreBackend(sl *model.ModelLoader, appConfig *config.ApplicationConfig, storeName string, backend string) (grpc.Backend, error) {
--- a/core/backend/stores_test.go
+++ b/core/backend/stores_test.go
@@ -1,88 +0,0 @@
-package backend
-
-import (
-	"context"
-
-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/core/trace"
-	"github.com/mudler/LocalAI/pkg/model"
-	"github.com/mudler/LocalAI/pkg/system"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-// findVectorStoreTrace returns the most recent vector_store trace whose
-// model_name matches storeName, or nil if none was recorded. Used by
-// the specs below to assert the trace landed without relying on
-// ring-buffer ordering across other tests in the suite.
-func findVectorStoreTrace(storeName string) *trace.BackendTrace {
-	traces := trace.GetBackendTraces()
-	for i := range traces {
-		bt := &traces[i]
-		if bt.Type == trace.BackendTraceVectorStore && bt.ModelName == storeName {
-			return bt
-		}
-	}
-	return nil
-}
-
-var _ = Describe("localVectorStore tracing", func() {
-	// Pin the trace surface admins read from /api/backend-traces.
-	// The original failure mode that motivated these specs — the
-	// local-store backend not installed — was silent on every surface
-	// except a per-call xlog.Warn. With tracing wired in, the row
-	// appears next to the embedder/score traces for the same request.
-	BeforeEach(func() {
-		trace.ClearBackendTraces()
-	})
-
-	It("records a vector_store trace with outcome=backend_load_error when the backend can't be loaded", func() {
-		// nil ModelLoader → s.backend → StoreBackend → panics on load.
-		// Use a real-but-empty loader so the failure surfaces as an
-		// error instead, exercising the load-failure trace path the
-		// admin would hit when local-store isn't installed.
-		appCfg := &config.ApplicationConfig{
-			EnableTracing:       true,
-			TracingMaxItems:     16,
-			TracingMaxBodyBytes: 1024,
-		}
-		s := &localVectorStore{
-			loader:    model.NewModelLoader(&system.SystemState{}),
-			appConfig: appCfg,
-			storeName: "router-cache-test",
-		}
-
-		// Search must surface the error AND record a trace describing it.
-		_, _, _, err := s.Search(context.Background(), []float32{0.1, 0.2, 0.3})
-		Expect(err).To(HaveOccurred())
-
-		Eventually(func() *trace.BackendTrace {
-			return findVectorStoreTrace("router-cache-test")
-		}).ShouldNot(BeNil())
-
-		bt := findVectorStoreTrace("router-cache-test")
-		Expect(bt.Backend).To(Equal(model.LocalStoreBackend))
-		Expect(bt.Data["op"]).To(Equal("search"))
-		Expect(bt.Data["outcome"]).To(Equal("backend_load_error"))
-		Expect(bt.Data["vector_dim"]).To(Equal(3))
-		// Error is the wrapped "vector store load: …" surfaced to the caller.
-		Expect(bt.Error).To(ContainSubstring("vector store load"))
-	})
-
-	It("does not record a trace when tracing is disabled", func() {
-		// Opt-out path: appConfig.EnableTracing=false must short-circuit
-		// before InitBackendTracingIfEnabled, so a workload with tracing
-		// turned off doesn't pay the channel-send cost per cache call.
-		appCfg := &config.ApplicationConfig{EnableTracing: false}
-		s := &localVectorStore{
-			loader:    model.NewModelLoader(&system.SystemState{}),
-			appConfig: appCfg,
-			storeName: "router-cache-disabled",
-		}
-		_, _, _, _ = s.Search(context.Background(), []float32{1})
-		Consistently(func() *trace.BackendTrace {
-			return findVectorStoreTrace("router-cache-disabled")
-		}).Should(BeNil())
-	})
-})
--- a/core/backend/tokenize.go
+++ b/core/backend/tokenize.go
@@ -7,23 +7,9 @@ import (
 	"github.com/mudler/LocalAI/core/schema"
 	"github.com/mudler/LocalAI/core/trace"
 	"github.com/mudler/LocalAI/pkg/grpc"
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
 	"github.com/mudler/LocalAI/pkg/model"
 )

-// tokenizeTokenCount returns the number of tokens in a backend response,
-// treating a nil response as zero. The gRPC client returns (nil, err) on
-// failure, and the tracing block below runs before that error is returned —
-// so the count must be read nil-safely here. Reading resp.Tokens on a nil
-// resp previously panicked the whole HTTP handler when tracing was enabled
-// (e.g. a transient tokenize failure during router probe-budget sizing).
-func tokenizeTokenCount(resp *pb.TokenizationResponse) int {
-	if resp == nil {
-		return 0
-	}
-	return len(resp.Tokens)
-}
-
 func ModelTokenize(s string, loader *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (schema.TokenizeResponse, error) {

 	var inferenceModel grpc.Backend
@@ -54,7 +40,10 @@ func ModelTokenize(s string, loader *model.ModelLoader, modelConfig config.Model
 			errStr = err.Error()
 		}

-		tokenCount := tokenizeTokenCount(resp)
+		tokenCount := 0
+		if resp.Tokens != nil {
+			tokenCount = len(resp.Tokens)
+		}

 		trace.RecordBackendTrace(trace.BackendTrace{
 			Timestamp: startTime,
@@ -75,8 +64,8 @@ func ModelTokenize(s string, loader *model.ModelLoader, modelConfig config.Model
 		return schema.TokenizeResponse{}, err
 	}

-	if resp == nil || resp.Tokens == nil {
-		return schema.TokenizeResponse{Tokens: make([]int32, 0)}, nil
+	if resp.Tokens == nil {
+		resp.Tokens = make([]int32, 0)
 	}

 	return schema.TokenizeResponse{
--- a/core/backend/tokenize_test.go
+++ b/core/backend/tokenize_test.go
@@ -1,27 +0,0 @@
-package backend
-
-import (
-	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-var _ = Describe("tokenizeTokenCount", func() {
-	// Regression: the gRPC client returns (nil, err) when a tokenize call
-	// fails, and ModelTokenize's tracing block reads the token count before
-	// the error is returned. Dereferencing a nil response there panicked the
-	// HTTP handler (nil pointer dereference) — e.g. a transient tokenize
-	// failure while the router sized its probe-token budget.
-	It("returns zero for a nil response instead of panicking", func() {
-		Expect(tokenizeTokenCount(nil)).To(Equal(0))
-	})
-
-	It("returns zero when the response carries no tokens", func() {
-		Expect(tokenizeTokenCount(&pb.TokenizationResponse{})).To(Equal(0))
-	})
-
-	It("counts the tokens present on the response", func() {
-		Expect(tokenizeTokenCount(&pb.TokenizationResponse{Tokens: []int32{1, 2, 3}})).To(Equal(3))
-	})
-})
--- a/core/config/application_config.go
+++ b/core/config/application_config.go
@@ -65,7 +65,7 @@ type ApplicationConfig struct {
 	//
 	//   patterns:
 	//     - id: email
-	//       action: allow            # downgrade default mask -> allow (log only)
+	//       action: route_local      # downgrade default mask -> route_local
 	//     - id: ssn
 	//       action: block            # upgrade default mask -> block
 	//
@@ -488,16 +488,6 @@ func (o *ApplicationConfig) GetEffectiveMaxActiveBackends() int {
 	return 0
 }

-// WatchdogShouldRun reports whether the live watchdog process should be
-// running for the current config. It mirrors the gating in
-// (*Application).startWatchdog so the /api/settings start/stop decision and
-// the startup path agree on a single source of truth: the watchdog runs when
-// idle/busy checks are enabled (WatchDog), when LRU eviction is active
-// (effective max active backends > 0), or when the memory reclaimer is on.
-func (o *ApplicationConfig) WatchdogShouldRun() bool {
-	return o.WatchDog || o.GetEffectiveMaxActiveBackends() > 0 || o.MemoryReclaimerEnabled
-}
-
 // WithForceEvictionWhenBusy sets whether to force eviction even when models have active API calls
 func WithForceEvictionWhenBusy(enabled bool) AppOption {
 	return func(o *ApplicationConfig) {
@@ -1208,22 +1198,18 @@ func (o *ApplicationConfig) ApplyRuntimeSettings(settings *RuntimeSettings) (req
 	}
 	if settings.WatchdogIdleEnabled != nil {
 		o.WatchDogIdle = *settings.WatchdogIdleEnabled
+		if o.WatchDogIdle {
+			o.WatchDog = true
+		}
 		requireRestart = true
 	}
 	if settings.WatchdogBusyEnabled != nil {
 		o.WatchDogBusy = *settings.WatchdogBusyEnabled
+		if o.WatchDogBusy {
+			o.WatchDog = true
+		}
 		requireRestart = true
 	}
-	// The React Settings "Enable Watchdog" master toggle manages only the
-	// idle/busy checks — watchdog_enabled is vestigial in that UI. Whenever
-	// either idle/busy field is present in the body, derive the run-state from
-	// idle||busy so a cold enable starts the watchdog and a full disable stops
-	// it, instead of trusting the stale watchdog_enabled the UI never updates.
-	// This mirrors the startup invariant in startup.go. An API client posting
-	// only watchdog_enabled (idle/busy absent) keeps its explicit value.
-	if settings.WatchdogIdleEnabled != nil || settings.WatchdogBusyEnabled != nil {
-		o.WatchDog = o.WatchDogIdle || o.WatchDogBusy
-	}
 	if settings.WatchdogIdleTimeout != nil {
 		if dur, err := time.ParseDuration(*settings.WatchdogIdleTimeout); err == nil {
 			o.WatchDogIdleTimeout = dur
--- a/core/config/application_config_test.go
+++ b/core/config/application_config_test.go
@@ -223,69 +223,6 @@ var _ = Describe("ApplicationConfig RuntimeSettings Conversion", func() {
 			Expect(appConfig.WatchDogBusy).To(BeTrue())
 		})

-		// Residual #9125: the React Settings "Enable Watchdog" master toggle
-		// manages only watchdog_idle_enabled / watchdog_busy_enabled — it never
-		// touches the vestigial watchdog_enabled field. On a cold enable the
-		// body therefore carries watchdog_enabled=false alongside idle/busy=true.
-		// The derived run-state (WatchDog) must follow idle||busy so the live
-		// watchdog actually starts, not the stale watchdog_enabled=false.
-		It("should derive WatchDog from idle||busy on a cold enable even when watchdog_enabled=false", func() {
-			appConfig := &ApplicationConfig{WatchDog: false}
-
-			watchdogEnabled := false
-			watchdogIdle := true
-			watchdogBusy := true
-			rs := &RuntimeSettings{
-				WatchdogEnabled:     &watchdogEnabled,
-				WatchdogIdleEnabled: &watchdogIdle,
-				WatchdogBusyEnabled: &watchdogBusy,
-			}
-
-			appConfig.ApplyRuntimeSettings(rs)
-
-			Expect(appConfig.WatchDog).To(BeTrue())
-			Expect(appConfig.WatchdogShouldRun()).To(BeTrue())
-		})
-
-		// The disable direction: the master toggle off sends idle=false,
-		// busy=false, but watchdog_enabled may still be the stale true loaded
-		// before the change. WatchDog must follow idle||busy down to false so
-		// the live watchdog is stopped (it stays stopped unless LRU / memory
-		// reclaimer keep it alive, which is gated by WatchdogShouldRun).
-		It("should disable WatchDog when both idle and busy are turned off", func() {
-			appConfig := &ApplicationConfig{WatchDog: true, WatchDogIdle: true, WatchDogBusy: true}
-
-			watchdogEnabled := true
-			watchdogIdle := false
-			watchdogBusy := false
-			rs := &RuntimeSettings{
-				WatchdogEnabled:     &watchdogEnabled,
-				WatchdogIdleEnabled: &watchdogIdle,
-				WatchdogBusyEnabled: &watchdogBusy,
-			}
-
-			appConfig.ApplyRuntimeSettings(rs)
-
-			Expect(appConfig.WatchDog).To(BeFalse())
-			Expect(appConfig.WatchdogShouldRun()).To(BeFalse())
-		})
-
-		// Backward compatibility: an API client that posts only watchdog_enabled
-		// (idle/busy nil) keeps the explicit value — the idle/busy derivation
-		// only kicks in when those fields are actually present in the body.
-		It("should preserve explicit watchdog_enabled when idle/busy are absent", func() {
-			appConfig := &ApplicationConfig{WatchDog: false}
-
-			watchdogEnabled := true
-			rs := &RuntimeSettings{
-				WatchdogEnabled: &watchdogEnabled,
-			}
-
-			appConfig.ApplyRuntimeSettings(rs)
-
-			Expect(appConfig.WatchDog).To(BeTrue())
-		})
-
 		It("should handle MaxActiveBackends and update SingleBackend accordingly", func() {
 			appConfig := &ApplicationConfig{}

--- a/core/config/meta/build.go
+++ b/core/config/meta/build.go
@@ -93,9 +93,6 @@ func applyOverride(f *FieldMeta, o FieldMetaOverride) {
 	if o.Component != "" {
 		f.Component = o.Component
 	}
-	if o.Language != "" {
-		f.Language = o.Language
-	}
 	if o.Placeholder != "" {
 		f.Placeholder = o.Placeholder
 	}
--- a/core/config/meta/constants.go
+++ b/core/config/meta/constants.go
@@ -8,7 +8,6 @@ const (
 	ProviderModelsTTS        = "models:tts"
 	ProviderModelsTranscript = "models:transcript"
 	ProviderModelsVAD        = "models:vad"
-	ProviderModelsScore      = "models:score"
 )

 // Static option lists embedded directly in field metadata.
--- a/core/config/meta/registry.go
+++ b/core/config/meta/registry.go
@@ -226,7 +226,6 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Label:       "Chat Template",
 			Description: "Go template for chat completion requests",
 			Component:   "code-editor",
-			Language:    "gotemplate",
 			Order:       40,
 		},
 		"template.chat_message": {
@@ -234,7 +233,6 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Label:       "Chat Message Template",
 			Description: "Go template for individual chat messages",
 			Component:   "code-editor",
-			Language:    "gotemplate",
 			Order:       41,
 		},
 		"template.completion": {
@@ -242,22 +240,13 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Label:       "Completion Template",
 			Description: "Go template for completion requests",
 			Component:   "code-editor",
-			Language:    "gotemplate",
 			Order:       42,
 		},
-		"template.function": {
-			Section:     "templates",
-			Label:       "Functions Template",
-			Description: "Go template applied when tools/functions are present in the request",
-			Component:   "code-editor",
-			Language:    "gotemplate",
-			Order:       43,
-		},
 		"template.use_tokenizer_template": {
 			Section:     "templates",
 			Label:       "Use Tokenizer Template",
 			Description: "Use the chat template from the model's tokenizer config",
-			Order:       44,
+			Order:       43,
 		},
 		// Router section template — kept in the templates UI section
 		// (rather than the router section under "other") so operators
@@ -268,8 +257,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Label:       "Router Classifier System Prompt",
 			Description: "Go text/template (with sprig functions) for the routing system prompt the score classifier feeds to its classifier_model. Executed with `.Policies` ([]{Label, Description}). Empty falls back to the built-in Arch-Router-shaped prompt (route-listing block + JSON output schema). Override when the classifier model was trained on a different schema or you need the routing instructions in a different language. The candidate format scored against the model is fixed at `{\"route\": \"<label>\"}` — keep your override's output schema instruction matching that.",
 			Component:   "code-editor",
-			Language:    "gotemplate",
-			Order:       45,
+			Order:       44,
 		},

 		// --- Pipeline ---
@@ -412,14 +400,14 @@ func DefaultRegistry() map[string]FieldMetaOverride {

 		// --- PII filtering (per-model) ---
 		"pii.enabled": {
-			Section:     "pii",
+			Section:     "other",
 			Label:       "PII Filtering Enabled",
 			Description: "Enable PII redaction middleware for this model. Unset means use the default (off for local backends, on for proxy-* / cloud-hosted backends).",
 			Component:   "toggle",
 			Order:       200,
 		},
 		"pii.patterns": {
-			Section:     "pii",
+			Section:     "other",
 			Label:       "PII Pattern Overrides",
 			Description: "Override the global default action for specific patterns on this model. Patterns not listed here inherit the global action (Settings → Middleware → Filtering).",
 			Component:   "pii-pattern-list",
@@ -432,7 +420,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 		// fails closed — the chat handler does NOT silently fall back
 		// to the local gRPC pipeline.
 		"proxy.mode": {
-			Section:     "proxy",
+			Section:     "other",
 			Label:       "Proxy Mode",
 			Description: "passthrough forwards the client's OpenAI body verbatim — point upstream_url at an OpenAI-compatible endpoint (incl. Anthropic's /v1/chat/completions compat layer). translate converts OpenAI ↔ Anthropic Messages so you can target a native API (/v1/messages); tool_calls and usage tokens survive the round-trip.",
 			Component:   "select",
@@ -444,7 +432,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Order:   208,
 		},
 		"proxy.provider": {
-			Section:     "proxy",
+			Section:     "other",
 			Label:       "Proxy Provider",
 			Description: "Upstream API family. Drives auth header shape (Bearer vs x-api-key + anthropic-version) and, in translate mode, which request/response codec is used.",
 			Component:   "select",
@@ -456,28 +444,28 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Order:   209,
 		},
 		"proxy.upstream_url": {
-			Section:     "proxy",
+			Section:     "other",
 			Label:       "Proxy Upstream URL",
 			Description: "Full POST endpoint of the upstream provider (e.g. https://api.openai.com/v1/chat/completions). Only used when Backend is cloud-proxy.",
 			Component:   "input",
 			Order:       210,
 		},
 		"proxy.api_key_env": {
-			Section:     "proxy",
+			Section:     "other",
 			Label:       "Proxy API Key Env Var",
 			Description: "Name of the environment variable holding the upstream API key. Reading from env keeps the secret out of the YAML and the admin UI.",
 			Component:   "input",
 			Order:       211,
 		},
 		"proxy.upstream_model": {
-			Section:     "proxy",
+			Section:     "other",
 			Label:       "Proxy Upstream Model",
 			Description: "Model name sent to the upstream. Leave empty to forward the client's model field unchanged. Useful when the LocalAI alias differs from the upstream's canonical name.",
 			Component:   "input",
 			Order:       212,
 		},
 		"proxy.request_timeout_seconds": {
-			Section:     "proxy",
+			Section:     "other",
 			Label:       "Proxy Request Timeout (seconds)",
 			Description: "Caps the upstream HTTP request duration. 0 disables the deadline; the request still ends when the client disconnects.",
 			Component:   "number",
@@ -492,7 +480,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 		// A host claimed by two configs is a critical error — the
 		// listener refuses to start until resolved.
 		"mitm.hosts": {
-			Section:     "mitm",
+			Section:     "other",
 			Label:       "MITM Intercept Hosts",
 			Description: "Hostnames the cloudproxy MITM proxy terminates TLS for on behalf of this model config. PII filtering and pattern overrides flow from this model when the host is intercepted. Each host must be unique across all configs.",
 			Component:   "string-list",
@@ -507,7 +495,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 		// the middleware admin page surfaces every model with a router
 		// block.
 		"router.classifier": {
-			Section:     "router",
+			Section:     "other",
 			Label:       "Classifier",
 			Description: "Picks a candidate by scoring every policy label against the prompt. Only \"score\" is shipped today; it asks the classifier_model to rank each label and reads off the softmax. Empty defaults to \"score\".",
 			Component:   "select",
@@ -517,15 +505,15 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Order: 230,
 		},
 		"router.classifier_model": {
-			Section:              "router",
+			Section:              "other",
 			Label:                "Classifier Model",
 			Description:          "Loaded LocalAI model the score classifier asks to rank each policy label as a continuation. Must support the Score gRPC primitive (today: llama-cpp, vLLM) and use the ChatML template. Arch-Router-1.5B Q4_K_M is the canonical choice; any small ChatML instruct model also works at a higher activation_threshold.",
 			Component:            "model-select",
-			AutocompleteProvider: ProviderModelsScore,
+			AutocompleteProvider: ProviderModelsChat,
 			Order:                231,
 		},
 		"router.fallback": {
-			Section:              "router",
+			Section:              "other",
 			Label:                "Fallback Model",
 			Description:          "Model used when no candidate's labels cover the classifier's active label set, or when the classifier errors. Empty means router failures bubble up as HTTP 500 — fail-fast, not silent-bypass.",
 			Component:            "model-select",
@@ -533,7 +521,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Order:                232,
 		},
 		"router.activation_threshold": {
-			Section:     "router",
+			Section:     "other",
 			Label:       "Activation Threshold",
 			Description: "Softmax-probability floor a policy must clear to join the active label set for a request. Higher → single-label dominant routes; lower → more multi-label activations. 0 picks the package default (0.15). On Arch-Router-1.5B a value around 0.40 keeps the dominant label clean without losing genuine compound activations.",
 			Component:   "slider",
@@ -543,7 +531,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Order:       233,
 		},
 		"router.classifier_cache_size": {
-			Section:     "router",
+			Section:     "other",
 			Label:       "Classifier L1 Cache Size",
 			Description: "Bounded LRU keyed on (case-folded, whitespace-trimmed) prompt — amortises the classifier round-trip across verbatim repeats common in agent loops. 0 here means \"use the default\" (1024); the cache cannot be disabled from YAML.",
 			Component:   "number",
@@ -551,21 +539,21 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Order:       234,
 		},
 		"router.policies": {
-			Section:     "router",
+			Section:     "other",
 			Label:       "Policies",
 			Description: "Label vocabulary the classifier scores over. Each policy has a label and a short natural-language description fed verbatim to the classifier model. Short action-oriented sentences work best (\"writing or debugging code\"; \"small talk\").",
 			Component:   "router-policies",
 			Order:       235,
 		},
 		"router.candidates": {
-			Section:     "router",
+			Section:     "other",
 			Label:       "Candidates",
 			Description: "Routing table: each entry binds a downstream model to a set of policy labels it can serve. Order matters — the middleware picks the FIRST candidate whose labels are a superset of the active set, so list candidates smallest → largest.",
 			Component:   "router-candidates",
 			Order:       236,
 		},
 		"router.score_normalization": {
-			Section:     "router",
+			Section:     "other",
 			Label:       "Score Normalization",
 			Description: "How the score classifier collapses per-candidate joint log-probs into the softmax input. \"raw\" (default) feeds joint log-prob as-is — on-distribution for Arch-Router (the route the model would actually emit if decoded freely). \"mean\" divides by candidate token count — fairer to long labels but off-distribution for models trained to emit fixed-format outputs.",
 			Component:   "select",
@@ -577,7 +565,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Order: 240,
 		},
 		"router.embedding_cache.embedding_model": {
-			Section:              "router",
+			Section:              "other",
 			Label:                "L2 Cache: Embedding Model",
 			Description:          "Embedding model used by the L2 decision cache. Embeds incoming probes and looks them up in the per-router local-store collection. Empty disables the cache entirely. nomic-embed-text-v1.5 is the recommended default.",
 			Component:            "model-select",
@@ -585,7 +573,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Order:                237,
 		},
 		"router.embedding_cache.similarity_threshold": {
-			Section:     "router",
+			Section:     "other",
 			Label:       "L2 Cache: Similarity Threshold",
 			Description: "Cosine-similarity floor a cache candidate must clear to count as a hit. 0 picks the package default (0.80). Re-tune per embedding model — the histogram on the Routing tab shows where the cosine distribution actually sits.",
 			Component:   "slider",
@@ -595,7 +583,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Order:       238,
 		},
 		"router.embedding_cache.confidence_threshold": {
-			Section:     "router",
+			Section:     "other",
 			Label:       "L2 Cache: Confidence Threshold",
 			Description: "Minimum top-label probability a classifier decision must have to be inserted into the cache. 0 picks the package default (0.60). Uncertain decisions are skipped so they can't poison future paraphrases.",
 			Component:   "slider",
@@ -605,7 +593,7 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			Order:       239,
 		},
 		"router.embedding_cache.store_name": {
-			Section:     "router",
+			Section:     "other",
 			Label:       "L2 Cache: Store Name",
 			Description: "Optional override for the local-store collection used by this router's cache. Empty defaults to \"router-cache-<router-model-name>\". Two routers sharing a store_name share their cache (rare).",
 			Component:   "input",
--- a/core/config/meta/registry_coverage_test.go
+++ b/core/config/meta/registry_coverage_test.go
@@ -240,6 +240,7 @@ var grandfatheredUnregistered = []string{
 	"swap_space",
 	"system_prompt",
 	"template.edit",
+	"template.function",
 	"template.join_chat_messages_by_character",
 	"template.multimodal",
 	"template.reply_prefix",
--- a/core/config/meta/types.go
+++ b/core/config/meta/types.go
@@ -11,7 +11,6 @@ type FieldMeta struct {
 	Label       string        `json:"label"`                 // human-readable label
 	Description string        `json:"description,omitempty"` // help text
 	Component   string        `json:"component"`             // "input", "number", "toggle", "select", "slider", etc.
-	Language    string        `json:"language,omitempty"`    // syntax mode for code-editor fields: "yaml" (default), "gotemplate"
 	Placeholder string        `json:"placeholder,omitempty"`
 	Default     any           `json:"default,omitempty"`
 	Min         *float64      `json:"min,omitempty"`
@@ -52,7 +51,6 @@ type FieldMetaOverride struct {
 	Label                string
 	Description          string
 	Component            string
-	Language             string
 	Placeholder          string
 	Default              any
 	Min                  *float64
@@ -80,10 +78,6 @@ func DefaultSections() []Section {
 		{ID: "grpc", Label: "gRPC", Icon: "server", Order: 65},
 		{ID: "agent", Label: "Agent", Icon: "bot", Order: 70},
 		{ID: "mcp", Label: "MCP", Icon: "plug", Order: 75},
-		{ID: "router", Label: "Router", Icon: "git-merge", Order: 78},
-		{ID: "proxy", Label: "Proxy", Icon: "cloud", Order: 80},
-		{ID: "mitm", Label: "MITM Proxy", Icon: "shield", Order: 82},
-		{ID: "pii", Label: "PII", Icon: "shield", Order: 84},
 		{ID: "other", Label: "Other", Icon: "more-horizontal", Order: 100},
 	}
 }
--- a/core/config/model_config.go
+++ b/core/config/model_config.go
@@ -385,7 +385,7 @@ type PIIConfig struct {
 	Enabled *bool `yaml:"enabled,omitempty" json:"enabled,omitempty"`

 	// Patterns lets a model upgrade or downgrade individual pattern
-	// actions (mask | block | allow) relative to the global
+	// actions (mask | block | route_local) relative to the global
 	// defaults loaded from --pii-config / DefaultPatterns. Pattern IDs
 	// not listed inherit the global action. The regex itself stays
 	// global — only the action is settable per-model.
@@ -1274,20 +1274,14 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
 	}

 	if (u & FLAG_CHAT) == FLAG_CHAT {
-		// A router model is a chat dispatcher: it carries no chat
-		// template of its own (those live on the candidates it routes
-		// to) and is invoked through the chat endpoint, so the router
-		// block stands in for chat capability.
-		if !c.HasRouter() {
-			if c.TemplateConfig.Chat == "" && c.TemplateConfig.ChatMessage == "" && !c.TemplateConfig.UseTokenizerTemplate {
-				return false
-			}
-			if slices.Contains(nonTextGenBackends, c.Backend) {
-				return false
-			}
-			if c.Embeddings != nil && *c.Embeddings {
-				return false
-			}
+		if c.TemplateConfig.Chat == "" && c.TemplateConfig.ChatMessage == "" && !c.TemplateConfig.UseTokenizerTemplate {
+			return false
+		}
+		if slices.Contains(nonTextGenBackends, c.Backend) {
+			return false
+		}
+		if c.Embeddings != nil && *c.Embeddings {
+			return false
 		}
 	}
 	if (u & FLAG_COMPLETION) == FLAG_COMPLETION {
--- a/core/config/model_config_test.go
+++ b/core/config/model_config_test.go
@@ -283,18 +283,6 @@ parameters:
 		Expect(e.HasUsecases(FLAG_CHAT)).To(BeFalse())
 		Expect(e.HasUsecases(FLAG_EMBEDDINGS)).To(BeTrue())

-		// Router models are chat dispatchers: no chat template of their
-		// own, but invoked through the chat endpoint, so they default to
-		// chat-capable.
-		r := ModelConfig{
-			Name: "r",
-			Router: RouterConfig{
-				Candidates: []RouterCandidate{{Model: "downstream", Labels: []string{"general"}}},
-			},
-		}
-		Expect(r.HasUsecases(FLAG_ANY)).To(BeTrue())
-		Expect(r.HasUsecases(FLAG_CHAT)).To(BeTrue())
-
 		f := ModelConfig{
 			Name:    "f",
 			Backend: "piper",
--- a/core/gallery/backends_test.go
+++ b/core/gallery/backends_test.go
@@ -50,14 +50,7 @@ var _ = Describe("Runtime capability-based backend selection", func() {
 		must(os.WriteFile(filepath.Join(cudaDir, "metadata.json"), b, 0o644))
 		must(os.WriteFile(filepath.Join(cudaDir, "run.sh"), []byte(""), 0o755))

-		// Default system: alias should point to CPU. Force the capability to
-		// "cpu" so this is hermetic on hosts that actually have a GPU: backend
-		// preference keys off getSystemCapabilities() (env → real nvidia-smi
-		// detection), not GPUVendor, so without this a GPU dev box reports
-		// "nvidia" and the cuda alias wins. The NVIDIA case below overrides it.
-		must(os.Setenv("LOCALAI_FORCE_META_BACKEND_CAPABILITY", "cpu"))
-		defer func() { _ = os.Unsetenv("LOCALAI_FORCE_META_BACKEND_CAPABILITY") }()
-
+		// Default system: alias should point to CPU
 		sysDefault, err := system.GetSystemState(
 			system.WithBackendPath(tempDir),
 		)
--- a/core/gallery/importers/importers.go
+++ b/core/gallery/importers/importers.go
@@ -158,11 +158,6 @@ var defaultImporters = []Importer{
 	// RFDetrImporter must run before TransformersImporter — RF-DETR
 	// checkpoints may carry tokenizer-adjacent artefacts.
 	&RFDetrImporter{},
-	// LocateAnythingImporter (NVIDIA LocateAnything open-vocab detection,
-	// native C++/ggml port) must run before LlamaCPPImporter so its GGUF
-	// bundles aren't claimed by the generic .gguf importer; kept next to
-	// RFDetrImporter as both are detection models.
-	&LocateAnythingImporter{},
 	// Existing
 	// DS4Importer must precede LlamaCPPImporter - ds4 weights are GGUFs and
 	// would otherwise be claimed by the generic .gguf-handling llama-cpp
--- a/core/gallery/importers/locate-anything.go
+++ b/core/gallery/importers/locate-anything.go
@@ -1,137 +0,0 @@
-package importers
-
-import (
-	"encoding/json"
-	"path/filepath"
-	"strings"
-
-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/core/gallery"
-	"github.com/mudler/LocalAI/core/schema"
-	"go.yaml.in/yaml/v2"
-)
-
-var _ Importer = &LocateAnythingImporter{}
-
-// LocateAnythingImporter routes NVIDIA LocateAnything open-vocabulary
-// object-detection / visual-grounding repositories to the
-// "locate-anything-cpp" backend (a native C++/ggml port). It must be
-// registered BEFORE the generic GGUF matchers (LlamaCPPImporter) so its
-// GGUF bundles aren't swallowed by the generic .gguf-handling importer,
-// and alongside RFDetrImporter since both are detection models that may
-// carry tokenizer-adjacent artefacts.
-//
-// Detection signals:
-//   - preferences.backend="locate-anything-cpp" (explicit override);
-//   - repo name contains "locate-anything" or "locateanything"
-//     (case-insensitive).
-type LocateAnythingImporter struct{}
-
-func (i *LocateAnythingImporter) Name() string      { return "locate-anything-cpp" }
-func (i *LocateAnythingImporter) Modality() string  { return "detection" }
-func (i *LocateAnythingImporter) AutoDetects() bool { return true }
-
-func repoLooksLikeLocateAnything(repo string) bool {
-	lower := strings.ToLower(repo)
-	return strings.Contains(lower, "locate-anything") ||
-		strings.Contains(lower, "locateanything") ||
-		strings.Contains(lower, "locate-anything.cpp") ||
-		strings.Contains(lower, "locate-anything-cpp")
-}
-
-func (i *LocateAnythingImporter) Match(details Details) bool {
-	preferences, err := details.Preferences.MarshalJSON()
-	if err != nil {
-		return false
-	}
-	preferencesMap := make(map[string]any)
-	if len(preferences) > 0 {
-		if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
-			return false
-		}
-	}
-
-	if b, ok := preferencesMap["backend"].(string); ok && b == "locate-anything-cpp" {
-		return true
-	}
-
-	if details.HuggingFace != nil {
-		repoName := details.HuggingFace.ModelID
-		if idx := strings.Index(repoName, "/"); idx >= 0 {
-			repoName = repoName[idx+1:]
-		}
-		if repoLooksLikeLocateAnything(repoName) {
-			return true
-		}
-	}
-
-	// Fallback: hfapi recursion bug may leave HuggingFace nil — decide
-	// from the URI owner/repo.
-	if _, repo, ok := HFOwnerRepoFromURI(details.URI); ok {
-		if repoLooksLikeLocateAnything(repo) {
-			return true
-		}
-	}
-
-	return false
-}
-
-func (i *LocateAnythingImporter) Import(details Details) (gallery.ModelConfig, error) {
-	preferences, err := details.Preferences.MarshalJSON()
-	if err != nil {
-		return gallery.ModelConfig{}, err
-	}
-	preferencesMap := make(map[string]any)
-	if len(preferences) > 0 {
-		if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
-			return gallery.ModelConfig{}, err
-		}
-	}
-
-	name, ok := preferencesMap["name"].(string)
-	if !ok {
-		name = filepath.Base(details.URI)
-	}
-
-	description, ok := preferencesMap["description"].(string)
-	if !ok {
-		description = "Imported from " + details.URI
-	}
-
-	// Prefer the canonical HF "owner/repo" identifier so the emitted
-	// YAML mirrors gallery locate-anything entries.
-	model := details.URI
-	if details.HuggingFace != nil && details.HuggingFace.ModelID != "" {
-		model = details.HuggingFace.ModelID
-	} else if owner, repo, ok := HFOwnerRepoFromURI(details.URI); ok {
-		model = owner + "/" + repo
-	}
-
-	// Always the native C++/ggml backend; explicit preferences.backend
-	// overrides the default.
-	backend := "locate-anything-cpp"
-	if b, ok := preferencesMap["backend"].(string); ok && b != "" {
-		backend = b
-	}
-
-	modelConfig := config.ModelConfig{
-		Name:                name,
-		Description:         description,
-		Backend:             backend,
-		KnownUsecaseStrings: []string{"detection"},
-		PredictionOptions: schema.PredictionOptions{
-			BasicModelRequest: schema.BasicModelRequest{Model: model},
-		},
-	}
-
-	data, err := yaml.Marshal(modelConfig)
-	if err != nil {
-		return gallery.ModelConfig{}, err
-	}
-
-	return gallery.ModelConfig{
-		Name:        name,
-		Description: description,
-		ConfigFile:  string(data),
-	}, nil
-}
--- a/core/gallery/importers/locate-anything_test.go
+++ b/core/gallery/importers/locate-anything_test.go
@@ -1,218 +0,0 @@
-package importers_test
-
-import (
-	"encoding/json"
-	"fmt"
-
-	"github.com/mudler/LocalAI/core/gallery/importers"
-	hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-var _ = Describe("LocateAnythingImporter", func() {
-	Context("Importer interface metadata", func() {
-		It("exposes name/modality/autodetect", func() {
-			imp := &importers.LocateAnythingImporter{}
-			Expect(imp.Name()).To(Equal("locate-anything-cpp"))
-			Expect(imp.Modality()).To(Equal("detection"))
-			Expect(imp.AutoDetects()).To(BeTrue())
-		})
-	})
-
-	Context("Match", func() {
-		It("matches when backend preference is locate-anything-cpp", func() {
-			imp := &importers.LocateAnythingImporter{}
-			preferences := json.RawMessage(`{"backend": "locate-anything-cpp"}`)
-			details := importers.Details{
-				URI:         "https://example.com/some-model",
-				Preferences: preferences,
-			}
-
-			Expect(imp.Match(details)).To(BeTrue())
-		})
-
-		It("matches when the repo name contains 'locate-anything' (case-insensitive)", func() {
-			imp := &importers.LocateAnythingImporter{}
-			details := importers.Details{
-				URI: "https://huggingface.co/mudler/locate-anything-cpp-3b",
-				HuggingFace: &hfapi.ModelDetails{
-					ModelID: "mudler/Locate-Anything-CPP-3B",
-					Author:  "mudler",
-				},
-			}
-
-			Expect(imp.Match(details)).To(BeTrue())
-		})
-
-		It("matches when the repo name contains 'locateanything' (case-insensitive)", func() {
-			imp := &importers.LocateAnythingImporter{}
-			details := importers.Details{
-				URI: "https://huggingface.co/nvidia/LocateAnything-3B",
-				HuggingFace: &hfapi.ModelDetails{
-					ModelID: "nvidia/LocateAnything-3B",
-					Author:  "nvidia",
-				},
-			}
-
-			Expect(imp.Match(details)).To(BeTrue())
-		})
-
-		It("matches via URI fallback when HuggingFace details are missing", func() {
-			imp := &importers.LocateAnythingImporter{}
-			details := importers.Details{
-				URI: "https://huggingface.co/nvidia/LocateAnything-3B",
-			}
-
-			Expect(imp.Match(details)).To(BeTrue())
-		})
-
-		It("does not match unrelated repos without locate-anything signals", func() {
-			imp := &importers.LocateAnythingImporter{}
-			details := importers.Details{
-				URI: "https://huggingface.co/meta-llama/Llama-3-8B",
-				HuggingFace: &hfapi.ModelDetails{
-					ModelID: "meta-llama/Llama-3-8B",
-					Author:  "meta-llama",
-				},
-			}
-
-			Expect(imp.Match(details)).To(BeFalse())
-		})
-
-		It("does not match an rfdetr repo", func() {
-			imp := &importers.LocateAnythingImporter{}
-			details := importers.Details{
-				URI: "https://huggingface.co/mudler/rfdetr-cpp-nano",
-				HuggingFace: &hfapi.ModelDetails{
-					ModelID: "mudler/rfdetr-cpp-nano",
-					Author:  "mudler",
-				},
-			}
-
-			Expect(imp.Match(details)).To(BeFalse())
-		})
-
-		It("returns false for invalid preferences JSON", func() {
-			imp := &importers.LocateAnythingImporter{}
-			preferences := json.RawMessage(`not valid json`)
-			details := importers.Details{
-				URI:         "https://example.com/model",
-				Preferences: preferences,
-			}
-
-			Expect(imp.Match(details)).To(BeFalse())
-		})
-	})
-
-	Context("Import", func() {
-		It("produces a YAML with backend locate-anything-cpp and the repo as the model", func() {
-			imp := &importers.LocateAnythingImporter{}
-			details := importers.Details{
-				URI: "https://huggingface.co/nvidia/LocateAnything-3B",
-				HuggingFace: &hfapi.ModelDetails{
-					ModelID: "nvidia/LocateAnything-3B",
-					Author:  "nvidia",
-				},
-			}
-
-			modelConfig, err := imp.Import(details)
-
-			Expect(err).ToNot(HaveOccurred())
-			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: locate-anything-cpp"), fmt.Sprintf("Model config: %+v", modelConfig))
-			Expect(modelConfig.ConfigFile).To(ContainSubstring("nvidia/LocateAnything-3B"), fmt.Sprintf("Model config: %+v", modelConfig))
-			Expect(modelConfig.ConfigFile).To(ContainSubstring("detection"), fmt.Sprintf("Model config: %+v", modelConfig))
-		})
-
-		It("respects custom name and description from preferences", func() {
-			imp := &importers.LocateAnythingImporter{}
-			preferences := json.RawMessage(`{"name": "my-locate", "description": "Custom"}`)
-			details := importers.Details{
-				URI:         "https://huggingface.co/nvidia/LocateAnything-3B",
-				Preferences: preferences,
-				HuggingFace: &hfapi.ModelDetails{
-					ModelID: "nvidia/LocateAnything-3B",
-					Author:  "nvidia",
-				},
-			}
-
-			modelConfig, err := imp.Import(details)
-
-			Expect(err).ToNot(HaveOccurred())
-			Expect(modelConfig.Name).To(Equal("my-locate"))
-			Expect(modelConfig.Description).To(Equal("Custom"))
-		})
-	})
-
-	// Table-driven coverage of the backend routing: locate-anything repos
-	// always route to the native locate-anything-cpp backend, with an
-	// explicit preferences.backend override honoured.
-	//
-	// Cases are kept offline-deterministic by injecting Details directly
-	// rather than going through DiscoverModelConfig (which would hit live HF).
-	Context("backend routing (offline)", func() {
-		hfFile := func(path string) hfapi.ModelFile {
-			return hfapi.ModelFile{Path: path}
-		}
-
-		type tc struct {
-			name          string
-			uri           string
-			modelID       string
-			files         []hfapi.ModelFile
-			prefs         string
-			expectBackend string // expected `backend:` line content
-		}
-
-		entries := []tc{
-			{
-				name:          "canonical NVIDIA repo routes to locate-anything-cpp",
-				uri:           "https://huggingface.co/nvidia/LocateAnything-3B",
-				modelID:       "nvidia/LocateAnything-3B",
-				files:         []hfapi.ModelFile{hfFile("locate-anything-3b-q8_0.gguf"), hfFile("README.md")},
-				prefs:         "",
-				expectBackend: "backend: locate-anything-cpp",
-			},
-			{
-				name:          "GGUF bundle with locate-anything name routes to locate-anything-cpp",
-				uri:           "https://huggingface.co/mudler/locate-anything.cpp-3b",
-				modelID:       "mudler/locate-anything.cpp-3b",
-				files:         []hfapi.ModelFile{hfFile("model-f16.gguf")},
-				prefs:         "",
-				expectBackend: "backend: locate-anything-cpp",
-			},
-			{
-				name:          "explicit preferences.backend override is honoured",
-				uri:           "https://huggingface.co/nvidia/LocateAnything-3B",
-				modelID:       "nvidia/LocateAnything-3B",
-				files:         nil,
-				prefs:         `{"backend": "locate-anything-cpp"}`,
-				expectBackend: "backend: locate-anything-cpp",
-			},
-		}
-
-		for _, e := range entries {
-			e := e // capture for closure
-			It(e.name, func() {
-				imp := &importers.LocateAnythingImporter{}
-				details := importers.Details{
-					URI: e.uri,
-					HuggingFace: &hfapi.ModelDetails{
-						ModelID: e.modelID,
-						Files:   e.files,
-					},
-				}
-				if e.prefs != "" {
-					details.Preferences = json.RawMessage(e.prefs)
-				}
-
-				Expect(imp.Match(details)).To(BeTrue(), fmt.Sprintf("Match should fire for %+v", details))
-
-				modelConfig, err := imp.Import(details)
-				Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Import error: %v", err))
-				Expect(modelConfig.ConfigFile).To(ContainSubstring(e.expectBackend),
-					fmt.Sprintf("Model config: %+v", modelConfig))
-			})
-		}
-	})
-})
--- a/core/gallery/importers/mlx.go
+++ b/core/gallery/importers/mlx.go
@@ -64,17 +64,7 @@ func (i *MLXImporter) Import(details Details) (gallery.ModelConfig, error) {
 		description = "Imported from " + details.URI
 	}

-	// Vision-language checkpoints (e.g. gemma-4 E4B) declare the
-	// "image-text-to-text" pipeline tag on HuggingFace. The text-only mlx-lm
-	// tokenizer does not carry their processor chat template, so routing them
-	// through the plain mlx backend yields degenerate looping output
-	// (issue #10269). Send them to the mlx-vlm backend, which applies the
-	// processor-aware chat template.
 	backend := "mlx"
-	if details.HuggingFace != nil && details.HuggingFace.PipelineTag == "image-text-to-text" {
-		backend = "mlx-vlm"
-	}
-	// An explicit backend preference always wins.
 	b, ok := preferencesMap["backend"].(string)
 	if ok {
 		backend = b
--- a/core/gallery/importers/mlx_test.go
+++ b/core/gallery/importers/mlx_test.go
@@ -4,7 +4,6 @@ import (
 	"encoding/json"

 	"github.com/mudler/LocalAI/core/gallery/importers"
-	hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 )
@@ -123,60 +122,6 @@ var _ = Describe("MLXImporter", func() {
 			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: mlx-vlm"))
 		})

-		It("should auto-route vision-language models to the mlx-vlm backend", func() {
-			// gemma-4 E4B and similar VLMs declare pipeline_tag
-			// "image-text-to-text" on HuggingFace. The text-only mlx-lm
-			// tokenizer does not carry their processor chat template, so
-			// routing them through the plain mlx backend produces degenerate
-			// looping output (issue #10269). They must go to mlx-vlm.
-			details := importers.Details{
-				URI: "https://huggingface.co/mlx-community/gemma-4-E4B-it-qat-4bit",
-				HuggingFace: &hfapi.ModelDetails{
-					ModelID:     "mlx-community/gemma-4-E4B-it-qat-4bit",
-					PipelineTag: "image-text-to-text",
-				},
-			}
-
-			modelConfig, err := importer.Import(details)
-
-			Expect(err).ToNot(HaveOccurred())
-			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: mlx-vlm"))
-		})
-
-		It("should keep text-only models on the plain mlx backend", func() {
-			details := importers.Details{
-				URI: "https://huggingface.co/mlx-community/Llama-3.2-1B-Instruct-4bit",
-				HuggingFace: &hfapi.ModelDetails{
-					ModelID:     "mlx-community/Llama-3.2-1B-Instruct-4bit",
-					PipelineTag: "text-generation",
-				},
-			}
-
-			modelConfig, err := importer.Import(details)
-
-			Expect(err).ToNot(HaveOccurred())
-			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: mlx"))
-			Expect(modelConfig.ConfigFile).ToNot(ContainSubstring("backend: mlx-vlm"))
-		})
-
-		It("should honor an explicit backend preference even for a VLM", func() {
-			preferences := json.RawMessage(`{"backend": "mlx"}`)
-			details := importers.Details{
-				URI:         "https://huggingface.co/mlx-community/gemma-4-E4B-it-qat-4bit",
-				Preferences: preferences,
-				HuggingFace: &hfapi.ModelDetails{
-					ModelID:     "mlx-community/gemma-4-E4B-it-qat-4bit",
-					PipelineTag: "image-text-to-text",
-				},
-			}
-
-			modelConfig, err := importer.Import(details)
-
-			Expect(err).ToNot(HaveOccurred())
-			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: mlx"))
-			Expect(modelConfig.ConfigFile).ToNot(ContainSubstring("backend: mlx-vlm"))
-		})
-
 		It("should handle invalid JSON preferences", func() {
 			preferences := json.RawMessage(`invalid json`)
 			details := importers.Details{
--- a/core/http/app_test.go
+++ b/core/http/app_test.go
@@ -383,13 +383,13 @@ var _ = Describe("API test", func() {
 			Expect(err).ToNot(HaveOccurred())

 			go func() {
-				if err := app.Start("127.0.0.1:9090"); err != nil && err != http.ErrServerClosed {
+				if err := app.Start(testHTTPAddr); err != nil && err != http.ErrServerClosed {
 					xlog.Error("server error", "error", err)
 				}
 			}()

 			defaultConfig := openai.DefaultConfig(apiKey)
-			defaultConfig.BaseURL = "http://127.0.0.1:9090/v1"
+			defaultConfig.BaseURL = testHTTPBase + "/v1"

 			client2 = openaigo.NewClient("")
 			client2.BaseURL = defaultConfig.BaseURL
@@ -418,7 +418,7 @@ var _ = Describe("API test", func() {

 		Context("Auth Tests", func() {
 			It("Should fail if the api key is missing", func() {
-				err, sc := postInvalidRequest("http://127.0.0.1:9090/models/available")
+				err, sc := postInvalidRequest(testHTTPBase + "/models/available")
 				Expect(err).ToNot(BeNil())
 				Expect(sc).To(Equal(401))
 			})
@@ -427,7 +427,7 @@ var _ = Describe("API test", func() {
 		Context("URL routing Tests", func() {
 			It("Should support reverse-proxy when unauthenticated", func() {

-				err, sc, body := getRequest("http://127.0.0.1:9090/myprefix/", http.Header{
+				err, sc, body := getRequest(testHTTPBase+"/myprefix/", http.Header{
 					"X-Forwarded-Proto":  {"https"},
 					"X-Forwarded-Host":   {"example.org"},
 					"X-Forwarded-Prefix": {"/myprefix/"},
@@ -441,7 +441,7 @@ var _ = Describe("API test", func() {

 			It("Should support reverse-proxy when authenticated", func() {

-				err, sc, body := getRequest("http://127.0.0.1:9090/myprefix/", http.Header{
+				err, sc, body := getRequest(testHTTPBase+"/myprefix/", http.Header{
 					"Authorization":      {bearerKey},
 					"X-Forwarded-Proto":  {"https"},
 					"X-Forwarded-Host":   {"example.org"},
@@ -459,7 +459,7 @@ var _ = Describe("API test", func() {
 			// requests them through the proxy.
 			It("Should support reverse-proxy when prefix is stripped by the proxy", func() {

-				err, sc, body := getRequest("http://127.0.0.1:9090/app", http.Header{
+				err, sc, body := getRequest(testHTTPBase+"/app", http.Header{
 					"X-Forwarded-Proto":  {"https"},
 					"X-Forwarded-Host":   {"example.org"},
 					"X-Forwarded-Prefix": {"/myprefix"},
@@ -477,7 +477,7 @@ var _ = Describe("API test", func() {
 			// from a foreign origin. BasePathPrefix must reject these via
 			// SafeForwardedPrefix and fall back to "/".
 			It("Should ignore an unsafe X-Forwarded-Prefix and not poison asset URLs", func() {
-				err, sc, body := getRequest("http://127.0.0.1:9090/app", http.Header{
+				err, sc, body := getRequest(testHTTPBase+"/app", http.Header{
 					"X-Forwarded-Proto":  {"https"},
 					"X-Forwarded-Host":   {"example.org"},
 					"X-Forwarded-Prefix": {"//evil.com"},
@@ -492,13 +492,13 @@ var _ = Describe("API test", func() {
 		Context("Applying models", func() {

 			It("applies models from a gallery", func() {
-				models, err := getModels("http://127.0.0.1:9090/models/available")
+				models, err := getModels(testHTTPBase + "/models/available")
 				Expect(err).To(BeNil())
 				Expect(len(models)).To(Equal(2), fmt.Sprint(models))
 				Expect(models[0].Installed).To(BeFalse(), fmt.Sprint(models))
 				Expect(models[1].Installed).To(BeFalse(), fmt.Sprint(models))

-				response := postModelApplyRequest("http://127.0.0.1:9090/models/apply", modelApplyRequest{
+				response := postModelApplyRequest(testHTTPBase+"/models/apply", modelApplyRequest{
 					ID: "test@bert2",
 				})

@@ -507,7 +507,7 @@ var _ = Describe("API test", func() {
 				uuid := response["uuid"].(string)
 				resp := map[string]any{}
 				Eventually(func() bool {
-					response := getModelStatus("http://127.0.0.1:9090/models/jobs/" + uuid)
+					response := getModelStatus(testHTTPBase + "/models/jobs/" + uuid)
 					fmt.Println(response)
 					resp = response
 					return response["processed"].(bool)
@@ -526,7 +526,7 @@ var _ = Describe("API test", func() {
 				Expect(content["usage"]).To(ContainSubstring("You can test this model with curl like this"))
 				Expect(content["foo"]).To(Equal("bar"))

-				models, err = getModels("http://127.0.0.1:9090/models/available")
+				models, err = getModels(testHTTPBase + "/models/available")
 				Expect(err).To(BeNil())
 				Expect(len(models)).To(Equal(2), fmt.Sprint(models))
 				Expect(models[0].Name).To(Or(Equal("bert"), Equal("bert2")))
@@ -541,7 +541,7 @@ var _ = Describe("API test", func() {
 			})
 			It("overrides models", func() {

-				response := postModelApplyRequest("http://127.0.0.1:9090/models/apply", modelApplyRequest{
+				response := postModelApplyRequest(testHTTPBase+"/models/apply", modelApplyRequest{
 					URL:  bertEmbeddingsURL,
 					Name: "bert",
 					Overrides: map[string]any{
@@ -554,7 +554,7 @@ var _ = Describe("API test", func() {
 				uuid := response["uuid"].(string)

 				Eventually(func() bool {
-					response := getModelStatus("http://127.0.0.1:9090/models/jobs/" + uuid)
+					response := getModelStatus(testHTTPBase + "/models/jobs/" + uuid)
 					return response["processed"].(bool)
 				}, "360s", "10s").Should(Equal(true))

@@ -567,7 +567,7 @@ var _ = Describe("API test", func() {
 				Expect(content["backend"]).To(Equal("llama"))
 			})
 			It("apply models without overrides", func() {
-				response := postModelApplyRequest("http://127.0.0.1:9090/models/apply", modelApplyRequest{
+				response := postModelApplyRequest(testHTTPBase+"/models/apply", modelApplyRequest{
 					URL:       bertEmbeddingsURL,
 					Name:      "bert",
 					Overrides: map[string]any{},
@@ -578,7 +578,7 @@ var _ = Describe("API test", func() {
 				uuid := response["uuid"].(string)

 				Eventually(func() bool {
-					response := getModelStatus("http://127.0.0.1:9090/models/jobs/" + uuid)
+					response := getModelStatus(testHTTPBase + "/models/jobs/" + uuid)
 					return response["processed"].(bool)
 				}, "360s", "10s").Should(Equal(true))

@@ -622,14 +622,14 @@ parameters:
 				}

 				var response schema.GalleryResponse
-				err := postRequestResponseJSON("http://127.0.0.1:9090/models/import-uri", &importReq, &response)
+				err := postRequestResponseJSON(testHTTPBase+"/models/import-uri", &importReq, &response)
 				Expect(err).ToNot(HaveOccurred())
 				Expect(response.ID).ToNot(BeEmpty())

 				uuid := response.ID
 				resp := map[string]any{}
 				Eventually(func() bool {
-					response := getModelStatus("http://127.0.0.1:9090/models/jobs/" + uuid)
+					response := getModelStatus(testHTTPBase + "/models/jobs/" + uuid)
 					resp = response
 					return response["processed"].(bool)
 				}, "360s", "10s").Should(Equal(true))
@@ -657,7 +657,7 @@ parameters:
 				}

 				var response schema.GalleryResponse
-				err := postRequestResponseJSON("http://127.0.0.1:9090/models/import-uri", &importReq, &response)
+				err := postRequestResponseJSON(testHTTPBase+"/models/import-uri", &importReq, &response)
 				// The endpoint should return an error immediately
 				Expect(err).To(HaveOccurred())
 				Expect(err.Error()).To(ContainSubstring("failed to discover model config"))
@@ -693,14 +693,14 @@ parameters:
 				}

 				var response schema.GalleryResponse
-				err := postRequestResponseJSON("http://127.0.0.1:9090/models/import-uri", &importReq, &response)
+				err := postRequestResponseJSON(testHTTPBase+"/models/import-uri", &importReq, &response)
 				Expect(err).ToNot(HaveOccurred())
 				Expect(response.ID).ToNot(BeEmpty())

 				uuid := response.ID
 				resp := map[string]any{}
 				Eventually(func() bool {
-					response := getModelStatus("http://127.0.0.1:9090/models/jobs/" + uuid)
+					response := getModelStatus(testHTTPBase + "/models/jobs/" + uuid)
 					resp = response
 					return response["processed"].(bool)
 				}, "360s", "10s").Should(Equal(true))
@@ -751,13 +751,13 @@ parameters:
 			app, err = API(localAIApp)
 			Expect(err).ToNot(HaveOccurred())
 			go func() {
-				if err := app.Start("127.0.0.1:9090"); err != nil && err != http.ErrServerClosed {
+				if err := app.Start(testHTTPAddr); err != nil && err != http.ErrServerClosed {
 					xlog.Error("server error", "error", err)
 				}
 			}()

 			defaultConfig := openai.DefaultConfig("")
-			defaultConfig.BaseURL = "http://127.0.0.1:9090/v1"
+			defaultConfig.BaseURL = testHTTPBase + "/v1"

 			client2 = openaigo.NewClient("")
 			client2.BaseURL = defaultConfig.BaseURL
@@ -801,7 +801,7 @@ parameters:
 			// Mock-backend is registered via SetExternalBackend so it appears
 			// alongside any built-in entries; verifying that string proves the
 			// endpoint is wired up regardless of which real backends exist.
-			resp, err := http.Get("http://127.0.0.1:9090/system")
+			resp, err := http.Get(testHTTPBase + "/system")
 			Expect(err).ToNot(HaveOccurred())
 			Expect(resp.StatusCode).To(Equal(200))
 			dat, err := io.ReadAll(resp.Body)
@@ -824,14 +824,14 @@ parameters:
 				}

 				var createResp map[string]any
-				err := postRequestResponseJSON("http://127.0.0.1:9090/api/agent/tasks", &taskBody, &createResp)
+				err := postRequestResponseJSON(testHTTPBase+"/api/agent/tasks", &taskBody, &createResp)
 				Expect(err).ToNot(HaveOccurred())
 				Expect(createResp["id"]).ToNot(BeEmpty())
 				taskID := createResp["id"].(string)

 				// Get the task
 				var task schema.Task
-				resp, err := http.Get("http://127.0.0.1:9090/api/agent/tasks/" + taskID)
+				resp, err := http.Get(testHTTPBase + "/api/agent/tasks/" + taskID)
 				Expect(err).ToNot(HaveOccurred())
 				Expect(resp.StatusCode).To(Equal(200))
 				body, _ := io.ReadAll(resp.Body)
@@ -839,7 +839,7 @@ parameters:
 				Expect(task.Name).To(Equal("Test Task"))

 				// List tasks
-				resp, err = http.Get("http://127.0.0.1:9090/api/agent/tasks")
+				resp, err = http.Get(testHTTPBase + "/api/agent/tasks")
 				Expect(err).ToNot(HaveOccurred())
 				Expect(resp.StatusCode).To(Equal(200))
 				var tasks []schema.Task
@@ -849,18 +849,18 @@ parameters:

 				// Update task
 				taskBody["name"] = "Updated Task"
-				err = putRequestJSON("http://127.0.0.1:9090/api/agent/tasks/"+taskID, &taskBody)
+				err = putRequestJSON(testHTTPBase+"/api/agent/tasks/"+taskID, &taskBody)
 				Expect(err).ToNot(HaveOccurred())

 				// Verify update
-				resp, err = http.Get("http://127.0.0.1:9090/api/agent/tasks/" + taskID)
+				resp, err = http.Get(testHTTPBase + "/api/agent/tasks/" + taskID)
 				Expect(err).ToNot(HaveOccurred())
 				body, _ = io.ReadAll(resp.Body)
 				json.Unmarshal(body, &task)
 				Expect(task.Name).To(Equal("Updated Task"))

 				// Delete task
-				req, _ := http.NewRequest("DELETE", "http://127.0.0.1:9090/api/agent/tasks/"+taskID, nil)
+				req, _ := http.NewRequest("DELETE", testHTTPBase+"/api/agent/tasks/"+taskID, nil)
 				req.Header.Set("Authorization", bearerKey)
 				resp, err = http.DefaultClient.Do(req)
 				Expect(err).ToNot(HaveOccurred())
@@ -877,7 +877,7 @@ parameters:
 				}

 				var createResp map[string]any
-				err := postRequestResponseJSON("http://127.0.0.1:9090/api/agent/tasks", &taskBody, &createResp)
+				err := postRequestResponseJSON(testHTTPBase+"/api/agent/tasks", &taskBody, &createResp)
 				Expect(err).ToNot(HaveOccurred())
 				taskID := createResp["id"].(string)

@@ -888,14 +888,14 @@ parameters:
 				}

 				var jobResp schema.JobExecutionResponse
-				err = postRequestResponseJSON("http://127.0.0.1:9090/api/agent/jobs/execute", &jobBody, &jobResp)
+				err = postRequestResponseJSON(testHTTPBase+"/api/agent/jobs/execute", &jobBody, &jobResp)
 				Expect(err).ToNot(HaveOccurred())
 				Expect(jobResp.JobID).ToNot(BeEmpty())
 				jobID := jobResp.JobID

 				// Get job status
 				var job schema.Job
-				resp, err := http.Get("http://127.0.0.1:9090/api/agent/jobs/" + jobID)
+				resp, err := http.Get(testHTTPBase + "/api/agent/jobs/" + jobID)
 				Expect(err).ToNot(HaveOccurred())
 				Expect(resp.StatusCode).To(Equal(200))
 				body, _ := io.ReadAll(resp.Body)
@@ -904,7 +904,7 @@ parameters:
 				Expect(job.TaskID).To(Equal(taskID))

 				// List jobs
-				resp, err = http.Get("http://127.0.0.1:9090/api/agent/jobs")
+				resp, err = http.Get(testHTTPBase + "/api/agent/jobs")
 				Expect(err).ToNot(HaveOccurred())
 				Expect(resp.StatusCode).To(Equal(200))
 				var jobs []schema.Job
@@ -914,7 +914,7 @@ parameters:

 				// Cancel job (if still pending/running)
 				if job.Status == schema.JobStatusPending || job.Status == schema.JobStatusRunning {
-					req, _ := http.NewRequest("POST", "http://127.0.0.1:9090/api/agent/jobs/"+jobID+"/cancel", nil)
+					req, _ := http.NewRequest("POST", testHTTPBase+"/api/agent/jobs/"+jobID+"/cancel", nil)
 					req.Header.Set("Authorization", bearerKey)
 					resp, err = http.DefaultClient.Do(req)
 					Expect(err).ToNot(HaveOccurred())
@@ -932,13 +932,13 @@ parameters:
 				}

 				var createResp map[string]any
-				err := postRequestResponseJSON("http://127.0.0.1:9090/api/agent/tasks", &taskBody, &createResp)
+				err := postRequestResponseJSON(testHTTPBase+"/api/agent/tasks", &taskBody, &createResp)
 				Expect(err).ToNot(HaveOccurred())

 				// Execute by name
 				paramsBody := map[string]string{"param1": "value1"}
 				var jobResp schema.JobExecutionResponse
-				err = postRequestResponseJSON("http://127.0.0.1:9090/api/agent/tasks/Named Task/execute", &paramsBody, &jobResp)
+				err = postRequestResponseJSON(testHTTPBase+"/api/agent/tasks/Named Task/execute", &paramsBody, &jobResp)
 				Expect(err).ToNot(HaveOccurred())
 				Expect(jobResp.JobID).ToNot(BeEmpty())
 			})
@@ -998,13 +998,13 @@ parameters:
 			Expect(err).ToNot(HaveOccurred())

 			go func() {
-				if err := app.Start("127.0.0.1:9090"); err != nil && err != http.ErrServerClosed {
+				if err := app.Start(testHTTPAddr); err != nil && err != http.ErrServerClosed {
 					xlog.Error("server error", "error", err)
 				}
 			}()

 			defaultConfig := openai.DefaultConfig("")
-			defaultConfig.BaseURL = "http://127.0.0.1:9090/v1"
+			defaultConfig.BaseURL = testHTTPBase + "/v1"
 			client2 = openaigo.NewClient("")
 			client2.BaseURL = defaultConfig.BaseURL
 			// Wait for API to be ready
--- a/core/http/endpoints/anthropic/messages.go
+++ b/core/http/endpoints/anthropic/messages.go
@@ -353,7 +353,7 @@ func handleAnthropicStream(c echo.Context, id string, input *schema.AnthropicReq
 			overrides = make(map[string]pii.Action, len(raw))
 			for ovid, action := range raw {
 				switch pii.Action(action) {
-				case pii.ActionMask, pii.ActionBlock, pii.ActionAllow:
+				case pii.ActionMask, pii.ActionBlock, pii.ActionRouteLocal:
 					overrides[ovid] = pii.Action(action)
 				}
 			}
--- a/core/http/endpoints/localai/api_instructions.go
+++ b/core/http/endpoints/localai/api_instructions.go
@@ -102,7 +102,7 @@ var instructionDefs = []instructionDef{
 		Name:        "pii-filtering",
 		Description: "Inspect and tune the regex PII filter applied to chat requests",
 		Tags:        []string{"pii"},
-		Intro:       "GET /api/pii/patterns lists the active pattern set with each one's action (mask, block, allow). GET /api/pii/events returns recent redaction events filtered by correlation_id / user_id / pattern_id (admin or local-user only). POST /api/pii/test dry-runs the redactor against an admin-supplied string. POST /api/pii/decide is the programmatic decision oracle for external routers: send `{text}`, receive `{findings, suggested_action, redacted_preview}` without LocalAI mutating, recording, or acting on the call — caller composes the action with its own policy. Default patterns: email, phone, SSN, credit card (Luhn), IPv4, common API key prefixes (sk-, pk-, ghp_, github_pat_). PII is per-model: by default it is OFF for non-proxy backends and ON for backends starting with proxy-* (cloud passthroughs). Opt in with `pii: { enabled: true }` in a model's YAML; use `pii: { patterns: [{id, action}] }` to upgrade or downgrade individual actions for that model. Override global default actions via --pii-config pii.yaml; --disable-pii turns the filter off entirely.",
+		Intro:       "GET /api/pii/patterns lists the active pattern set with each one's action (mask, block, route_local). GET /api/pii/events returns recent redaction events filtered by correlation_id / user_id / pattern_id (admin or local-user only). POST /api/pii/test dry-runs the redactor against an admin-supplied string. POST /api/pii/decide is the programmatic decision oracle for external routers: send `{text}`, receive `{findings, suggested_action, redacted_preview}` without LocalAI mutating, recording, or acting on the call — caller composes the action with its own policy. Default patterns: email, phone, SSN, credit card (Luhn), IPv4, common API key prefixes (sk-, pk-, ghp_, github_pat_). PII is per-model: by default it is OFF for non-proxy backends and ON for backends starting with proxy-* (cloud passthroughs). Opt in with `pii: { enabled: true }` in a model's YAML; use `pii: { patterns: [{id, action}] }` to upgrade or downgrade individual actions for that model. Override global default actions via --pii-config pii.yaml; --disable-pii turns the filter off entirely.",
 	},
 	{
 		Name:        "middleware-admin",
--- a/core/http/endpoints/localai/backend.go
+++ b/core/http/endpoints/localai/backend.go
@@ -25,6 +25,10 @@ var knownPrefOnlyBackends = []schema.KnownBackend{
 	// Text LLM
 	// ds4: antirez/ds4 - single-model DeepSeek V4 Flash engine; auto-detected via DS4Importer
 	{Name: "ds4", Modality: "text", AutoDetect: false, Description: "antirez/ds4 DeepSeek V4 Flash engine (auto-detected; pref-only fallback)"},
+	// dllm consumes GGUF weights like llama-cpp does, but only for the
+	// DiffusionGemma architecture - auto-detecting on .gguf would shadow
+	// llama-cpp, so it stays preference-only.
+	{Name: "dllm", Modality: "text", AutoDetect: false, Description: "dllm.cpp DiffusionGemma block-diffusion engine (preference-only)"},
 	{Name: "sglang", Modality: "text", AutoDetect: false, Description: "SGLang runtime (preference-only)"},
 	{Name: "tinygrad", Modality: "text", AutoDetect: false, Description: "tinygrad runtime (preference-only)"},
 	{Name: "trl", Modality: "text", AutoDetect: false, Description: "Transformers Reinforcement Learning (preference-only)"},
--- a/core/http/endpoints/localai/backend_test.go
+++ b/core/http/endpoints/localai/backend_test.go
@@ -135,6 +135,7 @@ var _ = Describe("Backend Endpoints", func() {
 				Expect(entry.Modality).To(Equal(modality))
 			}

+			expectPrefOnly("dllm", "text")
 			expectPrefOnly("sglang", "text")
 			expectPrefOnly("tinygrad", "text")
 			expectPrefOnly("trl", "text")
--- a/core/http/endpoints/localai/config_meta.go
+++ b/core/http/endpoints/localai/config_meta.go
@@ -124,8 +124,6 @@ func AutocompleteEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, a
 				filterFn = config.BuildUsecaseFilterFn(config.FLAG_VAD)
 			case config.UsecaseTranscript:
 				filterFn = config.BuildUsecaseFilterFn(config.FLAG_TRANSCRIPT)
-			case "score": // router classifier usecase (FLAG_SCORE); not in UsecaseInfoMap
-				filterFn = config.BuildUsecaseFilterFn(config.FLAG_SCORE)
 			default:
 				filterFn = config.NoFilterFn
 			}
--- a/core/http/endpoints/localai/pii_decide.go
+++ b/core/http/endpoints/localai/pii_decide.go
@@ -15,9 +15,9 @@ import (
 //
 // External routers (e.g. the localai-org/platform router) call this
 // before dispatching to learn whether to mask the prompt in place,
-// block the request, or pass it through. LocalAI's in-band PII
-// middleware is the alternative path for direct-to-LocalAI clients —
-// same Redactor, different framing.
+// route to a local-only backend, block the request, or pass it
+// through. LocalAI's in-band PII middleware is the alternative path
+// for direct-to-LocalAI clients — same Redactor, different framing.
 //
 // Takes the *pii.Redactor directly rather than the whole
 // *application.Application so the handler stays unit-testable with a
@@ -62,18 +62,24 @@ func PIIDecideEndpoint(redactor *pii.Redactor) echo.HandlerFunc {
 	}
 }

+// actionAllow is the wire-only value for "no findings". The other
+// three map to existing pii.Action* constants; allow has no in-band
+// counterpart because the in-band middleware simply passes through.
+const actionAllow = "allow"
+
 // suggestedAction collapses the Redactor's Result flags onto a single
-// wire-format action using the in-band ordering (block > mask >
-// allow). "allow" covers both "nothing matched" and "matched but every
-// span resolved to the allow action" — in both cases the caller may
-// dispatch unchanged, with the Findings list reporting what was seen.
+// wire-format action using the in-band ordering (block > route_local
+// > mask > allow). Spans-without-Blocked-or-LocalOnly means every
+// match resolved to ActionMask.
 func suggestedAction(res pii.Result) string {
 	switch {
 	case res.Blocked:
 		return string(pii.ActionBlock)
-	case res.Masked:
+	case res.LocalOnly:
+		return string(pii.ActionRouteLocal)
+	case len(res.Spans) > 0:
 		return string(pii.ActionMask)
 	default:
-		return string(pii.ActionAllow)
+		return actionAllow
 	}
 }
--- a/core/http/endpoints/localai/pii_decide_test.go
+++ b/core/http/endpoints/localai/pii_decide_test.go
@@ -16,8 +16,8 @@ import (

 // PIIDecideEndpoint exposes the redactor as a decision oracle. These
 // specs pin the validation surface and the suggested_action mapping
-// across the three actions (allow/mask/block). The redactor itself is
-// covered in core/services/routing/pii/redactor_test.go.
+// across all four actions (allow/mask/route_local/block). The redactor
+// itself is covered in core/services/routing/pii/redactor_test.go.

 var _ = Describe("PIIDecideEndpoint", func() {
 	var redactor *pii.Redactor
@@ -68,17 +68,16 @@ var _ = Describe("PIIDecideEndpoint", func() {
 		Expect(len(body.Findings)).To(BeNumerically(">=", 1))
 	})

-	It("returns allow when a matched pattern's action is allow", func() {
-		// Downgrade the email pattern to allow for this test —
-		// exercises the allow branch of suggestedAction: a match is
-		// found, but the strongest action is allow so the suggestion
-		// is "allow" and the text is left intact.
-		Expect(redactor.SetAction("email", pii.ActionAllow)).To(Succeed())
+	It("returns route_local when an override sets that action", func() {
+		// Promote the email pattern to route_local for this test —
+		// exercises the route_local branch of suggestedAction without
+		// needing a custom pattern set.
+		Expect(redactor.SetAction("email", pii.ActionRouteLocal)).To(Succeed())
 		rec, body := invokePIIDecide(redactor, `{"text":"contact alice@example.com"}`)
 		Expect(rec.Code).To(Equal(http.StatusOK))
-		Expect(body.SuggestedAction).To(Equal("allow"))
-		Expect(body.Findings).To(HaveLen(1), "allow still reports the finding")
-		// allow leaves the original text intact.
+		Expect(body.SuggestedAction).To(Equal("route_local"))
+		// route_local leaves the original text intact — caller decides
+		// whether to forward it to a local-only backend.
 		Expect(body.RedactedPreview).To(ContainSubstring("alice@example.com"))
 	})

--- a/core/http/endpoints/localai/settings.go
+++ b/core/http/endpoints/localai/settings.go
@@ -221,18 +221,9 @@ func UpdateSettingsEndpoint(app *application.Application) echo.HandlerFunc {
 		// Check if agent job retention changed
 		agentJobChanged := settings.AgentJobRetentionDays != nil

-		// Restart watchdog if settings changed.
-		//
-		// The live start/stop decision derives from the post-apply config
-		// (WatchdogShouldRun) rather than the raw watchdog_enabled request
-		// field: the React master toggle only ever writes the idle/busy flags,
-		// so keying off watchdog_enabled left the live watchdog stopped on a
-		// cold enable until the next restart (#9125). WatchdogShouldRun mirrors
-		// the gating in startWatchdog, so a cold enable starts it immediately
-		// and a full disable (both checks off, no LRU / memory reclaimer) stops
-		// it.
+		// Restart watchdog if settings changed
 		if watchdogChanged {
-			if !appConfig.WatchdogShouldRun() {
+			if settings.WatchdogEnabled != nil && !*settings.WatchdogEnabled {
 				if err := app.StopWatchdog(); err != nil {
 					xlog.Error("Failed to stop watchdog", "error", err)
 					return c.JSON(http.StatusInternalServerError, schema.SettingsResponse{
--- a/core/http/endpoints/localai/settings_test.go
+++ b/core/http/endpoints/localai/settings_test.go
@@ -108,20 +108,4 @@ var _ = Describe("Settings endpoints", func() {
 		_, err := os.Stat(filepath.Join(tmp, "runtime_settings.json"))
 		Expect(err).ToNot(HaveOccurred())
 	})
-
-	// Residual #9125: enabling the watchdog from a cold (off) state via the
-	// React master toggle must start the live watchdog immediately, without a
-	// restart. The toggle posts watchdog_idle_enabled/busy_enabled=true while
-	// the vestigial watchdog_enabled stays false (it was loaded false). The
-	// old handler keyed its stop decision off that raw watchdog_enabled=false
-	// and called StopWatchdog(), so the watchdog never started until restart.
-	It("starts the live watchdog on a cold enable even when watchdog_enabled=false", func() {
-		Expect(app.ModelLoader().GetWatchDog()).To(BeNil(), "precondition: watchdog should be off")
-
-		rec := post(`{"watchdog_enabled":false,"watchdog_idle_enabled":true,"watchdog_busy_enabled":true,"watchdog_idle_timeout":"15m","watchdog_busy_timeout":"5m","watchdog_interval":"1s"}`)
-		Expect(rec.Code).To(Equal(http.StatusOK))
-
-		Expect(app.ModelLoader().GetWatchDog()).ToNot(BeNil(),
-			"watchdog should be running after a cold enable, without waiting for a restart")
-	})
 })
--- a/core/http/endpoints/openai/completion.go
+++ b/core/http/endpoints/openai/completion.go
@@ -130,7 +130,7 @@ func CompletionEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, eva
 					overrides = make(map[string]pii.Action, len(raw))
 					for ovid, action := range raw {
 						switch pii.Action(action) {
-						case pii.ActionMask, pii.ActionBlock, pii.ActionAllow:
+						case pii.ActionMask, pii.ActionBlock, pii.ActionRouteLocal:
 							overrides[ovid] = pii.Action(action)
 						}
 					}
--- a/core/http/endpoints/openai/realtime_model.go
+++ b/core/http/endpoints/openai/realtime_model.go
@@ -451,14 +451,13 @@ func buildRealtimeRoutingContext(a *application.Application, sessionID string) *
 		return nil
 	}
 	deps := &middleware.ClassifierDeps{
-		Scorer:       a.Scorer,
-		TokenCounter: a.TokenCounter,
-		Embedder:     a.Embedder,
-		VectorStore:  a.VectorStore,
-		Reranker:     a.Reranker,
-		ModelLookup:  a.ModelConfigLookup(),
-		Registry:     a.RouterClassifierRegistry(),
-		Evaluator:    a.TemplatesEvaluator(),
+		Scorer:      a.Scorer,
+		Embedder:    a.Embedder,
+		VectorStore: a.VectorStore,
+		Reranker:    a.Reranker,
+		ModelLookup: a.ModelConfigLookup(),
+		Registry:    a.RouterClassifierRegistry(),
+		Evaluator:   a.TemplatesEvaluator(),
 	}
 	userID := ""
 	if u := a.FallbackUser(); u != nil {
--- a/core/http/http_suite_test.go
+++ b/core/http/http_suite_test.go
@@ -21,6 +21,20 @@ var (
 	mockBackendPath string
 )

+// testHTTPAddr is the listen address used by specs that start a full HTTP
+// server. Configurable so the suite can run on machines where the default
+// port is taken by an unrelated service (override: LOCALAI_TEST_HTTP_PORT).
+var testHTTPAddr = func() string {
+	port := os.Getenv("LOCALAI_TEST_HTTP_PORT")
+	if port == "" {
+		port = "9090"
+	}
+	return "127.0.0.1:" + port
+}()
+
+// testHTTPBase is the matching http://host:port prefix for client requests.
+var testHTTPBase = "http://" + testHTTPAddr
+
 // findMockBackendBinary locates the mock-backend binary built by
 // `make build-mock-backend`. Mirrors the lookup used by
 // tests/e2e/e2e_suite_test.go so both suites consume the same artifact.
--- a/core/http/middleware/probe_trim_test.go
+++ b/core/http/middleware/probe_trim_test.go
@@ -1,139 +0,0 @@
-package middleware
-
-import (
-	"strings"
-
-	"github.com/mudler/LocalAI/core/config"
-	"github.com/mudler/LocalAI/core/schema"
-
-	. "github.com/onsi/ginkgo/v2"
-	. "github.com/onsi/gomega"
-)
-
-var _ = Describe("routerConfigFingerprint", func() {
-	rc := config.RouterConfig{Classifier: "score", ClassifierModel: "arch-router"}
-	ctx4096 := 4096
-	ctx8192 := 8192
-
-	// Regression: the score classifier bakes context_size into its token
-	// budget at build time, and the built classifier is cached by this
-	// fingerprint. If context_size weren't hashed, editing it and reloading
-	// would return a classifier carrying the stale budget.
-	It("changes when the classifier model's context_size changes", func() {
-		cfgA := &config.ModelConfig{LLMConfig: config.LLMConfig{ContextSize: &ctx4096}}
-		cfgB := &config.ModelConfig{LLMConfig: config.LLMConfig{ContextSize: &ctx8192}}
-		Expect(routerConfigFingerprint(rc, cfgA)).NotTo(Equal(routerConfigFingerprint(rc, cfgB)))
-	})
-
-	It("is stable for identical classifier configs", func() {
-		cfgA := &config.ModelConfig{LLMConfig: config.LLMConfig{ContextSize: &ctx4096}}
-		cfgB := &config.ModelConfig{LLMConfig: config.LLMConfig{ContextSize: &ctx4096}}
-		Expect(routerConfigFingerprint(rc, cfgA)).To(Equal(routerConfigFingerprint(rc, cfgB)))
-	})
-})
-
-var _ = Describe("routing probe extraction and trimming", func() {
-	Describe("OpenAIProbeFromRequest", func() {
-		It("keeps a short conversation intact, newline-terminated per message", func() {
-			req := &schema.OpenAIRequest{Messages: []schema.Message{
-				{Role: "user", Content: "first"},
-				{Role: "assistant", Content: "second"},
-				{Role: "user", Content: "third"},
-			}}
-			Expect(OpenAIProbeFromRequest(req).Prompt).To(Equal("first\nsecond\nthird\n"))
-		})
-
-		It("flattens text blocks and skips image-only messages", func() {
-			req := &schema.OpenAIRequest{Messages: []schema.Message{
-				{Role: "user", Content: []any{
-					map[string]any{"type": "text", "text": "describe this"},
-					map[string]any{"type": "image_url", "image_url": map[string]any{"url": "data:..."}},
-				}},
-				{Role: "user", Content: []any{
-					map[string]any{"type": "image_url", "image_url": map[string]any{"url": "data:..."}},
-				}},
-			}}
-			// Second message contributes no text, so it neither adds a blank
-			// line nor a stray newline.
-			Expect(OpenAIProbeFromRequest(req).Prompt).To(Equal("describe this\n"))
-		})
-
-		It("carries the full conversation untrimmed — trimming is each classifier's job", func() {
-			// The middleware no longer caps the probe by a fixed rune budget;
-			// every turn reaches the Probe and each classifier trims to its own
-			// model's context (see modelTokenTrim / promptTrimmer).
-			block := strings.Repeat("x", 999)
-			msgs := make([]schema.Message, 0, 20)
-			msgs = append(msgs, schema.Message{Role: "user", Content: "OLDEST" + strings.Repeat("o", 994)})
-			for range 18 {
-				msgs = append(msgs, schema.Message{Role: "user", Content: block})
-			}
-			msgs = append(msgs, schema.Message{Role: "user", Content: "NEWEST" + strings.Repeat("n", 994)})
-
-			probe := OpenAIProbeFromRequest(&schema.OpenAIRequest{Messages: msgs})
-			Expect(probe.Prompt).To(ContainSubstring("OLDEST"), "no turn is dropped at probe-build time")
-			Expect(probe.Prompt).To(ContainSubstring("NEWEST"))
-			// Messages preserves the per-turn split the classifier trims from.
-			Expect(probe.Messages).To(HaveLen(20))
-			Expect(probe.Messages[0]).To(ContainSubstring("OLDEST"))
-			Expect(probe.Messages[19]).To(ContainSubstring("NEWEST"))
-		})
-	})
-
-	Describe("AnthropicProbe", func() {
-		It("extracts and trims the same way as the OpenAI path", func() {
-			req := &schema.AnthropicRequest{Messages: []schema.AnthropicMessage{
-				{Role: "user", Content: "alpha"},
-				{Role: "assistant", Content: []any{
-					map[string]any{"type": "text", "text": "beta"},
-				}},
-			}}
-			probe, ok := AnthropicProbe(req)
-			Expect(ok).To(BeTrue())
-			Expect(probe.Prompt).To(Equal("alpha\nbeta\n"))
-		})
-
-		It("returns ok=false for a non-Anthropic payload", func() {
-			_, ok := AnthropicProbe(&schema.OpenAIRequest{})
-			Expect(ok).To(BeFalse())
-		})
-	})
-
-	Describe("modelTokenTrim", func() {
-		tok := func(string) (int, error) { return 1, nil }
-		depsFor := func(cfg *config.ModelConfig) ClassifierDeps {
-			return ClassifierDeps{
-				ModelLookup:  func(string) *config.ModelConfig { return cfg },
-				TokenCounter: func(string) func(string) (int, error) { return tok },
-			}
-		}
-
-		It("still trims to the backend default when context_size is unset", func() {
-			// Regression: with the fixed middleware rune cap gone, an unset
-			// context_size must NOT disable trimming — otherwise a non-trivial
-			// prompt overflows the default 4096 window and every score fails.
-			score := config.FLAG_SCORE
-			cfg := &config.ModelConfig{KnownUsecases: &score} // FLAG_SCORE → batch follows context
-			count, ceiling := modelTokenTrim("classifier", depsFor(cfg))
-			Expect(count).NotTo(BeNil())
-			Expect(ceiling).To(Equal(4096), "unset context_size falls back to the backend default, not 0")
-		})
-
-		It("is bounded by the batch when the batch is smaller than the context", func() {
-			// The probe is one decode (n_tokens <= n_batch). A model with a
-			// large context but a small batch can only process the batch — the
-			// ceiling must follow it, not the context.
-			ctx8k := 8192
-			cfg := &config.ModelConfig{LLMConfig: config.LLMConfig{ContextSize: &ctx8k}}
-			cfg.Batch = 512
-			_, ceiling := modelTokenTrim("embedder", depsFor(cfg))
-			Expect(ceiling).To(Equal(512), "batch is the binding single-decode limit")
-		})
-
-		It("disables trimming only when no tokenizer is available", func() {
-			count, ceiling := modelTokenTrim("x", ClassifierDeps{ModelLookup: func(string) *config.ModelConfig { return &config.ModelConfig{} }})
-			Expect(count).To(BeNil())
-			Expect(ceiling).To(Equal(0))
-		})
-	})
-})
--- a/core/http/middleware/route_model.go
+++ b/core/http/middleware/route_model.go
@@ -6,7 +6,6 @@ import (
 	"encoding/hex"
 	"fmt"
 	"hash/fnv"
-	"strconv"
 	"strings"
 	"time"

@@ -87,12 +86,6 @@ type ClassifierDeps struct {
 	// templates.Evaluator so any model the operator points at gets
 	// its own chat template applied.
 	Evaluator *templates.Evaluator
-
-	// TokenCounter binds the classifier model's tokenizer for the score
-	// classifier's token-trim path. Optional; nil falls back to the
-	// backend's n_ctx guard. Plain func type so core/application supplies
-	// it as a method value without importing this package.
-	TokenCounter func(modelName string) func(text string) (int, error)
 }

 // ProbeExtractor pulls the prompt content out of a parsed request so
@@ -219,6 +212,7 @@ func recordHTTPDecision(c echo.Context, store router.DecisionStore, result *rout
 	_ = store.Record(context.Background(), result.ToDecisionRecord(newDecisionID(), correlationID, userID, source))
 }

+
 // GetOrBuildClassifier looks up a built Classifier for the named router
 // model in the registry and builds it on miss. Exported so the
 // /api/router/decide decision-oracle endpoint can share the same
@@ -268,10 +262,9 @@ func routerConfigFingerprint(rc config.RouterConfig, classifierCfg *config.Model
 	h := fnv.New64a()
 	h.Write(bytes)
 	if classifierCfg != nil {
-		// Narrow projection: only the fields buildClassifier reads (renderer,
-		// stop tokens, context_size → MaxContextTokens). Hashing the whole
-		// ModelConfig would invalidate the cache on irrelevant changes;
-		// omitting context_size would let a reload leave a stale token budget.
+		// Narrow projection: only the fields newTemplateRenderer and
+		// firstStopWord actually read. Hashing the whole ModelConfig
+		// would invalidate the cache on irrelevant parameter changes.
 		h.Write([]byte{0}) // separator so empty fields don't collide
 		h.Write([]byte(classifierCfg.TemplateConfig.Chat))
 		h.Write([]byte{0})
@@ -281,10 +274,6 @@ func routerConfigFingerprint(rc config.RouterConfig, classifierCfg *config.Model
 			h.Write([]byte(sw))
 			h.Write([]byte{0})
 		}
-		h.Write([]byte{0})
-		if classifierCfg.ContextSize != nil {
-			h.Write([]byte(strconv.Itoa(*classifierCfg.ContextSize)))
-		}
 	}
 	return h.Sum64()
 }
@@ -330,30 +319,11 @@ func buildClassifier(cfg *config.ModelConfig, deps ClassifierDeps) (router.Class
 		if deps.ModelLookup != nil {
 			if classifierCfg := deps.ModelLookup(rc.ClassifierModel); classifierCfg != nil {
 				if deps.Evaluator != nil {
-					// The router renders the scoring prompt client-side, so the
-					// classifier model MUST carry a chat template — refusing
-					// here beats silently falling back to a generic ChatML
-					// envelope the model may not have been trained on.
-					renderer := newTemplateRenderer(deps.Evaluator, classifierCfg)
-					if renderer == nil {
-						return nil, fmt.Errorf(
-							"router classifier score: classifier_model %q has no chat template "+
-								"(set template.chat and template.chat_message in its config). The router "+
-								"renders the scoring prompt with the classifier model's own template; "+
-								"without it the prompt format would not match the model",
-							rc.ClassifierModel)
-					}
-					opts.PromptRenderer = renderer
+					opts.PromptRenderer = newTemplateRenderer(deps.Evaluator, classifierCfg)
 				}
 				if st := pickAssistantTurnEnd(classifierCfg.StopWords, classifierCfg.TemplateConfig.ChatMessage); st != "" {
 					opts.StopToken = st
 				}
-				// Token-exact conversation trim — score classifier drops the
-				// oldest turns using the model's own tokenizer.
-				if count, ctxTokens := modelTokenTrim(rc.ClassifierModel, deps); count != nil {
-					opts.TokenCounter = count
-					opts.MaxContextTokens = ctxTokens
-				}
 			}
 		}
 		inner = router.NewScoreClassifier(policies, scorer, opts)
@@ -365,11 +335,7 @@ func buildClassifier(cfg *config.ModelConfig, deps ClassifierDeps) (router.Class
 		if reranker == nil {
 			return nil, fmt.Errorf("router classifier colbert: classifier_model %q not loadable", rc.ClassifierModel)
 		}
-		rerankClassifier := router.NewRerankClassifier(policies, reranker, cacheCap, rc.ActivationThreshold)
-		if count, ctxTokens := modelTokenTrim(rc.ClassifierModel, deps); count != nil {
-			rerankClassifier = rerankClassifier.WithTokenTrim(count, ctxTokens)
-		}
-		inner = rerankClassifier
+		inner = router.NewRerankClassifier(policies, reranker, cacheCap, rc.ActivationThreshold)
 	default:
 		return nil, fmt.Errorf("router: unknown classifier %q (supported: %s)", name, strings.Join([]string{router.ClassifierScore, router.ClassifierColbert}, ", "))
 	}
@@ -557,41 +523,7 @@ func wrapWithEmbeddingCache(cfg *config.ModelConfig, inner router.Classifier, de
 	if vstore == nil {
 		return nil, fmt.Errorf("vector store %q not loadable", storeName)
 	}
-	cache := router.NewEmbeddingCacheClassifier(inner, embedder, vstore, ec.SimilarityThreshold, ec.ConfidenceThreshold)
-	// Trim the probe to the embedder model's own context (e.g. nomic-embed at
-	// 8k) rather than a fixed guess — otherwise the cache key is an embedding
-	// of a silently-truncated conversation.
-	if count, ctxTokens := modelTokenTrim(ec.EmbeddingModel, deps); count != nil {
-		cache = cache.WithTokenTrim(count, ctxTokens)
-	}
-	return cache, nil
-}
-
-// modelTokenTrim returns a model's own tokenizer and the token ceiling its
-// probe must fit, or (nil, 0) when no tokenizer is available (only then can we
-// not trim exactly). The ceiling is min(effective context, effective batch):
-// score/embed/rerank all decode the whole prompt in one pass, so it must fit
-// both the context window and a single batch. Using the backend's *effective*
-// values — not the raw config fields — means trimming still works when
-// context_size and batch are unset; otherwise a non-trivial prompt overflows
-// the default window and every classification fails.
-func modelTokenTrim(modelName string, deps ClassifierDeps) (func(string) (int, error), int) {
-	if deps.TokenCounter == nil || deps.ModelLookup == nil {
-		return nil, 0
-	}
-	cfg := deps.ModelLookup(modelName)
-	if cfg == nil {
-		return nil, 0
-	}
-	count := deps.TokenCounter(modelName)
-	if count == nil {
-		return nil, 0
-	}
-	ceiling := backend.EffectiveContextSize(*cfg)
-	if b := backend.EffectiveBatchSize(*cfg); b < ceiling {
-		ceiling = b
-	}
-	return count, ceiling
+	return router.NewEmbeddingCacheClassifier(inner, embedder, vstore, ec.SimilarityThreshold, ec.ConfidenceThreshold), nil
 }

 func newDecisionID() string {
@@ -613,41 +545,6 @@ func OpenAIProbe(parsed any) (router.Probe, bool) {
 	return OpenAIProbeFromRequest(req), true
 }

-// messageText flattens a chat message's Content to plain text: string content
-// verbatim; []any structured content contributes only its "text" blocks.
-func messageText(content any) string {
-	switch ct := content.(type) {
-	case string:
-		return ct
-	case []any:
-		var b strings.Builder
-		for _, block := range ct {
-			if bm, ok := block.(map[string]any); ok && bm["type"] == "text" {
-				if t, ok := bm["text"].(string); ok {
-					if b.Len() > 0 {
-						b.WriteByte('\n')
-					}
-					b.WriteString(t)
-				}
-			}
-		}
-		return b.String()
-	}
-	return ""
-}
-
-// messageProbeParts drops empty (e.g. image-only) messages so they don't
-// consume budget or emit blank lines.
-func messageProbeParts(texts []string) []string {
-	parts := make([]string, 0, len(texts))
-	for _, t := range texts {
-		if t != "" {
-			parts = append(parts, t)
-		}
-	}
-	return parts
-}
-
 // OpenAIProbeFromRequest is the typed counterpart of OpenAIProbe — same
 // extraction logic, but takes the request struct directly. Realtime and
 // other non-HTTP callers use it to feed a probe to router.Resolve
@@ -656,15 +553,24 @@ func OpenAIProbeFromRequest(req *schema.OpenAIRequest) router.Probe {
 	if req == nil {
 		return router.Probe{}
 	}
-	texts := make([]string, len(req.Messages))
+	var b strings.Builder
 	for i := range req.Messages {
-		texts[i] = messageText(req.Messages[i].Content)
+		switch ct := req.Messages[i].Content.(type) {
+		case string:
+			b.WriteString(ct)
+			b.WriteByte('\n')
+		case []any:
+			for _, block := range ct {
+				if bm, ok := block.(map[string]any); ok && bm["type"] == "text" {
+					if t, ok := bm["text"].(string); ok {
+						b.WriteString(t)
+						b.WriteByte('\n')
+					}
+				}
+			}
+		}
 	}
-	parts := messageProbeParts(texts)
-	// Prompt carries the full conversation; each classifier trims it to its own
-	// model's context (see modelTokenTrim). Messages preserves the per-turn
-	// split the trimmer drops oldest-first.
-	return router.Probe{Prompt: router.JoinTurns(parts), Messages: parts}
+	return router.Probe{Prompt: b.String()}
 }

 // AnthropicProbe is the AnthropicRequest analogue of OpenAIProbe.
@@ -673,10 +579,25 @@ func AnthropicProbe(parsed any) (router.Probe, bool) {
 	if !ok || req == nil {
 		return router.Probe{}, false
 	}
-	texts := make([]string, len(req.Messages))
+	var b strings.Builder
 	for i := range req.Messages {
-		texts[i] = messageText(req.Messages[i].Content)
+		switch ct := req.Messages[i].Content.(type) {
+		case string:
+			b.WriteString(ct)
+			b.WriteByte('\n')
+		case []any:
+			for _, block := range ct {
+				if bm, ok := block.(map[string]any); ok && bm["type"] == "text" {
+					if t, ok := bm["text"].(string); ok {
+						b.WriteString(t)
+						b.WriteByte('\n')
+					}
+				}
+			}
+		}
 	}
-	parts := messageProbeParts(texts)
-	return router.Probe{Prompt: router.JoinTurns(parts), Messages: parts}, true
+	return router.Probe{
+		Prompt: b.String(),
+	}, true
 }
+
--- a/core/http/middleware/route_model_test.go
+++ b/core/http/middleware/route_model_test.go
@@ -246,12 +246,11 @@ var _ = Describe("RouteModel rendered classifier prompt", func() {
 			"rendered prompt must end at assistant-open marker. got: %q", s.lastPrompt)
 	})

-	It("refuses to build the router when the classifier model has no chat_message template", func() {
-		// Partial template config: only the outer Chat, no per-role piece.
-		// The router renders the scoring prompt client-side from the
-		// classifier model's own template, so a missing template is a hard
-		// error rather than a silent fall back to a generic ChatML envelope
-		// the model may not have been trained on.
+	It("falls back to chatMLRenderer when the classifier model has no chat_message template", func() {
+		// Partial template config: only outer Chat, no per-role
+		// piece. The renderer must refuse rather than emit a prompt
+		// that drops the system turn, so the score classifier's
+		// built-in ChatML default takes over.
 		writePartialClassifierModel(modelDir, "arch-router")
 		routerCfg := newScoreRouterModel(modelDir, "smart-router")

@@ -267,9 +266,19 @@ var _ = Describe("RouteModel rendered classifier prompt", func() {
 				ModelLookup: loaderLookup(loader, appConfig),
 				Evaluator:   eval,
 			})
-		Expect(err).To(HaveOccurred())
-		Expect(err.Error()).To(ContainSubstring("no chat template"),
-			"missing classifier template must surface as a clear config error. got: %v", err)
+		Expect(err).NotTo(HaveOccurred())
+
+		// chatMLRenderer fallback emits its own envelope and still
+		// embeds the routing system prompt. OpenAIProbeFromRequest
+		// appends "\n" after each message body, so the user content
+		// reaches the renderer as "hello world\n" — the substring
+		// match accounts for that.
+		Expect(s.lastPrompt).To(ContainSubstring("<routes>"),
+			"fallback renderer also dropped the system prompt")
+		Expect(s.lastPrompt).To(ContainSubstring("<|im_start|>system\n"))
+		Expect(s.lastPrompt).To(ContainSubstring("<|im_start|>user\nhello world\n<|im_end|>"))
+		Expect(strings.HasSuffix(s.lastPrompt, "<|im_start|>assistant\n")).To(BeTrue(),
+			"chatMLRenderer fallback must end at assistant-open marker. got: %q", s.lastPrompt)
 	})

 	It("uses the classifier model's first stopword as the candidate suffix", func() {
@@ -524,8 +533,8 @@ template:

 // writePartialClassifierModel writes a classifier model that has the
 // outer Chat template but no ChatMessage — exercises the
-// newTemplateRenderer "refuse partial templating" branch, which makes
-// buildClassifier reject the router with a missing-template error.
+// newTemplateRenderer "refuse partial templating" branch that hands
+// off to chatMLRenderer.
 func writePartialClassifierModel(modelDir, name string) {
 	body := `name: ` + name + `
 backend: llama-cpp
--- a/core/http/openresponses_test.go
+++ b/core/http/openresponses_test.go
@@ -59,14 +59,14 @@ var _ = Describe("Open Responses API", func() {
 			Expect(err).ToNot(HaveOccurred())

 			go func() {
-				if err := app.Start("127.0.0.1:9090"); err != nil && err != http.ErrServerClosed {
+				if err := app.Start(testHTTPAddr); err != nil && err != http.ErrServerClosed {
 					xlog.Error("server error", "error", err)
 				}
 			}()

 			// Wait for API to be ready
 			Eventually(func() error {
-				resp, err := http.Get("http://127.0.0.1:9090/healthz")
+				resp, err := http.Get(testHTTPBase + "/healthz")
 				if err != nil {
 					return err
 				}
@@ -95,7 +95,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -118,7 +118,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -143,7 +143,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -168,7 +168,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -196,7 +196,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -241,7 +241,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -269,7 +269,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -297,7 +297,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -328,7 +328,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -358,7 +358,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -386,7 +386,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -418,7 +418,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -454,7 +454,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -490,7 +490,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -539,7 +539,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -590,7 +590,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -624,7 +624,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -658,7 +658,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -680,7 +680,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -727,7 +727,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -756,7 +756,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -799,7 +799,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -835,7 +835,7 @@ var _ = Describe("Open Responses API", func() {
 				payload1, err := json.Marshal(reqBody1)
 				Expect(err).ToNot(HaveOccurred())

-				req1, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload1))
+				req1, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload1))
 				Expect(err).ToNot(HaveOccurred())
 				req1.Header.Set("Content-Type", "application/json")
 				req1.Header.Set("Authorization", bearerKey)
@@ -869,7 +869,7 @@ var _ = Describe("Open Responses API", func() {
 				payload2, err := json.Marshal(reqBody2)
 				Expect(err).ToNot(HaveOccurred())

-				req2, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload2))
+				req2, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload2))
 				Expect(err).ToNot(HaveOccurred())
 				req2.Header.Set("Content-Type", "application/json")
 				req2.Header.Set("Authorization", bearerKey)
@@ -897,7 +897,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
@@ -933,7 +933,7 @@ var _ = Describe("Open Responses API", func() {
 				payload1, err := json.Marshal(reqBody1)
 				Expect(err).ToNot(HaveOccurred())

-				req1, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload1))
+				req1, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload1))
 				Expect(err).ToNot(HaveOccurred())
 				req1.Header.Set("Content-Type", "application/json")
 				req1.Header.Set("Authorization", bearerKey)
@@ -983,7 +983,7 @@ var _ = Describe("Open Responses API", func() {
 				payload2, err := json.Marshal(reqBody2)
 				Expect(err).ToNot(HaveOccurred())

-				req2, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload2))
+				req2, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload2))
 				Expect(err).ToNot(HaveOccurred())
 				req2.Header.Set("Content-Type", "application/json")
 				req2.Header.Set("Authorization", bearerKey)
@@ -1009,7 +1009,7 @@ var _ = Describe("Open Responses API", func() {
 				payload, err := json.Marshal(reqBody)
 				Expect(err).ToNot(HaveOccurred())

-				req, err := http.NewRequest("POST", "http://127.0.0.1:9090/v1/responses", bytes.NewBuffer(payload))
+				req, err := http.NewRequest("POST", testHTTPBase+"/v1/responses", bytes.NewBuffer(payload))
 				Expect(err).ToNot(HaveOccurred())
 				req.Header.Set("Content-Type", "application/json")
 				req.Header.Set("Authorization", bearerKey)
--- a/core/http/react-ui/e2e/model-config.spec.js
+++ b/core/http/react-ui/e2e/model-config.spec.js
@@ -224,38 +224,4 @@ test.describe('Model Editor - Interactive Tab', () => {
    expect(estimateCalled).toBe(true)
  })

-  test('interactive tab scrolls at body height (no inner overflow pane) and tracks the active section', async ({ page }) => {
-    // Regression: the form sections used to live inside an overflow:auto pane
-    // with maxHeight: calc(100vh - 340px), which kept the global footer in
-    // view on every screen and ate ~50px of editing room on short windows.
-    // Pin two pieces of the fix:
-    //  1. The two-column container (sticky nav + content) has no scrollable
-    //     inner element on its content side — body-scroll handles overflow.
-    //  2. The active-section tracker now listens to window scroll. Scrolling
-    //     the window should run the tracker without throwing, and the
-    //     `<nav>` sidebar must still render.
-    const contentOverflowY = await page.evaluate(() => {
-      const sidebar = document.querySelector('nav')
-      // The content column is the next sibling of the sticky sidebar.
-      const content = sidebar?.nextElementSibling
-      return content ? getComputedStyle(content).overflowY : 'no-content'
-    })
-    expect(['visible', 'normal', 'auto', 'scroll', 'no-content']).toContain(contentOverflowY)
-    expect(contentOverflowY).not.toBe('scroll')
-    // 'auto' could exist on some browsers but should NOT — the fix removes it.
-    // We assert the strong invariant separately.
-    expect(['auto']).not.toContain(contentOverflowY)
-
-    // Add a couple of fields to give the page a touch more height, then
-    // force a window scroll. The tracker should run; the sidebar should
-    // remain visible.
-    const searchInput = page.locator('input[placeholder="Search fields to add..."]')
-    await searchInput.fill('Temperature')
-    const dropdown = searchInput.locator('..').locator('..')
-    await dropdown.locator('div', { hasText: 'Temperature' }).first().click()
-    await page.evaluate(() => window.scrollTo(0, 200))
-    await page.waitForTimeout(50)
-    await expect(page.locator('nav').first()).toBeVisible()
-  })
-
 })
--- a/core/http/react-ui/e2e/model-editor-back-nav.spec.js
+++ b/core/http/react-ui/e2e/model-editor-back-nav.spec.js
@@ -1,94 +0,0 @@
-import { test, expect } from './coverage-fixtures.js'
-
-// Exercises the "Back to <page>" navigation convention: whichever page links
-// into the Model Editor stamps its origin as react-router location state, and
-// the editor's Back button returns there (captioned with the origin) instead
-// of a hardcoded route. Also covers the Middleware page's ?tab= persistence,
-// which is what lets the editor return you to the exact tab you came from.
-
-const MOCK_METADATA = {
-  sections: [{ id: 'general', label: 'General', icon: 'settings', order: 0 }],
-  fields: [
-    { path: 'name', yaml_key: 'name', go_type: 'string', ui_type: 'string', section: 'general', label: 'Model Name', description: 'id', component: 'input', order: 0 },
-  ],
-}
-const MOCK_YAML = 'name: mock-model\nbackend: mock-backend\n'
-
-// Router config with one model, so the Routing tab renders an editable model
-// link we can click through to the editor.
-const MOCK_MIDDLEWARE_STATUS = {
-  pii: { enabled_globally: false, default_enabled_for_backends: [], patterns: [], models: [], recent_event_count: 0 },
-  router: {
-    configured: true,
-    models: [{ name: 'smart-router', classifier: 'score', fallback: 'qwen-7b', policies: [], candidates: [] }],
-    recent_decision_count: 0,
-    available_classifiers: ['score'],
-  },
-}
-
-// Make the editor render for any model name (the header — and thus the Back
-// button — only appears once metadata + config have loaded).
-async function mockEditorEndpoints(page) {
-  await page.route('**/api/models/config-metadata*', (route) =>
-    route.fulfill({ contentType: 'application/json', body: JSON.stringify(MOCK_METADATA) }))
-  await page.route('**/api/models/edit/**', (route) =>
-    route.fulfill({ contentType: 'application/json', body: JSON.stringify({ config: MOCK_YAML, name: 'mock-model' }) }))
-  await page.route('**/api/models/config-json/**', (route) =>
-    route.fulfill({ contentType: 'application/json', body: '{}' }))
-}
-
-test.describe('Model Editor — Back navigation', () => {
-  test.beforeEach(async ({ page }) => {
-    await page.route('**/api/auth/status', (route) =>
-      route.fulfill({ contentType: 'application/json', body: JSON.stringify({ authEnabled: false, staticApiKeyRequired: false, providers: [] }) }))
-    await mockEditorEndpoints(page)
-  })
-
-  test('Back returns to Manage with a "Back to Manage" caption', async ({ page }) => {
-    await page.goto('/app/manage')
-    await expect(page.locator('.table')).toBeVisible({ timeout: 10_000 })
-
-    // Open the first row's action menu and pick "Edit configuration".
-    const trigger = page.locator('button.action-menu__trigger').first()
-    await expect(trigger).toBeVisible()
-    await trigger.click()
-    await page.getByRole('menuitem', { name: 'Edit configuration' }).click()
-
-    await expect(page).toHaveURL(/\/app\/model-editor\//)
-    const back = page.getByRole('button', { name: /Back to Manage/ })
-    await expect(back).toBeVisible({ timeout: 10_000 })
-
-    await back.click()
-    await expect(page).toHaveURL(/\/app\/manage/)
-  })
-
-  test('returns to the originating Middleware tab (?tab=routing) it was opened from', async ({ page }) => {
-    await page.route('**/api/middleware/status', (route) =>
-      route.fulfill({ contentType: 'application/json', body: JSON.stringify(MOCK_MIDDLEWARE_STATUS) }))
-    await page.route('**/api/pii/events?**', (route) =>
-      route.fulfill({ contentType: 'application/json', body: JSON.stringify({ events: [] }) }))
-    await page.route('**/api/router/decisions?**', (route) =>
-      route.fulfill({ contentType: 'application/json', body: JSON.stringify({ decisions: [] }) }))
-
-    await page.goto('/app/middleware')
-    // Switching to Routing must push the tab into the URL.
-    await page.getByRole('button', { name: /Routing/i }).click()
-    await expect(page).toHaveURL(/[?&]tab=routing/)
-
-    // Click through to the router model's config, then back.
-    await page.getByRole('link', { name: 'smart-router' }).click()
-    await expect(page).toHaveURL(/\/app\/model-editor\/smart-router/)
-    const back = page.getByRole('button', { name: /Back to Middleware/ })
-    await expect(back).toBeVisible({ timeout: 10_000 })
-
-    await back.click()
-    // Returns to the exact tab, not the default Filtering tab.
-    await expect(page).toHaveURL(/\/app\/middleware\?tab=routing/)
-    await expect(page.getByText('smart-router').first()).toBeVisible()
-  })
-
-  test('falls back to "Back to Manage" on a direct visit with no origin state', async ({ page }) => {
-    await page.goto('/app/model-editor/mock-model')
-    await expect(page.getByRole('button', { name: /Back to Manage/ })).toBeVisible({ timeout: 10_000 })
-  })
-})
--- a/core/http/react-ui/e2e/traces-errors.spec.js
+++ b/core/http/react-ui/e2e/traces-errors.spec.js
@@ -48,77 +48,3 @@ test.describe('Traces - Error Display', () => {
    await expect(page.locator('th', { hasText: 'Type' })).toBeVisible()
  })
 })
-
-// Pin the BackendTraceDetail expansion path for a vector_store trace —
-// the type that surfaces the router's embedding-cache plumbing. The
-// row click triggers the detail render, which exercises typeBadgeStyle
-// (with the new vector_store badge color), the DataFields component
-// (op / outcome / vector_dim / similarity), and the "View backend
-// logs" link that resolves to the store namespace. Without this spec
-// the new color entry plus the data-field render branches stay
-// uncovered, dragging UI line coverage below the regression gate.
-test.describe('Traces - vector_store backend trace detail', () => {
-  test.beforeEach(async ({ page }) => {
-    await page.route('**/api/traces', (route) => {
-      route.fulfill({ contentType: 'application/json', body: '[]' })
-    })
-    await page.route('**/api/backend-traces', (route) => {
-      route.fulfill({
-        contentType: 'application/json',
-        body: JSON.stringify([
-          {
-            type: 'vector_store',
-            timestamp: '2026-05-28T13:56:25.558Z',
-            model_name: 'router-cache-smart-router',
-            backend: 'local-store',
-            summary: 'search hit (sim=0.989)',
-            duration: 160_000_000,
-            error: '',
-            data: {
-              op: 'search',
-              outcome: 'hit',
-              vector_dim: 768,
-              similarity: 0.9899752140045166,
-            },
-          },
-          {
-            type: 'vector_store',
-            timestamp: '2026-05-28T13:49:07.545Z',
-            model_name: 'router-cache-smart-router',
-            backend: 'local-store',
-            summary: 'search miss',
-            duration: 100_000_000,
-            error: '',
-            data: {
-              op: 'search',
-              outcome: 'miss',
-              vector_dim: 768,
-            },
-          },
-        ]),
-      })
-    })
-    await page.goto('/app/traces')
-    await expect(page.locator('text=Tracing is')).toBeVisible({ timeout: 10_000 })
-    await page.locator('button', { hasText: 'Backend Traces' }).click()
-  })
-
-  test('renders type badge and expands data fields on row click', async ({ page }) => {
-    // The vector_store badge appears in the type column.
-    await expect(page.locator('td span', { hasText: 'vector_store' }).first()).toBeVisible()
-
-    // Clicking the first row expands BackendTraceDetail, which renders
-    // the four data fields. Use the first row's "search hit" summary
-    // as the anchor to disambiguate from the miss row below.
-    await page.locator('tr', { hasText: 'search hit' }).first().click()
-
-    // DataFields renders op/outcome/vector_dim/similarity as label/value pairs.
-    // 'hit' appears as the rendered outcome value.
-    await expect(page.locator('text=outcome').first()).toBeVisible()
-    await expect(page.locator('text=hit').first()).toBeVisible()
-
-    // The model_name → /app/backend-logs link is the BackendTraceDetail
-    // affordance for jumping to logs for the store namespace.
-    await expect(page.locator('a', { hasText: 'View backend logs' })).toBeVisible()
-  })
-})
--- a/core/http/react-ui/i18next-parser.config.js
+++ b/core/http/react-ui/i18next-parser.config.js
@@ -1,5 +1,5 @@
 export default {
-  locales: ['en', 'it', 'es', 'de', 'zh-CN', 'id'],
+  locales: ['en', 'it', 'es', 'de', 'zh-CN'],
  defaultNamespace: 'common',
  output: 'public/locales/$LOCALE/$NAMESPACE.json',
  input: ['src/**/*.{js,jsx}'],
--- a/core/http/react-ui/playwright.config.js
+++ b/core/http/react-ui/playwright.config.js
@@ -4,12 +4,6 @@ export default defineConfig({
  testDir: './e2e',
  timeout: 30_000,
  retries: process.env.CI ? 2 : 0,
-  // TEMPORARY: cap parallelism. Playwright's default (cores/2) oversubscribes
-  // high-core dev machines and intermittently starves the page-teardown
-  // coverage harvest past the 30s test timeout (flaky "Tearing down page"
-  // failures, different specs each run). Capped at 8 pending a proper
-  // root-cause fix; override with PW_WORKERS.
-  workers: process.env.PW_WORKERS ? Number(process.env.PW_WORKERS) : 8,
  reporter: process.env.CI ? 'html' : 'list',
  use: {
    baseURL: 'http://127.0.0.1:8089',
--- a/core/http/react-ui/public/locales/id/admin.json
+++ b/core/http/react-ui/public/locales/id/admin.json
@@ -1,85 +0,0 @@
-{
-  "manage": {
-    "title": "Sistem",
-    "subtitle": "Kelola model dan backend yang terinstal"
-  },
-  "settings": {
-    "title": "Pengaturan",
-    "subtitle": "Konfigurasi pengaturan runtime LocalAI",
-    "saved": "Pengaturan berhasil disimpan",
-    "saveFailed": "Gagal menyimpan: {{message}}",
-    "loadFailed": "Gagal memuat pengaturan: {{message}}",
-    "sections": {
-      "branding": "Branding",
-      "watchdog": "Watchdog",
-      "memory": "Memori",
-      "backends": "Backend",
-      "performance": "Performa",
-      "tracing": "Tracing",
-      "api": "API & CORS",
-      "p2p": "P2P",
-      "galleries": "Galeri",
-      "apikeys": "API Key",
-      "agents": "Agent Job",
-      "agentpool": "Agent Pool",
-      "assistant": "Asisten LocalAI",
-      "responses": "Respons"
-    }
-  },
-  "backends": {
-    "title": "Manajemen Backend",
-    "subtitle": "Temukan dan instal backend AI untuk mendukung model Anda"
-  },
-  "backendLogs": {
-    "title": "Log Backend",
-    "subtitle": "Lihat log dari backend yang sedang berjalan",
-    "empty": "Tidak ada log yang tersedia"
-  },
-  "traces": {
-    "title": "Trace",
-    "subtitle": "Lihat log permintaan API, respons, dan operasi backend"
-  },
-  "nodes": {
-    "title": "Node Terdistribusi",
-    "subtitle": "Kelola node backend dan node worker"
-  },
-  "p2p": {
-    "title": "Komputasi AI Terdistribusi",
-    "subtitle": "Skalakan beban kerja AI Anda ke beberapa perangkat dengan distribusi peer-to-peer"
-  },
-  "users": {
-    "title": "Pengguna",
-    "subtitle": "Kelola pengguna terdaftar, peran, dan undangan"
-  },
-  "usage": {
-    "title": "Penggunaan",
-    "subtitle": "Statistik penggunaan token API",
-    "sources": {
-      "tab": "Sumber",
-      "mixTitle": "Campuran sumber",
-      "ribbonAria": "{{apikey}}% API Key, {{web}}% Web UI, {{legacy}}% Legasi",
-      "topSources": "Sumber teratas dari waktu ke waktu",
-      "searchPlaceholder": "Cari berdasarkan nama atau awalan",
-      "sortBy": "Urutkan",
-      "sortTokens": "Token",
-      "sortRequests": "Permintaan",
-      "sortLastUsed": "Terakhir digunakan",
-      "sortName": "Nama",
-      "sortUser": "Pengguna",
-      "webUI": "Web UI",
-      "legacy": "Legasi",
-      "revoked": " dicabut",
-      "filteredTo": "Difilter ke: {{name}}",
-      "clearFilter": "Hapus filter",
-      "other": "Lainnya ({{count}})",
-      "noTrafficShort": "Tidak ada permintaan dalam periode ini.",
-      "noKeysYet": "Setelah permintaan masuk, Anda akan melihat rinciannya di sini.",
-      "createKey": "Buat API Key pertama Anda",
-      "truncatedWarning": "Menampilkan 200 key teratas. Terapkan filter untuk mempersempit pencarian."
-    }
-  },
-  "explorer": {
-    "title": "Penjelajah",
-    "subtitle": "Jelajahi file dan konfigurasi"
-  }
-}
--- a/core/http/react-ui/public/locales/id/agents.json
+++ b/core/http/react-ui/public/locales/id/agents.json
@@ -1,55 +0,0 @@
-{
-  "title": "Agen",
-  "subtitle": "Kelola agen AI otonom",
-  "actions": {
-    "agentHub": "Pusat Agen",
-    "import": "Impor",
-    "createAgent": "Buat Agen",
-    "edit": "Edit",
-    "chat": "Obrolan",
-    "export": "Ekspor",
-    "delete": "Hapus",
-    "pause": "Jeda",
-    "resume": "Lanjutkan"
-  },
-  "table": {
-    "name": "Nama",
-    "status": "Status",
-    "events": "Event",
-    "actions": "Aksi",
-    "eventsTooltip": "{{count}} event - Klik untuk melihat"
-  },
-  "search": {
-    "placeholder": "Cari agen...",
-    "summary_one": "{{shown}} dari {{total}} agen",
-    "summary_other": "{{shown}} dari {{total}} agen"
-  },
-  "empty": {
-    "noConfigured": "Belum ada agen yang dikonfigurasi",
-    "noConfiguredText": "Buat agen untuk memulai alur kerja AI otonom.",
-    "browseHub": "Tidak tahu harus mulai dari mana? Jelajahi <1>Pusat Agen</1> untuk menemukan konfigurasi agen siap pakai yang bisa Anda impor.",
-    "noMatching": "Tidak ada agen yang cocok",
-    "noMatchingText": "Tidak ada agen yang cocok dengan \"{{query}}\""
-  },
-  "sections": {
-    "yourAgents": "Agent Anda",
-    "otherUsersAgents": "Agent Pengguna Lain"
-  },
-  "deleteDialog": {
-    "title": "Hapus Agen",
-    "message": "Hapus agen \"{{name}}\"? Tindakan ini tidak dapat dibatalkan.",
-    "confirm": "Hapus"
-  },
-  "toasts": {
-    "loadFailed": "Gagal memuat agen: {{message}}",
-    "deleted": "Agen \"{{name}}\" berhasil dihapus",
-    "deleteFailed": "Gagal menghapus agen: {{message}}",
-    "paused": "Agen \"{{name}}\" dijeda",
-    "resumed": "Agen \"{{name}}\" dilanjutkan",
-    "pauseFailed": "Gagal menjeda agen: {{message}}",
-    "resumeFailed": "Gagal melanjutkan agen: {{message}}",
-    "exported": "Agen \"{{name}}\" berhasil diekspor",
-    "exportFailed": "Gagal mengekspor agen: {{message}}",
-    "parseFailed": "Gagal melakukan parse file agen: {{message}}"
-  }
-}
--- a/core/http/react-ui/public/locales/id/auth.json
+++ b/core/http/react-ui/public/locales/id/auth.json
@@ -1,112 +0,0 @@
-{
-  "login": {
-    "subtitle": "Masuk untuk melanjutkan",
-    "registerSubtitle": "Buat akun",
-    "createAdminSubtitle": "Buat akun admin Anda",
-    "tokenSubtitle": "Masukkan API key Anda untuk melanjutkan",
-    "email": "Email",
-    "emailPlaceholder": "anda@example.com",
-    "name": "Nama",
-    "namePlaceholder": "Nama Anda (opsional)",
-    "password": "Kata Sandi",
-    "passwordPlaceholder": "Masukkan kata sandi...",
-    "newPasswordPlaceholder": "Minimal 12 karakter",
-    "confirmPassword": "Konfirmasi Kata Sandi",
-    "confirmPasswordPlaceholder": "Ulangi kata sandi",
-    "inviteCodeLabel": "Kode Undangan",
-    "inviteCodeOptional": " (opsional — lewati waktu tunggu persetujuan)",
-    "inviteCodePlaceholder": "Tempel kode undangan Anda...",
-    "tokenPlaceholder": "Masukkan API key...",
-    "tokenAltPlaceholder": "Masukkan token API...",
-    "signIn": "Masuk",
-    "signingIn": "Sedang masuk...",
-    "register": "Daftar",
-    "creatingAccount": "Membuat akun...",
-    "createAdminAccount": "Buat Akun Admin",
-    "signInWithGitHub": "Masuk dengan GitHub",
-    "signInWithSSO": "Masuk dengan SSO",
-    "loginWithToken": "Masuk dengan Token",
-    "showTokenLogin": "Masuk dengan Token API",
-    "hideTokenLogin": "Sembunyikan Token API",
-    "noAccount": "Belum punya akun?",
-    "hasAccount": "Sudah punya akun?",
-    "or": "atau",
-    "errors": {
-      "loginFailed": "Gagal masuk",
-      "registrationFailed": "Gagal mendaftar",
-      "invalidToken": "Token tidak valid",
-      "passwordsDoNotMatch": "Kata sandi tidak cocok",
-      "enterToken": "Silahkan masukkan token",
-      "networkError": "Eror jaringan",
-      "inviteRequired": "Kode undangan yang valid diperlukan untuk mendaftar"
-    },
-    "messages": {
-      "registrationPending": "Pendaftaran berhasil, menunggu persetujuan."
-    }
-  },
-  "account": {
-    "title": "Akun",
-    "subtitle": "Profil, kredensial, dan API key",
-    "unavailable": "Akun tidak tersedia",
-    "unavailableText": "Autentikasi harus diaktifkan untuk mengelola akun Anda.",
-    "tabs": {
-      "profile": "Profil",
-      "security": "Keamanan",
-      "apiKeys": "API Key"
-    },
-    "profile": {
-      "displayName": "Nama tampilan",
-      "displayNameDescription": "Nama tampilan publik Anda",
-      "avatarUrl": "URL Avatar",
-      "avatarUrlDescription": "URL ke gambar profil Anda",
-      "avatarUrlPlaceholder": "https://example.com/avatar.png",
-      "save": "Simpan",
-      "saving": "Menyimpan...",
-      "updated": "Profil berhasil diperbarui",
-      "updateFailed": "Gagal memperbarui profil: {{message}}"
-    },
-    "security": {
-      "currentPassword": "Kata sandi saat ini",
-      "currentPasswordDescription": "Masukkan kata sandi Anda saat ini untuk memverifikasi identitas Anda",
-      "currentPasswordPlaceholder": "Kata sandi saat ini",
-      "newPassword": "Kata sandi baru",
-      "newPasswordDescription": "Minimal harus 12 karakter",
-      "newPasswordPlaceholder": "Kata sandi baru",
-      "confirmPassword": "Konfirmasi kata sandi",
-      "confirmPasswordDescription": "Masukkan kembali kata sandi baru Anda",
-      "confirmPasswordPlaceholder": "Konfirmasi kata sandi baru",
-      "changePassword": "Ubah kata sandi",
-      "changing": "Mengubah...",
-      "changed": "Kata sandi berhasil diubah",
-      "passwordsDoNotMatch": "Kata sandi tidak cocok",
-      "tooShort": "Kata sandi baru minimal harus 12 karakter",
-      "oauthOnly": "Manajemen kata sandi tidak tersedia untuk akun {{provider}}."
-    },
-    "apiKeys": {
-      "create": "Buat API key",
-      "createDescription": "Buat key untuk akses terprogram",
-      "namePlaceholder": "Nama key (misal: my-app)",
-      "createButton": "Buat",
-      "creating": "Membuat...",
-      "createdToast": "API key berhasil dibuat",
-      "createFailed": "Gagal membuat API key: {{message}}",
-      "loadFailed": "Failed to load API keys: {{message}}",
-      "revoke": "Cabut",
-      "revokeKey": "Cabut key",
-      "revokeTitle": "Cabut API Key",
-      "revokeMessage": "Cabut API key \"{{name}}\"? Tindakan ini tidak dapat dibatalkan.",
-      "revoked": "API key dicabut",
-      "revokeFailed": "Gagal mencabut API key: {{message}}",
-      "copyNow": "Salin sekarang — key ini tidak akan ditampilkan lagi",
-      "copiedToast": "Berhasil disalin ke papan klip",
-      "copyFailed": "Gagal menyalin",
-      "empty": "Belum ada API key. Buat satu di atas untuk akses terprogram.",
-      "lastUsed": "terakhir digunakan {{date}}"
-    }
-  },
-  "notFound": {
-    "title": "Halaman Tidak Ditemukan",
-    "text": "Sepertinya halaman yang Anda cari tidak ditemukan. Mari kembalikan ke halaman sebelumnya.",
-    "goHome": "Kembali ke Beranda"
-  }
-}
--- a/core/http/react-ui/public/locales/id/chat.json
+++ b/core/http/react-ui/public/locales/id/chat.json
@@ -1,117 +0,0 @@
-{
-  "activity": {
-    "thought": "Penalaran",
-    "tool": "Alat",
-    "result": "Hasil",
-    "toolResult": "Hasil {{name}}",
-    "thinking": "Berpikir..."
-  },
-  "header": {
-    "manageModeTooltip": "Obrolan ini dapat menginstal model, mengedit konfigurasi, dan mengelola backend dengan berbicara melalui LocalAI.",
-    "modelInfo": "Info model",
-    "chatSettings": "Pengaturan Obrolan",
-    "modelInfoTitle": "Info model: {{model}}",
-    "editConfig": "Edit konfigurasi",
-    "close": "Tutup"
-  },
-  "modelInfo": {
-    "backend": "Backend",
-    "modelFile": "File model",
-    "contextSize": "Ukuran konteks",
-    "threads": "Thread",
-    "mcp": "MCP",
-    "configured": "Dikonfigurasi",
-    "chatTemplate": "Templat Obrolan",
-    "yes": "Ya",
-    "gpuLayers": "Layer GPU"
-  },
-  "context": {
-    "label": "Konteks: {{percent}}%",
-    "labelWithTokens": "Konteks: {{percent}}% ({{tokens}} tokens)"
-  },
-  "settings": {
-    "title": "Pengaturan Obrolan",
-    "manageMode": "Mode Manajemen",
-    "manageModeDesc": "Izinkan obrolan ini menginstal model, mengganti backend, dan mengedit konfigurasi dengan berbicara melalui LocalAI.",
-    "systemPrompt": "System Prompt",
-    "systemPromptPlaceholder": "Anda adalah asisten yang membantu...",
-    "temperature": "Temperatur",
-    "topP": "Top P",
-    "topK": "Top K",
-    "contextSize": "Ukuran Konteks",
-    "contextSizePlaceholder": "2048",
-    "clearHistory": "Hapus riwayat obrolan"
-  },
-  "empty": {
-    "manageTitle": "Kelola LocalAI dengan obrolan",
-    "manageText": "Izinkan untuk menginstal model, mengganti backend, mengedit konfigurasi, atau memeriksa status. Asisten akan merangkum tindakan dan menunggu konfirmasi Anda sebelum mengubah apa pun.",
-    "startTitle": "Mulai percakapan",
-    "readyText": "Siap untuk mengobrol dengan {{model}}",
-    "selectModelText": "Pilih model di atas untuk memulai",
-    "suggestionsManage": [
-      "Apa saja yang terinstal?",
-      "Instal model obrolan",
-      "Tampilkan status sistem",
-      "Perbarui backend"
-    ],
-    "suggestionsChat": [
-      "Jelaskan cara kerjanya",
-      "Bantu saya menulis kode",
-      "Rangkum dokumen",
-      "Gali ide"
-    ],
-    "recent": "Terbaru",
-    "noMessages": "Belum ada pesan",
-    "hintEnter": "Enter untuk mengirim",
-    "hintShiftEnter": "Shift+Enter untuk baris baru",
-    "hintAttach": "Lampirkan file"
-  },
-  "errors": {
-    "viewTraces": "Lihat trace untuk detailnya"
-  },
-  "actions": {
-    "copy": "Salin",
-    "regenerate": "Hasilkan ulang"
-  },
-  "streaming": {
-    "transferring": "Mentransfer model...",
-    "transferringTo": "Mentransfer model ke {{node}}..."
-  },
-  "tokens": {
-    "perSec": "{{count}} tok/s",
-    "peak": "Puncak: {{count}} tok/s",
-    "usage": "{{prompt}}p + {{completion}}c = {{total}}"
-  },
-  "input": {
-    "placeholder": "Pesan...",
-    "attachFile": "Lampirkan file",
-    "stopGenerating": "Hentikan pembuatan",
-    "canvasTitle": "Canvas — ekstrak blok kode dan media ke panel samping untuk pratinjau, salin, dan unduh",
-    "canvasLabel": "Canvas",
-    "openCanvas": "Buka panel canvas"
-  },
-  "deleteAllDialog": {
-    "title": "Hapus Semua Obrolan",
-    "message": "Hapus semua obrolan? Tindakan ini tidak dapat dibatalkan.",
-    "confirm": "Hapus semua"
-  },
-  "toasts": {
-    "selectModel": "Silahkan pilih model",
-    "copied": "Berhasil disalin ke papan klip",
-    "copyFailed": "Gagal menyalin ke papan klip"
-  },
-  "menu": {
-    "trigger": "Obrolan",
-    "triggerTitle": "Percakapan (Ctrl/Cmd+K)",
-    "search": "Cari percakapan...",
-    "clearSearch": "Hapus pencarian",
-    "noMatch": "Tidak ada percakapan yang cocok dengan pencarian Anda",
-    "noConversations": "Belum ada percakapan",
-    "rename": "Ubah nama",
-    "exportMarkdown": "Ekspor sebagai Markdown",
-    "deleteChat": "Hapus obrolan",
-    "newChat": "Obrolan baru",
-    "clearAll": "Hapus semua",
-    "deleteAllTitle": "Hapus semua percakapan"
-  }
-}
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Ettore Di Giacinto	b40843cf62	feat(dllm): image input through the backend (multimodal C-ABI) Routes PredictOptions.Images (raw base64, the core convention) through dllm.cpp's probed multimodal entry points as data: URIs; the gemma4 renderer appends one engine-side <image> marker per image after the last user message (llama.cpp attachment convention; the template's content-parts branch is unreachable through the flattened pb shape). The engine expands markers to boi + soft*n + eoi and splices the vision-tower embeddings. Older libdllm.so without the mm symbols fails with an actionable error (Dlsym probe). DLLM_VERSION pin bumped to the engine's vision-capable commit. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-12 00:41:04 +00:00
Ettore Di Giacinto	c9c6040fe8	feat(dllm): default gallery entry on Q4_K_M; add Q8_0 variant Q4_K_M (~17 GB, GB10-validated: cosine 0.9862, coherent generation) is the friendlier default download than the 50 GB BF16; Q8_0 (~27 GB) is the higher-fidelity middle ground. Both descriptions carry the measured caveat that BF16 is ~5x faster per denoise step on BF16-native hardware, with a pointer to fetch it manually when it fits. sha256 values are the HF LFS oids. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 20:24:26 +00:00
Ettore Di Giacinto	8134d6db37	docs(dllm): record Q4_K_M validation and quantization guidance Q4_K_M validated on GB10: quality holds (cosine 0.9862, coherent generation, 19/48 stopper exit) but a forward step is ~5x slower than BF16 (27.5s vs 5.6s: native BF16 tensor cores vs K-quant MoE dequant). Guidance: prefer BF16 when it fits; Q4_K_M is the memory-bound option. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 19:22:02 +00:00
Ettore Di Giacinto	ad6d1dbc8b	feat(grpc): request cancellation for Go backends via the Cancellable capability The llama.cpp C++ backend aborts generation when its gRPC context is cancelled (grpc-server.cpp polls context->IsCancelled() in the result loops), but Go backends served by pkg/grpc never observed context cancellation: a disconnected client left the generation running to completion. Add an optional Cancellable capability; the server registers context.AfterFunc on the request/stream context (after the Locking block so queued requests cannot abort the current owner) covering both rich and legacy paths. dllm implements it: measured cancel latency ~10ms vs ~10s of orphaned generation, and follow-up requests no longer queue behind cancelled ones (~220ms vs ~9s in the e2e proof). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 17:50:04 +00:00
Ettore Di Giacinto	eb61e1d770	chore(dllm): review fixes - file modes and build-matrix doc accuracy Drop the stray executable bit from the Go sources and Makefile (the sibling Go backends commit them 644; only run.sh/package.sh are executable), and correct two documentation claims found in the final branch review: cuda13-dllm is built for amd64 only (arm64 CUDA ships as the l4t flavor), and package.sh is the parakeet-cpp-style stub layout with no ldd walk. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 17:17:54 +00:00
Ettore Di Giacinto	aba9c4794a	docs(dllm): backend documentation and agents topic guide User docs: dllm section in text-generation (setup, eb_* options table, n_predict canvas rounding, enable_thinking metadata, honest GB10 throughput numbers). Agents guide: .agents/dllm-backend.md covering the purego C-ABI contract, serialization rules, template provenance, test layers, and known limitations. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 17:05:18 +00:00
Ettore Di Giacinto	04d6f66a9a	feat(dllm): diffusiongemma gallery entry and e2e coverage Gallery model diffusiongemma-26b-a4b-it (unsloth BF16 GGUF, sha256 verified against the HF LFS oid) with use_tokenizer_template and an honest experimental/throughput description. e2e: BACKEND_BINARY-mode specs boot the real gRPC backend with the tiny fixture model (templated chat + streaming); real-26B specs are separately env-gated. Adds an opt-in BACKEND_TEST_SEED knob so random-weight fixture models run the generic specs deterministically. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 17:05:18 +00:00
Ettore Di Giacinto	52b3b68cea	feat(dllm): backend packaging, gallery index, CI matrix Registers the dllm backend across every surface: backend gallery index (cpu amd64+arm64 with manifest merge, cuda13, l4t-cuda13 for GB10-class hardware; no darwin per engine scope), top-level Makefile targets, bump_deps pin tracking for DLLM_VERSION, and the curated known-backends list for /backends/known (pref-only: auto-detecting on .gguf would shadow llama-cpp). Note: image builds and the nightly bump leg stay red until github.com/mudler/dllm.cpp is published (planned at merge time). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 17:05:18 +00:00
Ettore Di Giacinto	99184809fa	feat(dllm): rich gRPC backend with ChatDelta streaming Implements PredictRich/PredictStreamRich (legacy methods delegate), TokenizeString, and Load over the purego binding. A single worker goroutine serializes all C calls per the dllm.cpp one-generate-per-ctx contract (cancel is the documented exception); an RWMutex guards Free against in-flight requests. Under use_tokenizer_template the gemma4 renderer and streaming parser own templating and ChatDelta extraction; raw-prompt mode passes through verbatim. enable_thinking is opt-in via request metadata (the gemma4 template treats thinking as opt-in). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 16:14:37 +00:00
Ettore Di Giacinto	294c04ae2f	feat(dllm): gemma4 streaming parser emitting ChatDeltas Fragment-safe state machine (content / channel header / thought / tool-call / done) classifying model output into content, reasoning_content and tool_calls deltas. Tool-call payload decoder is a non-partial port of vLLM's gemma4 parser grammar; ~25 of its test cases are ported with citations, plus a 2-split invariance property over every byte position. Recursion depth-capped against model-generated deep nesting; marker constants shared with the renderer. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 15:55:27 +00:00
Ettore Di Giacinto	778f85c2a0	feat(dllm): purego backend scaffold over the dllm.cpp C-ABI Binds the 9-symbol flat C-ABI of dllm.cpp (DiffusionGemma engine) via purego: typed wrappers with correct string ownership (malloc'd returns freed via dllm_capi_free_string, borrowed last_error never freed), once-allocated stream-callback trampolines, and a gated Ginkgo binding smoke against the tiny fixture model. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 14:50:39 +00:00
Ettore Di Giacinto	af0db1419c	test(http): make the suite listen port configurable The core/http specs hardcoded 127.0.0.1:9090 in ~70 call sites, so the pre-commit coverage gate fails on any machine where an unrelated service holds 9090. Centralize the address in the suite file behind LOCALAI_TEST_HTTP_PORT (default unchanged: 9090). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-11 14:28:39 +00:00