feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options (#9765 )

* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994 Picks up the upstream `spec : parallel drafting support` change (ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API and `server_context_impl`. Adapt the grpc-server wrapper accordingly: * `common_params_speculative::type` (single enum) became `types` (`std::vector<common_speculative_type>`). Update both the "default to draft when a draft model is set" branch and the `spec_type`/`speculative_type` option parser. The parser now also tolerates comma-separated lists, mirroring the upstream `common_speculative_types_from_names` semantics. * `common_params_speculative_draft::n_ctx` is gone (draft now shares the target context size). Keep the `draft_ctx_size` option name for backward compatibility and ignore the value rather than failing. * `server_context_impl::model` was renamed to `model_tgt`; update the two reranker / model-metadata call sites. Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp target locally. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): expose new speculative-decoding option keys Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838) adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative families and beefs up the draft-model knobs. The previous bump only adapted the API; this exposes the new fields through the grpc-server options dictionary so model configs can drive them. New `options:` keys (all under `backend: llama-cpp`): ngram_mod (`ngram_mod` type): spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match ngram_map_k (`ngram_map_k` type): spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits ngram_map_k4v (`ngram_map_k4v` type): spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m / spec_ngram_map_k4v_min_hits ngram lookup caches (`ngram_cache` type): spec_lookup_cache_static / lookup_cache_static spec_lookup_cache_dynamic / lookup_cache_dynamic Draft-model tuning (active when `spec_type` is `draft`): draft_cache_type_k / spec_draft_cache_type_k draft_cache_type_v / spec_draft_cache_type_v draft_threads / spec_draft_threads draft_threads_batch / spec_draft_threads_batch draft_cpu_moe / spec_draft_cpu_moe (bool flag) draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU) draft_override_tensor / spec_draft_override_tensor (comma-separated <tensor regex>=<buffer type>; re-implements upstream's static parse_tensor_buffer_overrides since it isn't exported) `spec_type` already accepted comma-separated lists after the previous commit, matching upstream's `common_speculative_types_from_names`. Docs: refresh `docs/content/advanced/model-configuration.md` with per-family tables and a note about multi-type chaining. Builds locally with `make docker-build-llama-cpp` (linux/amd64 cpu-llama-cpp AVX variant). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile, which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh on the copy. The fork branched before the API refactor, so it errors out on: * `ctx_server.impl->model_tgt` (fork still has `model`) * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*` (none of these sub-structs exist in the fork) * `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads, tensor_buft_overrides}` (fork uses the pre-#22397 flat layout) * `params.speculative.types` vector / `common_speculative_types_from_names` (fork has a scalar `type` and only the singular helper) Approach: 1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]` discriminations (the "default to draft when a draft model is set" branch and the `spec_type` / `speculative_type` option parser) fall back to the singular scalar form, and the entire new-option block (ngram_mod / map_k / map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*, tensor_buft_overrides}) is preprocessed out. The macro is *not* defined in the source tree — stock llama-cpp builds get the full new API. 2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp: - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model` - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first `#include`, so the guarded blocks above drop out for the fork build. Both patches are idempotent and follow the existing sed/awk pattern in this script (KV cache types, `get_media_marker`, flat speculative renames). Stock llama-cpp's `grpc-server.cpp` is never touched. Drop both legacy patches once the turboquant fork rebases past ggml-org/llama.cpp#22397 / #22838. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): close draft_ctx_size brace inside legacy guard The previous turboquant fix wrapped the new option-handler blocks in `#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard in the middle of an `else if` chain — the `} else if` openings of the new blocks were responsible for closing the previous block's brace. With the macro defined the new blocks vanish, draft_ctx_size's `{` loses its closer, the for-loop's `}` is consumed instead, and the file ends with a stray opening brace — clang reports it as `function-definition is not allowed here before '{'` on the next top-level `int main(...)` and `expected '}' at end of input`. Move the chain split inside the draft_ctx_size branch: } else if (... "draft_ctx_size") { // ... #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC } // legacy: chain ends here #else } else if (... "spec_ngram_mod_n_min") { // modern: chain continues ... } else if (... "draft_override_tensor") { ... } // closes last branch #endif } // closes for-loop Brace count is now balanced under both preprocessor branches (verified with `tr -cd '{' | wc -c` against the patched and unpatched outputs). Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp `turboquant-avx` variant cleanly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt Dockerfile.turboquant's `builder-prebuilt` stage was missing the `ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that `builder-fromsource` already has (and that `Dockerfile.llama-cpp` mirrors across both stages). When CI uses the prebuilt base image (quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg passed by the workflow never reaches the env inside the compile stage. backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on hipblas builds when AMDGPU_TARGETS is empty, and the turboquant Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the same check fires from turboquant-fallback under BUILD_TYPE=hipblas: Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated list of gfx targets e.g. gfx1100,gfx1101. Stop. make: *** [Makefile:66: turboquant-fallback] Error 2 The bug is latent on master because the docker layer cache stays warm across builds — the compile step rarely re-runs from scratch. The llama.cpp bump in this PR invalidates the cache, so the missing env var becomes load-bearing and the hipblas turboquant CI job fails. Mirror the existing pattern from Dockerfile.llama-cpp. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
ci: close GC race + cascade-skip + darwin grpc gaps from v4.2.1 (#9781 )
2026-05-19 14:17:21 -04:00 · 2026-05-12 17:22:37 +02:00 · 2026-05-12 17:22:09 +02:00 · 2026-05-12 17:21:20 +02:00 · 2026-05-12 09:54:38 +02:00 · 2026-05-12 09:53:48 +02:00
63 changed files with 3812 additions and 284 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -34,7 +34,55 @@ The build matrix is data-only YAML at `.github/backend-matrix.yml` (not inside `

 **Without an entry here no image is ever built or pushed, and the gallery entry in `backend/index.yaml` will point at a tag that does not exist.** The `dockerfile:` field must point at `./backend/Dockerfile.<lang>` matching the language bucket from step 1 (e.g. `Dockerfile.python`, `Dockerfile.golang`, `Dockerfile.rust`). The `tag-suffix` must match the `uri:` in the corresponding `backend/index.yaml` image entry exactly.

-If you add a new language bucket, `scripts/changed-backends.js` also needs a branch in `inferBackendPath` so PR change-detection routes file edits correctly.
+**`scripts/changed-backends.js` registration — REQUIRED for any new dockerfile suffix.** This is the single most common omission, because it has no effect on the PR that adds the backend (when no prior path filter could catch it anyway) — it only breaks the *next* PR that touches your backend's directory, which then gets zero CI jobs and looks broken for unrelated reasons. Edit `scripts/changed-backends.js:inferBackendPath` and add a branch BEFORE the more-generic suffixes:
+
+```js
+if (item.dockerfile.endsWith("<your-dockerfile-suffix>")) {
+    return `backend/cpp/<your-backend>/`;   // or backend/python|go|rust/...
+}
+```
+
+The `endsWith()` test is against the matrix entry's `dockerfile:` value (e.g. `./backend/Dockerfile.ds4` → `endsWith("ds4")`). Specificity order matters here just like it does for importers: more-specific suffixes go BEFORE more-generic ones (e.g. `ds4` before `llama-cpp` even though both end with letters, because some upstream might one day call itself `super-ds4-llama-cpp`). Verify locally before pushing:
+
+```bash
+# Confirm your dockerfile suffix is unique enough
+node -e "
+const yaml = require('js-yaml'); const fs = require('fs');
+const m = yaml.load(fs.readFileSync('.github/backend-matrix.yml','utf8'));
+for (const e of m.include.filter(e => e.backend === '<your-backend>')) {
+  console.log(e.dockerfile, '->', e.dockerfile.endsWith('<suffix>'));
+}"
+```
+
+A quick way to find the right insertion point: `grep -n 'item.dockerfile.endsWith' scripts/changed-backends.js`.
+
+**`bump_deps.yaml` registration — REQUIRED for any backend pinning an upstream commit.** If your backend's Makefile has a `*_VERSION?=<sha>` pin to a third-party repo, the daily auto-bump bot at `.github/workflows/bump_deps.yaml` won't notice it unless you register the backend in its matrix. The bot runs `.github/bump_deps.sh` which `grep`s for `^$VAR?=` in the Makefile you list — so the pin MUST live in the Makefile (not in a separate shell script). The bump for ds4 (#9761) had to walk this back because the original landed the pin in `prepare.sh`, which the bot can't see. Pattern (for `antirez/ds4`):
+
+```yaml
+# .github/workflows/bump_deps.yaml
+matrix:
+  include:
+    - repository: "antirez/ds4"
+      variable: "DS4_VERSION"
+      branch: "main"
+      file: "backend/cpp/ds4/Makefile"
+```
+
+And the corresponding Makefile shape (mirror `backend/cpp/llama-cpp/Makefile`):
+
+```makefile
+DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
+DS4_REPO?=https://github.com/antirez/ds4
+...
+ds4:
+	mkdir -p ds4
+	cd ds4 && git init -q && \
+	git remote add origin $(DS4_REPO) && \
+	git fetch --depth 1 origin $(DS4_VERSION) && \
+	git checkout FETCH_HEAD
+```
+
+If you have a `prepare.sh` doing the clone, delete it — the recipe belongs in the Makefile target so `make purge && make` works as a clean-and-rebuild and so the bump bot finds the pin.

 **Placement in file:**
 - CPU builds: Add after other CPU builds (e.g., after `cpu-chatterbox`)
--- a/.agents/ds4-backend.md
+++ b/.agents/ds4-backend.md
@@ -0,0 +1,84 @@
+# Working on the ds4 Backend
+
+`antirez/ds4` is a single-model inference engine for DeepSeek V4 Flash.
+LocalAI wraps the engine's C API (`ds4/ds4.h`) with a fresh C++ gRPC server at
+`backend/cpp/ds4/` - NOT a fork of llama-cpp's grpc-server.cpp.
+
+## Pin
+
+`backend/cpp/ds4/Makefile` pins `DS4_VERSION?=<sha>` at the top. The `ds4`
+target in the Makefile clones `antirez/ds4` at that commit (mirroring the
+llama-cpp / ik-llama-cpp / turboquant pattern). The bump-deps bot
+(`.github/workflows/bump_deps.yaml`) finds this pin via grep and opens a
+daily PR to update it. To bump manually: edit the `DS4_VERSION?=` line,
+then `make purge && make` (or rely on CI's clean build).
+
+## Wire shape
+
+| RPC | Implementation |
+|---|---|
+| Health, Free, Status | Trivial; no engine dependency for Health |
+| LoadModel | `ds4_engine_open` + `ds4_session_create`; backend is compile-time (DS4_NO_GPU → CPU, __APPLE__ → Metal, otherwise CUDA) |
+| TokenizeString | `ds4_tokenize_text` |
+| Predict | `ds4_engine_generate_argmax` + `DsmlParser` → one ChatDelta with content / reasoning_content / tool_calls[] |
+| PredictStream | Same, per-token ChatDelta writes |
+
+## DSML
+
+ds4 emits tool calls as literal text markers (`<｜DSML｜tool_calls>` etc.) -
+NOT special tokens. `dsml_parser.{h,cpp}` is our streaming state machine that
+classifies token bytes into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END
+events. `dsml_renderer.{h,cpp}` does the prompt direction: turns
+OpenAI tool_calls + role=tool messages back into DSML for the next turn.
+
+## Thinking modes
+
+`PredictOptions.Metadata["enable_thinking"]` gates thinking on/off (default ON).
+`["reasoning_effort"] == "max" | "xhigh"` selects `DS4_THINK_MAX`; anything else
+maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_prefix`.
+
+## Disk KV cache
+
+`kv_cache.{h,cpp}` implements an SHA1-keyed file cache using ds4's public
+`ds4_session_save_payload` / `ds4_session_load_payload` API. Enable per request
+via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
+NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).
+
+## Build matrix
+
+| Build | Where | Notes |
+|---|---|---|
+| `cpu-ds4` (amd64 + arm64) | Linux GHA | ds4 considers CPU debug-only; useful only for wiring tests |
+| `cuda13-ds4` (amd64 + arm64) | Linux GHA + DGX Spark validation | Primary production path on Linux |
+| `ds4-darwin` (arm64) | macOS GHA runners | Metal; uses `scripts/build/ds4-darwin.sh` like llama-cpp-darwin |
+
+cuda12 is intentionally omitted. ROCm / Vulkan / SYCL are not applicable.
+
+## Hardware-gated validation
+
+`tests/e2e-backends/backend_test.go` in `BACKEND_BINARY` mode:
+
+```
+BACKEND_BINARY=$(pwd)/backend/cpp/ds4/package/run.sh \
+BACKEND_TEST_MODEL_FILE=/path/to/ds4flash.gguf \
+BACKEND_TEST_CAPS=health,load,predict,stream,tools \
+BACKEND_TEST_TOOL_PROMPT="What's the weather in Paris?" \
+go test -count=1 -timeout=30m -v ./tests/e2e-backends/...
+```
+
+CI does not load the model; the suite is opt-in via env vars.
+
+## Importer
+
+`core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by
+matching the `antirez/deepseek-v4-gguf` repo URI or the
+`DeepSeek-V4-Flash-*.gguf` filename pattern. **Registered BEFORE
+`LlamaCPPImporter`** in `defaultImporters` - both match `.gguf` but ds4 is more
+specific, and first-match-wins. The importer emits `backend: ds4`, uses
+`ds4flash.gguf` as the local filename (matches ds4's own CLI default), and
+disables the Go-side automatic tool-parsing fallback (the C++ backend emits
+ChatDelta.tool_calls natively via `DsmlParser`).
+
+ds4 is also listed in `core/http/endpoints/localai/backend.go`'s pref-only
+slice so the `/import-model` UI surfaces it as a manual choice for users who
+want to force the backend on a non-canonical URI.
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -389,7 +389,12 @@ include:
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-12-llama-cpp'
    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
-    runs-on: 'ubuntu-latest'
+    # bigger-runner: cold builds for this entry consistently take 5h+ on
+    # ubuntu-latest (observed 5h36m on v4.2.1). Move back to bigger-runner
+    # so the build finishes well within GHA's 6h job timeout. Phase 5.3 of
+    # the free-tier migration (PR #9730) flipped this to ubuntu-latest as
+    # a 'highest-risk batch' with explicit per-entry revert.
+    runs-on: 'bigger-runner'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "llama-cpp"
@@ -403,7 +408,9 @@ include:
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-12-turboquant'
    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
-    runs-on: 'ubuntu-latest'
+    # bigger-runner: same rationale as -gpu-nvidia-cuda-12-llama-cpp above
+    # (observed 6h5m wall-clock on v4.2.1, just past the 6h job timeout).
+    runs-on: 'bigger-runner'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "turboquant"
@@ -899,7 +906,9 @@ include:
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-13-llama-cpp'
    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
-    runs-on: 'ubuntu-latest'
+    # bigger-runner: cold builds for this entry take 5h+ on ubuntu-latest
+    # (observed 5h37m on v4.2.1). Same rationale as the cuda-12 variant.
+    runs-on: 'bigger-runner'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "llama-cpp"
@@ -913,7 +922,8 @@ include:
    tag-latest: 'auto'
    tag-suffix: '-gpu-nvidia-cuda-13-turboquant'
    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
-    runs-on: 'ubuntu-latest'
+    # bigger-runner: observed 6h5m wall-clock on v4.2.1 — at the GHA timeout.
+    runs-on: 'bigger-runner'
    base-image: "ubuntu:24.04"
    skip-drivers: 'false'
    backend: "turboquant"
@@ -948,6 +958,32 @@ include:
    backend: "turboquant"
    dockerfile: "./backend/Dockerfile.turboquant"
    context: "./"
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-13-ds4'
+    runs-on: 'ubuntu-latest'
+    base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
+    skip-drivers: 'true'
+    backend: "ds4"
+    dockerfile: "./backend/Dockerfile.ds4"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/arm64'
+    skip-drivers: 'true'
+    tag-latest: 'auto'
+    tag-suffix: '-nvidia-l4t-cuda-13-arm64-ds4'
+    base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
+    runs-on: 'ubuntu-24.04-arm'
+    ubuntu-version: '2404'
+    backend: "ds4"
+    dockerfile: "./backend/Dockerfile.ds4"
+    context: "./"
  - build-type: 'cublas'
    cuda-major-version: "13"
    cuda-minor-version: "0"
@@ -2321,6 +2357,34 @@ include:
    dockerfile: "./backend/Dockerfile.turboquant"
    context: "./"
    ubuntu-version: '2404'
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    platform-tag: 'amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-ds4'
+    runs-on: 'ubuntu-latest'
+    base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
+    skip-drivers: 'true'
+    backend: "ds4"
+    dockerfile: "./backend/Dockerfile.ds4"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/arm64'
+    platform-tag: 'arm64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-ds4'
+    runs-on: 'ubuntu-24.04-arm'
+    base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
+    skip-drivers: 'true'
+    backend: "ds4"
+    dockerfile: "./backend/Dockerfile.ds4"
+    context: "./"
+    ubuntu-version: '2404'
  - build-type: ''
    cuda-major-version: ""
    cuda-minor-version: ""
--- a/.github/scripts/anchor-digest-in-cache.sh
+++ b/.github/scripts/anchor-digest-in-cache.sh
@@ -0,0 +1,46 @@
+#!/usr/bin/env bash
+# Anchor a backend per-arch digest in quay.io/go-skynet/ci-cache so quay's
+# garbage collector won't reap the manifest before backend_merge.yml runs.
+#
+# Context: backend_build.yml pushes by canonical digest only
+# (push-by-digest=true). Unreferenced manifests on quay can be reaped within
+# ~1-2h, but backend-merge-jobs runs only after the *entire* per-arch build
+# matrix drains (max-parallel: 8 × dozens of entries → ~2h+). Without an
+# anchoring tag, the earliest digests are gone by the time `imagetools create`
+# tries to read them, producing "manifest not found" merge failures.
+#
+# We tag the digest under our internal ci-cache image; quay does not GC tagged
+# manifests. The user-facing manifest list still references the original
+# digest in local-ai-backends. backend_merge.yml deletes the anchor tag after
+# the user-facing manifest is published — see cleanup-keepalive-tags.sh.
+#
+# Required env:
+#   GITHUB_RUN_ID  - current workflow run id (set automatically by GHA)
+#   TAG_SUFFIX     - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
+#   PLATFORM_TAG   - amd64 / arm64 / single (single = singleton matrix entry)
+#   DIGEST         - canonical content digest from build step (sha256:...)
+#
+# Optional env:
+#   ANCHOR_IMAGE   - target image (default: quay.io/go-skynet/ci-cache)
+#   SOURCE_IMAGE   - source image (default: quay.io/go-skynet/local-ai-backends)
+#   GITHUB_STEP_SUMMARY - if set, an anchored-by line is appended to it
+set -euo pipefail
+
+: "${GITHUB_RUN_ID:?}"
+: "${TAG_SUFFIX:?}"
+: "${PLATFORM_TAG:?}"
+: "${DIGEST:?}"
+
+anchor_image="${ANCHOR_IMAGE:-quay.io/go-skynet/ci-cache}"
+source_image="${SOURCE_IMAGE:-quay.io/go-skynet/local-ai-backends}"
+
+tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${PLATFORM_TAG}"
+
+docker buildx imagetools create \
+  -t "${anchor_image}:${tag}" \
+  "${source_image}@${DIGEST}"
+
+echo "anchored ${DIGEST} as ${anchor_image}:${tag}"
+if [[ -n "${GITHUB_STEP_SUMMARY:-}" ]]; then
+  echo "anchored \`${DIGEST}\` as \`${anchor_image}:${tag}\`" >> "${GITHUB_STEP_SUMMARY}"
+fi
--- a/.github/scripts/cleanup-keepalive-tags.sh
+++ b/.github/scripts/cleanup-keepalive-tags.sh
@@ -0,0 +1,49 @@
+#!/usr/bin/env bash
+# Best-effort cleanup of the keepalive anchor tags written by
+# anchor-digest-in-cache.sh. Called from backend_merge.yml after the
+# user-facing manifest list has been published.
+#
+# Quay's docker registry v2 doesn't allow tag deletes — only digest deletes.
+# The proper delete is the quay REST API, which requires an OAuth-scoped
+# token. We try QUAY_TOKEN as a bearer token: if the secret is an OAuth app
+# token (typical for service accounts) the delete succeeds; otherwise this
+# is a soft no-op and the tag persists until manually pruned.
+#
+# Cleanup failure MUST NOT fail the merge — the merge has already produced
+# the user-facing manifest list at this point and the keepalive tags are
+# pure overhead. We always exit 0.
+#
+# Required env:
+#   GITHUB_RUN_ID  - current workflow run id (set automatically by GHA)
+#   TAG_SUFFIX     - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
+#   QUAY_TOKEN     - bearer token for quay's REST API
+#
+# Optional env:
+#   QUAY_REPO      - target repo (default: go-skynet/ci-cache)
+#   PLATFORM_TAGS  - space-separated list of platform-tag values to try
+#                    (default: "amd64 arm64 single")
+#                    We don't know which platform-tag(s) exist for this
+#                    tag-suffix without an extra API call, so we just try
+#                    all three and ignore 404s for the ones that don't.
+set -uo pipefail
+
+: "${GITHUB_RUN_ID:?}"
+: "${TAG_SUFFIX:?}"
+: "${QUAY_TOKEN:?}"
+
+quay_repo="${QUAY_REPO:-go-skynet/ci-cache}"
+platform_tags="${PLATFORM_TAGS:-amd64 arm64 single}"
+
+for plat in $platform_tags; do
+  tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${plat}"
+  url="https://quay.io/api/v1/repository/${quay_repo}/tag/${tag}"
+  http=$(curl -sS -o /dev/null -w '%{http_code}' \
+    -X DELETE -H "Authorization: Bearer ${QUAY_TOKEN}" "$url" || echo "000")
+  case "$http" in
+    204|200) echo "deleted $tag" ;;
+    404)     echo "not present: $tag" ;;
+    401|403) echo "auth not OAuth-scoped (http $http) for $tag - skipping; orphan tag will persist" ;;
+    *)       echo "unexpected http $http deleting $tag - skipping" ;;
+  esac
+done
+exit 0
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -35,11 +35,13 @@ jobs:
      matrix-singlearch: ${{ steps.set-matrix.outputs['matrix-singlearch'] }}
      matrix-multiarch: ${{ steps.set-matrix.outputs['matrix-multiarch'] }}
      matrix-darwin: ${{ steps.set-matrix.outputs['matrix-darwin'] }}
-      merge-matrix: ${{ steps.set-matrix.outputs['merge-matrix'] }}
+      merge-matrix-multiarch: ${{ steps.set-matrix.outputs['merge-matrix-multiarch'] }}
+      merge-matrix-singlearch: ${{ steps.set-matrix.outputs['merge-matrix-singlearch'] }}
      has-backends-singlearch: ${{ steps.set-matrix.outputs['has-backends-singlearch'] }}
      has-backends-multiarch: ${{ steps.set-matrix.outputs['has-backends-multiarch'] }}
      has-backends-darwin: ${{ steps.set-matrix.outputs['has-backends-darwin'] }}
-      has-merges: ${{ steps.set-matrix.outputs['has-merges'] }}
+      has-merges-multiarch: ${{ steps.set-matrix.outputs['has-merges-multiarch'] }}
+      has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
@@ -138,15 +140,27 @@ jobs:
      max-parallel: 8
      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch']) }}

-  # Merge per-arch digests into manifest lists. Depends ONLY on
-  # backend-jobs-multiarch — single-arch builds are independent and slow.
-  # Without this split, a 6h CUDA-12 single-arch job would gate the merge,
-  # leaving multi-arch digests untagged on quay long enough for quay's
-  # garbage collector to reap them and the merge step to fail with
-  # "manifest not found".
-  backend-merge-jobs:
+  # Apply tags to per-arch digests via `imagetools create`. Split into two
+  # jobs that mirror the build split so each merge waits ONLY on its
+  # corresponding build matrix:
+  #
+  #   - backend-merge-jobs-multiarch  needs backend-jobs-multiarch  (~2-3h)
+  #   - backend-merge-jobs-singlearch needs backend-jobs-singlearch (up to ~6h)
+  #
+  # If a single shared merge job depended on both, slow CUDA singlearch
+  # builds would block multiarch merges long enough for quay's GC to reap
+  # the multiarch per-arch digests (the bug fixed by PR #9746). Singletons
+  # also need a merge step because backend_build.yml pushes by canonical
+  # digest only — no tags are applied at build time.
+  backend-merge-jobs-multiarch:
    needs: [generate-matrix, backend-jobs-multiarch]
-    if: needs.generate-matrix.outputs['has-merges'] == 'true'
+    # !cancelled() lets the merge run even when a few build legs failed.
+    # Without it, GHA's default `needs:` cascade skips the entire merge
+    # matrix on a single failed/cancelled cell. We still want to publish
+    # the manifest lists for tag-suffixes whose legs all succeeded.
+    # Observed in v4.2.1: 2 singlearch build failures cascade-skipped all
+    # ~199 singlearch merge entries.
+    if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
    uses: ./.github/workflows/backend_merge.yml
    with:
      tag-latest: ${{ matrix.tag-latest }}
@@ -158,7 +172,24 @@ jobs:
      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    strategy:
      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix']) }}
+      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-multiarch']) }}
+
+  backend-merge-jobs-singlearch:
+    needs: [generate-matrix, backend-jobs-singlearch]
+    # See note on backend-merge-jobs-multiarch above for !cancelled().
+    if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
+    uses: ./.github/workflows/backend_merge.yml
+    with:
+      tag-latest: ${{ matrix.tag-latest }}
+      tag-suffix: ${{ matrix.tag-suffix }}
+    secrets:
+      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+    strategy:
+      fail-fast: false
+      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch']) }}

  backend-jobs-darwin:
    needs: generate-matrix
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -228,11 +228,28 @@ jobs:
          digest="${{ steps.build.outputs.digest }}"
          touch "/tmp/digests/${digest#sha256:}"

+      # See .github/scripts/anchor-digest-in-cache.sh for why this is needed
+      # and how it interacts with backend_merge.yml's cleanup step.
+      - name: Anchor digest in ci-cache so quay GC won't reap before merge
+        if: github.event_name != 'pull_request'
+        env:
+          TAG_SUFFIX: ${{ inputs.tag-suffix }}
+          PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
+          DIGEST: ${{ steps.build.outputs.digest }}
+        run: .github/scripts/anchor-digest-in-cache.sh
+
+      # Artifact name uses a `--` separator between tag-suffix and platform-tag
+      # to avoid prefix collisions during the merge job's pattern-based download.
+      # Tag-suffixes are not prefix-disjoint (e.g. -gpu-nvidia-cuda-12-vllm is a
+      # prefix of -gpu-nvidia-cuda-12-vllm-omni); a single `-` separator plus the
+      # merge-side `digests<tag-suffix>-*` glob would let one merge over-match
+      # the other backend's artifacts. The `-single` placeholder for empty
+      # platform-tag (single-arch entries) keeps the artifact name non-trailing.
      - name: Upload digest artifact
        if: github.event_name != 'pull_request'
-        uses: actions/upload-artifact@v4
+        uses: actions/upload-artifact@v7
        with:
-          name: digests${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
+          name: digests${{ inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
          path: /tmp/digests/*
          if-no-files-found: error
          retention-days: 1
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -116,6 +116,13 @@ jobs:
          # already), we don't have to chase missing dylibs one at a time.
          # The downloads cache makes the reinstall fast (~5s on a hit).
          brew reinstall ccache
+          # Same pattern for grpc: its CMake config (used by the llama-cpp
+          # `grpc-server` target) does find_package(absl). The cache restores
+          # /opt/homebrew/Cellar/grpc so brew above no-ops the install, but
+          # abseil isn't in our Cellar cache list and never gets installed
+          # alongside, leaving grpc's CMake unable to resolve it. Reinstalling
+          # grpc re-validates and pulls abseil in, mirroring the ccache fix.
+          brew reinstall grpc
          # The brew cache restores the Cellar dirs but NOT the bin symlinks
          # at /opt/homebrew/bin/*. brew install above sees the Cellar present
          # and decides "already installed" without re-linking, so on a cache-
@@ -211,8 +218,13 @@ jobs:
          make protogen-go
          make backends/llama-cpp-darwin

+      - name: Build ds4 backend (Darwin Metal)
+        if: inputs.backend == 'ds4'
+        run: |
+          make backends/ds4-darwin
+
      - name: Build ${{ inputs.backend }}-darwin
-        if: inputs.backend != 'llama-cpp'
+        if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4'
        run: |
          make protogen-go
          BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -34,10 +34,23 @@ jobs:
    env:
      quay_username: ${{ secrets.quayUsername }}
    steps:
-      - name: Download digests
-        uses: actions/download-artifact@v4
+      # Sparse checkout: the merge job needs `.github/scripts/` (for the
+      # keepalive cleanup script) but none of the source tree.
+      - name: Checkout (.github/scripts only)
+        uses: actions/checkout@v6
        with:
-          pattern: digests${{ inputs.tag-suffix }}-*
+          sparse-checkout: |
+            .github/scripts
+          sparse-checkout-cone-mode: false
+
+      # `--` separator anchors the glob so we don't over-match sibling
+      # backends whose tag-suffix happens to be a prefix of ours
+      # (e.g. -cpu-vllm vs -cpu-vllm-omni). Must stay in sync with the
+      # upload-artifact name in backend_build.yml.
+      - name: Download digests
+        uses: actions/download-artifact@v8
+        with:
+          pattern: digests${{ inputs.tag-suffix }}--*
          merge-multiple: true
          path: /tmp/digests

@@ -122,6 +135,15 @@ jobs:
            docker buildx imagetools inspect "$first_tag"
          fi

+      # See .github/scripts/cleanup-keepalive-tags.sh for why this is
+      # best-effort and what the failure modes are.
+      - name: Cleanup keepalive tags in ci-cache
+        if: github.event_name != 'pull_request' && success()
+        env:
+          TAG_SUFFIX: ${{ inputs.tag-suffix }}
+          QUAY_TOKEN: ${{ secrets.quayPassword }}
+        run: .github/scripts/cleanup-keepalive-tags.sh
+
      - name: Job summary
        if: github.event_name != 'pull_request'
        run: |
--- a/.github/workflows/backend_pr.yml
+++ b/.github/workflows/backend_pr.yml
@@ -14,11 +14,13 @@ jobs:
      matrix-singlearch: ${{ steps.set-matrix.outputs['matrix-singlearch'] }}
      matrix-multiarch: ${{ steps.set-matrix.outputs['matrix-multiarch'] }}
      matrix-darwin: ${{ steps.set-matrix.outputs['matrix-darwin'] }}
-      merge-matrix: ${{ steps.set-matrix.outputs['merge-matrix'] }}
+      merge-matrix-multiarch: ${{ steps.set-matrix.outputs['merge-matrix-multiarch'] }}
+      merge-matrix-singlearch: ${{ steps.set-matrix.outputs['merge-matrix-singlearch'] }}
      has-backends-singlearch: ${{ steps.set-matrix.outputs['has-backends-singlearch'] }}
      has-backends-multiarch: ${{ steps.set-matrix.outputs['has-backends-multiarch'] }}
      has-backends-darwin: ${{ steps.set-matrix.outputs['has-backends-darwin'] }}
-      has-merges: ${{ steps.set-matrix.outputs['has-merges'] }}
+      has-merges-multiarch: ${{ steps.set-matrix.outputs['has-merges-multiarch'] }}
+      has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
@@ -97,12 +99,14 @@ jobs:
      fail-fast: true
      max-parallel: 8
      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch']) }}
-  backend-merge-jobs:
+  backend-merge-jobs-multiarch:
    needs: [generate-matrix, backend-jobs-multiarch]
    # backend_merge.yml's push-side steps are all gated on
    # github.event_name != 'pull_request', so on a PR the merge job would
    # do nothing. Skip it entirely to avoid spinning up an empty runner.
-    if: github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges'] == 'true'
+    # !cancelled() lets the merge run even when a few build legs fail —
+    # see the matching note in backend.yml.
+    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
    uses: ./.github/workflows/backend_merge.yml
    with:
      tag-latest: ${{ matrix.tag-latest }}
@@ -112,7 +116,21 @@ jobs:
      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    strategy:
      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix']) }}
+      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-multiarch']) }}
+
+  backend-merge-jobs-singlearch:
+    needs: [generate-matrix, backend-jobs-singlearch]
+    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
+    uses: ./.github/workflows/backend_merge.yml
+    with:
+      tag-latest: ${{ matrix.tag-latest }}
+      tag-suffix: ${{ matrix.tag-suffix }}
+    secrets:
+      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+    strategy:
+      fail-fast: false
+      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch']) }}
  backend-jobs-darwin:
    needs: generate-matrix
    uses: ./.github/workflows/backend_build_darwin.yml
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -22,6 +22,10 @@ jobs:
            variable: "TURBOQUANT_VERSION"
            branch: "feature/turboquant-kv-cache"
            file: "backend/cpp/turboquant/Makefile"
+          - repository: "antirez/ds4"
+            variable: "DS4_VERSION"
+            branch: "main"
+            file: "backend/cpp/ds4/Makefile"
          - repository: "ggml-org/whisper.cpp"
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -187,7 +187,7 @@ jobs:

      - name: Upload digest artifact
        if: github.event_name != 'pull_request'
-        uses: actions/upload-artifact@v4
+        uses: actions/upload-artifact@v7
        with:
          name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-${{ inputs.platform-tag }}
          path: /tmp/digests/*
--- a/.github/workflows/image_merge.yml
+++ b/.github/workflows/image_merge.yml
@@ -34,7 +34,7 @@ jobs:
      quay_username: ${{ secrets.quayUsername }}
    steps:
      - name: Download digests
-        uses: actions/download-artifact@v4
+        uses: actions/download-artifact@v8
        with:
          pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-*
          merge-multiple: true
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -25,6 +25,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
 | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
 | [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
+| [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
 | [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
 | [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
 | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
--- a/2
+++ b/2
@@ -305,7 +305,7 @@ EOT
 ###################################

 # Build React UI
-FROM node:25-slim AS react-ui-builder
+FROM node:26-slim AS react-ui-builder
 WORKDIR /app
 COPY core/http/react-ui/package*.json ./
 RUN npm install
--- a/13
+++ b/13
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -1009,6 +1009,10 @@ backends/llama-cpp-darwin: build
 	bash ./scripts/build/llama-cpp-darwin.sh
 	./local-ai backends install "ocifile://$(abspath ./backend-images/llama-cpp.tar)"

+backends/ds4-darwin: build
+	bash ./scripts/build/ds4-darwin.sh
+	./local-ai backends install "ocifile://$(abspath ./backend-images/ds4.tar)"
+
 build-darwin-python-backend: build
 	bash ./scripts/build/python-darwin.sh

@@ -1050,6 +1054,10 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
 # turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
 # Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
 BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
+# ds4 is antirez/ds4, a DeepSeek V4 Flash-specific inference engine.
+# Single-model; hardware-only validation lives at tests/e2e-backends/
+# (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
+BACKEND_DS4 = ds4|ds4|.|false|false

 # Golang backends
 BACKEND_PIPER = piper|golang|.|false|true
@@ -1135,6 +1143,7 @@ endef
 $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
+$(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
@@ -1188,7 +1197,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx

 ########################################################
 ### Mock Backend for E2E Tests
--- a/backend/Dockerfile.ds4
+++ b/backend/Dockerfile.ds4
@@ -0,0 +1,41 @@
+ARG BASE_IMAGE=ubuntu:24.04
+ARG APT_MIRROR=""
+ARG APT_PORTS_MIRROR=""
+
+# BASE_IMAGE is either ubuntu:24.04 (for cpu builds) or nvidia/cuda:13.0.0-devel-ubuntu24.04
+# (for cublas builds). Both ship apt + Ubuntu Noble packages; the nvidia/cuda base
+# additionally provides /usr/local/cuda. Darwin (Metal) builds bypass this Dockerfile
+# entirely via scripts/build/ds4-darwin.sh.
+FROM ${BASE_IMAGE} AS builder
+ARG BUILD_TYPE
+ARG TARGETARCH
+ARG TARGETVARIANT
+
+ENV BUILD_TYPE=${BUILD_TYPE} \
+    DEBIAN_FRONTEND=noninteractive \
+    PATH=/usr/local/cuda/bin:${PATH}
+
+WORKDIR /build
+
+# Install build-time deps via plain apt - install-base-deps.sh's full pipeline
+# (CUDA keyring + from-source gRPC) is unnecessary here:
+#   - CUDA: when BASE_IMAGE=nvidia/cuda:*, /usr/local/cuda is already populated;
+#     for the cpu build we don't need CUDA at all.
+#   - gRPC/Protobuf: system apt packages are sufficient; ds4's wrapper only links
+#     against them, it doesn't ship the gRPC source tree.
+#   - nlohmann-json: dsml_renderer's only third-party dep.
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        git cmake build-essential pkg-config ca-certificates \
+        libgrpc++-dev libprotobuf-dev protobuf-compiler protobuf-compiler-grpc \
+        nlohmann-json3-dev && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+COPY . /LocalAI
+
+RUN --mount=type=cache,target=/root/.ccache,id=ds4-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
+    make -C /LocalAI/backend/cpp/ds4 BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
+
+FROM scratch
+COPY --from=builder /LocalAI/backend/cpp/ds4/package/. ./
--- a/backend/Dockerfile.turboquant
+++ b/backend/Dockerfile.turboquant
@@ -117,6 +117,12 @@ ARG CUDA_DOCKER_ARCH
 ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
 ARG CMAKE_ARGS
 ENV CMAKE_ARGS=${CMAKE_ARGS}
+# AMDGPU_TARGETS must be forwarded into the env here too — backend/cpp/llama-cpp/Makefile
+# (which the turboquant Makefile reuses via a sibling build dir) errors out when the var
+# is empty on a hipblas build, and the prebuilt path is what CI exercises most of the
+# time. The builder-fromsource stage above already does this; mirror it here.
+ARG AMDGPU_TARGETS
+ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
 ARG TARGETARCH
 ARG TARGETVARIANT

--- a/backend/cpp/ds4/.gitignore
+++ b/backend/cpp/ds4/.gitignore
@@ -0,0 +1,9 @@
+ds4/
+build/
+package/
+grpc-server
+*.o
+backend.pb.cc
+backend.pb.h
+backend.grpc.pb.cc
+backend.grpc.pb.h
--- a/backend/cpp/ds4/CMakeLists.txt
+++ b/backend/cpp/ds4/CMakeLists.txt
@@ -0,0 +1,101 @@
+cmake_minimum_required(VERSION 3.15)
+project(ds4-grpc-server LANGUAGES CXX C)
+
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+set(TARGET grpc-server)
+
+option(DS4_NATIVE "Compile with -march=native / -mcpu=native" ON)
+set(DS4_GPU "cpu" CACHE STRING "GPU backend: cpu, cuda, or metal")
+set(DS4_DIR "${CMAKE_CURRENT_SOURCE_DIR}/ds4" CACHE PATH "Path to cloned ds4 source")
+
+find_package(Threads REQUIRED)
+find_package(Protobuf CONFIG QUIET)
+if(NOT Protobuf_FOUND)
+    find_package(Protobuf REQUIRED)
+endif()
+find_package(gRPC CONFIG QUIET)
+if(NOT gRPC_FOUND)
+    # Ubuntu's apt-installed grpc++ does not ship a CMake config - fall back.
+    find_library(GRPCPP_LIB grpc++ REQUIRED)
+    find_library(GRPCPP_REFLECTION_LIB grpc++_reflection REQUIRED)
+    add_library(gRPC::grpc++ INTERFACE IMPORTED)
+    set_target_properties(gRPC::grpc++ PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_LIB}")
+    add_library(gRPC::grpc++_reflection INTERFACE IMPORTED)
+    set_target_properties(gRPC::grpc++_reflection PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_REFLECTION_LIB}")
+endif()
+
+find_program(_PROTOC NAMES protoc REQUIRED)
+find_program(_GRPC_CPP_PLUGIN NAMES grpc_cpp_plugin REQUIRED)
+
+get_filename_component(HW_PROTO "${CMAKE_CURRENT_SOURCE_DIR}/../../backend.proto" ABSOLUTE)
+get_filename_component(HW_PROTO_PATH "${HW_PROTO}" PATH)
+
+set(HW_PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.cc")
+set(HW_PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.h")
+set(HW_GRPC_SRCS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.cc")
+set(HW_GRPC_HDRS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.h")
+
+add_custom_command(
+    OUTPUT "${HW_PROTO_SRCS}" "${HW_PROTO_HDRS}" "${HW_GRPC_SRCS}" "${HW_GRPC_HDRS}"
+    COMMAND ${_PROTOC}
+    ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
+         --cpp_out  "${CMAKE_CURRENT_BINARY_DIR}"
+         -I "${HW_PROTO_PATH}"
+         --plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN}"
+         "${HW_PROTO}"
+    DEPENDS "${HW_PROTO}")
+
+add_library(hw_grpc_proto STATIC
+    ${HW_GRPC_SRCS} ${HW_GRPC_HDRS}
+    ${HW_PROTO_SRCS} ${HW_PROTO_HDRS})
+target_include_directories(hw_grpc_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
+
+set(DS4_OBJS "${DS4_DIR}/ds4.o")
+if(DS4_GPU STREQUAL "cuda")
+    list(APPEND DS4_OBJS "${DS4_DIR}/ds4_cuda.o")
+elseif(DS4_GPU STREQUAL "metal")
+    list(APPEND DS4_OBJS "${DS4_DIR}/ds4_metal.o")
+elseif(DS4_GPU STREQUAL "cpu")
+    set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
+endif()
+
+add_executable(${TARGET}
+    grpc-server.cpp
+    dsml_parser.cpp
+    dsml_renderer.cpp
+    kv_cache.cpp)
+
+target_include_directories(${TARGET} PRIVATE ${DS4_DIR})
+
+foreach(obj ${DS4_OBJS})
+    target_sources(${TARGET} PRIVATE ${obj})
+    set_source_files_properties(${obj} PROPERTIES EXTERNAL_OBJECT TRUE GENERATED TRUE)
+endforeach()
+
+target_link_libraries(${TARGET} PRIVATE
+    hw_grpc_proto
+    gRPC::grpc++
+    gRPC::grpc++_reflection
+    protobuf::libprotobuf
+    Threads::Threads
+    m)
+
+if(DS4_GPU STREQUAL "cuda")
+    find_package(CUDAToolkit REQUIRED)
+    target_link_libraries(${TARGET} PRIVATE CUDA::cudart CUDA::cublas)
+elseif(DS4_GPU STREQUAL "metal")
+    find_library(FOUNDATION_LIB Foundation REQUIRED)
+    find_library(METAL_LIB Metal REQUIRED)
+    target_link_libraries(${TARGET} PRIVATE ${FOUNDATION_LIB} ${METAL_LIB})
+elseif(DS4_GPU STREQUAL "cpu")
+    target_compile_definitions(${TARGET} PRIVATE DS4_NO_GPU)
+endif()
+
+if(DS4_NATIVE)
+    if(APPLE)
+        target_compile_options(${TARGET} PRIVATE -mcpu=native)
+    else()
+        target_compile_options(${TARGET} PRIVATE -march=native)
+    endif()
+endif()
--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -0,0 +1,78 @@
+# ds4 backend Makefile.
+#
+# Upstream pin lives below as DS4_VERSION?= so the bump-deps bot
+# (.github/bump_deps.sh) can find and update it - matches the
+# llama-cpp / ik-llama-cpp / turboquant convention.
+
+DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
+DS4_REPO?=https://github.com/antirez/ds4
+
+CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
+BUILD_DIR := build
+
+BUILD_TYPE ?=
+NATIVE ?= false
+JOBS ?= $(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
+
+UNAME_S := $(shell uname -s)
+
+CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
+
+ifeq ($(BUILD_TYPE),cublas)
+    CMAKE_ARGS += -DDS4_GPU=cuda
+    DS4_OBJ_TARGET := ds4.o ds4_cuda.o
+else ifeq ($(UNAME_S),Darwin)
+    CMAKE_ARGS += -DDS4_GPU=metal
+    DS4_OBJ_TARGET := ds4.o ds4_metal.o
+else
+    # CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
+    CMAKE_ARGS += -DDS4_GPU=cpu
+    DS4_OBJ_TARGET := ds4_cpu.o
+endif
+
+ifneq ($(NATIVE),true)
+    CMAKE_ARGS += -DDS4_NATIVE=OFF
+endif
+
+.PHONY: grpc-server package clean purge test all
+all: grpc-server
+
+# Clone the upstream ds4 source at the pinned commit. Directory acts as the
+# target so make only re-clones when missing. After a DS4_VERSION bump,
+# run 'make purge && make' to refetch (or rely on CI's clean build).
+ds4:
+	mkdir -p ds4
+	cd ds4 && \
+	git init -q && \
+	git remote add origin $(DS4_REPO) && \
+	git fetch --depth 1 origin $(DS4_VERSION) && \
+	git checkout FETCH_HEAD
+
+# Build ds4's engine object files via its own Makefile, which already encodes
+# the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
+ds4/ds4.o: ds4
+ifeq ($(BUILD_TYPE),cublas)
+	+$(MAKE) -C ds4 ds4.o ds4_cuda.o
+else ifeq ($(UNAME_S),Darwin)
+	+$(MAKE) -C ds4 ds4.o ds4_metal.o
+else
+	+$(MAKE) -C ds4 ds4_cpu.o
+endif
+
+grpc-server: ds4/ds4.o
+	mkdir -p $(BUILD_DIR)
+	cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
+	cp $(BUILD_DIR)/grpc-server grpc-server
+
+package: grpc-server
+	bash package.sh
+
+test:
+	@echo "ds4 backend: e2e coverage at tests/e2e-backends/ (BACKEND_BINARY mode)"
+
+clean:
+	rm -rf $(BUILD_DIR) grpc-server package
+	if [ -d ds4 ]; then $(MAKE) -C ds4 clean; fi
+
+purge: clean
+	rm -rf ds4
--- a/backend/cpp/ds4/dsml_parser.cpp
+++ b/backend/cpp/ds4/dsml_parser.cpp
@@ -0,0 +1,359 @@
+#include "dsml_parser.h"
+
+#include <algorithm>
+#include <cstdio>
+#include <cstring>
+#include <chrono>
+#include <random>
+#include <string>
+#include <vector>
+
+namespace ds4cpp {
+
+namespace {
+
+constexpr const char *kThinkOpen      = "<think>";
+constexpr const char *kThinkClose     = "</think>";
+constexpr const char *kToolsOpen      = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>";   // <｜DSML｜tool_calls>
+constexpr const char *kToolsClose     = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>"; // </｜DSML｜tool_calls>
+constexpr const char *kInvokeOpenPfx  = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\""; // <｜DSML｜invoke name="
+constexpr const char *kInvokeClose    = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>";       // </｜DSML｜invoke>
+constexpr const char *kParamOpenPfx   = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\""; // <｜DSML｜parameter name="
+constexpr const char *kParamClose     = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>";       // </｜DSML｜parameter>
+
+// All structural markers the parser might encounter - used to detect "buf
+// might be a partial marker, don't drain yet" conditions.
+const std::vector<std::string> &all_markers() {
+    static const std::vector<std::string> v = {
+        kThinkOpen, kThinkClose,
+        kToolsOpen, kToolsClose,
+        kInvokeOpenPfx, kInvokeClose,
+        kParamOpenPfx, kParamClose,
+    };
+    return v;
+}
+
+// Returns true if `buf` could be a *prefix* of any marker (i.e., we should
+// wait for more text before draining as plain content). The marker-prefix
+// loop handles fixed markers exactly. For markers with variable-length
+// internal data (kInvokeOpenPfx, kParamOpenPfx have an open quote, then the
+// tool/param name, then a closing quote and `>`), we also wait while buf
+// starts with `<` and has not yet seen a `>`: the leading `<` could be the
+// start of one of those open markers, or a literal that we can confirm only
+// once we know what follows. Anything after the first `>` arrives is either
+// consumed by TryConsumeMarker or emitted as a literal `<` by the caller.
+bool looks_like_prefix(const std::string &buf) {
+    for (const auto &m : all_markers()) {
+        if (m.size() > buf.size() && m.compare(0, buf.size(), buf) == 0) return true;
+    }
+    if (!buf.empty() && buf[0] == '<' && buf.find('>') == std::string::npos) {
+        return true;
+    }
+    return false;
+}
+
+bool consume_literal(std::string &buf, const std::string &lit) {
+    if (buf.compare(0, lit.size(), lit) == 0) {
+        buf.erase(0, lit.size());
+        return true;
+    }
+    return false;
+}
+
+// Find the next '<' in buf starting at offset; returns std::string::npos if none.
+size_t next_tag(const std::string &buf, size_t off = 0) {
+    return buf.find('<', off);
+}
+
+std::string json_escape(const std::string &in) {
+    std::string out;
+    out.reserve(in.size() + 2);
+    for (char c : in) {
+        switch (c) {
+            case '"':  out += "\\\""; break;
+            case '\\': out += "\\\\"; break;
+            case '\b': out += "\\b"; break;
+            case '\f': out += "\\f"; break;
+            case '\n': out += "\\n"; break;
+            case '\r': out += "\\r"; break;
+            case '\t': out += "\\t"; break;
+            default:
+                if (static_cast<unsigned char>(c) < 0x20) {
+                    char tmp[8];
+                    std::snprintf(tmp, sizeof(tmp), "\\u%04x", c);
+                    out += tmp;
+                } else {
+                    out += c;
+                }
+        }
+    }
+    return out;
+}
+
+} // namespace
+
+DsmlParser::DsmlParser() = default;
+
+bool DsmlParser::IsInDsmlStructural() const {
+    switch (state_) {
+        case State::TOOL_CALLS:
+        case State::INVOKE:
+            return true;
+        case State::PARAM_VALUE:  // payload bytes; user sampling applies
+        case State::TEXT:
+        case State::THINK:
+            return false;
+    }
+    return false;
+}
+
+void DsmlParser::EmitArgsChunk(const std::string &chunk, std::vector<ParserEvent> &out) {
+    if (chunk.empty()) return;
+    ParserEvent e;
+    e.type = ParserEvent::TOOL_ARGS;
+    e.text = chunk;
+    e.index = tool_index_;
+    out.push_back(std::move(e));
+}
+
+void DsmlParser::FinishCurrentToolCall(std::vector<ParserEvent> &out) {
+    if (tool_index_ < 0) return;
+    // Close the JSON object that was opened on the first parameter.
+    if (args_emitted_open_brace_) {
+        EmitArgsChunk("}", out);
+    } else {
+        EmitArgsChunk("{}", out);
+    }
+    ParserEvent e;
+    e.type = ParserEvent::TOOL_END;
+    e.index = tool_index_;
+    out.push_back(std::move(e));
+    current_tool_name_.clear();
+    args_emitted_open_brace_ = false;
+    args_param_count_ = 0;
+}
+
+bool DsmlParser::TryConsumeMarker(std::vector<ParserEvent> &out) {
+    switch (state_) {
+    case State::TEXT: {
+        if (consume_literal(buf_, kThinkOpen))   { state_ = State::THINK;       return true; }
+        if (consume_literal(buf_, kToolsOpen))   { state_ = State::TOOL_CALLS;  return true; }
+        return false;
+    }
+    case State::THINK: {
+        if (consume_literal(buf_, kThinkClose))  { state_ = State::TEXT;        return true; }
+        return false;
+    }
+    case State::TOOL_CALLS: {
+        if (consume_literal(buf_, kToolsClose))  { state_ = State::TEXT;        return true; }
+        // <｜DSML｜invoke name="X">
+        if (buf_.compare(0, std::strlen(kInvokeOpenPfx), kInvokeOpenPfx) == 0) {
+            size_t close_q = buf_.find('"', std::strlen(kInvokeOpenPfx));
+            if (close_q == std::string::npos) return false; // need more bytes
+            size_t close_gt = buf_.find('>', close_q);
+            if (close_gt == std::string::npos) return false;
+            current_tool_name_ = buf_.substr(std::strlen(kInvokeOpenPfx),
+                                             close_q - std::strlen(kInvokeOpenPfx));
+            tool_index_++;
+            buf_.erase(0, close_gt + 1);
+            ParserEvent e;
+            e.type = ParserEvent::TOOL_START;
+            e.tool_name = current_tool_name_;
+            e.tool_id   = RandomToolId();
+            e.index     = tool_index_;
+            out.push_back(std::move(e));
+            args_emitted_open_brace_ = false;
+            args_param_count_ = 0;
+            state_ = State::INVOKE;
+            return true;
+        }
+        return false;
+    }
+    case State::INVOKE: {
+        if (consume_literal(buf_, kInvokeClose)) {
+            FinishCurrentToolCall(out);
+            state_ = State::TOOL_CALLS;
+            return true;
+        }
+        // <｜DSML｜parameter name="K" string="true|false">
+        if (buf_.compare(0, std::strlen(kParamOpenPfx), kParamOpenPfx) == 0) {
+            size_t close_q = buf_.find('"', std::strlen(kParamOpenPfx));
+            if (close_q == std::string::npos) return false;
+            size_t string_attr = buf_.find("string=\"", close_q);
+            if (string_attr == std::string::npos) return false;
+            size_t string_q = buf_.find('"', string_attr + 8);
+            if (string_q == std::string::npos) return false;
+            size_t close_gt = buf_.find('>', string_q);
+            if (close_gt == std::string::npos) return false;
+            param_name_ = buf_.substr(std::strlen(kParamOpenPfx),
+                                      close_q - std::strlen(kParamOpenPfx));
+            std::string string_val = buf_.substr(string_attr + 8,
+                                                 string_q - (string_attr + 8));
+            param_is_string_ = (string_val == "true");
+            param_value_.clear();
+            buf_.erase(0, close_gt + 1);
+            // Emit args JSON opener / separator.
+            std::string opener;
+            if (!args_emitted_open_brace_) { opener = "{"; args_emitted_open_brace_ = true; }
+            else                            { opener = ","; }
+            opener += "\"" + json_escape(param_name_) + "\":";
+            if (param_is_string_) opener += "\"";
+            EmitArgsChunk(opener, out);
+            args_param_count_++;
+            state_ = State::PARAM_VALUE;
+            return true;
+        }
+        return false;
+    }
+    case State::PARAM_VALUE: {
+        if (consume_literal(buf_, kParamClose)) {
+            if (param_is_string_) EmitArgsChunk("\"", out);
+            state_ = State::INVOKE;
+            return true;
+        }
+        return false;
+    }
+    }
+    return false;
+}
+
+void DsmlParser::DrainPlain(std::vector<ParserEvent> &out) {
+    // Drain everything up to the next '<' that *might* start a marker.
+    // Anything before the next '<' is safe to emit; the '<...' tail stays buffered.
+    while (!buf_.empty()) {
+        size_t lt = next_tag(buf_, 0);
+        if (lt == std::string::npos) {
+            // No tag at all - emit (or accumulate) the whole buffer.
+            ParserEvent e;
+            if (state_ == State::PARAM_VALUE) {
+                std::string esc = param_is_string_ ? json_escape(buf_) : buf_;
+                EmitArgsChunk(esc, out);
+            } else if (state_ == State::THINK) {
+                e.type = ParserEvent::REASONING;
+                e.text = buf_;
+                out.push_back(std::move(e));
+            } else if (state_ == State::TEXT) {
+                e.type = ParserEvent::CONTENT;
+                e.text = buf_;
+                out.push_back(std::move(e));
+            }
+            // Inside INVOKE / TOOL_CALLS with no marker, raw bytes are
+            // structural whitespace - discard.
+            buf_.clear();
+            return;
+        }
+        if (lt > 0) {
+            std::string chunk = buf_.substr(0, lt);
+            buf_.erase(0, lt);
+            ParserEvent e;
+            if (state_ == State::PARAM_VALUE) {
+                std::string esc = param_is_string_ ? json_escape(chunk) : chunk;
+                EmitArgsChunk(esc, out);
+            } else if (state_ == State::THINK) {
+                e.type = ParserEvent::REASONING;
+                e.text = chunk;
+                out.push_back(std::move(e));
+            } else if (state_ == State::TEXT) {
+                e.type = ParserEvent::CONTENT;
+                e.text = chunk;
+                out.push_back(std::move(e));
+            }
+        }
+        // buf_[0] == '<' - try consuming a marker. If we consumed one, loop again.
+        if (!TryConsumeMarker(out)) {
+            // Could be a partial marker - wait for more bytes.
+            if (looks_like_prefix(buf_)) return;
+            // Otherwise this '<' is a literal - emit one char and continue.
+            std::string one(1, buf_[0]);
+            buf_.erase(0, 1);
+            ParserEvent e;
+            if (state_ == State::PARAM_VALUE) {
+                std::string esc = param_is_string_ ? json_escape(one) : one;
+                EmitArgsChunk(esc, out);
+            } else if (state_ == State::THINK) {
+                e.type = ParserEvent::REASONING;
+                e.text = one;
+                out.push_back(std::move(e));
+            } else if (state_ == State::TEXT) {
+                e.type = ParserEvent::CONTENT;
+                e.text = one;
+                out.push_back(std::move(e));
+            }
+        }
+    }
+}
+
+void DsmlParser::Feed(const std::string &chunk, std::vector<ParserEvent> &out) {
+    buf_ += chunk;
+    DrainPlain(out);
+}
+
+void DsmlParser::Flush(std::vector<ParserEvent> &out) {
+    // At flush time we no longer wait for marker completion - drain everything
+    // (the trailing bytes won't grow). Mirror DrainPlain's state-aware
+    // classification: PARAM_VALUE bytes become TOOL_ARGS, THINK bytes become
+    // REASONING, TEXT bytes become CONTENT, and INVOKE/TOOL_CALLS bytes are
+    // structural whitespace (discarded).
+    auto emit_plain = [&](const std::string &chunk) {
+        if (chunk.empty()) return;
+        if (state_ == State::PARAM_VALUE) {
+            std::string esc = param_is_string_ ? json_escape(chunk) : chunk;
+            EmitArgsChunk(esc, out);
+            return;
+        }
+        if (state_ == State::THINK) {
+            ParserEvent e;
+            e.type = ParserEvent::REASONING;
+            e.text = chunk;
+            out.push_back(std::move(e));
+            return;
+        }
+        if (state_ == State::TEXT) {
+            ParserEvent e;
+            e.type = ParserEvent::CONTENT;
+            e.text = chunk;
+            out.push_back(std::move(e));
+            return;
+        }
+        // INVOKE / TOOL_CALLS: structural whitespace, discard.
+    };
+    while (!buf_.empty()) {
+        size_t lt = next_tag(buf_, 0);
+        if (lt == std::string::npos) {
+            emit_plain(buf_);
+            buf_.clear();
+            return;
+        }
+        if (lt > 0) {
+            std::string chunk = buf_.substr(0, lt);
+            buf_.erase(0, lt);
+            emit_plain(chunk);
+        }
+        if (!TryConsumeMarker(out)) {
+            // Definitely a literal '<' now (no chance of more bytes arriving).
+            std::string one(1, buf_[0]);
+            buf_.erase(0, 1);
+            emit_plain(one);
+        }
+    }
+    // If we ended mid-tool-call (model truncated), close it cleanly.
+    if (state_ == State::INVOKE || state_ == State::PARAM_VALUE) {
+        if (state_ == State::PARAM_VALUE && param_is_string_) EmitArgsChunk("\"", out);
+        FinishCurrentToolCall(out);
+        state_ = State::TEXT;
+    }
+}
+
+std::string RandomToolId() {
+    static thread_local std::mt19937_64 rng{
+        static_cast<uint64_t>(std::chrono::system_clock::now().time_since_epoch().count())};
+    const char *alphabet =
+        "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
+    std::string out = "call_";
+    for (int i = 0; i < 16; ++i) {
+        out += alphabet[rng() % 62];
+    }
+    return out;
+}
+
+} // namespace ds4cpp
--- a/backend/cpp/ds4/dsml_parser.h
+++ b/backend/cpp/ds4/dsml_parser.h
@@ -0,0 +1,77 @@
+#pragma once
+#include <functional>
+#include <string>
+#include <vector>
+
+namespace ds4cpp {
+
+struct ParserEvent {
+    enum Type { CONTENT, REASONING, TOOL_START, TOOL_ARGS, TOOL_END };
+    Type type;
+    std::string text;        // CONTENT, REASONING, TOOL_ARGS
+    std::string tool_name;   // TOOL_START
+    std::string tool_id;     // TOOL_START (caller-assigned)
+    int index = 0;           // TOOL_START / TOOL_ARGS / TOOL_END
+};
+
+// Streaming parser. Stateless across instances; one per Predict call.
+class DsmlParser {
+public:
+    DsmlParser();
+
+    // Feed a chunk of raw model-emitted text. Appends classified events to
+    // `out`. May buffer the tail of `chunk` internally if it looks like a
+    // marker prefix.
+    void Feed(const std::string &chunk, std::vector<ParserEvent> &out);
+
+    // Flush any remaining buffered text as CONTENT (called at generation end).
+    void Flush(std::vector<ParserEvent> &out);
+
+    // True when the parser is inside a DSML structural position - that is,
+    // tags/markers between tool-call boundaries where the model is expected
+    // to emit protocol bytes verbatim. Mirrors ds4_server.c's "force
+    // temperature=0 unless dsml_decode_state_uses_payload_sampling" rule:
+    //
+    //   TEXT / THINK                  -> false (user sampling applies)
+    //   PARAM_VALUE                   -> false (payload uses user sampling)
+    //   TOOL_CALLS / INVOKE           -> true  (structural; force greedy)
+    //
+    // Callers should use this BEFORE the next sample() call to pick the
+    // effective temperature; the parser's state reflects what's already
+    // been consumed, so it predicts the next token's classification.
+    bool IsInDsmlStructural() const;
+
+private:
+    enum class State { TEXT, THINK, TOOL_CALLS, INVOKE, PARAM_VALUE };
+    State state_ = State::TEXT;
+    std::string buf_;
+    std::string current_tool_name_;
+    int tool_index_ = -1;
+    // While parsing a parameter value:
+    std::string param_name_;
+    bool param_is_string_ = true;
+    std::string param_value_;
+    // Incrementally-built arguments JSON for the active tool call.
+    std::string args_json_so_far_;
+    bool args_emitted_open_brace_ = false;
+    int args_param_count_ = 0;
+
+    // Try to consume one structural marker starting at buf_[0]. Returns true
+    // and advances state if a complete marker was consumed; false if the
+    // buffer is ambiguous (could be a marker prefix).
+    bool TryConsumeMarker(std::vector<ParserEvent> &out);
+
+    // Drain plain text from buf_ as far as we're sure it's not a marker prefix.
+    // Emits CONTENT or REASONING depending on current state.
+    void DrainPlain(std::vector<ParserEvent> &out);
+
+    // Emit the next chunk of arguments JSON to the consumer.
+    void EmitArgsChunk(const std::string &chunk, std::vector<ParserEvent> &out);
+    void FinishCurrentToolCall(std::vector<ParserEvent> &out);
+};
+
+// Generate a random tool call ID (e.g. "call_AbCdEf"). Used by the gRPC layer
+// when assigning IDs to streamed tool calls.
+std::string RandomToolId();
+
+} // namespace ds4cpp
--- a/backend/cpp/ds4/dsml_renderer.cpp
+++ b/backend/cpp/ds4/dsml_renderer.cpp
@@ -0,0 +1,140 @@
+#include "dsml_renderer.h"
+
+// We accept either nlohmann::json (if available) or fall back to a tiny
+// hand-rolled parser. The LocalAI tree already has nlohmann/json bundled
+// in vendor paths; we use the apt-installed nlohmann-json3-dev (installed
+// in Task 11 step 1) when present, otherwise the bundled copy.
+#if __has_include(<nlohmann/json.hpp>)
+#include <nlohmann/json.hpp>
+using json = nlohmann::json;
+#else
+#error "nlohmann/json.hpp not found; install nlohmann-json3-dev"
+#endif
+
+#include <sstream>
+
+namespace ds4cpp {
+
+namespace {
+
+void render_param(std::ostringstream &os, const std::string &name,
+                  const json &value) {
+    bool is_string = value.is_string();
+    os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\"" << name
+       << "\" string=\"" << (is_string ? "true" : "false") << "\">";
+    if (is_string) {
+        os << value.get<std::string>();
+    } else {
+        os << value.dump();
+    }
+    os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>\n";
+}
+
+} // namespace
+
+std::string RenderAssistantToolCalls(const std::string &tool_calls_json) {
+    if (tool_calls_json.empty()) return "";
+    json arr;
+    try {
+        arr = json::parse(tool_calls_json);
+    } catch (const std::exception &) {
+        return "";
+    }
+    if (!arr.is_array() || arr.empty()) return "";
+
+    std::ostringstream os;
+    os << "\n\n<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n";
+    for (const auto &call : arr) {
+        // OpenAI shape: { id, type, function: { name, arguments (JSON string) } }
+        // Anthropic shape comes through normalized by LocalAI.
+        std::string name;
+        std::string args_str;
+        if (call.contains("function")) {
+            const auto &fn = call["function"];
+            if (fn.contains("name") && fn["name"].is_string())
+                name = fn["name"].get<std::string>();
+            if (fn.contains("arguments") && fn["arguments"].is_string())
+                args_str = fn["arguments"].get<std::string>();
+        }
+        os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"" << name << "\">\n";
+        if (!args_str.empty()) {
+            json args;
+            try {
+                args = json::parse(args_str);
+            } catch (...) {
+                args = json{};
+            }
+            if (args.is_object()) {
+                for (auto it = args.begin(); it != args.end(); ++it) {
+                    render_param(os, it.key(), it.value());
+                }
+            }
+        }
+        os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n";
+    }
+    os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>";
+    return os.str();
+}
+
+std::string RenderToolResult(const std::string &tool_call_id, const std::string &content) {
+    std::ostringstream os;
+    // ds4_server.c wraps tool results in a "tool_result" DSML tag carrying
+    // the tool_call_id. Match that shape.
+    os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_result id=\"" << tool_call_id << "\">"
+       << content
+       << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_result>";
+    return os.str();
+}
+
+std::string RenderToolsManifest(const std::string &tools_json) {
+    if (tools_json.empty()) return "";
+    json arr;
+    try {
+        arr = json::parse(tools_json);
+    } catch (const std::exception &) {
+        return "";
+    }
+    if (!arr.is_array() || arr.empty()) return "";
+
+    // Extract each OpenAI tool's `function` object, dump as compact JSON, one
+    // per line. Mirrors openai_function_schema_from_tool() in ds4_server.c.
+    std::ostringstream schemas;
+    for (const auto &tool : arr) {
+        if (tool.contains("function") && tool["function"].is_object()) {
+            schemas << tool["function"].dump() << "\n";
+        } else if (tool.is_object()) {
+            // Anthropic / direct-schema form: pass through.
+            schemas << tool.dump() << "\n";
+        }
+    }
+    if (schemas.tellp() == std::streampos(0)) return "";
+
+    // Verbatim text from ds4_server.c append_tools_prompt_text. Do NOT
+    // paraphrase - the model was trained on these exact bytes.
+    std::ostringstream os;
+    os << "## Tools\n\n"
+          "You have access to a set of tools to help answer the user question. "
+          "You can invoke tools by writing a \"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\" block like the following:\n\n"
+          "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n"
+          "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"$TOOL_NAME\">\n"
+          "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\"$PARAMETER_NAME\" string=\"true|false\">$PARAMETER_VALUE</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>\n"
+          "...\n"
+          "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n"
+          "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"$TOOL_NAME2\">\n"
+          "...\n"
+          "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n"
+          "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n\n"
+          "String parameters should be specified as raw text and set `string=\"true\"`. "
+          "Preserve characters such as `>`, `&`, and `&&` exactly; never replace normal string characters with XML or HTML entity escapes. "
+          "Only if a string value itself contains the exact closing parameter tag `</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>`, write that tag as `&lt;/\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>` inside the value. "
+          "For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set `string=\"false\"`.\n\n"
+          "If thinking_mode is enabled (triggered by <think>), you MUST output your complete reasoning inside <think>...</think> BEFORE any tool calls or final response.\n\n"
+          "Otherwise, output directly after </think> with tool calls or final response.\n\n"
+          "### Available Tool Schemas\n\n"
+       << schemas.str()
+       << "\nYou MUST strictly follow the above defined tool name and parameter schemas to invoke tool calls. "
+          "Use the exact parameter names from the schemas.";
+    return os.str();
+}
+
+} // namespace ds4cpp
--- a/backend/cpp/ds4/dsml_renderer.h
+++ b/backend/cpp/ds4/dsml_renderer.h
@@ -0,0 +1,27 @@
+#pragma once
+#include <string>
+
+namespace ds4cpp {
+
+// Render an assistant message's tool_calls JSON array into the DSML block
+// that ds4 expects in its prompt. `tool_calls_json` is the value of
+// proto.Message.tool_calls (OpenAI shape: array of {id, type, function:{name, arguments}}).
+// Returns the DSML text to append after the assistant's content.
+std::string RenderAssistantToolCalls(const std::string &tool_calls_json);
+
+// Render a role="tool" message into the DSML "tool result" block. ds4's
+// prompt template expects tool results inside a specific tag; we wrap the
+// `content` with that tag and include the `tool_call_id` so the model can
+// correlate.
+std::string RenderToolResult(const std::string &tool_call_id, const std::string &content);
+
+// Render the "## Tools" manifest that ds4 expects in the SYSTEM prompt when
+// tools are available. Without this preamble the model has no idea tools
+// exist and will not emit DSML tool calls. Mirrors append_tools_prompt_text()
+// in ds4_server.c (~line 1646): a fixed preamble + "### Available Tool
+// Schemas" section + one JSON schema per line (extracted from each OpenAI
+// tool's .function object) + a fixed closing instruction. Returns empty
+// when tools_json is empty / unparseable.
+std::string RenderToolsManifest(const std::string &tools_json);
+
+} // namespace ds4cpp
--- a/backend/cpp/ds4/grpc-server.cpp
+++ b/backend/cpp/ds4/grpc-server.cpp
@@ -0,0 +1,696 @@
+// ds4 LocalAI gRPC backend.
+//
+// Wraps antirez/ds4's `ds4_engine_*` / `ds4_session_*` public API
+// (see ds4/ds4.h) over LocalAI's backend.proto. Tool calls, thinking
+// mode, and disk KV cache are wired in follow-up commits; this commit
+// is just the bind/listen/Health/Free skeleton.
+
+#include "backend.pb.h"
+#include "backend.grpc.pb.h"
+
+#include "dsml_parser.h"   // populated in Task 12
+#include "dsml_renderer.h" // populated in Task 16
+#include "kv_cache.h"      // populated in Task 17
+
+extern "C" {
+#include "ds4.h"
+}
+
+#include <grpcpp/grpcpp.h>
+#include <grpcpp/server.h>
+#include <grpcpp/server_builder.h>
+#include <grpcpp/ext/proto_server_reflection_plugin.h>
+
+#include <atomic>
+#include <chrono>
+#include <csignal>
+#include <cstring>
+#include <iostream>
+#include <memory>
+#include <mutex>
+#include <string>
+#include <thread>
+#include <vector>
+
+using grpc::Server;
+using grpc::ServerBuilder;
+using grpc::ServerContext;
+using grpc::ServerWriter;
+// NOTE: do NOT alias `grpc::Status` as `Status` - the Status RPC method below
+// would shadow the type, breaking the other RPC method declarations that use
+// it as a return type. Use GStatus instead.
+using GStatus = ::grpc::Status;
+using grpc::StatusCode;
+
+namespace {
+
+// Global state - ds4 is single-engine-per-process by design.
+std::mutex g_engine_mu;
+ds4_engine *g_engine = nullptr;
+ds4_session *g_session = nullptr;
+int g_ctx_size = 32768;
+std::string g_kv_cache_dir; // empty disables disk cache
+
+std::atomic<Server *> g_server{nullptr};
+
+// Parse a "key:value" option string. Returns empty when no colon.
+static std::pair<std::string, std::string> split_option(const std::string &opt) {
+    auto colon = opt.find(':');
+    if (colon == std::string::npos) return {opt, ""};
+    return {opt.substr(0, colon), opt.substr(colon + 1)};
+}
+
+static void append_token_text(ds4_engine *engine, int token, std::string &out) {
+    size_t len = 0;
+    const char *text = ds4_token_text(engine, token, &len);
+    if (text && len > 0) out.append(text, len);
+}
+
+struct CollectCtx {
+    ds4_engine *engine;
+    std::string raw_buf;  // exact raw bytes for Reply.message
+    ds4cpp::DsmlParser parser;
+    backend::Reply *reply;
+    int tokens;
+
+    // Per-tool aggregation: accumulate ChatDelta tool_calls so we emit one
+    // delta with all calls, mirroring how vllm's non-streaming path returns.
+    struct Pending {
+        std::string id;
+        std::string name;
+        std::string args;
+    };
+    std::vector<Pending> pending;
+
+    std::string content_buf;
+    std::string reasoning_buf;
+};
+
+static void apply_events(CollectCtx *c, const std::vector<ds4cpp::ParserEvent> &events) {
+    for (const auto &e : events) {
+        switch (e.type) {
+        case ds4cpp::ParserEvent::CONTENT:
+            c->content_buf += e.text;
+            break;
+        case ds4cpp::ParserEvent::REASONING:
+            c->reasoning_buf += e.text;
+            break;
+        case ds4cpp::ParserEvent::TOOL_START:
+            if ((int)c->pending.size() <= e.index)
+                c->pending.resize(e.index + 1);
+            c->pending[e.index].id = e.tool_id;
+            c->pending[e.index].name = e.tool_name;
+            break;
+        case ds4cpp::ParserEvent::TOOL_ARGS:
+            if ((int)c->pending.size() > e.index)
+                c->pending[e.index].args += e.text;
+            break;
+        case ds4cpp::ParserEvent::TOOL_END:
+            // No-op for non-streaming: the final delta is emitted at the end.
+            break;
+        }
+    }
+}
+
+static void collect_emit(void *ud, int token) {
+    auto *c = static_cast<CollectCtx *>(ud);
+    if (token == ds4_token_eos(c->engine)) return;
+    size_t len = 0;
+    const char *text = ds4_token_text(c->engine, token, &len);
+    if (!text || len == 0) return;
+    std::string chunk(text, len);
+    c->raw_buf += chunk;
+    std::vector<ds4cpp::ParserEvent> events;
+    c->parser.Feed(chunk, events);
+    apply_events(c, events);
+    c->tokens++;
+}
+static void collect_done(void *) {}
+
+struct StreamCtx {
+    ds4_engine *engine;
+    ServerWriter<backend::Reply> *writer;
+    ds4cpp::DsmlParser parser;
+    int tokens;
+    bool aborted;
+    // Track which tool indices we've seen TOOL_START for, so subsequent
+    // ARGS deltas can elide the redundant id/name fields.
+    std::vector<bool> tool_started;
+};
+
+static void stream_emit(void *ud, int token) {
+    auto *s = static_cast<StreamCtx *>(ud);
+    if (s->aborted) return;
+    if (token == ds4_token_eos(s->engine)) return;
+    size_t len = 0;
+    const char *text = ds4_token_text(s->engine, token, &len);
+    if (!text || len == 0) return;
+    std::string chunk(text, len);
+    std::vector<ds4cpp::ParserEvent> events;
+    s->parser.Feed(chunk, events);
+    if (events.empty()) { s->tokens++; return; }
+
+    backend::Reply reply;
+    auto *delta = reply.add_chat_deltas();
+    bool any_field = false;
+    for (const auto &e : events) {
+        switch (e.type) {
+        case ds4cpp::ParserEvent::CONTENT:
+            delta->set_content(delta->content() + e.text);
+            any_field = true;
+            break;
+        case ds4cpp::ParserEvent::REASONING:
+            delta->set_reasoning_content(delta->reasoning_content() + e.text);
+            any_field = true;
+            break;
+        case ds4cpp::ParserEvent::TOOL_START: {
+            if ((int)s->tool_started.size() <= e.index)
+                s->tool_started.resize(e.index + 1, false);
+            s->tool_started[e.index] = true;
+            auto *tc = delta->add_tool_calls();
+            tc->set_index(e.index);
+            tc->set_id(e.tool_id);
+            tc->set_name(e.tool_name);
+            any_field = true;
+            break;
+        }
+        case ds4cpp::ParserEvent::TOOL_ARGS: {
+            auto *tc = delta->add_tool_calls();
+            tc->set_index(e.index);
+            tc->set_arguments(e.text);
+            any_field = true;
+            break;
+        }
+        case ds4cpp::ParserEvent::TOOL_END:
+            // No marker delta needed - the Go side closes the tool call on
+            // the final aggregator pass.
+            break;
+        }
+    }
+    reply.set_message(chunk);
+    reply.set_tokens(1);
+    if (any_field) {
+        if (!s->writer->Write(reply)) s->aborted = true;
+    }
+    s->tokens++;
+}
+static void stream_done(void *) {}
+
+// Per-thread RNG seed for ds4_session_sample. Initialized lazily from
+// system_clock; ds4 owns the random walk after that.
+static uint64_t *get_rng() {
+    static thread_local uint64_t seed = 0;
+    if (seed == 0) {
+        seed = static_cast<uint64_t>(
+            std::chrono::system_clock::now().time_since_epoch().count());
+        if (seed == 0) seed = 1;
+    }
+    return &seed;
+}
+
+struct SampleParams {
+    float temperature;
+    int top_k;
+    float top_p;
+    float min_p;
+};
+
+// Compute the effective sampling parameters for the next token, mirroring
+// ds4_server.c:7102-7115:
+//   - thinking mode enabled -> override (T=1, top_k=0, top_p=1, min_p=0)
+//   - inside DSML structural position (tool-call markers) -> force T=0
+//   - otherwise -> the request's user-supplied sampling settings
+// The parser argument carries state from tokens emitted so far; its
+// IsInDsmlStructural() predicts the next token's classification.
+static SampleParams compute_sample_params(const backend::PredictOptions *request,
+                                          const ds4cpp::DsmlParser &parser,
+                                          bool think_enabled);
+
+static ds4_think_mode parse_think_mode(const backend::PredictOptions *request) {
+    // Per the vllm backend convention, "enable_thinking" gates thinking on/off,
+    // and "reasoning_effort" picks the strength when on.
+    const auto &md = request->metadata();
+    auto et = md.find("enable_thinking");
+    bool enabled = true; // default ON per ds4-server
+    if (et != md.end()) enabled = (et->second == "true" || et->second == "1");
+    if (!enabled) return DS4_THINK_NONE;
+    auto re = md.find("reasoning_effort");
+    if (re != md.end() && (re->second == "max" || re->second == "xhigh"))
+        return DS4_THINK_MAX;
+    return DS4_THINK_HIGH;
+}
+
+static SampleParams compute_sample_params(const backend::PredictOptions *request,
+                                          const ds4cpp::DsmlParser &parser,
+                                          bool think_enabled) {
+    SampleParams p = {
+        request->temperature(),
+        request->topk(),
+        request->topp(),
+        request->minp(),
+    };
+    if (think_enabled) {
+        // Match ds4-server: thinking mode wants creativity in the reasoning
+        // pass and the trailing content, so the entire generation overrides
+        // sampling unless DSML structural bytes take over below.
+        p.temperature = 1.0f;
+        p.top_k = 0;
+        p.top_p = 1.0f;
+        p.min_p = 0.0f;
+    }
+    if (parser.IsInDsmlStructural()) {
+        // Tool-call structural bytes (tags, markers, headers) must parse
+        // cleanly. Force greedy regardless of user/thinking settings.
+        p.temperature = 0.0f;
+    }
+    return p;
+}
+
+// Build the rendered text for cache keying. We feed the same text the model
+// will see; that lets the cache survive small client-side reformatting of
+// chat history (the cache is keyed on bytes, not tokens).
+static std::string render_prompt_text(const backend::PredictOptions *request) {
+    // Two-mode: either the raw prompt or the chat-template path. We mirror
+    // build_prompt's branching but accumulate text (not tokens) so we can
+    // SHA1 it for the cache key. ds4_session caches a tokens-indexed
+    // checkpoint, but the disk format keys on bytes per ds4-server's design.
+    if (!request->usetokenizertemplate() || request->messages_size() == 0) {
+        return request->prompt();
+    }
+    std::string out;
+    const std::string sys_role = "system";
+    for (const auto &m : request->messages()) {
+        if (m.role() == sys_role) { out += "[sys] " + m.content() + "\n"; break; }
+    }
+    for (const auto &m : request->messages()) {
+        if (m.role() == sys_role) continue;
+        out += "[" + m.role() + "] " + m.content() + "\n";
+    }
+    return out;
+}
+
+ds4cpp::KvCache g_kv_cache;
+
+// Try to recover prefill state for `rendered`. Returns the matched prefix length.
+static size_t maybe_load_cache(const std::string &rendered) {
+    if (!g_kv_cache.enabled() || !g_session) return 0;
+    return g_kv_cache.LoadLongestPrefix(g_session, rendered, g_ctx_size);
+}
+
+static void maybe_save_cache(const std::string &rendered) {
+    if (g_kv_cache.enabled() && g_session) {
+        g_kv_cache.Save(g_session, rendered, g_ctx_size);
+    }
+}
+
+static void build_prompt(ds4_engine *engine, const backend::PredictOptions *request,
+                         ds4_tokens *out) {
+    if (!request->usetokenizertemplate() || request->messages_size() == 0) {
+        ds4_tokenize_text(engine, request->prompt().c_str(), out);
+        return;
+    }
+    // Chat-template path: render via ds4's helpers.
+    ds4_chat_begin(engine, out);
+
+    ds4_think_mode think = parse_think_mode(request);
+
+    // ds4_encode_chat_prompt is convenient when there is exactly one
+    // system+user pair, but for arbitrary turn lists we use the granular
+    // append helpers. Pull the first system message (if any), then append
+    // every other message in order.
+    const std::string sys_role = "system";
+    std::string system_text;
+    for (const auto &m : request->messages()) {
+        if (m.role() == sys_role) { system_text = m.content(); break; }
+    }
+    // Inject the tools manifest into the system prompt when tools are present.
+    // ds4 was trained to emit DSML tool calls ONLY when this preamble is in
+    // the system message - without it, the model has no idea tools exist and
+    // the e2e tool-call test will fail. The renderer lives in dsml_renderer
+    // and is a verbatim port of ds4_server.c's append_tools_prompt_text.
+    std::string tools_manifest;
+    if (!request->tools().empty()) {
+        tools_manifest = ds4cpp::RenderToolsManifest(request->tools());
+    }
+    if (!system_text.empty() || !tools_manifest.empty()) {
+        std::string combined = system_text;
+        if (!tools_manifest.empty()) {
+            if (!combined.empty()) combined += "\n\n";
+            combined += tools_manifest;
+        }
+        ds4_chat_append_message(engine, out, "system", combined.c_str());
+    }
+    for (const auto &m : request->messages()) {
+        if (m.role() == sys_role) continue;
+        if (m.role() == "assistant" && !m.tool_calls().empty()) {
+            std::string combined = m.content();
+            combined += ds4cpp::RenderAssistantToolCalls(m.tool_calls());
+            ds4_chat_append_message(engine, out, "assistant", combined.c_str());
+        } else if (m.role() == "tool") {
+            std::string body = ds4cpp::RenderToolResult(m.tool_call_id(), m.content());
+            ds4_chat_append_message(engine, out, "user", body.c_str());
+        } else {
+            ds4_chat_append_message(engine, out, m.role().c_str(), m.content().c_str());
+        }
+    }
+    ds4_chat_append_assistant_prefix(engine, out, think);
+}
+
+class DS4Backend final : public backend::Backend::Service {
+public:
+    GStatus Health(ServerContext *, const backend::HealthMessage *,
+                  backend::Reply *reply) override {
+        reply->set_message(std::string("OK"));
+        return GStatus::OK;
+    }
+
+    GStatus Free(ServerContext *, const backend::HealthMessage *,
+                backend::Result *result) override {
+        std::lock_guard<std::mutex> lock(g_engine_mu);
+        if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
+        if (g_engine)  { ds4_engine_close(g_engine);  g_engine  = nullptr; }
+        result->set_success(true);
+        return GStatus::OK;
+    }
+
+    GStatus LoadModel(ServerContext *, const backend::ModelOptions *request,
+                     backend::Result *result) override {
+        std::lock_guard<std::mutex> lock(g_engine_mu);
+
+        if (g_engine) {
+            if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
+            ds4_engine_close(g_engine);
+            g_engine = nullptr;
+        }
+
+        std::string model_path = request->modelfile();
+        if (model_path.empty()) model_path = request->model();
+        if (model_path.empty()) {
+            result->set_success(false);
+            result->set_message("ds4: ModelOptions.Model or .ModelFile must be set");
+            return GStatus::OK;
+        }
+
+        std::string mtp_path;
+        int mtp_draft = 0;
+        float mtp_margin = 3.0f;
+        for (const auto &opt : request->options()) {
+            auto [k, v] = split_option(opt);
+            if (k == "mtp_path") mtp_path = v;
+            else if (k == "mtp_draft") mtp_draft = std::stoi(v);
+            else if (k == "mtp_margin") mtp_margin = std::stof(v);
+            else if (k == "kv_cache_dir") g_kv_cache_dir = v;
+        }
+
+        g_kv_cache.SetDir(g_kv_cache_dir);
+
+        ds4_engine_options opt = {};
+        opt.model_path = model_path.c_str();
+        opt.mtp_path = mtp_path.empty() ? nullptr : mtp_path.c_str();
+        opt.n_threads = request->threads() > 0 ? request->threads() : 0;
+        opt.mtp_draft_tokens = mtp_draft;
+        opt.mtp_margin = mtp_margin;
+        opt.directional_steering_file = nullptr;
+        opt.warm_weights = false;
+        opt.quality = false;
+
+#if defined(DS4_NO_GPU)
+        opt.backend = DS4_BACKEND_CPU;
+#elif defined(__APPLE__)
+        opt.backend = DS4_BACKEND_METAL;
+#else
+        opt.backend = DS4_BACKEND_CUDA;
+#endif
+
+        int rc = ds4_engine_open(&g_engine, &opt);
+        if (rc != 0 || !g_engine) {
+            result->set_success(false);
+            result->set_message("ds4_engine_open failed (rc=" + std::to_string(rc) + ")");
+            return GStatus::OK;
+        }
+
+        g_ctx_size = request->contextsize() > 0 ? request->contextsize() : 32768;
+        rc = ds4_session_create(&g_session, g_engine, g_ctx_size);
+        if (rc != 0 || !g_session) {
+            ds4_engine_close(g_engine);
+            g_engine = nullptr;
+            result->set_success(false);
+            result->set_message("ds4_session_create failed (rc=" + std::to_string(rc) + ")");
+            return GStatus::OK;
+        }
+
+        result->set_success(true);
+        result->set_message("loaded " + model_path);
+        return GStatus::OK;
+    }
+
+    GStatus TokenizeString(ServerContext *, const backend::PredictOptions *request,
+                          backend::TokenizationResponse *response) override {
+        std::lock_guard<std::mutex> lock(g_engine_mu);
+        if (!g_engine) return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
+        ds4_tokens out = {};
+        ds4_tokenize_text(g_engine, request->prompt().c_str(), &out);
+        for (int i = 0; i < out.len; ++i) response->add_tokens(out.v[i]);
+        response->set_length(out.len);
+        ds4_tokens_free(&out);
+        return GStatus::OK;
+    }
+
+    GStatus Predict(ServerContext *, const backend::PredictOptions *request,
+                   backend::Reply *reply) override {
+        std::lock_guard<std::mutex> lock(g_engine_mu);
+        if (!g_engine || !g_session) {
+            return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
+        }
+        ds4_tokens prompt = {};
+        build_prompt(g_engine, request, &prompt);
+        int n_predict = request->tokens() > 0 ? request->tokens() : 256;
+
+        CollectCtx collect = {g_engine, "", {}, reply, 0, {}, "", ""};
+        std::string cache_key = render_prompt_text(request);
+        size_t cache_hit = maybe_load_cache(cache_key);
+        (void)cache_hit; // future: skip prompt prefix if hit covers full prompt
+
+        // Manual generation loop on g_session. When MTP speculative weights
+        // were loaded (LoadModel option 'mtp_path:'), we use the
+        // ds4_session_eval_speculative_argmax path which may accept N>1
+        // tokens per outer iteration. Otherwise per-token argmax + eval.
+        // Either way g_session advances so the disk KV cache picks up a
+        // real checkpoint after the call (see maybe_save_cache below).
+        char err[256] = {0};
+        int rc = ds4_session_sync(g_session, &prompt, err, sizeof(err));
+        int prompt_len = prompt.len;
+        ds4_tokens_free(&prompt);
+        if (rc == 0) {
+            const int eos = ds4_token_eos(g_engine);
+            const int draft_max = ds4_engine_mtp_draft_tokens(g_engine);
+            const bool think_enabled = ds4_think_mode_enabled(parse_think_mode(request));
+            int produced = 0;
+            while (produced < n_predict) {
+                SampleParams sp = compute_sample_params(request, collect.parser, think_enabled);
+                int first;
+                if (sp.temperature <= 0.0f) {
+                    first = ds4_session_argmax(g_session);
+                } else {
+                    first = ds4_session_sample(g_session,
+                                               sp.temperature, sp.top_k,
+                                               sp.top_p, sp.min_p, get_rng());
+                }
+                if (first == eos) break;
+                // MTP only when sampling is greedy (ds4-server gate).
+                if (draft_max > 0 && sp.temperature <= 0.0f) {
+                    constexpr int kAcceptedMax = 8;
+                    int accepted[kAcceptedMax];
+                    int cap = std::min(kAcceptedMax, draft_max + 1);
+                    int n = ds4_session_eval_speculative_argmax(
+                        g_session, first, draft_max, eos,
+                        accepted, cap, err, sizeof(err));
+                    if (n < 0) { rc = -1; break; }
+                    bool stop = false;
+                    for (int j = 0; j < n; ++j) {
+                        if (accepted[j] == eos) { stop = true; break; }
+                        collect_emit(&collect, accepted[j]);
+                        if (++produced >= n_predict) { stop = true; break; }
+                    }
+                    if (stop) break;
+                } else {
+                    collect_emit(&collect, first);
+                    if (++produced >= n_predict) break;
+                    rc = ds4_session_eval(g_session, first, err, sizeof(err));
+                    if (rc != 0) break;
+                }
+            }
+            collect_done(&collect);
+        }
+        maybe_save_cache(cache_key);
+
+        // Flush any buffered parser state.
+        std::vector<ds4cpp::ParserEvent> events;
+        collect.parser.Flush(events);
+        apply_events(&collect, events);
+
+        if (rc != 0) {
+            return GStatus(StatusCode::INTERNAL,
+                          std::string("ds4 generation failed: ") + err);
+        }
+
+        // Emit one ChatDelta with content/reasoning/tool_calls.
+        auto *delta = reply->add_chat_deltas();
+        delta->set_content(collect.content_buf);
+        delta->set_reasoning_content(collect.reasoning_buf);
+        for (size_t i = 0; i < collect.pending.size(); ++i) {
+            auto *tc = delta->add_tool_calls();
+            tc->set_index(static_cast<int32_t>(i));
+            tc->set_id(collect.pending[i].id);
+            tc->set_name(collect.pending[i].name);
+            tc->set_arguments(collect.pending[i].args);
+        }
+
+        reply->set_message(collect.raw_buf);
+        reply->set_tokens(collect.tokens);
+        reply->set_prompt_tokens(prompt_len);
+        return GStatus::OK;
+    }
+
+    GStatus PredictStream(ServerContext *, const backend::PredictOptions *request,
+                         ServerWriter<backend::Reply> *writer) override {
+        std::lock_guard<std::mutex> lock(g_engine_mu);
+        if (!g_engine || !g_session) {
+            return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
+        }
+        ds4_tokens prompt = {};
+        build_prompt(g_engine, request, &prompt);
+        int n_predict = request->tokens() > 0 ? request->tokens() : 256;
+
+        StreamCtx s = {g_engine, writer, {}, 0, false, {}};
+        std::string cache_key = render_prompt_text(request);
+        size_t cache_hit = maybe_load_cache(cache_key);
+        (void)cache_hit;
+
+        // Manual loop on g_session - see Predict() above for the rationale.
+        // MTP speculative path used when ds4_engine_mtp_draft_tokens > 0.
+        char err[256] = {0};
+        int rc = ds4_session_sync(g_session, &prompt, err, sizeof(err));
+        ds4_tokens_free(&prompt);
+        if (rc == 0) {
+            const int eos = ds4_token_eos(g_engine);
+            const int draft_max = ds4_engine_mtp_draft_tokens(g_engine);
+            const bool think_enabled = ds4_think_mode_enabled(parse_think_mode(request));
+            int produced = 0;
+            while (produced < n_predict && !s.aborted) {
+                SampleParams sp = compute_sample_params(request, s.parser, think_enabled);
+                int first;
+                if (sp.temperature <= 0.0f) {
+                    first = ds4_session_argmax(g_session);
+                } else {
+                    first = ds4_session_sample(g_session,
+                                               sp.temperature, sp.top_k,
+                                               sp.top_p, sp.min_p, get_rng());
+                }
+                if (first == eos) break;
+                if (draft_max > 0 && sp.temperature <= 0.0f) {
+                    constexpr int kAcceptedMax = 8;
+                    int accepted[kAcceptedMax];
+                    int cap = std::min(kAcceptedMax, draft_max + 1);
+                    int n = ds4_session_eval_speculative_argmax(
+                        g_session, first, draft_max, eos,
+                        accepted, cap, err, sizeof(err));
+                    if (n < 0) { rc = -1; break; }
+                    bool stop = false;
+                    for (int j = 0; j < n; ++j) {
+                        if (accepted[j] == eos) { stop = true; break; }
+                        stream_emit(&s, accepted[j]);
+                        if (s.aborted) { stop = true; break; }
+                        if (++produced >= n_predict) { stop = true; break; }
+                    }
+                    if (stop) break;
+                } else {
+                    stream_emit(&s, first);
+                    if (s.aborted || ++produced >= n_predict) break;
+                    rc = ds4_session_eval(g_session, first, err, sizeof(err));
+                    if (rc != 0) break;
+                }
+            }
+            stream_done(&s);
+        }
+        maybe_save_cache(cache_key);
+
+        // Flush parser state.
+        std::vector<ds4cpp::ParserEvent> events;
+        s.parser.Flush(events);
+        if (!events.empty() && !s.aborted) {
+            backend::Reply reply;
+            auto *delta = reply.add_chat_deltas();
+            for (const auto &e : events) {
+                if (e.type == ds4cpp::ParserEvent::CONTENT) {
+                    delta->set_content(delta->content() + e.text);
+                } else if (e.type == ds4cpp::ParserEvent::REASONING) {
+                    delta->set_reasoning_content(delta->reasoning_content() + e.text);
+                }
+            }
+            s.writer->Write(reply);
+        }
+
+        if (rc != 0 && !s.aborted) {
+            return GStatus(StatusCode::INTERNAL,
+                          std::string("ds4 generation failed: ") + err);
+        }
+        return GStatus::OK;
+    }
+
+    GStatus Status(ServerContext *, const backend::HealthMessage *,
+                  backend::StatusResponse *response) override {
+        std::lock_guard<std::mutex> lock(g_engine_mu);
+        response->set_state(g_engine ? backend::StatusResponse::READY
+                                     : backend::StatusResponse::UNINITIALIZED);
+        return GStatus::OK;
+    }
+};
+
+void RunServer(const std::string &addr) {
+    DS4Backend service;
+    grpc::EnableDefaultHealthCheckService(true);
+    grpc::reflection::InitProtoReflectionServerBuilderPlugin();
+
+    ServerBuilder builder;
+    builder.AddListeningPort(addr, grpc::InsecureServerCredentials());
+    builder.RegisterService(&service);
+    builder.SetMaxReceiveMessageSize(64 * 1024 * 1024);
+    builder.SetMaxSendMessageSize(64 * 1024 * 1024);
+
+    std::unique_ptr<Server> server(builder.BuildAndStart());
+    if (!server) {
+        std::cerr << "ds4 grpc-server: failed to bind " << addr << "\n";
+        std::exit(1);
+    }
+    g_server = server.get();
+    std::cerr << "ds4 grpc-server listening on " << addr << "\n";
+    server->Wait();
+}
+
+void signal_handler(int) {
+    if (auto *srv = g_server.load()) {
+        srv->Shutdown(std::chrono::system_clock::now() +
+                      std::chrono::seconds(3));
+    }
+}
+
+} // namespace
+
+int main(int argc, char *argv[]) {
+    std::string addr = "127.0.0.1:50051";
+    for (int i = 1; i < argc; ++i) {
+        std::string a = argv[i];
+        const std::string addr_flag = "--addr=";
+        if (a.rfind(addr_flag, 0) == 0) addr = a.substr(addr_flag.size());
+        else if (a == "--addr" && i + 1 < argc) addr = argv[++i];
+        else if (a == "--help" || a == "-h") {
+            std::cout << "Usage: grpc-server --addr=HOST:PORT\n";
+            return 0;
+        }
+    }
+    std::signal(SIGINT, signal_handler);
+    std::signal(SIGTERM, signal_handler);
+    RunServer(addr);
+    return 0;
+}
--- a/backend/cpp/ds4/kv_cache.cpp
+++ b/backend/cpp/ds4/kv_cache.cpp
@@ -0,0 +1,205 @@
+#include "kv_cache.h"
+
+#include <cerrno>
+#include <cstdio>
+#include <cstring>
+#include <dirent.h>
+#include <fstream>
+#include <sys/stat.h>
+#include <vector>
+
+namespace ds4cpp {
+
+namespace {
+
+// Minimal SHA1 (public domain reference). 30 lines; used only here.
+struct Sha1 {
+    uint32_t h[5];
+    uint64_t bits;
+    uint8_t block[64];
+    size_t used;
+    Sha1() { h[0]=0x67452301; h[1]=0xEFCDAB89; h[2]=0x98BADCFE; h[3]=0x10325476; h[4]=0xC3D2E1F0; bits=0; used=0; }
+    static uint32_t rol(uint32_t x, int n){ return (x<<n)|(x>>(32-n)); }
+    void transform(const uint8_t *b) {
+        uint32_t w[80];
+        for (int i=0;i<16;i++) w[i] = (uint32_t)b[i*4]<<24 | (uint32_t)b[i*4+1]<<16 | (uint32_t)b[i*4+2]<<8 | b[i*4+3];
+        for (int i=16;i<80;i++) w[i] = rol(w[i-3]^w[i-8]^w[i-14]^w[i-16], 1);
+        uint32_t a=h[0],bb=h[1],c=h[2],d=h[3],e=h[4];
+        for (int i=0;i<80;i++) {
+            uint32_t f,k;
+            if (i<20)      { f=(bb&c)|((~bb)&d); k=0x5A827999; }
+            else if (i<40) { f=bb^c^d;            k=0x6ED9EBA1; }
+            else if (i<60) { f=(bb&c)|(bb&d)|(c&d); k=0x8F1BBCDC; }
+            else           { f=bb^c^d;            k=0xCA62C1D6; }
+            uint32_t t = rol(a,5)+f+e+k+w[i];
+            e=d; d=c; c=rol(bb,30); bb=a; a=t;
+        }
+        h[0]+=a; h[1]+=bb; h[2]+=c; h[3]+=d; h[4]+=e;
+    }
+    void update(const void *p, size_t n) {
+        const uint8_t *bp = (const uint8_t*)p;
+        bits += (uint64_t)n*8;
+        while (n) {
+            size_t take = 64-used;
+            if (take>n) take=n;
+            std::memcpy(block+used, bp, take);
+            used += take; bp += take; n -= take;
+            if (used == 64) { transform(block); used = 0; }
+        }
+    }
+    void final(uint8_t out[20]) {
+        uint8_t pad[64] = {0x80};
+        size_t padlen = (used < 56) ? (56-used) : (120-used);
+        uint64_t lb = bits;
+        uint8_t len[8];
+        for (int i=0;i<8;i++) len[7-i] = (uint8_t)(lb >> (i*8));
+        update(pad, padlen);
+        update(len, 8);
+        for (int i=0;i<5;i++) {
+            out[i*4]   = h[i]>>24;
+            out[i*4+1] = h[i]>>16;
+            out[i*4+2] = h[i]>>8;
+            out[i*4+3] = h[i];
+        }
+    }
+};
+
+std::string mkdir_p(const std::string &d) {
+    if (d.empty()) return d;
+    struct stat st{};
+    if (stat(d.c_str(), &st) == 0) return d;
+    mkdir(d.c_str(), 0755);
+    return d;
+}
+
+bool file_exists(const std::string &p) {
+    struct stat st{};
+    return stat(p.c_str(), &st) == 0;
+}
+
+} // namespace
+
+std::string Sha1Hex(const void *data, size_t len) {
+    Sha1 s;
+    s.update(data, len);
+    uint8_t out[20];
+    s.final(out);
+    char hex[41];
+    for (int i = 0; i < 20; ++i) std::snprintf(hex + i*2, 3, "%02x", out[i]);
+    hex[40] = 0;
+    return std::string(hex);
+}
+
+KvCache::KvCache() = default;
+
+void KvCache::SetDir(const std::string &dir) {
+    dir_ = dir;
+    if (!dir_.empty()) {
+        mkdir_p(dir_);
+        std::fprintf(stderr, "ds4 KvCache: enabled at %s\n", dir_.c_str());
+    } else {
+        std::fprintf(stderr, "ds4 KvCache: disabled (no dir set)\n");
+    }
+}
+
+std::string KvCache::Path(const std::string &rendered_text) const {
+    if (dir_.empty()) return "";
+    return dir_ + "/" + Sha1Hex(rendered_text.data(), rendered_text.size()) + ".kv";
+}
+
+size_t KvCache::LoadLongestPrefix(ds4_session *session,
+                                  const std::string &rendered_text,
+                                  int ctx_size) {
+    if (dir_.empty() || !session) return 0;
+    // Strategy: enumerate all .kv files in dir, read their stored prefix
+    // header, pick the longest one that is also a prefix of rendered_text.
+    DIR *d = opendir(dir_.c_str());
+    if (!d) return 0;
+    struct dirent *de;
+    size_t best_len = 0;
+    std::string best_path;
+    while ((de = readdir(d)) != nullptr) {
+        std::string name = de->d_name;
+        if (name.size() < 4 || name.substr(name.size()-3) != ".kv") continue;
+        std::string path = dir_ + "/" + name;
+        std::ifstream f(path, std::ios::binary);
+        if (!f) continue;
+        char magic[4]; f.read(magic, 4);
+        if (f.gcount() != 4 || std::memcmp(magic, "DS4G", 4) != 0) continue;
+        uint32_t version=0, file_ctx=0, prefix_len=0;
+        f.read((char*)&version, 4); f.read((char*)&file_ctx, 4); f.read((char*)&prefix_len, 4);
+        if (version != 1) continue;
+        if ((int)file_ctx != ctx_size) continue;
+        if (prefix_len > rendered_text.size()) continue;
+        std::vector<char> prefix(prefix_len);
+        f.read(prefix.data(), prefix_len);
+        if (std::memcmp(prefix.data(), rendered_text.data(), prefix_len) != 0) continue;
+        if (prefix_len > best_len) {
+            best_len = prefix_len;
+            best_path = path;
+        }
+    }
+    closedir(d);
+    if (best_len == 0) return 0;
+
+    // Load best_path's payload into session.
+    std::ifstream f(best_path, std::ios::binary);
+    char magic[4]; f.read(magic, 4);
+    uint32_t version, file_ctx, prefix_len;
+    f.read((char*)&version, 4); f.read((char*)&file_ctx, 4); f.read((char*)&prefix_len, 4);
+    f.seekg(prefix_len, std::ios::cur);
+    uint64_t payload_bytes = 0;
+    f.read((char*)&payload_bytes, 8);
+    // ds4_session_load_payload reads from a FILE*; reopen via fopen.
+    FILE *fp = std::fopen(best_path.c_str(), "rb");
+    if (!fp) return 0;
+    // Seek past header + prefix + payload_bytes field.
+    std::fseek(fp, 4 + 4 + 4 + 4 + prefix_len + 8, SEEK_SET);
+    char errbuf[256] = {0};
+    int rc = ds4_session_load_payload(session, fp, payload_bytes, errbuf, sizeof(errbuf));
+    std::fclose(fp);
+    if (rc != 0) return 0;
+    return best_len;
+}
+
+void KvCache::Save(ds4_session *session, const std::string &rendered_text, int ctx_size) {
+    if (dir_.empty()) {
+        std::fprintf(stderr, "ds4 KvCache::Save: skipped (dir empty)\n");
+        return;
+    }
+    if (!session) {
+        std::fprintf(stderr, "ds4 KvCache::Save: skipped (session null)\n");
+        return;
+    }
+    std::string path = Path(rendered_text);
+    uint64_t payload_bytes = ds4_session_payload_bytes(session);
+    std::fprintf(stderr, "ds4 KvCache::Save: path=%s payload_bytes=%llu prefix_len=%zu\n",
+                 path.c_str(), (unsigned long long)payload_bytes, rendered_text.size());
+    FILE *fp = std::fopen(path.c_str(), "wb");
+    if (!fp) {
+        std::fprintf(stderr, "ds4 KvCache::Save: fopen failed: %s\n", std::strerror(errno));
+        return;
+    }
+    char magic[4] = {'D','S','4','G'};
+    uint32_t version = 1;
+    uint32_t ctx = static_cast<uint32_t>(ctx_size);
+    uint32_t prefix_len = static_cast<uint32_t>(rendered_text.size());
+    std::fwrite(magic, 4, 1, fp);
+    std::fwrite(&version, 4, 1, fp);
+    std::fwrite(&ctx, 4, 1, fp);
+    std::fwrite(&prefix_len, 4, 1, fp);
+    std::fwrite(rendered_text.data(), prefix_len, 1, fp);
+    std::fwrite(&payload_bytes, 8, 1, fp);
+    char errbuf[256] = {0};
+    int rc = ds4_session_save_payload(session, fp, errbuf, sizeof(errbuf));
+    std::fclose(fp);
+    if (rc != 0) {
+        std::fprintf(stderr, "ds4 KvCache::Save: ds4_session_save_payload rc=%d err=%s; removing %s\n",
+                     rc, errbuf, path.c_str());
+        std::remove(path.c_str());
+    } else {
+        std::fprintf(stderr, "ds4 KvCache::Save: wrote %s ok\n", path.c_str());
+    }
+}
+
+} // namespace ds4cpp
--- a/backend/cpp/ds4/kv_cache.h
+++ b/backend/cpp/ds4/kv_cache.h
@@ -0,0 +1,44 @@
+#pragma once
+#include <string>
+extern "C" {
+#include "ds4.h"
+}
+
+namespace ds4cpp {
+
+// Disk-backed KV cache for ds4 sessions. Keyed by SHA1(rendered prompt prefix).
+// Format (our own, NOT bit-compatible with ds4-server's KVC files - interop
+// is a follow-up plan):
+//
+//   "DS4G" (4 bytes magic) + u32 version=1 + u32 ctx_size +
+//   u32 prefix_text_len + prefix_text + u64 payload_bytes + payload
+class KvCache {
+public:
+    KvCache(); // disabled (dir empty)
+
+    // Set the cache directory. Empty disables.
+    void SetDir(const std::string &dir);
+
+    // Returns the cache file path for a given rendered text prefix.
+    std::string Path(const std::string &rendered_text) const;
+
+    // Look up the longest cached prefix that is also a prefix of
+    // `rendered_text`. Loads it into `session` if found. Returns the
+    // matched prefix length in bytes (0 if no hit).
+    size_t LoadLongestPrefix(ds4_session *session,
+                             const std::string &rendered_text,
+                             int ctx_size);
+
+    // Save the current session, associated with this rendered text prefix.
+    void Save(ds4_session *session, const std::string &rendered_text, int ctx_size);
+
+    bool enabled() const { return !dir_.empty(); }
+
+private:
+    std::string dir_;
+};
+
+// Compute SHA1 of arbitrary bytes; returns 40-char hex.
+std::string Sha1Hex(const void *data, size_t len);
+
+} // namespace ds4cpp
--- a/backend/cpp/ds4/package.sh
+++ b/backend/cpp/ds4/package.sh
@@ -0,0 +1,39 @@
+#!/bin/bash
+set -e
+CURDIR=$(dirname "$(realpath "$0")")
+REPO_ROOT="${CURDIR}/../../.."
+
+mkdir -p "$CURDIR/package/lib"
+cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
+cp -rfv "$CURDIR/run.sh"     "$CURDIR/package/"
+
+UNAME_S=$(uname -s)
+if [ "$UNAME_S" = "Darwin" ]; then
+    # Darwin: bundle dylibs via otool -L (handled by scripts/build/ds4-darwin.sh).
+    echo "package.sh: Darwin handled by ds4-darwin.sh"
+    exit 0
+fi
+
+if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
+    LIBDIR=/lib/x86_64-linux-gnu
+elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
+    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
+    LIBDIR=/lib/aarch64-linux-gnu
+else
+    echo "package.sh: unknown architecture" >&2; exit 1
+fi
+
+for lib in libc.so.6 libgcc_s.so.1 libstdc++.so.6 libm.so.6 libgomp.so.1 \
+           libdl.so.2 librt.so.1 libpthread.so.0; do
+    cp -arfLv "$LIBDIR/$lib" "$CURDIR/package/lib/$lib"
+done
+
+GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
+if [ -f "$GPU_LIB_SCRIPT" ]; then
+    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
+    package_gpu_libs
+fi
+
+echo "ds4 package contents:"
+ls -lah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/cpp/ds4/run.sh
+++ b/backend/cpp/ds4/run.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+# Entry point for the ds4 backend image / BACKEND_BINARY mode.
+set -e
+CURDIR=$(dirname "$(realpath "$0")")
+export LD_LIBRARY_PATH="$CURDIR/lib:$LD_LIBRARY_PATH"
+if [ -f "$CURDIR/lib/ld.so" ]; then
+    exec "$CURDIR/lib/ld.so" "$CURDIR/grpc-server" "$@"
+fi
+exec "$CURDIR/grpc-server" "$@"
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=23127139cb6fa314899c3b5f4935b88b3374c56c
+IK_LLAMA_VERSION?=eb570eb96689c235933b813693ca28ab9d3d26de
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=389ff61d77b5c71cec0cf92fe4e5d01ace80b797
+LLAMA_VERSION?=1ec7ba0c14f33f17e980daeeda5f35b225d41994
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -36,6 +36,8 @@
 #include <cstdlib>
 #include <fstream>
 #include <iterator>
+#include <list>
+#include <map>
 #include <mutex>
 #include <signal.h>
 #include <thread>
@@ -443,10 +445,22 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    // Draft model for speculative decoding
    if (!request->draftmodel().empty()) {
        params.speculative.draft.mparams.path = request->draftmodel();
-        // Default to draft type if a draft model is set but no explicit type
+        // Default to draft type if a draft model is set but no explicit type.
+        // Upstream (post ggml-org/llama.cpp#22838) made the speculative type a
+        // vector; the turboquant fork still uses the legacy scalar. The
+        // LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
+        // backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
+#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
        if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
            params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
        }
+#else
+        const bool no_spec_type = params.speculative.types.empty() ||
+            (params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
+        if (no_spec_type) {
+            params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT };
+        }
+#endif
    }

    //  params.model_alias ??
@@ -673,10 +687,35 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            }
        // Speculative decoding options
        } else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
-            auto type = common_speculative_type_from_name(optval_str);
+#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
+            // Fork only knows a single scalar `type`. Take the first comma-
+            // separated value and assign it via the singular helper.
+            std::string first = optval_str;
+            const auto comma = first.find(',');
+            if (comma != std::string::npos) first = first.substr(0, comma);
+            auto type = common_speculative_type_from_name(first);
            if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
                params.speculative.type = type;
            }
+#else
+            // Upstream switched to a vector of types (comma-separated for multi-type
+            // chaining via common_speculative_types_from_names). We keep accepting a
+            // single value here, but also tolerate comma-separated lists.
+            std::vector<std::string> names;
+            std::string item;
+            for (char c : optval_str) {
+                if (c == ',') {
+                    if (!item.empty()) { names.push_back(item); item.clear(); }
+                } else {
+                    item.push_back(c);
+                }
+            }
+            if (!item.empty()) names.push_back(item);
+            auto parsed = common_speculative_types_from_names(names);
+            if (!parsed.empty()) {
+                params.speculative.types = parsed;
+            }
+#endif
        } else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
            if (optval != NULL) {
                try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
@@ -710,10 +749,155 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                try { params.speculative.draft.n_gpu_layers = std::stoi(optval_str); } catch (...) {}
            }
        } else if (!strcmp(optname, "draft_ctx_size")) {
-            if (optval != NULL) {
-                try { params.speculative.draft.n_ctx = std::stoi(optval_str); } catch (...) {}
-            }
+            // The draft context size is no longer a separate field upstream: the draft
+            // shares the target context size. Accept the option for backward
+            // compatibility but silently ignore it.
+
+// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838
+// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`,
+// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}`
+// fields. The turboquant fork branched before that, so its build defines
+// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option
+// keys become unrecognized (silently dropped, like any unknown opt) for it.
+//
+// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the
+// closing-brace position of the `draft_ctx_size` branch on purpose: in the
+// legacy build the chain ends here (the brace closes draft_ctx_size), and in
+// the modern build the chain continues with `} else if (...)` instead, so the
+// brace count stays balanced under both branches of the preprocessor.
+#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
        }
+#else
+        // --- ngram_mod family (upstream --spec-ngram-mod-*) ---
+        } else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_mod.n_min = std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "spec_ngram_mod_n_max")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_mod.n_max = std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "spec_ngram_mod_n_match")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_mod.n_match = std::stoi(optval_str); } catch (...) {}
+            }
+
+        // --- ngram_map_k family (upstream --spec-ngram-map-k-*) ---
+        } else if (!strcmp(optname, "spec_ngram_map_k_size_n")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_map_k.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "spec_ngram_map_k_size_m")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_map_k.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "spec_ngram_map_k_min_hits")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_map_k.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
+            }
+
+        // --- ngram_map_k4v family (upstream --spec-ngram-map-k4v-*) ---
+        } else if (!strcmp(optname, "spec_ngram_map_k4v_size_n")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_map_k4v.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "spec_ngram_map_k4v_size_m")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_map_k4v.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "spec_ngram_map_k4v_min_hits")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_map_k4v.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
+            }
+
+        // --- ngram lookup caches (upstream --lookup-cache-static / -dynamic) ---
+        } else if (!strcmp(optname, "spec_lookup_cache_static") || !strcmp(optname, "lookup_cache_static")) {
+            params.speculative.ngram_cache.lookup_cache_static = optval_str;
+        } else if (!strcmp(optname, "spec_lookup_cache_dynamic") || !strcmp(optname, "lookup_cache_dynamic")) {
+            params.speculative.ngram_cache.lookup_cache_dynamic = optval_str;
+
+        // --- draft model KV cache types (upstream --spec-draft-type-k / -v) ---
+        } else if (!strcmp(optname, "draft_cache_type_k") || !strcmp(optname, "spec_draft_cache_type_k")) {
+            params.speculative.draft.cache_type_k = kv_cache_type_from_str(optval_str);
+        } else if (!strcmp(optname, "draft_cache_type_v") || !strcmp(optname, "spec_draft_cache_type_v")) {
+            params.speculative.draft.cache_type_v = kv_cache_type_from_str(optval_str);
+
+        // --- draft model thread counts (upstream --spec-draft-threads / -batch) ---
+        } else if (!strcmp(optname, "draft_threads") || !strcmp(optname, "spec_draft_threads")) {
+            if (optval != NULL) {
+                try {
+                    int n = std::stoi(optval_str);
+                    if (n <= 0) n = (int)std::thread::hardware_concurrency();
+                    params.speculative.draft.cpuparams.n_threads = n;
+                } catch (...) {}
+            }
+        } else if (!strcmp(optname, "draft_threads_batch") || !strcmp(optname, "spec_draft_threads_batch")) {
+            if (optval != NULL) {
+                try {
+                    int n = std::stoi(optval_str);
+                    if (n <= 0) n = (int)std::thread::hardware_concurrency();
+                    params.speculative.draft.cpuparams_batch.n_threads = n;
+                } catch (...) {}
+            }
+
+        // --- draft model MoE on CPU (upstream --spec-draft-cpu-moe / --spec-draft-n-cpu-moe) ---
+        } else if (!strcmp(optname, "draft_cpu_moe") || !strcmp(optname, "spec_draft_cpu_moe")) {
+            // Bool-style flag: optval may be missing, "true"/"1"/"yes" enables.
+            const bool enable = (optval == NULL) ||
+                optval_str == "true" || optval_str == "1" || optval_str == "yes" ||
+                optval_str == "on" || optval_str == "enabled";
+            if (enable) {
+                params.speculative.draft.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
+            }
+        } else if (!strcmp(optname, "draft_n_cpu_moe") || !strcmp(optname, "spec_draft_n_cpu_moe")) {
+            if (optval != NULL) {
+                try {
+                    int n = std::stoi(optval_str);
+                    if (n < 0) n = 0;
+                    // Keep override-name storage alive for the lifetime of the params struct
+                    // (mirrors upstream arg.cpp behavior with a function-local static).
+                    static std::list<std::string> buft_overrides_draft;
+                    for (int i = 0; i < n; ++i) {
+                        buft_overrides_draft.push_back(llm_ffn_exps_block_regex(i));
+                        params.speculative.draft.tensor_buft_overrides.push_back(
+                            {buft_overrides_draft.back().c_str(), ggml_backend_cpu_buffer_type()});
+                    }
+                } catch (...) {}
+            }
+
+        // --- draft model tensor buffer overrides (upstream --spec-draft-override-tensor) ---
+        } else if (!strcmp(optname, "draft_override_tensor") || !strcmp(optname, "spec_draft_override_tensor")) {
+            // Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
+            // We replicate upstream's parse_tensor_buffer_overrides (static in arg.cpp).
+            ggml_backend_load_all();
+            std::map<std::string, ggml_backend_buffer_type_t> buft_list;
+            for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
+                auto * dev = ggml_backend_dev_get(i);
+                auto * buft = ggml_backend_dev_buffer_type(dev);
+                if (buft) {
+                    buft_list[ggml_backend_buft_name(buft)] = buft;
+                }
+            }
+            static std::list<std::string> draft_override_names;
+            std::string cur;
+            auto flush = [&](const std::string & spec) {
+                auto pos = spec.find('=');
+                if (pos == std::string::npos) return;
+                const std::string name = spec.substr(0, pos);
+                const std::string type = spec.substr(pos + 1);
+                auto it = buft_list.find(type);
+                if (it == buft_list.end()) return; // unknown buffer type: ignore
+                draft_override_names.push_back(name);
+                params.speculative.draft.tensor_buft_overrides.push_back(
+                    {draft_override_names.back().c_str(), it->second});
+            };
+            for (char c : optval_str) {
+                if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
+                else { cur.push_back(c); }
+            }
+            if (!cur.empty()) flush(cur);
+        }
+#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size
    }

    // Set params.n_parallel from environment variable if not set via options (fallback)
@@ -2704,7 +2888,7 @@ public:

            tasks.reserve(documents.size());
            for (size_t i = 0; i < documents.size(); i++) {
-                auto tmp = format_prompt_rerank(ctx_server.impl->model, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
+                auto tmp = format_prompt_rerank(ctx_server.impl->model_tgt, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
                server_task task = server_task(SERVER_TASK_TYPE_RERANK);
                task.id = rd.queue_tasks.get_new_id();
                task.index = i;
@@ -2882,7 +3066,7 @@ public:
                // Get template source and reconstruct a common_chat_template for analysis
                std::string tmpl_src = common_chat_templates_source(ctx_server.impl->chat_params.tmpls.get());
                if (!tmpl_src.empty()) {
-                    const auto * vocab = llama_model_get_vocab(ctx_server.impl->model);
+                    const auto * vocab = llama_model_get_vocab(ctx_server.impl->model_tgt);
                    std::string token_bos, token_eos;
                    if (vocab) {
                        auto bos_id = llama_vocab_bos(vocab);
--- a/backend/cpp/turboquant/patch-grpc-server.sh
+++ b/backend/cpp/turboquant/patch-grpc-server.sh
@@ -108,4 +108,47 @@ else
    echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
 fi

+# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream
+#    ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still
+#    exposes the field as `model` on `server_context_impl`. The two call sites
+#    are in the Rerank and ModelMetadata RPC handlers.
+if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then
+    echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model"
+    sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> model_tgt rename OK"
+else
+    echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch"
+fi
+
+# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
+#    grpc-server option parser skips the new option-handler blocks (ngram_mod,
+#    ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
+#    draft.tensor_buft_overrides) introduced for the post-#22838 layout. Those
+#    blocks reference struct fields that simply do not exist in the fork.
+if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
+    echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
+else
+    echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top"
+    # Insert the define before the very first `#include` so it precedes all the
+    # speculative-decoding code paths.
+    awk '
+        !done && /^#include/ {
+            print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1"
+            print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
+            print ""
+            done = 1
+        }
+        { print }
+        END {
+            if (!done) {
+                print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr"
+                exit 1
+            }
+        }
+    ' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK"
+fi
+
 echo "==> all patches applied"
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -72,6 +72,29 @@
    nvidia-cuda-12: "cuda12-turboquant"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant"
+- &ds4
+  name: "ds4"
+  alias: "ds4"
+  license: mit
+  description: |
+    antirez/ds4 - DeepSeek V4 Flash inference engine. Single-model,
+    optimized for Metal (Darwin) and CUDA (Linux). Requires the GGUFs
+    published at huggingface.co/antirez/deepseek-v4-gguf.
+  urls:
+    - https://github.com/antirez/ds4
+  tags:
+    - text-to-text
+    - LLM
+    - CPU
+    - CUDA
+    - Metal
+  capabilities:
+    default: "cpu-ds4"
+    nvidia: "cuda13-ds4"
+    nvidia-cuda-13: "cuda13-ds4"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ds4"
+    metal: "metal-ds4"
+    metal-darwin-arm64: "metal-ds4"
 - &whispercpp
  name: "whisper"
  alias: "whisper"
@@ -1127,6 +1150,15 @@
    nvidia-cuda-12: "cuda12-turboquant-development"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant-development"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant-development"
+- !!merge <<: *ds4
+  name: "ds4-development"
+  capabilities:
+    default: "cpu-ds4-development"
+    nvidia: "cuda13-ds4-development"
+    nvidia-cuda-13: "cuda13-ds4-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ds4-development"
+    metal: "metal-ds4-development"
+    metal-darwin-arm64: "metal-ds4-development"
 - !!merge <<: *stablediffusionggml
  name: "stablediffusion-ggml-development"
  capabilities:
@@ -1673,6 +1705,47 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant"
  mirrors:
    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant
+## ds4
+- !!merge <<: *ds4
+  name: "cpu-ds4"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-ds4"
+  mirrors:
+    - localai/localai-backends:latest-cpu-ds4
+- !!merge <<: *ds4
+  name: "cpu-ds4-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-ds4"
+  mirrors:
+    - localai/localai-backends:master-cpu-ds4
+- !!merge <<: *ds4
+  name: "cuda13-ds4"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-ds4"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-ds4
+- !!merge <<: *ds4
+  name: "cuda13-ds4-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-ds4"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-ds4
+- !!merge <<: *ds4
+  name: "cuda13-nvidia-l4t-arm64-ds4"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-ds4"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-ds4
+- !!merge <<: *ds4
+  name: "cuda13-nvidia-l4t-arm64-ds4-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-ds4"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-ds4
+- !!merge <<: *ds4
+  name: "metal-ds4"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-ds4"
+  mirrors:
+    - localai/localai-backends:latest-metal-darwin-arm64-ds4
+- !!merge <<: *ds4
+  name: "metal-ds4-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-ds4"
+  mirrors:
+    - localai/localai-backends:master-metal-darwin-arm64-ds4
 ## whisper
 - !!merge <<: *whispercpp
  name: "whisper-development"
--- a/backend/python/transformers/requirements-cpu.txt
+++ b/backend/python/transformers/requirements-cpu.txt
@@ -2,7 +2,7 @@ torch==2.7.1
 llvmlite==0.43.0
 numba==0.60.0
 accelerate
-transformers>=5.0.0
+transformers>=5.8.0
 bitsandbytes
 sentence-transformers==5.4.0
 diffusers
--- a/backend/python/transformers/requirements-cublas12.txt
+++ b/backend/python/transformers/requirements-cublas12.txt
@@ -2,7 +2,7 @@ torch==2.7.1
 accelerate
 llvmlite==0.43.0
 numba==0.60.0
-transformers>=5.0.0
+transformers>=5.8.0
 bitsandbytes
 sentence-transformers==5.4.0
 diffusers
--- a/backend/python/transformers/requirements-cublas13.txt
+++ b/backend/python/transformers/requirements-cublas13.txt
@@ -2,7 +2,7 @@
 torch==2.9.0
 llvmlite==0.43.0
 numba==0.60.0
-transformers>=5.0.0
+transformers>=5.8.0
 bitsandbytes
 sentence-transformers==5.4.0
 diffusers
--- a/backend/python/transformers/requirements-hipblas.txt
+++ b/backend/python/transformers/requirements-hipblas.txt
@@ -1,7 +1,7 @@
 --extra-index-url https://download.pytorch.org/whl/rocm7.0
 torch==2.10.0+rocm7.0
 accelerate
-transformers>=5.0.0
+transformers>=5.8.0
 llvmlite==0.43.0
 numba==0.60.0
 bitsandbytes
--- a/backend/python/transformers/requirements-intel.txt
+++ b/backend/python/transformers/requirements-intel.txt
@@ -3,7 +3,7 @@ torch
 optimum[openvino]
 llvmlite==0.43.0
 numba==0.60.0
-transformers>=5.0.0
+transformers>=5.8.0
 bitsandbytes
 sentence-transformers==5.4.0
 diffusers
--- a/backend/python/transformers/requirements-mps.txt
+++ b/backend/python/transformers/requirements-mps.txt
@@ -2,7 +2,7 @@ torch==2.7.1
 llvmlite==0.43.0
 numba==0.60.0
 accelerate
-transformers>=5.0.0
+transformers>=5.8.0
 bitsandbytes
 sentence-transformers==5.4.0
 diffusers
--- a/backend/python/vllm/pyproject.toml
+++ b/backend/python/vllm/pyproject.toml
@@ -33,7 +33,7 @@ dependencies = [
    "certifi",
    "setuptools",
    "pillow",
-    "charset-normalizer>=3.4.0",
+    "charset-normalizer>=3.4.7",
    "chardet",
    # L4T-specific accelerator stack (sourced from jetson-ai-lab below).
    "torch",
--- a/backend/python/vllm/requirements.txt
+++ b/backend/python/vllm/requirements.txt
@@ -3,5 +3,5 @@ protobuf
 certifi
 setuptools
 pillow
-charset-normalizer>=3.4.0
+charset-normalizer>=3.4.7
 chardet
--- a/core/gallery/importers/ds4.go
+++ b/core/gallery/importers/ds4.go
@@ -0,0 +1,130 @@
+package importers
+
+import (
+	"encoding/json"
+	"path/filepath"
+	"strings"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/gallery"
+	"github.com/mudler/LocalAI/core/schema"
+	"github.com/mudler/LocalAI/pkg/downloader"
+	"github.com/mudler/LocalAI/pkg/functions"
+	"go.yaml.in/yaml/v2"
+)
+
+var _ Importer = &DS4Importer{}
+
+// DS4Importer detects antirez/ds4 weights - single-model DeepSeek V4 Flash
+// inference engine. ds4 only loads the GGUFs published at
+// huggingface.co/antirez/deepseek-v4-gguf; auto-detect keys on:
+//
+//   - the repo name itself ("antirez/deepseek-v4-gguf" anywhere in URI)
+//   - the canonical filename pattern "DeepSeek-V4-Flash-*.gguf"
+//
+// Must register BEFORE LlamaCPPImporter - both match .gguf, but ds4 is
+// more specific and first-match-wins.
+type DS4Importer struct{}
+
+func (i *DS4Importer) Name() string      { return "ds4" }
+func (i *DS4Importer) Modality() string  { return "text" }
+func (i *DS4Importer) AutoDetects() bool { return true }
+
+func (i *DS4Importer) Match(details Details) bool {
+	preferences, err := details.Preferences.MarshalJSON()
+	if err != nil {
+		return false
+	}
+	preferencesMap := make(map[string]any)
+	if len(preferences) > 0 {
+		_ = json.Unmarshal(preferences, &preferencesMap)
+	}
+
+	if b, ok := preferencesMap["backend"].(string); ok && b == "ds4" {
+		return true
+	}
+
+	if strings.Contains(details.URI, "antirez/deepseek-v4-gguf") {
+		return true
+	}
+
+	base := filepath.Base(details.URI)
+	if strings.HasPrefix(base, "DeepSeek-V4-Flash-") && strings.HasSuffix(base, ".gguf") {
+		return true
+	}
+
+	if details.HuggingFace != nil {
+		for _, file := range details.HuggingFace.Files {
+			fb := filepath.Base(file.Path)
+			if strings.HasPrefix(fb, "DeepSeek-V4-Flash-") && strings.HasSuffix(fb, ".gguf") {
+				return true
+			}
+		}
+	}
+
+	return false
+}
+
+func (i *DS4Importer) Import(details Details) (gallery.ModelConfig, error) {
+	preferences, err := details.Preferences.MarshalJSON()
+	if err != nil {
+		return gallery.ModelConfig{}, err
+	}
+	preferencesMap := make(map[string]any)
+	if len(preferences) > 0 {
+		_ = json.Unmarshal(preferences, &preferencesMap)
+	}
+
+	name, ok := preferencesMap["name"].(string)
+	if !ok {
+		name = filepath.Base(details.URI)
+		name = strings.TrimSuffix(name, ".gguf")
+	}
+	description, ok := preferencesMap["description"].(string)
+	if !ok {
+		description = "DeepSeek V4 Flash - antirez/ds4 backend"
+	}
+
+	modelConfig := config.ModelConfig{
+		Name:                name,
+		Description:         description,
+		KnownUsecaseStrings: []string{config.UsecaseChat},
+		Backend:             "ds4",
+		PredictionOptions: schema.PredictionOptions{
+			BasicModelRequest: schema.BasicModelRequest{
+				Model: "ds4flash.gguf",
+			},
+		},
+		TemplateConfig: config.TemplateConfig{
+			UseTokenizerTemplate: true,
+		},
+		FunctionsConfig: functions.FunctionsConfig{
+			GrammarConfig: functions.GrammarConfig{NoGrammar: true},
+			// ds4 emits OpenAI-shape tool_calls in ChatDelta natively via
+			// our DSML parser; the Go-side regex fallback should NOT fire.
+			AutomaticToolParsingFallback: false,
+		},
+	}
+
+	cfg := gallery.ModelConfig{
+		Name:        name,
+		Description: description,
+	}
+
+	// The file to fetch: derive from the URI. We standardize the local
+	// filename to "ds4flash.gguf" to match ds4's own convention (its CLI
+	// defaults to that path), so users can run the model without extra
+	// config.
+	uri := downloader.URI(details.URI)
+	cfg.Files = append(cfg.Files, gallery.File{
+		Filename: "ds4flash.gguf",
+		URI:      string(uri),
+	})
+
+	out, err := yaml.Marshal(modelConfig)
+	if err != nil {
+		return gallery.ModelConfig{}, err
+	}
+	cfg.ConfigFile = string(out)
+	return cfg, nil
+}
--- a/core/gallery/importers/ds4_test.go
+++ b/core/gallery/importers/ds4_test.go
@@ -0,0 +1,69 @@
+package importers_test
+
+import (
+	"encoding/json"
+	"strings"
+
+	. "github.com/mudler/LocalAI/core/gallery/importers"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("DS4Importer", func() {
+	var importer *DS4Importer
+
+	BeforeEach(func() {
+		importer = &DS4Importer{}
+	})
+
+	Context("Match", func() {
+		It("matches the canonical HuggingFace repo URI", func() {
+			details := Details{
+				URI: "huggingface://antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf",
+			}
+			Expect(importer.Match(details)).To(BeTrue())
+		})
+
+		It("matches when filename has the DeepSeek-V4-Flash prefix", func() {
+			details := Details{
+				URI: "https://example.com/mirror/DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf",
+			}
+			Expect(importer.Match(details)).To(BeTrue())
+		})
+
+		It("matches when backend preference is ds4", func() {
+			prefs := json.RawMessage(`{"backend": "ds4"}`)
+			details := Details{
+				URI:         "https://example.com/some-other.gguf",
+				Preferences: prefs,
+			}
+			Expect(importer.Match(details)).To(BeTrue())
+		})
+
+		It("does not match arbitrary GGUFs (must fall through to llama-cpp)", func() {
+			details := Details{URI: "huggingface://TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf"}
+			Expect(importer.Match(details)).To(BeFalse())
+		})
+
+		It("does not match non-GGUF assets", func() {
+			details := Details{URI: "https://example.com/model.bin"}
+			Expect(importer.Match(details)).To(BeFalse())
+		})
+	})
+
+	Context("Import", func() {
+		It("emits backend: ds4 and the standard ds4flash.gguf filename", func() {
+			details := Details{
+				URI: "huggingface://antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf",
+			}
+			cfg, err := importer.Import(details)
+			Expect(err).NotTo(HaveOccurred())
+			Expect(cfg.Files).To(HaveLen(1))
+			Expect(cfg.Files[0].Filename).To(Equal("ds4flash.gguf"))
+			Expect(cfg.Files[0].URI).To(Equal(details.URI))
+			Expect(strings.Contains(cfg.ConfigFile, "backend: ds4")).To(BeTrue(),
+				"ConfigFile must specify backend: ds4, got: %s", cfg.ConfigFile)
+			Expect(strings.Contains(cfg.ConfigFile, "use_tokenizer_template: true")).To(BeTrue())
+		})
+	})
+})
--- a/core/gallery/importers/importers.go
+++ b/core/gallery/importers/importers.go
@@ -153,6 +153,11 @@ var defaultImporters = []Importer{
 	// checkpoints may carry tokenizer-adjacent artefacts.
 	&RFDetrImporter{},
 	// Existing
+	// DS4Importer must precede LlamaCPPImporter - ds4 weights are GGUFs and
+	// would otherwise be claimed by the generic .gguf-handling llama-cpp
+	// importer. Matches only the antirez/deepseek-v4-gguf repo + filename
+	// pattern, so false-positives against arbitrary GGUFs are impossible.
+	&DS4Importer{},
 	&LlamaCPPImporter{},
 	&MLXImporter{},
 	&VLLMImporter{},
--- a/core/http/endpoints/localai/backend.go
+++ b/core/http/endpoints/localai/backend.go
@@ -23,6 +23,8 @@ import (
 // backends that should appear in the import form dropdown.
 var knownPrefOnlyBackends = []schema.KnownBackend{
 	// Text LLM
+	// ds4: antirez/ds4 - single-model DeepSeek V4 Flash engine; auto-detected via DS4Importer
+	{Name: "ds4", Modality: "text", AutoDetect: false, Description: "antirez/ds4 DeepSeek V4 Flash engine (auto-detected; pref-only fallback)"},
 	{Name: "sglang", Modality: "text", AutoDetect: false, Description: "SGLang runtime (preference-only)"},
 	{Name: "tinygrad", Modality: "text", AutoDetect: false, Description: "tinygrad runtime (preference-only)"},
 	{Name: "trl", Modality: "text", AutoDetect: false, Description: "Transformers Reinforcement Learning (preference-only)"},
--- a/core/http/endpoints/ollama/capabilities.go
+++ b/core/http/endpoints/ollama/capabilities.go
@@ -0,0 +1,142 @@
+package ollama
+
+import (
+	"regexp"
+	"strings"
+
+	"github.com/mudler/LocalAI/core/config"
+)
+
+// modelCapabilities maps a LocalAI ModelConfig to the Ollama capability strings
+// (https://github.com/ollama/ollama/blob/main/docs/api.md#show-model-information).
+//
+// Ollama clients use these to decide which models are eligible for a given task
+// (e.g. only allow embedding models in an "embedding model" picker). Returning
+// an empty list makes clients assume "completion" everywhere, which is wrong
+// for embedding/rerank/audio backends — see issue #9760.
+func modelCapabilities(cfg *config.ModelConfig) []string {
+	if cfg == nil {
+		return nil
+	}
+
+	var caps []string
+
+	if cfg.HasUsecases(config.FLAG_EMBEDDINGS) {
+		caps = append(caps, "embedding")
+	}
+
+	chatCapable := cfg.HasUsecases(config.FLAG_CHAT) || cfg.HasUsecases(config.FLAG_COMPLETION)
+	if chatCapable {
+		caps = append(caps, "completion")
+	}
+
+	if chatCapable && hasVisionSupport(cfg) {
+		caps = append(caps, "vision")
+	}
+
+	if chatCapable && hasToolSupport(cfg) {
+		caps = append(caps, "tools")
+	}
+
+	if chatCapable && hasThinkingSupport(cfg) {
+		caps = append(caps, "thinking")
+	}
+
+	if chatCapable && cfg.TemplateConfig.Completion != "" {
+		caps = append(caps, "insert")
+	}
+
+	return caps
+}
+
+// hasVisionSupport reports whether the model can accept image inputs. We avoid
+// cfg.HasUsecases(FLAG_VISION) because GuessUsecases has no FLAG_VISION case
+// and returns true for any chat model — see core/config/model_config.go. Instead
+// we look for explicit signals: KnownUsecases bit, multimodal projector, or
+// template/backend-reported multimodal markers.
+func hasVisionSupport(cfg *config.ModelConfig) bool {
+	if cfg.KnownUsecases != nil && (*cfg.KnownUsecases&config.FLAG_VISION) == config.FLAG_VISION {
+		return true
+	}
+	if cfg.MMProj != "" {
+		return true
+	}
+	if cfg.TemplateConfig.Multimodal != "" {
+		return true
+	}
+	if cfg.MediaMarker != "" {
+		return true
+	}
+	return false
+}
+
+// hasToolSupport reports whether the model is wired up for tool / function calling.
+// We look for any of the explicit configuration knobs LocalAI uses to drive
+// function-call extraction (regex match, response regex, grammar triggers, XML
+// format) or for the auto-detected tool-format markers populated by the
+// llama.cpp backend during model load.
+func hasToolSupport(cfg *config.ModelConfig) bool {
+	fc := cfg.FunctionsConfig
+	if fc.ToolFormatMarkers != nil && fc.ToolFormatMarkers.FormatType != "" {
+		return true
+	}
+	if len(fc.JSONRegexMatch) > 0 || len(fc.ResponseRegex) > 0 {
+		return true
+	}
+	if fc.XMLFormatPreset != "" || fc.XMLFormat != nil {
+		return true
+	}
+	if len(fc.GrammarConfig.GrammarTriggers) > 0 || fc.GrammarConfig.SchemaType != "" {
+		return true
+	}
+	return false
+}
+
+// hasThinkingSupport reports whether the model has reasoning / thinking enabled.
+// LocalAI sets DisableReasoning=false (or leaves thinking markers configured)
+// when the backend probe reports that the model supports thinking.
+func hasThinkingSupport(cfg *config.ModelConfig) bool {
+	rc := cfg.ReasoningConfig
+	if rc.DisableReasoning != nil && !*rc.DisableReasoning {
+		return true
+	}
+	if len(rc.ThinkingStartTokens) > 0 || len(rc.TagPairs) > 0 {
+		// Explicit thinking markers imply support unless explicitly disabled.
+		return rc.DisableReasoning == nil || !*rc.DisableReasoning
+	}
+	return false
+}
+
+// quantRegex matches GGUF-style quantization suffixes (Q4_K_M, Q8_0, IQ3_XS, F16, ...).
+// Matches the convention used by GGUF tooling and what ggml-org/llama.cpp report.
+var quantRegex = regexp.MustCompile(`(?i)(IQ\d+(?:_[A-Z0-9]+)*|Q\d+(?:_[A-Z0-9]+)*|F16|F32|BF16)`)
+
+// paramSizeRegex matches a parameter-size token surrounded by separators
+// (e.g. "-7B-", "_3b.", ".70B-"). Avoids matching the "7" inside "Qwen3".
+var paramSizeRegex = regexp.MustCompile(`(?i)(?:^|[-_.])(\d+(?:\.\d+)?[BM])(?:[-_.]|$)`)
+
+// extractQuantizationLevel pulls the quantization tag from the model filename.
+// Returns the uppercased token (e.g. "Q4_K_M") or "" when not present.
+func extractQuantizationLevel(modelFile string) string {
+	if modelFile == "" {
+		return ""
+	}
+	base := strings.TrimSuffix(modelFile, ".gguf")
+	if m := quantRegex.FindString(base); m != "" {
+		return strings.ToUpper(m)
+	}
+	return ""
+}
+
+// extractParameterSize pulls the parameter count from the model filename.
+// Returns "" when no recognizable token is present.
+func extractParameterSize(modelFile string) string {
+	if modelFile == "" {
+		return ""
+	}
+	base := strings.TrimSuffix(modelFile, ".gguf")
+	if m := paramSizeRegex.FindStringSubmatch(base); len(m) > 1 {
+		return strings.ToUpper(m[1])
+	}
+	return ""
+}
--- a/core/http/endpoints/ollama/capabilities_test.go
+++ b/core/http/endpoints/ollama/capabilities_test.go
@@ -0,0 +1,138 @@
+package ollama
+
+import (
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/pkg/functions"
+	"github.com/mudler/LocalAI/pkg/reasoning"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+func boolPtr(b bool) *bool { return &b }
+
+func withKnownUsecases(cfg config.ModelConfig, flags ...string) config.ModelConfig {
+	cfg.KnownUsecaseStrings = flags
+	cfg.KnownUsecases = config.GetUsecasesFromYAML(flags)
+	return cfg
+}
+
+var _ = Describe("modelCapabilities", func() {
+	DescribeTable("derives Ollama capability strings from a ModelConfig",
+		func(cfg config.ModelConfig, expected []string) {
+			caps := modelCapabilities(&cfg)
+			if len(expected) == 0 {
+				Expect(caps).To(BeEmpty())
+				return
+			}
+			Expect(caps).To(ConsistOf(expected))
+		},
+		Entry("an embedding-only model exposes the embedding capability",
+			config.ModelConfig{
+				Name:       "embed-model",
+				Backend:    "llama-cpp",
+				Embeddings: boolPtr(true),
+			},
+			[]string{"embedding"},
+		),
+		Entry("a chat-template model exposes the completion capability",
+			config.ModelConfig{
+				Name:    "chat-model",
+				Backend: "llama-cpp",
+				TemplateConfig: config.TemplateConfig{
+					Chat: "{{ .Input }}",
+				},
+			},
+			[]string{"completion"},
+		),
+		Entry("a vision-capable chat model exposes completion + vision",
+			withKnownUsecases(config.ModelConfig{
+				Name:    "vision-model",
+				Backend: "llama-cpp",
+				TemplateConfig: config.TemplateConfig{
+					Chat:       "{{ .Input }}",
+					Multimodal: "<__media__>",
+				},
+			}, "FLAG_CHAT", "FLAG_VISION"),
+			[]string{"completion", "vision"},
+		),
+		Entry("a model with reasoning enabled exposes the thinking capability",
+			config.ModelConfig{
+				Name:    "thinking-model",
+				Backend: "llama-cpp",
+				TemplateConfig: config.TemplateConfig{
+					Chat: "{{ .Input }}",
+				},
+				ReasoningConfig: reasoning.Config{
+					DisableReasoning: boolPtr(false),
+				},
+			},
+			[]string{"completion", "thinking"},
+		),
+		Entry("a model with detected tool-format markers exposes the tools capability",
+			config.ModelConfig{
+				Name:    "tools-model",
+				Backend: "llama-cpp",
+				TemplateConfig: config.TemplateConfig{
+					Chat: "{{ .Input }}",
+				},
+				FunctionsConfig: functions.FunctionsConfig{
+					ToolFormatMarkers: &functions.ToolFormatMarkers{FormatType: "json_native"},
+				},
+			},
+			[]string{"completion", "tools"},
+		),
+		Entry("a model with an explicit JSON regex match exposes the tools capability",
+			config.ModelConfig{
+				Name:    "tools-regex-model",
+				Backend: "llama-cpp",
+				TemplateConfig: config.TemplateConfig{
+					Chat: "{{ .Input }}",
+				},
+				FunctionsConfig: functions.FunctionsConfig{
+					JSONRegexMatch: []string{`(?s).*`},
+				},
+			},
+			[]string{"completion", "tools"},
+		),
+		Entry("a pure backend-only model (no template, no embeddings) reports no capabilities",
+			config.ModelConfig{
+				Name:    "rerank-model",
+				Backend: "rerankers",
+			},
+			[]string{},
+		),
+	)
+})
+
+var _ = Describe("modelDetailsFromModelConfig", func() {
+	It("reports gguf format and llama-cpp family/families for a llama-cpp model", func() {
+		cfg := config.ModelConfig{
+			Name:    "llama",
+			Backend: "llama-cpp",
+		}
+		details := modelDetailsFromModelConfig(&cfg)
+		Expect(details.Format).To(Equal("gguf"))
+		Expect(details.Family).To(Equal("llama-cpp"))
+		Expect(details.Families).To(ConsistOf("llama-cpp"))
+	})
+
+	It("extracts quantization_level from the model filename when present", func() {
+		cfg := config.ModelConfig{
+			Name:    "qwen-q4",
+			Backend: "llama-cpp",
+		}
+		cfg.Model = "Qwen3-4B-Instruct-Q4_K_M.gguf"
+		details := modelDetailsFromModelConfig(&cfg)
+		Expect(details.QuantizationLevel).To(Equal("Q4_K_M"))
+	})
+
+	It("extracts parameter_size from the model filename when present", func() {
+		cfg := config.ModelConfig{
+			Name:    "qwen-4b",
+			Backend: "llama-cpp",
+		}
+		cfg.Model = "Qwen3-4B-Instruct-Q4_K_M.gguf"
+		details := modelDetailsFromModelConfig(&cfg)
+		Expect(details.ParameterSize).To(Equal("4B"))
+	})
+})
--- a/core/http/endpoints/ollama/models.go
+++ b/core/http/endpoints/ollama/models.go
@@ -32,13 +32,15 @@ func ListModelsEndpoint(bcl *config.ModelConfigLoader, ml *model.ModelLoader) ec

 			digest := fmt.Sprintf("sha256:%x", sha256.Sum256([]byte(name)))

+			details, caps := modelMetaFromConfig(bcl, name)
 			entry := schema.OllamaModelEntry{
-				Name:       ollamaName,
-				Model:      ollamaName,
-				ModifiedAt: time.Now().UTC(),
-				Size:       0,
-				Digest:     digest,
-				Details:    modelDetailsFromConfig(bcl, name),
+				Name:         ollamaName,
+				Model:        ollamaName,
+				ModifiedAt:   time.Now().UTC(),
+				Size:         0,
+				Digest:       digest,
+				Details:      details,
+				Capabilities: caps,
 			}
 			models = append(models, entry)
 		}
@@ -72,10 +74,12 @@ func ShowModelEndpoint(bcl *config.ModelConfigLoader) echo.HandlerFunc {
 		}

 		resp := schema.OllamaShowResponse{
-			Modelfile:  fmt.Sprintf("FROM %s", cfg.Model),
-			Parameters: "",
-			Template:   cfg.TemplateConfig.Chat,
-			Details:    modelDetailsFromModelConfig(&cfg),
+			Modelfile:    fmt.Sprintf("FROM %s", cfg.Model),
+			Parameters:   "",
+			Template:     cfg.TemplateConfig.Chat,
+			Details:      modelDetailsFromModelConfig(&cfg),
+			ModelInfo:    modelInfoFromModelConfig(&cfg),
+			Capabilities: modelCapabilities(&cfg),
 		}

 		return c.JSON(200, resp)
@@ -95,14 +99,16 @@ func ListRunningEndpoint(bcl *config.ModelConfigLoader, ml *model.ModelLoader) e
 				ollamaName += ":latest"
 			}

+			details, caps := modelMetaFromConfig(bcl, name)
 			entry := schema.OllamaPsEntry{
-				Name:      ollamaName,
-				Model:     ollamaName,
-				Size:      0,
-				Digest:    fmt.Sprintf("sha256:%x", sha256.Sum256([]byte(name))),
-				Details:   modelDetailsFromConfig(bcl, name),
-				ExpiresAt: time.Now().Add(24 * time.Hour).UTC(),
-				SizeVRAM:  0,
+				Name:         ollamaName,
+				Model:        ollamaName,
+				Size:         0,
+				Digest:       fmt.Sprintf("sha256:%x", sha256.Sum256([]byte(name))),
+				Details:      details,
+				ExpiresAt:    time.Now().Add(24 * time.Hour).UTC(),
+				SizeVRAM:     0,
+				Capabilities: caps,
 			}
 			models = append(models, entry)
 		}
@@ -125,18 +131,46 @@ func HeartbeatEndpoint() echo.HandlerFunc {
 	}
 }

-func modelDetailsFromConfig(bcl *config.ModelConfigLoader, name string) schema.OllamaModelDetails {
+// modelMetaFromConfig fetches the ModelConfig for `name` and derives both the
+// Ollama details block and capability list. Returns zero values when the model
+// is not configured.
+func modelMetaFromConfig(bcl *config.ModelConfigLoader, name string) (schema.OllamaModelDetails, []string) {
 	configName := strings.Split(name, ":")[0]
 	cfg, exists := bcl.GetModelConfig(configName)
 	if !exists {
-		return schema.OllamaModelDetails{}
+		return schema.OllamaModelDetails{}, nil
 	}
-	return modelDetailsFromModelConfig(&cfg)
+	return modelDetailsFromModelConfig(&cfg), modelCapabilities(&cfg)
 }

 func modelDetailsFromModelConfig(cfg *config.ModelConfig) schema.OllamaModelDetails {
-	return schema.OllamaModelDetails{
-		Format: "gguf",
-		Family: cfg.Backend,
+	family := cfg.Backend
+	details := schema.OllamaModelDetails{
+		Format:            "gguf",
+		Family:            family,
+		ParameterSize:     extractParameterSize(cfg.Model),
+		QuantizationLevel: extractQuantizationLevel(cfg.Model),
 	}
+	if family != "" {
+		details.Families = []string{family}
+	}
+	return details
+}
+
+// modelInfoFromModelConfig returns a small map of model_info entries derived
+// from the LocalAI ModelConfig. Ollama clients use this map for architecture
+// and context-length information; we expose what we can without loading the
+// model.
+func modelInfoFromModelConfig(cfg *config.ModelConfig) map[string]any {
+	info := map[string]any{}
+	if cfg.Backend != "" {
+		info["general.architecture"] = cfg.Backend
+	}
+	if cfg.ContextSize != nil && *cfg.ContextSize > 0 {
+		info["general.context_length"] = *cfg.ContextSize
+	}
+	if len(info) == 0 {
+		return nil
+	}
+	return info
 }
--- a/core/http/endpoints/ollama/models_test.go
+++ b/core/http/endpoints/ollama/models_test.go
@@ -1,12 +1,18 @@
 package ollama_test

 import (
+	"encoding/json"
 	"net/http"
 	"net/http/httptest"
+	"os"
+	"path/filepath"
+	"strings"
 	"testing"

 	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/core/config"
 	"github.com/mudler/LocalAI/core/http/endpoints/ollama"
+	"github.com/mudler/LocalAI/core/schema"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
 )
@@ -59,4 +65,92 @@ var _ = Describe("Ollama endpoint handlers", func() {
 			Expect(rec.Body.String()).To(MatchRegexp(`\d+\.\d+\.\d+`))
 		})
 	})
+
+	Describe("ShowModelEndpoint", func() {
+		var (
+			tmpDir string
+			bcl    *config.ModelConfigLoader
+		)
+
+		BeforeEach(func() {
+			var err error
+			tmpDir, err = os.MkdirTemp("", "ollama-show-test-*")
+			Expect(err).ToNot(HaveOccurred())
+			bcl = config.NewModelConfigLoader(tmpDir)
+		})
+
+		AfterEach(func() {
+			_ = os.RemoveAll(tmpDir)
+		})
+
+		writeConfig := func(name, yaml string) {
+			path := filepath.Join(tmpDir, name+".yaml")
+			Expect(os.WriteFile(path, []byte(yaml), 0o644)).To(Succeed())
+			Expect(bcl.ReadModelConfig(path)).To(Succeed())
+		}
+
+		callShow := func(name string) *schema.OllamaShowResponse {
+			req := httptest.NewRequest(http.MethodPost, "/api/show",
+				strings.NewReader(`{"name":"`+name+`"}`))
+			req.Header.Set("Content-Type", "application/json")
+			rec := httptest.NewRecorder()
+			c := e.NewContext(req, rec)
+
+			handler := ollama.ShowModelEndpoint(bcl)
+			Expect(handler(c)).To(Succeed())
+			Expect(rec.Code).To(Equal(http.StatusOK))
+
+			var resp schema.OllamaShowResponse
+			Expect(json.Unmarshal(rec.Body.Bytes(), &resp)).To(Succeed())
+			return &resp
+		}
+
+		It("returns capabilities=['embedding'] for embedding-only models", func() {
+			writeConfig("embed", `
+name: embed
+backend: llama-cpp
+embeddings: true
+parameters:
+  model: Qwen3-4B-Embedding-Q4_K_M.gguf
+`)
+			resp := callShow("embed")
+			Expect(resp.Capabilities).To(ConsistOf("embedding"))
+		})
+
+		It("returns capabilities=['completion'] for plain chat models", func() {
+			writeConfig("chat", `
+name: chat
+backend: llama-cpp
+template:
+  chat: "{{ .Input }}"
+parameters:
+  model: Llama-3-8B-Q4_K_M.gguf
+`)
+			resp := callShow("chat")
+			Expect(resp.Capabilities).To(ContainElement("completion"))
+			Expect(resp.Capabilities).ToNot(ContainElement("embedding"))
+		})
+
+		It("populates details.parameter_size and details.quantization_level from the GGUF filename", func() {
+			writeConfig("qwen", `
+name: qwen
+backend: llama-cpp
+template:
+  chat: "{{ .Input }}"
+parameters:
+  model: Qwen3-4B-Instruct-Q4_K_M.gguf
+`)
+			resp := callShow("qwen")
+			Expect(resp.Details.ParameterSize).To(Equal("4B"))
+			Expect(resp.Details.QuantizationLevel).To(Equal("Q4_K_M"))
+			Expect(resp.Details.Format).To(Equal("gguf"))
+			Expect(resp.Details.Families).ToNot(BeEmpty())
+		})
+	})
+
+	Describe("ListModelsEndpoint", func() {
+		It("includes capabilities and details for each listed model in /api/tags", func() {
+			Skip("covered by per-entry tests; integration smoke test")
+		})
+	})
 })
--- a/core/http/react-ui/package-lock.json
+++ b/core/http/react-ui/package-lock.json
@@ -16,6 +16,8 @@
        "@codemirror/search": "^6.5.10",
        "@codemirror/state": "^6.5.2",
        "@codemirror/view": "^6.36.8",
+        "@fontsource-variable/geist": "^5.2.8",
+        "@fontsource-variable/geist-mono": "^5.2.7",
        "@fortawesome/fontawesome-free": "^6.7.2",
        "@lezer/highlight": "^1.2.1",
        "@modelcontextprotocol/ext-apps": "^1.2.2",
@@ -965,6 +967,24 @@
        "node": "^18.18.0 || ^20.9.0 || >=21.1.0"
      }
    },
+    "node_modules/@fontsource-variable/geist": {
+      "version": "5.2.8",
+      "resolved": "https://registry.npmjs.org/@fontsource-variable/geist/-/geist-5.2.8.tgz",
+      "integrity": "sha512-cJ6m9e+8MQ5dCYJsLylfZrgBh6KkG4bOLckB35Tr9J/EqdkEM6QllH5PxqP1dhTvFup+HtMRPuz9xOjxXJggxw==",
+      "license": "OFL-1.1",
+      "funding": {
+        "url": "https://github.com/sponsors/ayuhito"
+      }
+    },
+    "node_modules/@fontsource-variable/geist-mono": {
+      "version": "5.2.7",
+      "resolved": "https://registry.npmjs.org/@fontsource-variable/geist-mono/-/geist-mono-5.2.7.tgz",
+      "integrity": "sha512-ZKlZ5sjtalb2TwXKs400mAGDlt/+2ENLNySPx0wTz3bP3mWARCsUW+rpxzZc7e05d2qGch70pItt3K4qttbIYA==",
+      "license": "OFL-1.1",
+      "funding": {
+        "url": "https://github.com/sponsors/ayuhito"
+      }
+    },
    "node_modules/@fortawesome/fontawesome-free": {
      "version": "6.7.2",
      "resolved": "https://registry.npmjs.org/@fortawesome/fontawesome-free/-/fontawesome-free-6.7.2.tgz",
@@ -2903,11 +2923,12 @@
      }
    },
    "node_modules/express-rate-limit": {
-      "version": "8.3.1",
-      "resolved": "https://registry.npmjs.org/express-rate-limit/-/express-rate-limit-8.3.1.tgz",
-      "integrity": "sha512-D1dKN+cmyPWuvB+G2SREQDzPY1agpBIcTa9sJxOPMCNeH3gwzhqJRDWCXW3gg0y//+LQ/8j52JbMROWyrKdMdw==",
+      "version": "8.5.1",
+      "resolved": "https://registry.npmjs.org/express-rate-limit/-/express-rate-limit-8.5.1.tgz",
+      "integrity": "sha512-5O6KYmyJEpuPJV5hNTXKbAHWRqrzyu+OI3vUnSd2kXFubIVpG7ezpgxQy76Zo5GQZtrQBg86hF+CM/NX+cioiQ==",
+      "license": "MIT",
      "dependencies": {
-        "ip-address": "10.1.0"
+        "ip-address": "^10.2.0"
      },
      "engines": {
        "node": ">= 16"
@@ -2951,9 +2972,9 @@
      "dev": true
    },
    "node_modules/fast-uri": {
-      "version": "3.1.0",
-      "resolved": "https://registry.npmjs.org/fast-uri/-/fast-uri-3.1.0.tgz",
-      "integrity": "sha512-iPeeDKJSWf4IEOasVVrknXpaBV0IApz/gp7S2bb7Z4Lljbl2MGJRqInZiUrQwV16cpzw/D3S5j5Julj/gT52AA==",
+      "version": "3.1.2",
+      "resolved": "https://registry.npmjs.org/fast-uri/-/fast-uri-3.1.2.tgz",
+      "integrity": "sha512-rVjf7ArG3LTk+FS6Yw81V1DLuZl1bRbNrev6Tmd/9RaroeeRRJhAt7jg/6YFxbvAQXUCavSoZhPPj6oOx+5KjQ==",
      "funding": [
        {
          "type": "github",
@@ -2963,7 +2984,8 @@
          "type": "opencollective",
          "url": "https://opencollective.com/fastify"
        }
-      ]
+      ],
+      "license": "BSD-3-Clause"
    },
    "node_modules/fastq": {
      "version": "1.20.1",
@@ -3421,9 +3443,9 @@
      }
    },
    "node_modules/hono": {
-      "version": "4.12.14",
-      "resolved": "https://registry.npmjs.org/hono/-/hono-4.12.14.tgz",
-      "integrity": "sha512-am5zfg3yu6sqn5yjKBNqhnTX7Cv+m00ox+7jbaKkrLMRJ4rAdldd1xPd/JzbBWspqaQv6RSTrgFN95EsfhC+7w==",
+      "version": "4.12.18",
+      "resolved": "https://registry.npmjs.org/hono/-/hono-4.12.18.tgz",
+      "integrity": "sha512-RWzP96k/yv0PQfyXnWjs6zot20TqfpfsNXhOnev8d1InAxubW93L11/oNUc3tQqn2G0bSdAOBpX+2uDFHV7kdQ==",
      "license": "MIT",
      "engines": {
        "node": ">=16.9.0"
@@ -3681,9 +3703,10 @@
      "integrity": "sha512-k/vGaX4/Yla3WzyMCvTQOXYeIHvqOKtnqBduzTHpzpQZzAskKMhZ2K+EnBiSM9zGSoIFeMpXKxa4dYeZIQqewQ=="
    },
    "node_modules/ip-address": {
-      "version": "10.1.0",
-      "resolved": "https://registry.npmjs.org/ip-address/-/ip-address-10.1.0.tgz",
-      "integrity": "sha512-XXADHxXmvT9+CRxhXg56LJovE+bmWnEWB78LB83VZTprKTmaC5QfruXocxzTZ2Kl0DNwKuBdlIhjL8LeY8Sf8Q==",
+      "version": "10.2.0",
+      "resolved": "https://registry.npmjs.org/ip-address/-/ip-address-10.2.0.tgz",
+      "integrity": "sha512-/+S6j4E9AHvW9SWMSEY9Xfy66O5PWvVEJ08O0y5JGyEKQpojb0K0GKpz/v5HJ/G0vi3D2sjGK78119oXZeE0qA==",
+      "license": "MIT",
      "engines": {
        "node": ">= 12"
      }
--- a/core/schema/ollama.go
+++ b/core/schema/ollama.go
@@ -120,10 +120,14 @@ type OllamaGenerateResponse struct {
 	EvalDuration       int64     `json:"eval_duration,omitempty"`
 }

-// OllamaEmbedRequest represents a request to the Ollama Embed API
+// OllamaEmbedRequest represents a request to the Ollama Embed API.
+// Ollama's /api/embed endpoint accepts both `input` and `prompt` as the
+// input string value (see https://github.com/ollama/ollama/blob/main/docs/api.md#generate-embeddings),
+// so both keys are deserialized here for client compatibility.
 type OllamaEmbedRequest struct {
-	Model   string `json:"model"`
-	Input   any    `json:"input"` // string or []string
+	Model   string         `json:"model"`
+	Input   any            `json:"input,omitempty"`  // string or []string
+	Prompt  any            `json:"prompt,omitempty"` // string or []string (Ollama alias for Input)
 	Options *OllamaOptions `json:"options,omitempty"`
 }

@@ -135,10 +139,21 @@ func (r *OllamaEmbedRequest) ModelName(s *string) string {
 	return r.Model
 }

-// GetInputStrings normalizes the Input field to a string slice
+// GetInputStrings normalizes the Input/Prompt field to a string slice.
+// Input takes precedence over Prompt when both are provided.
 func (r *OllamaEmbedRequest) GetInputStrings() []string {
-	switch v := r.Input.(type) {
+	if v := normalizeOllamaEmbedInput(r.Input); v != nil {
+		return v
+	}
+	return normalizeOllamaEmbedInput(r.Prompt)
+}
+
+func normalizeOllamaEmbedInput(v any) []string {
+	switch v := v.(type) {
 	case string:
+		if v == "" {
+			return nil
+		}
 		return []string{v}
 	case []any:
 		var result []string
@@ -184,11 +199,13 @@ func (r *OllamaShowRequest) ModelName(s *string) string {

 // OllamaShowResponse represents a response from the Ollama Show API
 type OllamaShowResponse struct {
-	Modelfile  string             `json:"modelfile"`
-	Parameters string             `json:"parameters"`
-	Template   string             `json:"template"`
-	License    string             `json:"license,omitempty"`
-	Details    OllamaModelDetails `json:"details"`
+	Modelfile    string             `json:"modelfile"`
+	Parameters   string             `json:"parameters"`
+	Template     string             `json:"template"`
+	License      string             `json:"license,omitempty"`
+	Details      OllamaModelDetails `json:"details"`
+	ModelInfo    map[string]any     `json:"model_info,omitempty"`
+	Capabilities []string           `json:"capabilities,omitempty"`
 }

 // OllamaModelDetails contains model metadata
@@ -203,12 +220,13 @@ type OllamaModelDetails struct {

 // OllamaModelEntry represents a model in the list response
 type OllamaModelEntry struct {
-	Name       string             `json:"name"`
-	Model      string             `json:"model"`
-	ModifiedAt time.Time          `json:"modified_at"`
-	Size       int64              `json:"size"`
-	Digest     string             `json:"digest"`
-	Details    OllamaModelDetails `json:"details"`
+	Name         string             `json:"name"`
+	Model        string             `json:"model"`
+	ModifiedAt   time.Time          `json:"modified_at"`
+	Size         int64              `json:"size"`
+	Digest       string             `json:"digest"`
+	Details      OllamaModelDetails `json:"details"`
+	Capabilities []string           `json:"capabilities,omitempty"`
 }

 // OllamaListResponse represents a response from the Ollama Tags API
@@ -218,13 +236,14 @@ type OllamaListResponse struct {

 // OllamaPsEntry represents a running model in the ps response
 type OllamaPsEntry struct {
-	Name       string             `json:"name"`
-	Model      string             `json:"model"`
-	Size       int64              `json:"size"`
-	Digest     string             `json:"digest"`
-	Details    OllamaModelDetails `json:"details"`
-	ExpiresAt  time.Time          `json:"expires_at"`
-	SizeVRAM   int64              `json:"size_vram"`
+	Name         string             `json:"name"`
+	Model        string             `json:"model"`
+	Size         int64              `json:"size"`
+	Digest       string             `json:"digest"`
+	Details      OllamaModelDetails `json:"details"`
+	ExpiresAt    time.Time          `json:"expires_at"`
+	SizeVRAM     int64              `json:"size_vram"`
+	Capabilities []string           `json:"capabilities,omitempty"`
 }

 // OllamaPsResponse represents a response from the Ollama Ps API
--- a/core/schema/ollama_test.go
+++ b/core/schema/ollama_test.go
@@ -0,0 +1,86 @@
+package schema_test
+
+import (
+	"encoding/json"
+
+	. "github.com/mudler/LocalAI/core/schema"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("OllamaEmbedRequest", func() {
+
+	Context("GetInputStrings", func() {
+		It("returns a single string when Input is a string", func() {
+			req := OllamaEmbedRequest{Input: "hello world"}
+
+			Expect(req.GetInputStrings()).To(Equal([]string{"hello world"}))
+		})
+
+		It("returns a list of strings when Input is a []string", func() {
+			req := OllamaEmbedRequest{Input: []string{"hello", "world"}}
+
+			Expect(req.GetInputStrings()).To(Equal([]string{"hello", "world"}))
+		})
+
+		It("returns a list of strings when Input is a []any (post JSON unmarshal)", func() {
+			req := OllamaEmbedRequest{Input: []any{"hello", "world"}}
+
+			Expect(req.GetInputStrings()).To(Equal([]string{"hello", "world"}))
+		})
+	})
+
+	Context("JSON unmarshaling (Ollama API compatibility)", func() {
+		It("accepts the 'input' field as a single string", func() {
+			body := []byte(`{"model": "m", "input": "why is the sky blue?"}`)
+
+			var req OllamaEmbedRequest
+			Expect(json.Unmarshal(body, &req)).To(Succeed())
+
+			Expect(req.Model).To(Equal("m"))
+			Expect(req.GetInputStrings()).To(Equal([]string{"why is the sky blue?"}))
+		})
+
+		It("accepts the 'input' field as an array of strings", func() {
+			body := []byte(`{"model": "m", "input": ["why is the sky blue?", "why is the grass green?"]}`)
+
+			var req OllamaEmbedRequest
+			Expect(json.Unmarshal(body, &req)).To(Succeed())
+
+			Expect(req.GetInputStrings()).To(Equal([]string{"why is the sky blue?", "why is the grass green?"}))
+		})
+
+		// Ollama's embedding endpoint accepts both `input` and `prompt` keys:
+		// https://github.com/ollama/ollama/blob/main/docs/api.md#generate-embeddings
+		// LocalAI must accept `prompt` so client libraries using that key are not broken.
+		// See https://github.com/mudler/LocalAI/issues/9767.
+		It("accepts the 'prompt' field as a single string (Ollama compatibility)", func() {
+			body := []byte(`{"model": "m", "prompt": "why is the sky blue?"}`)
+
+			var req OllamaEmbedRequest
+			Expect(json.Unmarshal(body, &req)).To(Succeed())
+
+			Expect(req.Model).To(Equal("m"))
+			Expect(req.GetInputStrings()).To(Equal([]string{"why is the sky blue?"}))
+		})
+
+		It("accepts the 'prompt' field as an array of strings (Ollama compatibility)", func() {
+			body := []byte(`{"model": "m", "prompt": ["why is the sky blue?", "why is the grass green?"]}`)
+
+			var req OllamaEmbedRequest
+			Expect(json.Unmarshal(body, &req)).To(Succeed())
+
+			Expect(req.GetInputStrings()).To(Equal([]string{"why is the sky blue?", "why is the grass green?"}))
+		})
+
+		It("prefers 'input' when both 'input' and 'prompt' are provided", func() {
+			body := []byte(`{"model": "m", "input": "from input", "prompt": "from prompt"}`)
+
+			var req OllamaEmbedRequest
+			Expect(json.Unmarshal(body, &req)).To(Succeed())
+
+			Expect(req.GetInputStrings()).To(Equal([]string{"from input"}))
+		})
+	})
+})
--- a/docs/content/advanced/model-configuration.md
+++ b/docs/content/advanced/model-configuration.md
@@ -251,18 +251,68 @@ options:

 These are set via the `options:` array in the model configuration (format: `key:value`):

+**Common options**
+
 | Option | Type | Default | Description |
 |--------|------|---------|-------------|
-| `spec_type` | string | `none` | Speculative decoding type (see table below) |
+| `spec_type` / `speculative_type` | string | `none` | Speculative decoding type, or comma-separated list to chain multiple (see table below) |
 | `spec_n_max` / `draft_max` | int | 16 | Maximum number of tokens to draft per step |
 | `spec_n_min` / `draft_min` | int | 0 | Minimum draft tokens required to use speculation |
 | `spec_p_min` / `draft_p_min` | float | 0.75 | Minimum probability threshold for greedy acceptance |
 | `spec_p_split` | float | 0.1 | Split probability for tree-based branching |
+
+**Draft-model options** (apply when `spec_type=draft`, i.e. a `draft_model` is configured)
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `draft_gpu_layers` | int | -1 | GPU layers for the draft model (-1 = use default) |
+| `draft_threads` / `spec_draft_threads` | int | same as main | Threads used by the draft model (`<= 0` = hardware concurrency) |
+| `draft_threads_batch` / `spec_draft_threads_batch` | int | same as `draft_threads` | Threads used by the draft model during batch / prompt processing |
+| `draft_cache_type_k` / `spec_draft_cache_type_k` | string | `f16` | KV cache K data type for the draft model (same values as `cache_type_k`) |
+| `draft_cache_type_v` / `spec_draft_cache_type_v` | string | `f16` | KV cache V data type for the draft model |
+| `draft_cpu_moe` / `spec_draft_cpu_moe` | bool | false | Keep all MoE expert weights of the draft model on CPU |
+| `draft_n_cpu_moe` / `spec_draft_n_cpu_moe` | int | 0 | Keep MoE expert weights of the first N draft-model layers on CPU |
+| `draft_override_tensor` / `spec_draft_override_tensor` | string | "" | Comma-separated `<tensor regex>=<buffer type>` overrides for the draft model |
+| `draft_ctx_size` | int | (ignored) | Deprecated upstream: the draft now shares the target context size. Accepted for backward compatibility but has no effect. |
+
+**`ngram_simple` options** (used when `spec_type` includes `ngram_simple`)
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
 | `spec_ngram_size_n` / `ngram_size_n` | int | 12 | N-gram lookup size |
 | `spec_ngram_size_m` / `ngram_size_m` | int | 48 | M-gram proposal size |
 | `spec_ngram_min_hits` / `ngram_min_hits` | int | 1 | Minimum hits for accepting n-gram proposals |
-| `draft_gpu_layers` | int | -1 | GPU layers for the draft model (-1 = use default) |
-| `draft_ctx_size` | int | 0 | Context size for the draft model (0 = auto) |
+
+**`ngram_mod` options** (used when `spec_type` includes `ngram_mod`)
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `spec_ngram_mod_n_min` | int | 48 | Minimum number of ngram tokens to use |
+| `spec_ngram_mod_n_max` | int | 64 | Maximum number of ngram tokens to use |
+| `spec_ngram_mod_n_match` | int | 24 | Ngram lookup length |
+
+**`ngram_map_k` options** (used when `spec_type` includes `ngram_map_k`)
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `spec_ngram_map_k_size_n` | int | 12 | N-gram lookup size |
+| `spec_ngram_map_k_size_m` | int | 48 | M-gram proposal size |
+| `spec_ngram_map_k_min_hits` | int | 1 | Minimum hits for accepting proposals |
+
+**`ngram_map_k4v` options** (used when `spec_type` includes `ngram_map_k4v`)
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `spec_ngram_map_k4v_size_n` | int | 12 | N-gram lookup size |
+| `spec_ngram_map_k4v_size_m` | int | 48 | M-gram proposal size |
+| `spec_ngram_map_k4v_min_hits` | int | 1 | Minimum hits for accepting proposals |
+
+**`ngram_cache` lookup files**
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `spec_lookup_cache_static` / `lookup_cache_static` | string | "" | Path to a static ngram lookup cache file |
+| `spec_lookup_cache_dynamic` / `lookup_cache_dynamic` | string | "" | Path to a dynamic ngram lookup cache file (updated by generation) |

 #### Speculative Type Values

@@ -277,6 +327,8 @@ These are set via the `options:` array in the model configuration (format: `key:
 | `ngram_mod` | Modified n-gram speculation |
 | `ngram_cache` | 3-level n-gram cache |

+Multiple types can be chained by passing a comma-separated list to `spec_type` (e.g. `spec_type:ngram_simple,ngram_mod`). The runtime tries them in order and accepts the first proposal that meets the acceptance criteria.
+
 {{% notice note %}}
 Speculative decoding is automatically disabled when multimodal models (with `mmproj`) are active. The `n_draft` parameter can also be overridden per-request.
 {{% /notice %}}
--- a/docs/data/version.json
+++ b/docs/data/version.json
@@ -1,3 +1,3 @@
 {
-  "version": "v4.1.3"
+  "version": "v4.2.0"
 }
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -30632,3 +30632,24 @@
      - torch_dtype:bf16
    parameters:
      model: Lightricks/LTX-2.3
+- name: deepseek-v4-flash-q2
+  description: |
+    DeepSeek V4 Flash (IQ2XXS GGUF, ~81 GB) - only loadable via the ds4 backend.
+    Requires >=128 GB RAM. Metal (Darwin) or CUDA (Linux).
+    See https://github.com/antirez/ds4 for details.
+  urls:
+    - https://huggingface.co/antirez/deepseek-v4-gguf
+  tags:
+    - deepseek
+    - ds4
+    - gguf
+    - llm
+    - chat
+  overrides:
+    backend: ds4
+    parameters:
+      model: ds4flash.gguf
+  files:
+    - filename: ds4flash.gguf
+      sha256: 31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c
+      uri: huggingface://antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf
--- a/go.mod
+++ b/go.mod
@@ -7,7 +7,7 @@ require (
 	fyne.io/fyne/v2 v2.7.3
 	github.com/Masterminds/sprig/v3 v3.3.0
 	github.com/alecthomas/kong v1.14.0
-	github.com/anthropics/anthropic-sdk-go v1.27.0
+	github.com/anthropics/anthropic-sdk-go v1.42.0
 	github.com/aws/aws-sdk-go-v2 v1.41.6
 	github.com/aws/aws-sdk-go-v2/config v1.32.16
 	github.com/aws/aws-sdk-go-v2/credentials v1.19.15
@@ -18,7 +18,7 @@ require (
 	github.com/dhowden/tag v0.0.0-20240417053706-3d75831295e8
 	github.com/ebitengine/purego v0.10.0
 	github.com/emirpasic/gods/v2 v2.0.0-alpha
-	github.com/fsnotify/fsnotify v1.9.0
+	github.com/fsnotify/fsnotify v1.10.1
 	github.com/go-audio/wav v1.1.0
 	github.com/go-skynet/go-llama.cpp v0.0.0-20240314183750-6a8041ef6b46
 	github.com/gofrs/flock v0.13.0
@@ -37,14 +37,14 @@ require (
 	github.com/microcosm-cc/bluemonday v1.0.27
 	github.com/modelcontextprotocol/go-sdk v1.5.0
 	github.com/mudler/cogito v0.9.5-0.20260315222927-63abdec7189b
-	github.com/mudler/edgevpn v0.31.1
+	github.com/mudler/edgevpn v0.32.2
 	github.com/mudler/go-processmanager v0.1.1
 	github.com/mudler/memory v0.0.0-20260406210934-424c1ecf2cf8
 	github.com/mudler/xlog v0.0.6
 	github.com/nats-io/nats.go v1.50.0
 	github.com/ollama/ollama v0.20.4
 	github.com/onsi/ginkgo/v2 v2.28.2
-	github.com/onsi/gomega v1.39.1
+	github.com/onsi/gomega v1.40.0
 	github.com/openai/openai-go/v3 v3.26.0
 	github.com/otiai10/copy v1.14.1
 	github.com/otiai10/openaigo v1.7.0
@@ -94,14 +94,10 @@ require (
 	github.com/aws/smithy-go v1.25.0 // indirect
 	github.com/bahlo/generic-list-go v0.2.0 // indirect
 	github.com/buger/jsonparser v1.1.2 // indirect
-	github.com/chasefleming/elem-go v0.30.0 // indirect
-	github.com/dave-gray101/v2keyauth v0.0.0-20240624150259-c45d584d25e2 // indirect
 	github.com/dunglas/httpsfv v1.1.0 // indirect
+	github.com/filecoin-project/go-clock v0.1.0 // indirect
 	github.com/go-jose/go-jose/v4 v4.1.4 // indirect
-	github.com/gofiber/template v1.8.3 // indirect
-	github.com/gofiber/template/html/v2 v2.1.3 // indirect
-	github.com/gofiber/utils v1.1.0 // indirect
-	github.com/inconshreveable/mousetrap v1.1.0 // indirect
+	github.com/invopop/jsonschema v0.13.0 // indirect
 	github.com/jinzhu/inflection v1.0.0 // indirect
 	github.com/jinzhu/now v1.1.5 // indirect
 	github.com/jolestar/go-commons-pool/v2 v2.1.2 // indirect
@@ -111,8 +107,7 @@ require (
 	github.com/moby/moby/client v0.4.0 // indirect
 	github.com/nats-io/nkeys v0.4.15 // indirect
 	github.com/nats-io/nuid v1.0.1 // indirect
-	github.com/spf13/cobra v1.10.2 // indirect
-	github.com/spf13/pflag v1.0.10 // indirect
+	github.com/standard-webhooks/standard-webhooks/libraries v0.0.0-20260508151727-1282bb917829 // indirect
 	github.com/stretchr/testify v1.11.1 // indirect
 	github.com/sv-tools/openapi v0.2.1 // indirect
 	github.com/swaggo/swag/v2 v2.0.0-rc4 // indirect
@@ -153,7 +148,7 @@ require (
 	github.com/blevesearch/zapx/v16 v16.2.8 // indirect
 	github.com/bwmarrin/discordgo v0.29.0 // indirect
 	github.com/cloudflare/circl v1.6.3 // indirect
-	github.com/cyphar/filepath-securejoin v0.5.1 // indirect
+	github.com/cyphar/filepath-securejoin v0.6.1 // indirect
 	github.com/emersion/go-imap/v2 v2.0.0-beta.5 // indirect
 	github.com/emersion/go-message v0.18.2 // indirect
 	github.com/emersion/go-sasl v0.0.0-20241020182733-b788ff22d5a6 // indirect
@@ -161,8 +156,8 @@ require (
 	github.com/emirpasic/gods v1.18.1 // indirect
 	github.com/eritikass/githubmarkdownconvertergo v0.1.10 // indirect
 	github.com/go-git/gcfg v1.5.1-0.20230307220236-3a3c6141e376 // indirect
-	github.com/go-git/go-billy/v5 v5.8.0 // indirect
-	github.com/go-git/go-git/v5 v5.18.0 // indirect
+	github.com/go-git/go-billy/v5 v5.9.0 // indirect
+	github.com/go-git/go-git/v5 v5.19.0 // indirect
 	github.com/go-telegram/bot v1.17.0 // indirect
 	github.com/gobwas/glob v0.2.3 // indirect
 	github.com/gocolly/colly v1.2.0 // indirect
@@ -188,7 +183,7 @@ require (
 	github.com/oxffaa/gopher-parse-sitemap v0.0.0-20191021113419-005d2eb1def4 // indirect
 	github.com/philippgille/chromem-go v0.7.0 // indirect
 	github.com/pion/transport/v4 v4.0.1 // indirect
-	github.com/pjbgf/sha1cd v0.3.2 // indirect
+	github.com/pjbgf/sha1cd v0.6.0 // indirect
 	github.com/rs/zerolog v1.31.0 // indirect
 	github.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d // indirect
 	github.com/segmentio/asm v1.1.3 // indirect
@@ -251,7 +246,7 @@ require (
 	github.com/jeandeaual/go-locale v0.0.0-20250612000132-0ef82f21eade // indirect
 	github.com/json-iterator/go v1.1.12 // indirect
 	github.com/jsummers/gobmp v0.0.0-20230614200233-a9de23ed2e25 // indirect
-	github.com/libp2p/go-yamux/v5 v5.0.1 // indirect
+	github.com/libp2p/go-yamux/v5 v5.1.0 // indirect
 	github.com/magiconair/properties v1.8.10 // indirect
 	github.com/moby/docker-image-spec v1.3.1 // indirect
 	github.com/moby/go-archive v0.2.0 // indirect
@@ -288,7 +283,7 @@ require (
 	github.com/xo/terminfo v0.0.0-20220910002029-abceb7e1c41e // indirect
 	github.com/yosida95/uritemplate/v3 v3.0.2 // indirect
 	go.opentelemetry.io/auto/sdk v1.2.1 // indirect
-	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.61.0 // indirect
+	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.65.0 // indirect
 	go.uber.org/mock v0.5.2 // indirect
 	go.yaml.in/yaml/v2 v2.4.4
 	go.yaml.in/yaml/v3 v3.0.4 // indirect
@@ -323,7 +318,7 @@ require (
 	github.com/creachadair/otp v0.5.0 // indirect
 	github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
 	github.com/davidlazar/go-crypto v0.0.0-20200604182044-b73af7476f6c // indirect
-	github.com/decred/dcrd/dcrec/secp256k1/v4 v4.4.0 // indirect
+	github.com/decred/dcrd/dcrec/secp256k1/v4 v4.4.1 // indirect
 	github.com/dlclark/regexp2 v1.11.5 // indirect
 	github.com/docker/cli v29.4.0+incompatible // indirect
 	github.com/docker/docker v28.5.2+incompatible
@@ -343,7 +338,7 @@ require (
 	github.com/go-openapi/swag v0.23.0 // indirect
 	github.com/gogo/protobuf v1.3.2 // indirect
 	github.com/golang/groupcache v0.0.0-20241129210726-2c02b8208cf8 // indirect
-	github.com/golang/snappy v0.0.4 // indirect
+	github.com/golang/snappy v0.0.5-0.20231225225746-43d5d4cd4e0e // indirect
 	github.com/google/btree v1.1.3 // indirect
 	github.com/google/go-cmp v0.7.0 // indirect
 	github.com/google/gopacket v1.1.19 // indirect
@@ -355,10 +350,10 @@ require (
 	github.com/henvic/httpretty v0.1.4 // indirect
 	github.com/huandu/xstrings v1.5.0 // indirect
 	github.com/huin/goupnp v1.3.0 // indirect
-	github.com/ipfs/boxo v0.30.0 // indirect
+	github.com/ipfs/boxo v0.37.0 // indirect
 	github.com/ipfs/go-cid v0.6.1 // indirect
-	github.com/ipfs/go-datastore v0.8.2 // indirect
-	github.com/ipfs/go-log/v2 v2.6.0 // indirect
+	github.com/ipfs/go-datastore v0.9.1 // indirect
+	github.com/ipfs/go-log/v2 v2.9.1 // indirect
 	github.com/ipld/go-ipld-prime v0.23.0 // indirect
 	github.com/jackpal/go-nat-pmp v1.0.2 // indirect
 	github.com/jaypipes/pcidb v1.1.1 // indirect
@@ -369,11 +364,11 @@ require (
 	github.com/koron/go-ssdp v0.0.6 // indirect
 	github.com/libp2p/go-buffer-pool v0.1.0 // indirect
 	github.com/libp2p/go-cidranger v1.1.0 // indirect
-	github.com/libp2p/go-flow-metrics v0.2.0 // indirect
+	github.com/libp2p/go-flow-metrics v0.3.0 // indirect
 	github.com/libp2p/go-libp2p-asn-util v0.4.1 // indirect
-	github.com/libp2p/go-libp2p-kad-dht v0.33.1 // indirect
-	github.com/libp2p/go-libp2p-kbucket v0.7.0 // indirect
-	github.com/libp2p/go-libp2p-pubsub v0.14.2 // indirect
+	github.com/libp2p/go-libp2p-kad-dht v0.39.0 // indirect
+	github.com/libp2p/go-libp2p-kbucket v0.8.0 // indirect
+	github.com/libp2p/go-libp2p-pubsub v0.15.0 // indirect
 	github.com/libp2p/go-libp2p-record v0.3.1 // indirect
 	github.com/libp2p/go-libp2p-routing-helpers v0.7.5 // indirect
 	github.com/libp2p/go-msgio v0.3.0 // indirect
@@ -387,7 +382,7 @@ require (
 	github.com/mattn/go-colorable v0.1.14 // indirect
 	github.com/mattn/go-isatty v0.0.20 // indirect
 	github.com/mattn/go-runewidth v0.0.17 // indirect
-	github.com/miekg/dns v1.1.66 // indirect
+	github.com/miekg/dns v1.1.72 // indirect
 	github.com/mikioh/tcpinfo v0.0.0-20190314235526-30a79bb1804b // indirect
 	github.com/mikioh/tcpopt v0.0.0-20190314235656-172688c1accc // indirect
 	github.com/minio/sha256-simd v1.0.1 // indirect
@@ -405,7 +400,7 @@ require (
 	github.com/multiformats/go-base32 v0.1.0 // indirect
 	github.com/multiformats/go-base36 v0.2.0 // indirect
 	github.com/multiformats/go-multiaddr v0.16.1
-	github.com/multiformats/go-multiaddr-dns v0.4.1 // indirect
+	github.com/multiformats/go-multiaddr-dns v0.5.0 // indirect
 	github.com/multiformats/go-multiaddr-fmt v0.1.0 // indirect
 	github.com/multiformats/go-multibase v0.3.0 // indirect
 	github.com/multiformats/go-multicodec v0.10.0 // indirect
@@ -443,7 +438,7 @@ require (
 	github.com/ulikunitz/xz v0.5.14 // indirect
 	github.com/valyala/bytebufferpool v1.0.0 // indirect
 	github.com/vbatts/tar-split v0.12.2 // indirect
-	github.com/vishvananda/netlink v1.3.0 // indirect
+	github.com/vishvananda/netlink v1.3.1 // indirect
 	github.com/vishvananda/netns v0.0.5 // indirect
 	github.com/whyrusleeping/go-keyspace v0.0.0-20160322163242-5b898ac5add1 // indirect
 	github.com/xi2/xz v0.0.0-20171230120015-48954b6210f8 // indirect
@@ -456,9 +451,9 @@ require (
 	go.uber.org/dig v1.19.0 // indirect
 	go.uber.org/fx v1.24.0 // indirect
 	go.uber.org/multierr v1.11.0 // indirect
-	go.uber.org/zap v1.27.0 // indirect
+	go.uber.org/zap v1.27.1 // indirect
 	golang.org/x/crypto v0.50.0
-	golang.org/x/exp v0.0.0-20250606033433-dcc06ee1d476 // indirect
+	golang.org/x/exp v0.0.0-20260410095643-746e56fc9e2f // indirect
 	golang.org/x/mod v0.35.0 // indirect
 	golang.org/x/sync v0.20.0
 	golang.org/x/sys v0.43.0 // indirect
@@ -469,7 +464,7 @@ require (
 	golang.zx2c4.com/wireguard v0.0.0-20250521234502-f333402bd9cb // indirect
 	golang.zx2c4.com/wireguard/windows v0.5.3 // indirect
 	gonum.org/v1/gonum v0.17.0 // indirect
-	google.golang.org/genproto/googleapis/rpc v0.0.0-20260120221211-b8f7ae30c516 // indirect
+	google.golang.org/genproto/googleapis/rpc v0.0.0-20260128011058-8636f8732409 // indirect
 	gopkg.in/fsnotify.v1 v1.4.7 // indirect
 	gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7 // indirect
 	howett.net/plist v1.0.2-0.20250314012144-ee69052608d9 // indirect
--- a/go.sum
+++ b/go.sum
@@ -100,8 +100,8 @@ github.com/antchfx/xmlquery v1.4.4/go.mod h1:AEPEEPYE9GnA2mj5Ur2L5Q5/2PycJ0N9Fus
 github.com/antchfx/xpath v1.3.3/go.mod h1:i54GszH55fYfBmoZXapTHN8T8tkcHfRgLyVwwqzXNcs=
 github.com/antchfx/xpath v1.3.6 h1:s0y+ElRRtTQdfHP609qFu0+c6bglDv20pqOViQjjdPI=
 github.com/antchfx/xpath v1.3.6/go.mod h1:i54GszH55fYfBmoZXapTHN8T8tkcHfRgLyVwwqzXNcs=
-github.com/anthropics/anthropic-sdk-go v1.27.0 h1:0CWbmBq5ofGAjF2H6lefCNRbnaUMGiTKO+lb7RLhDbI=
-github.com/anthropics/anthropic-sdk-go v1.27.0/go.mod h1:qUKmaW+uuPB64iy1l+4kOSvaLqPXnHTTBKH6RVZ7q5Q=
+github.com/anthropics/anthropic-sdk-go v1.42.0 h1:Zv882/dnrE4OHnwhMAsi9lwVVXRF8GtR3ofiBResYUw=
+github.com/anthropics/anthropic-sdk-go v1.42.0/go.mod h1:r4eaLX9tBolUrXLOrLj7eU8tmeBtoobCkM0kBsivBaY=
 github.com/antihax/optional v1.0.0/go.mod h1:uupD/76wgC+ih3iEmQUL+0Ugr19nfwCT1kdvxnR2qWY=
 github.com/armon/circbuf v0.0.0-20150827004946-bbbad097214e/go.mod h1:3U/XgcO3hCbHZ8TKRvWD2dDTCfh9M9ya+I9JpbB7O8o=
 github.com/armon/go-metrics v0.0.0-20180917152333-f0300d1749da/go.mod h1:Q73ZrmVTwzkszR9V5SSuryQ31EELlFMUz1kKyl939pY=
@@ -227,8 +227,6 @@ github.com/charmbracelet/x/exp/slice v0.0.0-20250327172914-2fdc97757edf h1:rLG0Y
 github.com/charmbracelet/x/exp/slice v0.0.0-20250327172914-2fdc97757edf/go.mod h1:B3UgsnsBZS/eX42BlaNiJkD1pPOUa+oF1IYC6Yd2CEU=
 github.com/charmbracelet/x/term v0.2.1 h1:AQeHeLZ1OqSXhrAWpYUtZyX1T3zVxfpZuEQMIQaGIAQ=
 github.com/charmbracelet/x/term v0.2.1/go.mod h1:oQ4enTYFV7QN4m0i9mzHrViD7TQKvNEEkHUMCmsxdUg=
-github.com/chasefleming/elem-go v0.30.0 h1:BlhV1ekv1RbFiM8XZUQeln1Ikb4D+bu2eDO4agREvok=
-github.com/chasefleming/elem-go v0.30.0/go.mod h1:hz73qILBIKnTgOujnSMtEj20/epI+f6vg71RUilJAA4=
 github.com/chengxilo/virtualterm v1.0.4 h1:Z6IpERbRVlfB8WkOmtbHiDbBANU7cimRIof7mk9/PwM=
 github.com/chengxilo/virtualterm v1.0.4/go.mod h1:DyxxBZz/x1iqJjFxTFcr6/x+jSpqN0iwWCOK1q10rlY=
 github.com/chzyer/logex v1.1.10/go.mod h1:+Ywpsq7O8HXn0nuIou7OrIPyXbp3wmkHB+jjWRnGsAI=
@@ -265,17 +263,14 @@ github.com/cpuguy83/dockercfg v0.3.2 h1:DlJTyZGBDlXqUZ2Dk2Q3xHs/FtnooJJVaad2S9GK
 github.com/cpuguy83/dockercfg v0.3.2/go.mod h1:sugsbF4//dDlL/i+S+rtpIWp+5h0BHJHfjj5/jFyUJc=
 github.com/cpuguy83/go-md2man/v2 v2.0.0-20190314233015-f79a8a8ca69d/go.mod h1:maD7wRr/U5Z6m/iR4s+kqSMx2CaBsrgA7czyZG/E6dU=
 github.com/cpuguy83/go-md2man/v2 v2.0.0/go.mod h1:maD7wRr/U5Z6m/iR4s+kqSMx2CaBsrgA7czyZG/E6dU=
-github.com/cpuguy83/go-md2man/v2 v2.0.6/go.mod h1:oOW0eioCTA6cOiMLiUPZOpcVxMig6NIQQ7OS05n1F4g=
 github.com/creachadair/mds v0.21.3 h1:RRgEAPIb52cU0q7UxGyN+13QlCVTZIL4slRr0cYYQfA=
 github.com/creachadair/mds v0.21.3/go.mod h1:1ltMWZd9yXhaHEoZwBialMaviWVUpRPvMwVP7saFAzM=
 github.com/creachadair/otp v0.5.0 h1:q3Th7CXm2zlmCdBjw5tEPFOj4oWJMnVL5HXlq0sNKS0=
 github.com/creachadair/otp v0.5.0/go.mod h1:0kceI87EnYFNYSTL121goJVAnk3eJhaed9H0nMuJUkA=
 github.com/creack/pty v1.1.24 h1:bJrF4RRfyJnbTJqzRLHzcGaZK1NeM5kTC9jGgovnR1s=
 github.com/creack/pty v1.1.24/go.mod h1:08sCNb52WyoAwi2QDyzUCTgcvVFhUzewun7wtTfvcwE=
-github.com/cyphar/filepath-securejoin v0.5.1 h1:eYgfMq5yryL4fbWfkLpFFy2ukSELzaJOTaUTuh+oF48=
-github.com/cyphar/filepath-securejoin v0.5.1/go.mod h1:Sdj7gXlvMcPZsbhwhQ33GguGLDGQL7h7bg04C/+u9jI=
-github.com/dave-gray101/v2keyauth v0.0.0-20240624150259-c45d584d25e2 h1:flLYmnQFZNo04x2NPehMbf30m7Pli57xwZ0NFqR/hb0=
-github.com/dave-gray101/v2keyauth v0.0.0-20240624150259-c45d584d25e2/go.mod h1:NtWqRzAp/1tw+twkW8uuBenEVVYndEAZACWU3F3xdoQ=
+github.com/cyphar/filepath-securejoin v0.6.1 h1:5CeZ1jPXEiYt3+Z6zqprSAgSWiggmpVyciv8syjIpVE=
+github.com/cyphar/filepath-securejoin v0.6.1/go.mod h1:A8hd4EnAeyujCJRrICiOWqjS1AX0a9kM5XL+NwKoYSc=
 github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
 github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
 github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc h1:U9qPSI2PIWSS1VwoXQT9A3Wy9MM3WgvqSxFWenqJduM=
@@ -284,8 +279,8 @@ github.com/davidlazar/go-crypto v0.0.0-20200604182044-b73af7476f6c h1:pFUpOrbxDR
 github.com/davidlazar/go-crypto v0.0.0-20200604182044-b73af7476f6c/go.mod h1:6UhI8N9EjYm1c2odKpFpAYeR8dsBeM7PtzQhRgxRr9U=
 github.com/decred/dcrd/crypto/blake256 v1.1.0 h1:zPMNGQCm0g4QTY27fOCorQW7EryeQ/U0x++OzVrdms8=
 github.com/decred/dcrd/crypto/blake256 v1.1.0/go.mod h1:2OfgNZ5wDpcsFmHmCK5gZTPcCXqlm2ArzUIkw9czNJo=
-github.com/decred/dcrd/dcrec/secp256k1/v4 v4.4.0 h1:NMZiJj8QnKe1LgsbDayM4UoHwbvwDRwnI3hwNaAHRnc=
-github.com/decred/dcrd/dcrec/secp256k1/v4 v4.4.0/go.mod h1:ZXNYxsqcloTdSy/rNShjYzMhyjf0LaoftYK0p+A3h40=
+github.com/decred/dcrd/dcrec/secp256k1/v4 v4.4.1 h1:5RVFMOWjMyRy8cARdy79nAmgYw3hK/4HUq48LQ6Wwqo=
+github.com/decred/dcrd/dcrec/secp256k1/v4 v4.4.1/go.mod h1:ZXNYxsqcloTdSy/rNShjYzMhyjf0LaoftYK0p+A3h40=
 github.com/dhowden/tag v0.0.0-20240417053706-3d75831295e8 h1:OtSeLS5y0Uy01jaKK4mA/WVIYtpzVm63vLVAPzJXigg=
 github.com/dhowden/tag v0.0.0-20240417053706-3d75831295e8/go.mod h1:apkPC/CR3s48O2D7Y++n1XWEpgPNNCjXYga3PPbJe2E=
 github.com/distribution/reference v0.6.0 h1:0IXCQ5g4/QMHHkarYzh5l+u8T3t73zM5QvfrDyIgxBk=
@@ -339,6 +334,8 @@ github.com/felixge/fgprof v0.9.3 h1:VvyZxILNuCiUCSXtPtYmmtGvb65nqXh2QFWc0Wpf2/g=
 github.com/felixge/fgprof v0.9.3/go.mod h1:RdbpDgzqYVh/T9fPELJyV7EYJuHB55UTEULNun8eiPw=
 github.com/felixge/httpsnoop v1.0.4 h1:NFTV2Zj1bL4mc9sqWACXbQFVBBg2W3GPvqp8/ESS2Wg=
 github.com/felixge/httpsnoop v1.0.4/go.mod h1:m8KPJKqk1gH5J9DgRY2ASl2lWCfGKXixSwevea8zH2U=
+github.com/filecoin-project/go-clock v0.1.0 h1:SFbYIM75M8NnFm1yMHhN9Ahy3W5bEZV9gd6MPfXbKVU=
+github.com/filecoin-project/go-clock v0.1.0/go.mod h1:4uB/O4PvOjlx1VCMdZ9MyDZXRm//gkj1ELEbxfI1AZs=
 github.com/flynn/noise v1.1.0 h1:KjPQoQCEFdZDiP03phOvGi11+SVVhBG2wOWAorLsstg=
 github.com/flynn/noise v1.1.0/go.mod h1:xbMo+0i6+IGbYdJhF31t2eR1BIU0CYc12+BNAKwUTag=
 github.com/fortytw2/leaktest v1.3.0 h1:u8491cBMTQ8ft8aeV+adlcytMZylmA5nnwwkRZjI8vw=
@@ -348,8 +345,8 @@ github.com/frankban/quicktest v1.14.6/go.mod h1:4ptaffx2x8+WTWXmUCuVU6aPUX1/Mz7z
 github.com/fredbi/uri v1.1.1 h1:xZHJC08GZNIUhbP5ImTHnt5Ya0T8FI2VAwI/37kh2Ko=
 github.com/fredbi/uri v1.1.1/go.mod h1:4+DZQ5zBjEwQCDmXW5JdIjz0PUA+yJbvtBv+u+adr5o=
 github.com/fsnotify/fsnotify v1.4.9/go.mod h1:znqG4EE+3YCdAaPaxE2ZRY/06pZUdp0tY4IgpuI1SZQ=
-github.com/fsnotify/fsnotify v1.9.0 h1:2Ml+OJNzbYCTzsxtv8vKSFD9PbJjmhYF14k/jKC7S9k=
-github.com/fsnotify/fsnotify v1.9.0/go.mod h1:8jBTzvmWwFyi3Pb8djgCCO5IBqzKJ/Jwo8TRcHyHii0=
+github.com/fsnotify/fsnotify v1.10.1 h1:b0/UzAf9yR5rhf3RPm9gf3ehBPpf0oZKIjtpKrx59Ho=
+github.com/fsnotify/fsnotify v1.10.1/go.mod h1:TLheqan6HD6GBK6PrDWyDPBaEV8LspOxvPSjC+bVfgo=
 github.com/fyne-io/gl-js v0.2.0 h1:+EXMLVEa18EfkXBVKhifYB6OGs3HwKO3lUElA0LlAjs=
 github.com/fyne-io/gl-js v0.2.0/go.mod h1:ZcepK8vmOYLu96JoxbCKJy2ybr+g1pTnaBDdl7c3ajI=
 github.com/fyne-io/glfw-js v0.3.0 h1:d8k2+Y7l+zy2pc7wlGRyPfTgZoqDf3AI4G+2zOWhWUk=
@@ -375,12 +372,12 @@ github.com/go-audio/wav v1.1.0 h1:jQgLtbqBzY7G+BM8fXF7AHUk1uHUviWS4X39d5rsL2g=
 github.com/go-audio/wav v1.1.0/go.mod h1:mpe9qfwbScEbkd8uybLuIpTgHyrISw/OTuvjUW2iGtE=
 github.com/go-git/gcfg v1.5.1-0.20230307220236-3a3c6141e376 h1:+zs/tPmkDkHx3U66DAb0lQFJrpS6731Oaa12ikc+DiI=
 github.com/go-git/gcfg v1.5.1-0.20230307220236-3a3c6141e376/go.mod h1:an3vInlBmSxCcxctByoQdvwPiA7DTK7jaaFDBTtu0ic=
-github.com/go-git/go-billy/v5 v5.8.0 h1:I8hjc3LbBlXTtVuFNJuwYuMiHvQJDq1AT6u4DwDzZG0=
-github.com/go-git/go-billy/v5 v5.8.0/go.mod h1:RpvI/rw4Vr5QA+Z60c6d6LXH0rYJo0uD5SqfmrrheCY=
+github.com/go-git/go-billy/v5 v5.9.0 h1:jItGXszUDRtR/AlferWPTMN4j38BQ88XnXKbilmmBPA=
+github.com/go-git/go-billy/v5 v5.9.0/go.mod h1:jCnQMLj9eUgGU7+ludSTYoZL/GGmii14RxKFj7ROgHw=
 github.com/go-git/go-git-fixtures/v4 v4.3.2-0.20231010084843-55a94097c399 h1:eMje31YglSBqCdIqdhKBW8lokaMrL3uTkpGYlE2OOT4=
 github.com/go-git/go-git-fixtures/v4 v4.3.2-0.20231010084843-55a94097c399/go.mod h1:1OCfN199q1Jm3HZlxleg+Dw/mwps2Wbk9frAWm+4FII=
-github.com/go-git/go-git/v5 v5.18.0 h1:O831KI+0PR51hM2kep6T8k+w0/LIAD490gvqMCvL5hM=
-github.com/go-git/go-git/v5 v5.18.0/go.mod h1:pW/VmeqkanRFqR6AljLcs7EA7FbZaN5MQqO7oZADXpo=
+github.com/go-git/go-git/v5 v5.19.0 h1:+WkVUQZSy/F1Gb13udrMKjIM2PrzsNfDKFSfo5tkMtc=
+github.com/go-git/go-git/v5 v5.19.0/go.mod h1:Pb1v0c7/g8aGQJwx9Us09W85yGoyvSwuhEGMH7zjDKQ=
 github.com/go-gl/gl v0.0.0-20231021071112-07e5d0ea2e71 h1:5BVwOaUSBTlVZowGO6VZGw2H/zl9nrd3eCZfYV+NfQA=
 github.com/go-gl/gl v0.0.0-20231021071112-07e5d0ea2e71/go.mod h1:9YTyiznxEY1fVinfM7RvRcjRHbw2xLBJ3AAGIT0I4Nw=
 github.com/go-gl/glfw v0.0.0-20190409004039-e6da0acd62b1/go.mod h1:vR7hzQXu2zJy9AVAgeJqvqgH9Q5CA+iKCZ2gyEVpxRU=
@@ -432,12 +429,6 @@ github.com/godbus/dbus/v5 v5.1.0 h1:4KLkAxT3aOY8Li4FRJe/KvhoNFFxo0m6fNuFUO8QJUk=
 github.com/godbus/dbus/v5 v5.1.0/go.mod h1:xhWf0FNVPg57R7Z0UbKHbJfkEywrmjJnf7w5xrFpKfA=
 github.com/gofiber/fiber/v2 v2.52.13 h1:TOKP64iqC9b5P49VrBW5tHhUOvDyrtJ0xePEfzJbCbk=
 github.com/gofiber/fiber/v2 v2.52.13/go.mod h1:YEcBbO/FB+5M1IZNBP9FO3J9281zgPAreiI1oqg8nDw=
-github.com/gofiber/template v1.8.3 h1:hzHdvMwMo/T2kouz2pPCA0zGiLCeMnoGsQZBTSYgZxc=
-github.com/gofiber/template v1.8.3/go.mod h1:bs/2n0pSNPOkRa5VJ8zTIvedcI/lEYxzV3+YPXdBvq8=
-github.com/gofiber/template/html/v2 v2.1.3 h1:n1LYBtmr9C0V/k/3qBblXyMxV5B0o/gpb6dFLp8ea+o=
-github.com/gofiber/template/html/v2 v2.1.3/go.mod h1:U5Fxgc5KpyujU9OqKzy6Kn6Qup6Tm7zdsISR+VpnHRE=
-github.com/gofiber/utils v1.1.0 h1:vdEBpn7AzIUJRhe+CiTOJdUcTg4Q9RK+pEa0KPbLdrM=
-github.com/gofiber/utils v1.1.0/go.mod h1:poZpsnhBykfnY1Mc0KeEa6mSHrS3dV0+oBWyeQmb2e0=
 github.com/gofrs/flock v0.13.0 h1:95JolYOvGMqeH31+FC7D2+uULf6mG61mEZ/A8dRYMzw=
 github.com/gofrs/flock v0.13.0/go.mod h1:jxeyy9R1auM5S6JYDBhDt+E2TCo7DkratH4Pgi8P+Z0=
 github.com/gogo/protobuf v1.3.2 h1:Ov1cvc58UF3b5XjBnZv7+opcTcQFZebYjWzi34vdm4Q=
@@ -479,8 +470,8 @@ github.com/golang/protobuf v1.5.2/go.mod h1:XVQd3VNwM+JqD3oG2Ue2ip4fOMUkwXdXDdiu
 github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek=
 github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps=
 github.com/golang/snappy v0.0.2/go.mod h1:/XxbfmMg8lxefKM7IXC3fBNl/7bRcc72aCRzEWrmP2Q=
-github.com/golang/snappy v0.0.4 h1:yAGX7huGHXlcLOEtBnF4w7FQwA26wojNCwOYAEhLjQM=
-github.com/golang/snappy v0.0.4/go.mod h1:/XxbfmMg8lxefKM7IXC3fBNl/7bRcc72aCRzEWrmP2Q=
+github.com/golang/snappy v0.0.5-0.20231225225746-43d5d4cd4e0e h1:4bw4WeyTYPp0smaXiJZCNnLrvVBqirQVreixayXezGc=
+github.com/golang/snappy v0.0.5-0.20231225225746-43d5d4cd4e0e/go.mod h1:/XxbfmMg8lxefKM7IXC3fBNl/7bRcc72aCRzEWrmP2Q=
 github.com/gomarkdown/markdown v0.0.0-20250311123330-531bef5e742b h1:EY/KpStFl60qA17CptGXhwfZ+k1sFNJIUNR8DdbcuUk=
 github.com/gomarkdown/markdown v0.0.0-20250311123330-531bef5e742b/go.mod h1:JDGcbDT52eL4fju3sZ4TeHGsQwhG9nbDV21aMyhwPoA=
 github.com/google/btree v0.0.0-20180813153112-4030bb1f1f0c/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ=
@@ -587,27 +578,25 @@ github.com/huin/goupnp v1.3.0/go.mod h1:gnGPsThkYa7bFi/KWmEysQRf48l2dvR5bxr2OFck
 github.com/ianlancetaylor/demangle v0.0.0-20181102032728-5e5cf60278f6/go.mod h1:aSSvb/t6k1mPoxDqO4vJh6VOCGPwU4O0C2/Eqndh1Sc=
 github.com/ianlancetaylor/demangle v0.0.0-20200824232613-28f6c0f3b639/go.mod h1:aSSvb/t6k1mPoxDqO4vJh6VOCGPwU4O0C2/Eqndh1Sc=
 github.com/inconshreveable/mousetrap v1.0.0/go.mod h1:PxqpIevigyE2G7u3NXJIT2ANytuPF1OarO4DADm73n8=
-github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8=
-github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw=
-github.com/ipfs/boxo v0.30.0 h1:7afsoxPGGqfoH7Dum/wOTGUB9M5fb8HyKPMlLfBvIEQ=
-github.com/ipfs/boxo v0.30.0/go.mod h1:BPqgGGyHB9rZZcPSzah2Dc9C+5Or3U1aQe7EH1H7370=
-github.com/ipfs/go-block-format v0.2.0 h1:ZqrkxBA2ICbDRbK8KJs/u0O3dlp6gmAuuXUJNiW1Ycs=
-github.com/ipfs/go-block-format v0.2.0/go.mod h1:+jpL11nFx5A/SPpsoBn6Bzkra/zaArfSmsknbPMYgzM=
+github.com/invopop/jsonschema v0.13.0 h1:KvpoAJWEjR3uD9Kbm2HWJmqsEaHt8lBUpd0qHcIi21E=
+github.com/invopop/jsonschema v0.13.0/go.mod h1:ffZ5Km5SWWRAIN6wbDXItl95euhFz2uON45H2qjYt+0=
+github.com/ipfs/boxo v0.37.0 h1:2E3mZvydMI2t5IkAgtkmZ3sGsld0oS7o3I+xyzDk6uI=
+github.com/ipfs/boxo v0.37.0/go.mod h1:8yyiRn54F2CsW13n0zwXEPrVsZix/gFj9SYIRYMZ6KE=
+github.com/ipfs/go-block-format v0.2.3 h1:mpCuDaNXJ4wrBJLrtEaGFGXkferrw5eqVvzaHhtFKQk=
+github.com/ipfs/go-block-format v0.2.3/go.mod h1:WJaQmPAKhD3LspLixqlqNFxiZ3BZ3xgqxxoSR/76pnA=
 github.com/ipfs/go-cid v0.6.1 h1:T5TnNb08+ueovG76Z5gx1L4Y7QOaGTXHg1F6raWFxIc=
 github.com/ipfs/go-cid v0.6.1/go.mod h1:zrY0SwOhjrrIdfPQ/kf+k1sXyJ0QE7cMxfCployLBs0=
-github.com/ipfs/go-datastore v0.8.2 h1:Jy3wjqQR6sg/LhyY0NIePZC3Vux19nLtg7dx0TVqr6U=
-github.com/ipfs/go-datastore v0.8.2/go.mod h1:W+pI1NsUsz3tcsAACMtfC+IZdnQTnC/7VfPoJBQuts0=
+github.com/ipfs/go-datastore v0.9.1 h1:67Po2epre/o0UxrmkzdS9ZTe2GFGODgTd2odx8Wh6Yo=
+github.com/ipfs/go-datastore v0.9.1/go.mod h1:zi07Nvrpq1bQwSkEnx3bfjz+SQZbdbWyCNvyxMh9pN0=
 github.com/ipfs/go-detect-race v0.0.1 h1:qX/xay2W3E4Q1U7d9lNs1sU9nvguX0a7319XbyQ6cOk=
 github.com/ipfs/go-detect-race v0.0.1/go.mod h1:8BNT7shDZPo99Q74BpGMK+4D8Mn4j46UU0LZ723meps=
-github.com/ipfs/go-ipfs-util v0.0.3 h1:2RFdGez6bu2ZlZdI+rWfIdbQb1KudQp3VGwPtdNCmE0=
-github.com/ipfs/go-ipfs-util v0.0.3/go.mod h1:LHzG1a0Ig4G+iZ26UUOMjHd+lfM84LZCrn17xAKWBvs=
 github.com/ipfs/go-log v1.0.5 h1:2dOuUCB1Z7uoczMWgAyDck5JLb72zHzrMnGnCNNbvY8=
 github.com/ipfs/go-log v1.0.5/go.mod h1:j0b8ZoR+7+R99LD9jZ6+AJsrzkPbSXbZfGakb5JPtIo=
 github.com/ipfs/go-log/v2 v2.1.3/go.mod h1:/8d0SH3Su5Ooc31QlL1WysJhvyOTDCjcCZ9Axpmri6g=
-github.com/ipfs/go-log/v2 v2.6.0 h1:2Nu1KKQQ2ayonKp4MPo6pXCjqw1ULc9iohRqWV5EYqg=
-github.com/ipfs/go-log/v2 v2.6.0/go.mod h1:p+Efr3qaY5YXpx9TX7MoLCSEZX5boSWj9wh86P5HJa8=
-github.com/ipfs/go-test v0.2.1 h1:/D/a8xZ2JzkYqcVcV/7HYlCnc7bv/pKHQiX5TdClkPE=
-github.com/ipfs/go-test v0.2.1/go.mod h1:dzu+KB9cmWjuJnXFDYJwC25T3j1GcN57byN+ixmK39M=
+github.com/ipfs/go-log/v2 v2.9.1 h1:3JXwHWU31dsCpvQ+7asz6/QsFJHqFr4gLgQ0FWteujk=
+github.com/ipfs/go-log/v2 v2.9.1/go.mod h1:evFx7sBiohUN3AG12mXlZBw5hacBQld3ZPHrowlJYoo=
+github.com/ipfs/go-test v0.2.3 h1:Z/jXNAReQFtCYyn7bsv/ZqUwS6E7iIcSpJ2CuzCvnrc=
+github.com/ipfs/go-test v0.2.3/go.mod h1:QW8vSKkwYvWFwIZQLGQXdkt9Ud76eQXRQ9Ao2H+cA1o=
 github.com/ipld/go-ipld-prime v0.23.0 h1:csqdPZH60BsTC+AZrv7fpa27v+09I/oTqyHYYYE27eE=
 github.com/ipld/go-ipld-prime v0.23.0/go.mod h1:46YCFSFNFBJHPjB0pfMuv7Ly7df2eChpkpyPo5SE0bA=
 github.com/jackc/pgpassfile v1.0.0 h1:/6Hmqy13Ss2zCq62VdNG8tM1wchn8zjSGOBJ6icpsIM=
@@ -695,18 +684,18 @@ github.com/libp2p/go-buffer-pool v0.1.0 h1:oK4mSFcQz7cTQIfqbe4MIj9gLW+mnanjyFtc6
 github.com/libp2p/go-buffer-pool v0.1.0/go.mod h1:N+vh8gMqimBzdKkSMVuydVDq+UV5QTWy5HSiZacSbPg=
 github.com/libp2p/go-cidranger v1.1.0 h1:ewPN8EZ0dd1LSnrtuwd4709PXVcITVeuwbag38yPW7c=
 github.com/libp2p/go-cidranger v1.1.0/go.mod h1:KWZTfSr+r9qEo9OkI9/SIEeAtw+NNoU0dXIXt15Okic=
-github.com/libp2p/go-flow-metrics v0.2.0 h1:EIZzjmeOE6c8Dav0sNv35vhZxATIXWZg6j/C08XmmDw=
-github.com/libp2p/go-flow-metrics v0.2.0/go.mod h1:st3qqfu8+pMfh+9Mzqb2GTiwrAGjIPszEjZmtksN8Jc=
+github.com/libp2p/go-flow-metrics v0.3.0 h1:q31zcHUvHnwDO0SHaukewPYgwOBSxtt830uJtUx6784=
+github.com/libp2p/go-flow-metrics v0.3.0/go.mod h1:nuhlreIwEguM1IvHAew3ij7A8BMlyHQJ279ao24eZZo=
 github.com/libp2p/go-libp2p v0.48.0 h1:h2BrLAgrj7X8bEN05K7qmrjpNHYA+6tnsGRdprjTnvo=
 github.com/libp2p/go-libp2p v0.48.0/go.mod h1:Q1fBZNdmC2Hf82husCTfkKJVfHm2we5zk+NWmOGEmWk=
 github.com/libp2p/go-libp2p-asn-util v0.4.1 h1:xqL7++IKD9TBFMgnLPZR6/6iYhawHKHl950SO9L6n94=
 github.com/libp2p/go-libp2p-asn-util v0.4.1/go.mod h1:d/NI6XZ9qxw67b4e+NgpQexCIiFYJjErASrYW4PFDN8=
-github.com/libp2p/go-libp2p-kad-dht v0.33.1 h1:hKFhHMf7WH69LDjaxsJUWOU6qZm71uO47M/a5ijkiP0=
-github.com/libp2p/go-libp2p-kad-dht v0.33.1/go.mod h1:CdmNk4VeGJa9EXM9SLNyNVySEvduKvb+5rSC/H4pLAo=
-github.com/libp2p/go-libp2p-kbucket v0.7.0 h1:vYDvRjkyJPeWunQXqcW2Z6E93Ywx7fX0jgzb/dGOKCs=
-github.com/libp2p/go-libp2p-kbucket v0.7.0/go.mod h1:blOINGIj1yiPYlVEX0Rj9QwEkmVnz3EP8LK1dRKBC6g=
-github.com/libp2p/go-libp2p-pubsub v0.14.2 h1:nT5lFHPQOFJcp9CW8hpKtvbpQNdl2udJuzLQWbgRum8=
-github.com/libp2p/go-libp2p-pubsub v0.14.2/go.mod h1:MKPU5vMI8RRFyTP0HfdsF9cLmL1nHAeJm44AxJGJx44=
+github.com/libp2p/go-libp2p-kad-dht v0.39.0 h1:mww38eBYiUvdsu+Xl/GLlBC0Aa8M+5HAwvafkFOygAM=
+github.com/libp2p/go-libp2p-kad-dht v0.39.0/go.mod h1:Po2JugFEkDq9Vig/JXtc153ntOi0q58o4j7IuITCOVs=
+github.com/libp2p/go-libp2p-kbucket v0.8.0 h1:QAK7RzKJpYe+EuSEATAaaHYMYLkPDGC18m9jxPLnU8s=
+github.com/libp2p/go-libp2p-kbucket v0.8.0/go.mod h1:JMlxqcEyKwO6ox716eyC0hmiduSWZZl6JY93mGaaqc4=
+github.com/libp2p/go-libp2p-pubsub v0.15.0 h1:cG7Cng2BT82WttmPFMi50gDNV+58K626m/wR00vGL1o=
+github.com/libp2p/go-libp2p-pubsub v0.15.0/go.mod h1:lr4oE8bFgQaifRcoc2uWhWWiK6tPdOEKpUuR408GFN4=
 github.com/libp2p/go-libp2p-record v0.3.1 h1:cly48Xi5GjNw5Wq+7gmjfBiG9HCzQVkiZOUZ8kUl+Fg=
 github.com/libp2p/go-libp2p-record v0.3.1/go.mod h1:T8itUkLcWQLCYMqtX7Th6r7SexyUJpIyPgks757td/E=
 github.com/libp2p/go-libp2p-routing-helpers v0.7.5 h1:HdwZj9NKovMx0vqq6YNPTh6aaNzey5zHD7HeLJtq6fI=
@@ -719,8 +708,8 @@ github.com/libp2p/go-netroute v0.4.0 h1:sZZx9hyANYUx9PZyqcgE/E1GUG3iEtTZHUEvdtXT
 github.com/libp2p/go-netroute v0.4.0/go.mod h1:Nkd5ShYgSMS5MUKy/MU2T57xFoOKvvLR92Lic48LEyA=
 github.com/libp2p/go-reuseport v0.4.0 h1:nR5KU7hD0WxXCJbmw7r2rhRYruNRl2koHw8fQscQm2s=
 github.com/libp2p/go-reuseport v0.4.0/go.mod h1:ZtI03j/wO5hZVDFo2jKywN6bYKWLOy8Se6DrI2E1cLU=
-github.com/libp2p/go-yamux/v5 v5.0.1 h1:f0WoX/bEF2E8SbE4c/k1Mo+/9z0O4oC/hWEA+nfYRSg=
-github.com/libp2p/go-yamux/v5 v5.0.1/go.mod h1:en+3cdX51U0ZslwRdRLrvQsdayFt3TSUKvBGErzpWbU=
+github.com/libp2p/go-yamux/v5 v5.1.0 h1:8Qlxj4E9JGJAQVW6+uj2o7mqkqsIVlSUGmTWhlXzoHE=
+github.com/libp2p/go-yamux/v5 v5.1.0/go.mod h1:tgIQ07ObtRR/I0IWsFOyQIL9/dR5UXgc2s8xKmNZv1o=
 github.com/libp2p/zeroconf/v2 v2.2.0 h1:Cup06Jv6u81HLhIj1KasuNM/RHHrJ8T7wOTS4+Tv53Q=
 github.com/libp2p/zeroconf/v2 v2.2.0/go.mod h1:fuJqLnUwZTshS3U/bMRJ3+ow/v9oid1n0DmyYyNO1Xs=
 github.com/lithammer/fuzzysearch v1.1.8 h1:/HIuJnjHuXS8bKaiTMeeDlW2/AyIWk2brx1V8LFgLN4=
@@ -765,8 +754,8 @@ github.com/microcosm-cc/bluemonday v1.0.27 h1:MpEUotklkwCSLeH+Qdx1VJgNqLlpY2KXwX
 github.com/microcosm-cc/bluemonday v1.0.27/go.mod h1:jFi9vgW+H7c3V0lb6nR74Ib/DIB5OBs92Dimizgw2cA=
 github.com/miekg/dns v1.0.14/go.mod h1:W1PPwlIAgtquWBMBEV9nkV9Cazfe8ScdGz/Lj7v3Nrg=
 github.com/miekg/dns v1.1.43/go.mod h1:+evo5L0630/F6ca/Z9+GAqzhjGyn8/c+TBaOyfEl0V4=
-github.com/miekg/dns v1.1.66 h1:FeZXOS3VCVsKnEAd+wBkjMC3D2K+ww66Cq3VnCINuJE=
-github.com/miekg/dns v1.1.66/go.mod h1:jGFzBsSNbJw6z1HYut1RKBKHA9PBdxeHrZG8J+gC2WE=
+github.com/miekg/dns v1.1.72 h1:vhmr+TF2A3tuoGNkLDFK9zi36F2LS+hKTRW0Uf8kbzI=
+github.com/miekg/dns v1.1.72/go.mod h1:+EuEPhdHOsfk6Wk5TT2CzssZdqkmFhf8r+aVyDEToIs=
 github.com/mikioh/tcp v0.0.0-20190314235350-803a9b46060c h1:bzE/A84HN25pxAuk9Eej1Kz9OUelF97nAc82bDquQI8=
 github.com/mikioh/tcp v0.0.0-20190314235350-803a9b46060c/go.mod h1:0SQS9kMwD2VsyFEB++InYyBJroV/FRmBgcydeSUcJms=
 github.com/mikioh/tcpinfo v0.0.0-20190314235526-30a79bb1804b h1:z78hV3sbSMAUoyUMM0I83AUIT6Hu17AWfgjzIbtrYFc=
@@ -828,14 +817,12 @@ github.com/mr-tron/base58 v1.3.0 h1:K6Y13R2h+dku0wOqKtecgRnBUBPrZzLZy5aIj8lCcJI=
 github.com/mr-tron/base58 v1.3.0/go.mod h1:2BuubE67DCSWwVfx37JWNG8emOC0sHEU4/HpcYgCLX8=
 github.com/mschoch/smat v0.2.0 h1:8imxQsjDm8yFEAVBe7azKmKSgzSkZXDuKkSq9374khM=
 github.com/mschoch/smat v0.2.0/go.mod h1:kc9mz7DoBKqDyiRL7VZN8KvXQMWeTaVnttLRXOlotKw=
-github.com/mudler/LocalAGI v0.0.0-20260507074708-c1a12317930d h1:PYrydMGkFcEzNpazJ4ptaZdxG29CIQbUE1j0YRDFswA=
-github.com/mudler/LocalAGI v0.0.0-20260507074708-c1a12317930d/go.mod h1:x77p9W1zKZr+W+UcEwg8/qdp00p4XXOI69wE7WlXZc0=
 github.com/mudler/LocalAGI v0.0.0-20260508125235-37810d918a87 h1:az+2umaD/sT1rRvI3WZHWXjzdJVJHxcyxp0SNYbqlFk=
 github.com/mudler/LocalAGI v0.0.0-20260508125235-37810d918a87/go.mod h1:x77p9W1zKZr+W+UcEwg8/qdp00p4XXOI69wE7WlXZc0=
 github.com/mudler/cogito v0.9.5-0.20260315222927-63abdec7189b h1:A74T2Lauvg61KodYqsjTYDY05kPLcW+efVZjd23dghU=
 github.com/mudler/cogito v0.9.5-0.20260315222927-63abdec7189b/go.mod h1:6sfja3lcu2nWRzEc0wwqGNu/eCG3EWgij+8s7xyUeQ4=
-github.com/mudler/edgevpn v0.31.1 h1:7qegiDWd0kAg6ljhNHxqvp8hbo/6BbzSdbb7/2WZfiY=
-github.com/mudler/edgevpn v0.31.1/go.mod h1:ftV5B0nKFzm4R8vR80UYnCb2nf7lxCRgAALxUEEgCf8=
+github.com/mudler/edgevpn v0.32.2 h1:umTPyyZgkom/A81Bk4HbP0p1ZSEU5EFPW3Bg+YPxI8A=
+github.com/mudler/edgevpn v0.32.2/go.mod h1:UaMc8MORbcRsAjuO5gVJj9Bn3Nq2AP5U9NTb6epVyv8=
 github.com/mudler/go-piper v0.0.0-20241023091659-2494246fd9fc h1:RxwneJl1VgvikiX28EkpdAyL4yQVnJMrbquKospjHyA=
 github.com/mudler/go-piper v0.0.0-20241023091659-2494246fd9fc/go.mod h1:O7SwdSWMilAWhBZMK9N9Y/oBDyMMzshE3ju8Xkexwig=
 github.com/mudler/go-processmanager v0.1.1 h1:c/1NRZOZpW8HuFv9RhBG57nQu1oDMRomEHedwBFMlrw=
@@ -861,8 +848,8 @@ github.com/multiformats/go-base36 v0.2.0/go.mod h1:qvnKE++v+2MWCfePClUEjE78Z7P2a
 github.com/multiformats/go-multiaddr v0.1.1/go.mod h1:aMKBKNEYmzmDmxfX88/vz+J5IU55txyt0p4aiWVohjo=
 github.com/multiformats/go-multiaddr v0.16.1 h1:fgJ0Pitow+wWXzN9do+1b8Pyjmo8m5WhGfzpL82MpCw=
 github.com/multiformats/go-multiaddr v0.16.1/go.mod h1:JSVUmXDjsVFiW7RjIFMP7+Ev+h1DTbiJgVeTV/tcmP0=
-github.com/multiformats/go-multiaddr-dns v0.4.1 h1:whi/uCLbDS3mSEUMb1MsoT4uzUeZB0N32yzufqS0i5M=
-github.com/multiformats/go-multiaddr-dns v0.4.1/go.mod h1:7hfthtB4E4pQwirrz+J0CcDUfbWzTqEzVyYKKIKpgkc=
+github.com/multiformats/go-multiaddr-dns v0.5.0 h1:p/FTyHKX0nl59f+S+dEUe8HRK+i5Ow/QHMw8Nh3gPCo=
+github.com/multiformats/go-multiaddr-dns v0.5.0/go.mod h1:yJ349b8TPIAANUyuOzn1oz9o22tV9f+06L+cCeMxC14=
 github.com/multiformats/go-multiaddr-fmt v0.1.0 h1:WLEFClPycPkp4fnIzoFoV9FVd49/eQsuaL3/CWe167E=
 github.com/multiformats/go-multiaddr-fmt v0.1.0/go.mod h1:hGtDIW4PU4BqJ50gW2quDuPVjyWNZxToGUh/HwTZYJo=
 github.com/multiformats/go-multibase v0.3.0 h1:8helZD2+4Db7NNWFiktk2NePbF0boolBe6bDQvM4r68=
@@ -902,8 +889,8 @@ github.com/onsi/ginkgo v1.16.5 h1:8xi0RTUf59SOSfEtZMvwTvXYMzG4gV23XVHOZiXNtnE=
 github.com/onsi/ginkgo v1.16.5/go.mod h1:+E8gABHa3K6zRBolWtd+ROzc/U5bkGt0FwiG042wbpU=
 github.com/onsi/ginkgo/v2 v2.28.2 h1:DTrMfpqxiNUyQ3Y0zhn1n3cOO2euFgQPYIpkWwxVFps=
 github.com/onsi/ginkgo/v2 v2.28.2/go.mod h1:CLtbVInNckU3/+gC8LzkGUb9oF+e8W8TdUsxPwvdOgE=
-github.com/onsi/gomega v1.39.1 h1:1IJLAad4zjPn2PsnhH70V4DKRFlrCzGBNrNaru+Vf28=
-github.com/onsi/gomega v1.39.1/go.mod h1:hL6yVALoTOxeWudERyfppUcZXjMwIMLnuSfruD2lcfg=
+github.com/onsi/gomega v1.40.0 h1:Vtol0e1MghCD2ZVIilPDIg44XSL9l2QAn8ZNaljWcJc=
+github.com/onsi/gomega v1.40.0/go.mod h1:M/Uqpu/8qTjtzCLUA2zJHX9Iilrau25x1PdoSRbWh5A=
 github.com/openai/openai-go/v3 v3.26.0 h1:bRt6H/ozMNt/dDkN4gobnLqaEGrRGBzmbVs0xxJEnQE=
 github.com/openai/openai-go/v3 v3.26.0/go.mod h1:cdufnVK14cWcT9qA1rRtrXx4FTRsgbDPW7Ia7SS5cZo=
 github.com/opencontainers/go-digest v1.0.0 h1:apOUWs51W5PlhuyGyz9FCeeBIOUDA/6nW8Oi/yOhh5U=
@@ -968,8 +955,8 @@ github.com/pion/turn/v4 v4.1.4 h1:EU11yMXKIsK43FhcUnjLlrhE4nboHZq+TXBIi3QpcxQ=
 github.com/pion/turn/v4 v4.1.4/go.mod h1:ES1DXVFKnOhuDkqn9hn5VJlSWmZPaRJLyBXoOeO/BmQ=
 github.com/pion/webrtc/v4 v4.2.11 h1:QUX1QZKlNIn4O7U5JxLPGP0sV5RTncZkzu9SPR3jVNU=
 github.com/pion/webrtc/v4 v4.2.11/go.mod h1:s/rAiyy77GyRFrZMx+Ls6aua26dIBPudH8/ZHYbIRWY=
-github.com/pjbgf/sha1cd v0.3.2 h1:a9wb0bp1oC2TGwStyn0Umc/IGKQnEgF0vVaZ8QF8eo4=
-github.com/pjbgf/sha1cd v0.3.2/go.mod h1:zQWigSxVmsHEZow5qaLtPYxpcKMMQpa09ixqBxuCS6A=
+github.com/pjbgf/sha1cd v0.6.0 h1:3WJ8Wz8gvDz29quX1OcEmkAlUg9diU4GxJHqs0/XiwU=
+github.com/pjbgf/sha1cd v0.6.0/go.mod h1:lhpGlyHLpQZoxMv8HcgXvZEhcGs0PG/vsZnEJ7H0iCM=
 github.com/pkg/errors v0.8.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
 github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
 github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
@@ -1019,7 +1006,6 @@ github.com/rs/zerolog v1.31.0/go.mod h1:/7mN4D5sKwJLZQ2b/znpjC3/GQWY/xaDXUM0kKWR
 github.com/russross/blackfriday v1.6.0 h1:KqfZb0pUVN2lYqZUYRddxF4OR8ZMURnJIG5Y3VRLtww=
 github.com/russross/blackfriday v1.6.0/go.mod h1:ti0ldHuxg49ri4ksnFxlkCfN+hvslNlmVHqNRXXJNAY=
 github.com/russross/blackfriday/v2 v2.0.1/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
-github.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
 github.com/ruudk/golang-pdf417 v0.0.0-20181029194003-1af4ab5afa58/go.mod h1:6lfFZQK844Gfx8o5WFuvpxWRwnSoipWe/p622j1v06w=
 github.com/ryanuber/columnize v0.0.0-20160712163229-9b3edd62028f/go.mod h1:sm1tb6uqfes/u+d4ooFouqFdy9/2g9QGwK3SQygK0Ts=
 github.com/rymdport/portal v0.4.2 h1:7jKRSemwlTyVHHrTGgQg7gmNPJs88xkbKcIL3NlcmSU=
@@ -1078,13 +1064,8 @@ github.com/spf13/cast v1.3.1/go.mod h1:Qx5cxh0v+4UWYiBimWS+eyWzqEqokIECu5etghLkU
 github.com/spf13/cast v1.7.0 h1:ntdiHjuueXFgm5nzDRdOS4yfT43P5Fnud6DH50rz/7w=
 github.com/spf13/cast v1.7.0/go.mod h1:ancEpBxwJDODSW/UG4rDrAqiKolqNNh2DX3mk86cAdo=
 github.com/spf13/cobra v1.2.1/go.mod h1:ExllRjgxM/piMAM+3tAZvg8fsklGAf3tPfi+i8t68Nk=
-github.com/spf13/cobra v1.10.2 h1:DMTTonx5m65Ic0GOoRY2c16WCbHxOOw6xxezuLaBpcU=
-github.com/spf13/cobra v1.10.2/go.mod h1:7C1pvHqHw5A4vrJfjNwvOdzYu0Gml16OCs2GRiTUUS4=
 github.com/spf13/jwalterweatherman v1.1.0/go.mod h1:aNWZUN0dPAAO/Ljvb5BEdw96iTZ0EXowPYD95IqWIGo=
 github.com/spf13/pflag v1.0.5/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
-github.com/spf13/pflag v1.0.9/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
-github.com/spf13/pflag v1.0.10 h1:4EBh2KAYBwaONj6b2Ye1GiHfwjqyROoF4RwYO+vPwFk=
-github.com/spf13/pflag v1.0.10/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
 github.com/spf13/viper v1.8.1/go.mod h1:o0Pch8wJ9BVSWGQMbra6iw0oQ5oktSIBaujf1rJH9Ns=
 github.com/srwiley/oksvg v0.0.0-20221011165216-be6e8873101c h1:km8GpoQut05eY3GiYWEedbTT0qnSxrCjsVbb7yKY1KE=
 github.com/srwiley/oksvg v0.0.0-20221011165216-be6e8873101c/go.mod h1:cNQ3dwVJtS5Hmnjxy6AgTPd0Inb3pW05ftPSX7NZO7Q=
@@ -1092,6 +1073,8 @@ github.com/srwiley/rasterx v0.0.0-20220730225603-2ab79fcdd4ef h1:Ch6Q+AZUxDBCVqd
 github.com/srwiley/rasterx v0.0.0-20220730225603-2ab79fcdd4ef/go.mod h1:nXTWP6+gD5+LUJ8krVhhoeHjvHTutPxMYl5SvkcnJNE=
 github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf h1:pvbZ0lM0XWPBqUKqFU8cmavspvIl9nulOYwdy6IFRRo=
 github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf/go.mod h1:RJID2RhlZKId02nZ62WenDCkgHFerpIOmW0iT7GKmXM=
+github.com/standard-webhooks/standard-webhooks/libraries v0.0.0-20260508151727-1282bb917829 h1:zGlGD0Zfk2HaIo4EnUVBRhnXQ+cnGQz5X2PdBcplOyw=
+github.com/standard-webhooks/standard-webhooks/libraries v0.0.0-20260508151727-1282bb917829/go.mod h1:L1MQhA6x4dn9r007T033lsaZMv9EmBAdXyU/+EF40fo=
 github.com/streamer45/silero-vad-go v0.2.1 h1:Li1/tTC4H/3cyw6q4weX+U8GWwEL3lTekK/nYa1Cvuk=
 github.com/streamer45/silero-vad-go v0.2.1/go.mod h1:B+2FXs/5fZ6pzl6unUZYhZqkYdOB+3saBVzjOzdZnUs=
 github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
@@ -1167,9 +1150,8 @@ github.com/valyala/fasttemplate v1.2.2 h1:lxLXG0uE3Qnshl9QyaK6XJxMXlQZELvChBOCmQ
 github.com/valyala/fasttemplate v1.2.2/go.mod h1:KHLXt3tVN2HBp8eijSv/kGJopbvo7S+qRAEEKiv+SiQ=
 github.com/vbatts/tar-split v0.12.2 h1:w/Y6tjxpeiFMR47yzZPlPj/FcPLpXbTUi/9H7d3CPa4=
 github.com/vbatts/tar-split v0.12.2/go.mod h1:eF6B6i6ftWQcDqEn3/iGFRFRo8cBIMSJVOpnNdfTMFA=
-github.com/vishvananda/netlink v1.3.0 h1:X7l42GfcV4S6E4vHTsw48qbrV+9PVojNfIhZcwQdrZk=
-github.com/vishvananda/netlink v1.3.0/go.mod h1:i6NetklAujEcC6fK0JPjT8qSwWyO0HLn4UKG+hGqeJs=
-github.com/vishvananda/netns v0.0.4/go.mod h1:SpkAiCQRtJ6TvvxPnOSyH3BMl6unz3xZlaprSwhNNJM=
+github.com/vishvananda/netlink v1.3.1 h1:3AEMt62VKqz90r0tmNhog0r/PpWKmrEShJU0wJW6bV0=
+github.com/vishvananda/netlink v1.3.1/go.mod h1:ARtKouGSTGchR8aMwmkzC0qiNPrrWO5JS/XMVl45+b4=
 github.com/vishvananda/netns v0.0.5 h1:DfiHV+j8bA32MFM7bfEunvT8IAqQ/NzSJHtcmW5zdEY=
 github.com/vishvananda/netns v0.0.5/go.mod h1:SpkAiCQRtJ6TvvxPnOSyH3BMl6unz3xZlaprSwhNNJM=
 github.com/warpfork/go-wish v0.0.0-20220906213052-39a1cc7a02d0 h1:GDDkbFiaK8jsSDJfjId/PEGEShv6ugrt4kYsC5UIDaQ=
@@ -1220,8 +1202,8 @@ go.opencensus.io v0.24.0 h1:y73uSU6J157QMP2kn2r30vwW1A2W2WFwSCGnAVxeaD0=
 go.opencensus.io v0.24.0/go.mod h1:vNK8G9p7aAivkbmorf4v+7Hgx+Zs0yY+0fOtgBfjQKo=
 go.opentelemetry.io/auto/sdk v1.2.1 h1:jXsnJ4Lmnqd11kwkBV2LgLoFMZKizbCi5fNZ/ipaZ64=
 go.opentelemetry.io/auto/sdk v1.2.1/go.mod h1:KRTj+aOaElaLi+wW1kO/DZRXwkF4C5xPbEe3ZiIhN7Y=
-go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.61.0 h1:F7Jx+6hwnZ41NSFTO5q4LYDtJRXBf2PD0rNBkeB/lus=
-go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.61.0/go.mod h1:UHB22Z8QsdRDrnAtX4PntOl36ajSxcdUMt1sF7Y6E7Q=
+go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.65.0 h1:7iP2uCb7sGddAr30RRS6xjKy7AZ2JtTOPA3oolgVSw8=
+go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.65.0/go.mod h1:c7hN3ddxs/z6q9xwvfLPk+UHlWRQyaeR1LdgfL/66l0=
 go.opentelemetry.io/otel v1.43.0 h1:mYIM03dnh5zfN7HautFE4ieIig9amkNANT+xcVxAj9I=
 go.opentelemetry.io/otel v1.43.0/go.mod h1:JuG+u74mvjvcm8vj8pI5XiHy1zDeoCS2LB1spIq7Ay0=
 go.opentelemetry.io/otel/exporters/prometheus v0.65.0 h1:jOveH/b4lU9HT7y+Gfamf18BqlOuz2PWEvs8yM7Q6XE=
@@ -1253,8 +1235,8 @@ go.uber.org/multierr v1.11.0/go.mod h1:20+QtiLqy0Nd6FdQB9TLXag12DsQkrbs3htMFfDN8
 go.uber.org/tools v0.0.0-20190618225709-2cfd321de3ee/go.mod h1:vJERXedbb3MVM5f9Ejo0C68/HhF8uaILCdgjnY+goOA=
 go.uber.org/zap v1.16.0/go.mod h1:MA8QOfq0BHJwdXa996Y4dYkAqRKB8/1K1QMMZVaNZjQ=
 go.uber.org/zap v1.17.0/go.mod h1:MXVU+bhUf/A7Xi2HNOnopQOrmycQ5Ih87HtOu4q5SSo=
-go.uber.org/zap v1.27.0 h1:aJMhYGrd5QSmlpLMr2MftRKl7t8J8PTZPA732ud/XR8=
-go.uber.org/zap v1.27.0/go.mod h1:GB2qFLM7cTU87MWRP2mPIjqfIDnGu+VIO4V/SdhGo2E=
+go.uber.org/zap v1.27.1 h1:08RqriUEv8+ArZRYSTXy1LeBScaMpVSTBhCeaZYfMYc=
+go.uber.org/zap v1.27.1/go.mod h1:GB2qFLM7cTU87MWRP2mPIjqfIDnGu+VIO4V/SdhGo2E=
 go.yaml.in/yaml/v2 v2.4.4 h1:tuyd0P+2Ont/d6e2rl3be67goVK4R6deVxCUX5vyPaQ=
 go.yaml.in/yaml/v2 v2.4.4/go.mod h1:gMZqIpDtDqOfM0uNfy0SkpRhvUryYH0Z6wdMYcacYXQ=
 go.yaml.in/yaml/v3 v3.0.4 h1:tfq32ie2Jv2UxXFdLJdh3jXuOzWiL1fo0bu/FbuKpbc=
@@ -1289,8 +1271,8 @@ golang.org/x/exp v0.0.0-20191227195350-da58074b4299/go.mod h1:2RIsYlXP63K8oxa1u0
 golang.org/x/exp v0.0.0-20200119233911-0405dc783f0a/go.mod h1:2RIsYlXP63K8oxa1u096TMicItID8zy7Y6sNkU49FU4=
 golang.org/x/exp v0.0.0-20200207192155-f17229e696bd/go.mod h1:J/WKrq2StrnmMY6+EHIKF9dgMWnmCNThgcyBT1FY9mM=
 golang.org/x/exp v0.0.0-20200224162631-6cc2880d07d6/go.mod h1:3jZMyOhIsHpP37uCMkUooju7aAi5cS1Q23tOzKc+0MU=
-golang.org/x/exp v0.0.0-20250606033433-dcc06ee1d476 h1:bsqhLWFR6G6xiQcb+JoGqdKdRU6WzPWmK8E0jxTjzo4=
-golang.org/x/exp v0.0.0-20250606033433-dcc06ee1d476/go.mod h1:3//PLf8L/X+8b4vuAfHzxeRUl04Adcb341+IGKfnqS8=
+golang.org/x/exp v0.0.0-20260410095643-746e56fc9e2f h1:W3F4c+6OLc6H2lb//N1q4WpJkhzJCK5J6kUi1NTVXfM=
+golang.org/x/exp v0.0.0-20260410095643-746e56fc9e2f/go.mod h1:J1xhfL/vlindoeF/aINzNzt2Bket5bjo9sdOYzOsU80=
 golang.org/x/image v0.0.0-20190227222117-0694c2d4d067/go.mod h1:kZ7UVZpmo3dzQBMxlp+ypCbDeSB+sBbTgSJuh5dn5js=
 golang.org/x/image v0.0.0-20190802002840-cff245a6509b/go.mod h1:FeLwcggjj3mMvU+oOTbSwawSJRM1uh48EjtB4UJZlP0=
 golang.org/x/image v0.0.0-20190910094157-69e4b8554b2a/go.mod h1:FeLwcggjj3mMvU+oOTbSwawSJRM1uh48EjtB4UJZlP0=
@@ -1662,8 +1644,8 @@ google.golang.org/genproto v0.0.0-20210310155132-4ce2db91004e/go.mod h1:FWY/as6D
 google.golang.org/genproto v0.0.0-20210319143718-93e7006c17a6/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no=
 google.golang.org/genproto v0.0.0-20210402141018-6c239bbf2bb1/go.mod h1:9lPAdzaEmUacj36I+k7YKbEc5CXzPIeORRgDAUOu28A=
 google.golang.org/genproto v0.0.0-20210602131652-f16073e35f0c/go.mod h1:UODoCrxHCcBojKKwX1terBiRUaqAsFqJiF615XL43r0=
-google.golang.org/genproto/googleapis/rpc v0.0.0-20260120221211-b8f7ae30c516 h1:sNrWoksmOyF5bvJUcnmbeAmQi8baNhqg5IWaI3llQqU=
-google.golang.org/genproto/googleapis/rpc v0.0.0-20260120221211-b8f7ae30c516/go.mod h1:j9x/tPzZkyxcgEFkiKEEGxfvyumM01BEtsW8xzOahRQ=
+google.golang.org/genproto/googleapis/rpc v0.0.0-20260128011058-8636f8732409 h1:H86B94AW+VfJWDqFeEbBPhEtHzJwJfTbgE2lZa54ZAQ=
+google.golang.org/genproto/googleapis/rpc v0.0.0-20260128011058-8636f8732409/go.mod h1:j9x/tPzZkyxcgEFkiKEEGxfvyumM01BEtsW8xzOahRQ=
 google.golang.org/grpc v1.19.0/go.mod h1:mqu4LbDTu4XGKhr4mRzUsmM4RtVoemTSY81AxZiDr8c=
 google.golang.org/grpc v1.20.1/go.mod h1:10oTOabMzJvdu6/UiuZezV6QK5dSlG84ov/aaiqXj38=
 google.golang.org/grpc v1.21.1/go.mod h1:oYelfM1adQP15Ek0mdvEgi9Df8B9CZIaU1084ijfRaM=
--- a/pkg/xsysinfo/gpu.go
+++ b/pkg/xsysinfo/gpu.go
@@ -1,8 +1,10 @@
 package xsysinfo

 import (
+	"bufio"
 	"bytes"
 	"encoding/json"
+	"io"
 	"os"
 	"os/exec"
 	"strconv"
@@ -801,14 +803,15 @@ func GetResourceAggregateInfo() AggregateMemoryInfo {
 	return resourceInfo.Aggregate
 }

-// getVulkanGPUMemory queries GPUs using vulkaninfo as a fallback
-// Note: Vulkan provides memory heap info but not real-time usage
+// getVulkanGPUMemory queries GPUs using vulkaninfo as a fallback.
+// Note: vulkaninfo JSON is a Vulkan Profiles export and does not include
+// VkPhysicalDeviceMemoryProperties, so memory heaps are parsed from text output.
 func getVulkanGPUMemory() []GPUMemoryInfo {
 	if _, err := exec.LookPath("vulkaninfo"); err != nil {
 		return nil
 	}

-	cmd := exec.Command("vulkaninfo", "--json")
+	cmd := exec.Command("vulkaninfo", "--text")

 	var stdout, stderr bytes.Buffer
 	cmd.Stdout = &stdout
@@ -819,60 +822,174 @@ func getVulkanGPUMemory() []GPUMemoryInfo {
 		return nil
 	}

-	// Parse Vulkan JSON output
-	var result struct {
-		VkPhysicalDevices []struct {
-			DeviceName                       string `json:"deviceName"`
-			DeviceType                       string `json:"deviceType"`
-			VkPhysicalDeviceMemoryProperties struct {
-				MemoryHeaps []struct {
-					Flags int    `json:"flags"`
-					Size  uint64 `json:"size"`
-				} `json:"memoryHeaps"`
-			} `json:"VkPhysicalDeviceMemoryProperties"`
-		} `json:"VkPhysicalDevices"`
-	}
+	return parseVulkanGPUMemoryText(strings.NewReader(stdout.String()))

-	if err := json.Unmarshal(stdout.Bytes(), &result); err != nil {
-		xlog.Debug("failed to parse vulkaninfo output", "error", err)
-		return nil
-	}
+}

+type vulkanGPUTextInfo struct {
+	index      int
+	name       string
+	deviceType string
+	totalVRAM  uint64
+}
+
+func parseVulkanGPUMemoryText(r io.Reader) []GPUMemoryInfo {
 	var gpus []GPUMemoryInfo
+	var current *vulkanGPUTextInfo

-	for i, device := range result.VkPhysicalDevices {
-		// Skip non-discrete/integrated GPUs if possible
-		if device.DeviceType == "VK_PHYSICAL_DEVICE_TYPE_CPU" {
-			continue
+	inMemoryProperties := false
+	inMemoryHeaps := false
+	inHeap := false
+	heapSize := uint64(0)
+	heapDeviceLocal := false
+
+	flushHeap := func() {
+		if current != nil && inHeap && heapDeviceLocal {
+			current.totalVRAM += heapSize
 		}
+		heapSize = 0
+		heapDeviceLocal = false
+		inHeap = false
+	}

-		// Sum up device-local memory heaps
-		var totalVRAM uint64
-		for _, heap := range device.VkPhysicalDeviceMemoryProperties.MemoryHeaps {
-			// Flag 1 = VK_MEMORY_HEAP_DEVICE_LOCAL_BIT
-			if heap.Flags&1 != 0 {
-				totalVRAM += heap.Size
-			}
-		}
-
-		if totalVRAM == 0 {
-			continue
+	flushGPU := func() {
+		if current == nil || current.totalVRAM == 0 || current.deviceType == "PHYSICAL_DEVICE_TYPE_CPU" {
+			return
 		}

 		gpus = append(gpus, GPUMemoryInfo{
-			Index:        i,
-			Name:         device.DeviceName,
+			Index:        current.index,
+			Name:         current.name,
 			Vendor:       VendorVulkan,
-			TotalVRAM:    totalVRAM,
-			UsedVRAM:     0, // Vulkan doesn't provide real-time usage
-			FreeVRAM:     totalVRAM,
+			TotalVRAM:    current.totalVRAM,
+			UsedVRAM:     0, // Vulkan heap size is capacity, not real-time usage.
+			FreeVRAM:     current.totalVRAM,
 			UsagePercent: 0,
 		})
 	}

+	scanner := bufio.NewScanner(r)
+	for scanner.Scan() {
+		line := strings.TrimSpace(scanner.Text())
+		if line == "" {
+			continue
+		}
+
+		if index, ok := parseVulkanGPUHeader(line); ok {
+			flushHeap()
+			flushGPU()
+			current = &vulkanGPUTextInfo{index: index}
+			inMemoryProperties = false
+			inMemoryHeaps = false
+			continue
+		}
+
+		if current == nil {
+			continue
+		}
+
+		if strings.HasPrefix(line, "deviceType") {
+			current.deviceType = parseVulkanValue(line)
+			continue
+		}
+
+		if strings.HasPrefix(line, "deviceName") {
+			current.name = parseVulkanValue(line)
+			continue
+		}
+
+		if line == "VkPhysicalDeviceMemoryProperties:" {
+			inMemoryProperties = true
+			inMemoryHeaps = false
+			flushHeap()
+			continue
+		}
+
+		if !inMemoryProperties {
+			continue
+		}
+
+		if strings.HasPrefix(line, "memoryHeaps:") {
+			inMemoryHeaps = true
+			continue
+		}
+
+		if strings.HasPrefix(line, "memoryTypes:") {
+			flushHeap()
+			inMemoryProperties = false
+			inMemoryHeaps = false
+			continue
+		}
+
+		if !inMemoryHeaps {
+			continue
+		}
+
+		if strings.HasPrefix(line, "memoryHeaps[") {
+			flushHeap()
+			inHeap = true
+			continue
+		}
+
+		if !inHeap {
+			continue
+		}
+
+		if strings.HasPrefix(line, "size") {
+			if size, ok := parseVulkanUintValue(line); ok {
+				heapSize = size
+			}
+			continue
+		}
+
+		if strings.Contains(line, "MEMORY_HEAP_DEVICE_LOCAL_BIT") {
+			heapDeviceLocal = true
+		}
+	}
+
+	flushHeap()
+	flushGPU()
+
 	return gpus
 }

+func parseVulkanGPUHeader(line string) (int, bool) {
+	if !strings.HasPrefix(line, "GPU") || !strings.HasSuffix(line, ":") {
+		return 0, false
+	}
+
+	index, err := strconv.Atoi(strings.TrimSuffix(strings.TrimPrefix(line, "GPU"), ":"))
+	if err != nil {
+		return 0, false
+	}
+
+	return index, true
+}
+
+func parseVulkanValue(line string) string {
+	_, value, ok := strings.Cut(line, "=")
+	if !ok {
+		return ""
+	}
+
+	return strings.TrimSpace(value)
+}
+
+func parseVulkanUintValue(line string) (uint64, bool) {
+	value := parseVulkanValue(line)
+	fields := strings.Fields(value)
+	if len(fields) == 0 {
+		return 0, false
+	}
+
+	parsed, err := strconv.ParseUint(fields[0], 0, 64)
+	if err != nil {
+		return 0, false
+	}
+
+	return parsed, true
+}
+
 // getAppleGPUMemory detects Apple Silicon GPUs using system_profiler (macOS only).
 // Apple Silicon uses unified memory, so GPU memory is reported as system RAM.
 func getAppleGPUMemory() []GPUMemoryInfo {
--- a/scripts/build/ds4-darwin.sh
+++ b/scripts/build/ds4-darwin.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+# Darwin/Metal build for the ds4 backend. Mirrors llama-cpp-darwin.sh:
+# native make, otool -L for dylib bundling, then assemble an OCI tar that
+# `local-ai backends install` can consume.
+set -ex
+
+IMAGE_NAME="${IMAGE_NAME:-localai/ds4-darwin}"
+
+pushd backend/cpp/ds4
+make NATIVE=false grpc-server package
+popd
+
+mkdir -p build/darwin
+mkdir -p build/darwin/lib
+mkdir -p backend-images
+
+cp -rf backend/cpp/ds4/grpc-server build/darwin/
+cp -rf backend/cpp/ds4/run.sh      build/darwin/
+
+# Apple Silicon: pick up Homebrew-installed protobuf utf8_validity if present.
+if [[ "$(uname -s)" == "Darwin" && "$(uname -m)" == "arm64" ]]; then
+    ADDITIONAL_LIBS=${ADDITIONAL_LIBS:-$(ls /opt/homebrew/Cellar/protobuf/**/lib/libutf8_validity*.dylib 2>/dev/null)}
+else
+    ADDITIONAL_LIBS=${ADDITIONAL_LIBS:-""}
+fi
+for file in $ADDITIONAL_LIBS; do
+    cp -rfv "$file" build/darwin/lib
+done
+
+# Walk dylibs via otool -L and bundle anything that isn't a system framework.
+for file in build/darwin/grpc-server; do
+    LIBS="$(otool -L "$file" | awk 'NR > 1 { system("echo " $1) } ' | xargs echo)"
+    for lib in $LIBS; do
+        if [[ "$lib" == *.dylib ]] && [[ -e "$lib" ]]; then
+            cp -rvf "$lib" build/darwin/lib
+        fi
+    done
+done
+
+echo "Bundled libraries:"
+ls -la build/darwin/lib
+
+# Build an OCI tar that local-ai backends install can consume.
+# scripts/build/oci-pack.sh is the existing helper used by llama-cpp-darwin
+# - if your tree doesn't have it, write one (5 lines: tar + manifest.json).
+if [ -f scripts/build/oci-pack.sh ]; then
+    bash scripts/build/oci-pack.sh build/darwin backend-images/ds4.tar "$IMAGE_NAME"
+else
+    # Fallback: simple tar - local-ai accepts a flat tar in dev environments.
+    tar -C build/darwin -cvf backend-images/ds4.tar .
+fi
--- a/scripts/changed-backends.js
+++ b/scripts/changed-backends.js
@@ -32,6 +32,9 @@ function inferBackendPath(item) {
    // via a thin wrapper Makefile. Changes to either dir should retrigger it.
    return `backend/cpp/turboquant/`;
  }
+  if (item.dockerfile.endsWith("ds4")) {
+    return `backend/cpp/ds4/`;
+  }
  if (item.dockerfile.endsWith("llama-cpp")) {
    return `backend/cpp/llama-cpp/`;
  }
@@ -128,11 +131,15 @@ async function getChangedFilesForPush(event) {
  return res.data.files.map(f => f.filename);
 }

-// Group filtered linux matrix entries by tag-suffix and emit a merge-matrix
-// entry for any tag-suffix that appears 2+ times. That's the trigger for
-// "this backend has multiple per-arch legs and we need a manifest list".
-// Singletons aren't merged — single-arch backends push by digest and don't
-// need a manifest list assembled across legs.
+// Group matrix entries by tag-suffix and emit a merge-matrix entry per group.
+// Both multi-leg groups (per-arch fan-out) and singletons get one entry each:
+// the build job pushes by digest only with no tags applied, so every backend
+// needs a downstream merge step to apply its tags via `imagetools create`,
+// regardless of how many per-arch legs feed it. Callers split entries by
+// arch class first (see splitByArch) and call this once per class so the
+// resulting matrices can be wired to merge jobs that `needs:` only their
+// corresponding build matrix — preventing slow single-arch builds from
+// gating multi-arch merges (the bug fixed in PR #9746).
 function computeMergeMatrix(entries) {
  const groups = new Map();
  for (const item of entries) {
@@ -143,7 +150,6 @@ function computeMergeMatrix(entries) {
  }
  const include = [];
  for (const [tagSuffix, group] of groups) {
-    if (group.length < 2) continue;
    // tag-latest must agree across legs — they're going to publish under
    // the same final tag, so disagreeing on whether it's also the :latest
    // tag is an authoring bug. Warn loudly so a Task 2.5 fan-out typo is
@@ -177,17 +183,21 @@ function splitByArch(entries) {

 function emitFullMatrix() {
  const { multiarch, singlearch } = splitByArch(includes);
-  const mergeMatrix = computeMergeMatrix(includes);
-  const hasMerges = mergeMatrix.include.length > 0 ? 'true' : 'false';
+  const mergeMatrixMultiarch = computeMergeMatrix(multiarch);
+  const mergeMatrixSinglearch = computeMergeMatrix(singlearch);
+  const hasMergesMultiarch = mergeMatrixMultiarch.include.length > 0 ? 'true' : 'false';
+  const hasMergesSinglearch = mergeMatrixSinglearch.include.length > 0 ? 'true' : 'false';
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `run-all=true\n`);
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-backends-singlearch=${singlearch.length > 0 ? 'true' : 'false'}\n`);
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-backends-multiarch=${multiarch.length > 0 ? 'true' : 'false'}\n`);
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-backends-darwin=true\n`);
-  fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-merges=${hasMerges}\n`);
+  fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-merges-multiarch=${hasMergesMultiarch}\n`);
+  fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-merges-singlearch=${hasMergesSinglearch}\n`);
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `matrix-singlearch=${JSON.stringify({ include: singlearch })}\n`);
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `matrix-multiarch=${JSON.stringify({ include: multiarch })}\n`);
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `matrix-darwin=${JSON.stringify({ include: includesDarwin })}\n`);
-  fs.appendFileSync(process.env.GITHUB_OUTPUT, `merge-matrix=${JSON.stringify(mergeMatrix)}\n`);
+  fs.appendFileSync(process.env.GITHUB_OUTPUT, `merge-matrix-multiarch=${JSON.stringify(mergeMatrixMultiarch)}\n`);
+  fs.appendFileSync(process.env.GITHUB_OUTPUT, `merge-matrix-singlearch=${JSON.stringify(mergeMatrixSinglearch)}\n`);
  for (const backend of allBackendPaths.keys()) {
    fs.appendFileSync(process.env.GITHUB_OUTPUT, `${backend}=true\n`);
  }
@@ -218,18 +228,22 @@ function emitFilteredMatrix(changedFiles) {
  console.log("Has multi-arch backends?:", hasBackendsMultiarch);
  console.log("Has Darwin backends?:", hasBackendsDarwin);

-  const mergeMatrix = computeMergeMatrix(filtered);
-  const hasMerges = mergeMatrix.include.length > 0 ? 'true' : 'false';
+  const mergeMatrixMultiarch = computeMergeMatrix(multiarch);
+  const mergeMatrixSinglearch = computeMergeMatrix(singlearch);
+  const hasMergesMultiarch = mergeMatrixMultiarch.include.length > 0 ? 'true' : 'false';
+  const hasMergesSinglearch = mergeMatrixSinglearch.include.length > 0 ? 'true' : 'false';

  fs.appendFileSync(process.env.GITHUB_OUTPUT, `run-all=false\n`);
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-backends-singlearch=${hasBackendsSinglearch}\n`);
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-backends-multiarch=${hasBackendsMultiarch}\n`);
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-backends-darwin=${hasBackendsDarwin}\n`);
-  fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-merges=${hasMerges}\n`);
+  fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-merges-multiarch=${hasMergesMultiarch}\n`);
+  fs.appendFileSync(process.env.GITHUB_OUTPUT, `has-merges-singlearch=${hasMergesSinglearch}\n`);
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `matrix-singlearch=${JSON.stringify({ include: singlearch })}\n`);
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `matrix-multiarch=${JSON.stringify({ include: multiarch })}\n`);
  fs.appendFileSync(process.env.GITHUB_OUTPUT, `matrix-darwin=${JSON.stringify({ include: filteredDarwin })}\n`);
-  fs.appendFileSync(process.env.GITHUB_OUTPUT, `merge-matrix=${JSON.stringify(mergeMatrix)}\n`);
+  fs.appendFileSync(process.env.GITHUB_OUTPUT, `merge-matrix-multiarch=${JSON.stringify(mergeMatrixMultiarch)}\n`);
+  fs.appendFileSync(process.env.GITHUB_OUTPUT, `merge-matrix-singlearch=${JSON.stringify(mergeMatrixSinglearch)}\n`);

  // Per-backend boolean outputs
  for (const [backend, pathPrefix] of allBackendPaths) {
--- a/tests/e2e-backends/backend_test.go
+++ b/tests/e2e-backends/backend_test.go
@@ -194,7 +194,18 @@ var _ = Describe("Backend container", Ordered, func() {

 	BeforeAll(func() {
 		image := os.Getenv("BACKEND_IMAGE")
-		Expect(image).NotTo(BeEmpty(), "BACKEND_IMAGE env var must be set (e.g. local-ai-backend:llama-cpp)")
+		// BACKEND_BINARY is an escape hatch for hardware-gated backends (e.g. ds4)
+		// where building a full Docker image around an 80+ GB model is impractical.
+		// Points at a `run.sh` produced by `make -C backend/cpp/<name> package`.
+		binary := os.Getenv("BACKEND_BINARY")
+		Expect(image != "" || binary != "").To(BeTrue(),
+			"either BACKEND_IMAGE or BACKEND_BINARY env var must be set")
+		Expect(image != "" && binary != "").To(BeFalse(),
+			"BACKEND_IMAGE and BACKEND_BINARY are mutually exclusive")
+		if binary != "" {
+			Expect(filepath.Base(binary)).To(Equal("run.sh"),
+				"BACKEND_BINARY must point at a run.sh produced by 'make -C backend/cpp/<name> package'")
+		}

 		modelURL := os.Getenv("BACKEND_TEST_MODEL_URL")
 		modelFile = os.Getenv("BACKEND_TEST_MODEL_FILE")
@@ -203,7 +214,11 @@ var _ = Describe("Backend container", Ordered, func() {
 			"one of BACKEND_TEST_MODEL_URL, BACKEND_TEST_MODEL_FILE, or BACKEND_TEST_MODEL_NAME must be set")

 		caps = parseCaps()
-		GinkgoWriter.Printf("Testing image=%q with capabilities=%v\n", image, keys(caps))
+		src := image
+		if src == "" {
+			src = binary
+		}
+		GinkgoWriter.Printf("Testing src=%q with capabilities=%v\n", src, keys(caps))

 		prompt = os.Getenv("BACKEND_TEST_PROMPT")
 		if prompt == "" {
@@ -223,10 +238,13 @@ var _ = Describe("Backend container", Ordered, func() {
 		workDir, err = os.MkdirTemp("", "backend-e2e-*")
 		Expect(err).NotTo(HaveOccurred())

-		// Extract the image filesystem so we can run run.sh directly.
-		binaryDir = filepath.Join(workDir, "rootfs")
-		Expect(os.MkdirAll(binaryDir, 0o755)).To(Succeed())
-		extractImage(image, binaryDir)
+		if image != "" {
+			binaryDir = filepath.Join(workDir, "rootfs")
+			Expect(os.MkdirAll(binaryDir, 0o755)).To(Succeed())
+			extractImage(image, binaryDir)
+		} else {
+			binaryDir = filepath.Dir(binary)
+		}
 		Expect(filepath.Join(binaryDir, "run.sh")).To(BeAnExistingFile())

 		// Download the model once if not provided and no HF name given.