fix(distributed): stop queue loops on agent nodes + dead-letter cap

pending_backend_ops rows targeting agent-type workers looped forever: the reconciler fan-out hit a NATS subject the worker doesn't subscribe to, returned ErrNoResponders, we marked the node unhealthy, and the health monitor flipped it back to healthy on the next heartbeat. Next tick, same row, same failure. Three related fixes: 1. enqueueAndDrainBackendOp skips nodes whose NodeType != backend. Agent workers handle agent NATS subjects, not backend.install / delete / list, so enqueueing for them guarantees an infinite retry loop. Silent skip is correct — they aren't consumers of these ops. 2. Reconciler drain mirrors enqueueAndDrainBackendOp's behavior on nats.ErrNoResponders: mark the node unhealthy before recording the failure, so subsequent ListDuePendingBackendOps (filters by status=healthy) stops picking the row until the node actually recovers. Matches the synchronous fan-out path. 3. Dead-letter cap at maxPendingBackendOpAttempts (10). After ~1h of exponential backoff the row is a poison message; further retries just thrash NATS. Row is deleted and logged at ERROR so it stays visible without staying infinite. Plus a one-shot startup cleanup in NewNodeRegistry: drop queue rows that target agent-type nodes, non-existent nodes, or carry an empty backend name. Guarded by the same schema-migration advisory lock so only one instance performs it. The guards above prevent new rows of this shape; this closes the migration gap for existing ones. Tests: the prune migration (valid row stays, agent + empty-name rows drop) on top of existing upsert / backoff coverage.
feat(ui): shared FilterBar across the System page tabs
2026-06-23 08:08:52 -04:00 · 2026-04-19 21:27:05 +00:00 · 2026-04-19 08:46:22 +00:00 · 2026-04-19 08:39:59 +00:00 · 2026-04-19 08:37:45 +00:00 · 2026-04-19 08:34:57 +00:00
1640 changed files with 20797 additions and 210756 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -8,7 +8,6 @@ Create the backend directory under the appropriate location:
 - **Python backends**: `backend/python/<backend-name>/`
 - **Go backends**: `backend/go/<backend-name>/`
 - **C++ backends**: `backend/cpp/<backend-name>/`
- **Rust backends**: `backend/rust/<backend-name>/`

 For Python backends, you'll typically need:
 - `backend.py` - Main gRPC server implementation
@@ -19,70 +18,9 @@ For Python backends, you'll typically need:
 - `run.sh` - Runtime script
 - `test.py` / `test.sh` - Test files

-For Rust backends, you'll typically need (see `backend/rust/kokoros/` as a reference):
- `Cargo.toml` - Crate manifest; depend on the upstream project as a submodule under `sources/`
- `build.rs` - Invokes `tonic_build` to generate gRPC stubs from `backend/backend.proto` (use the `BACKEND_PROTO_PATH` env var so the Makefile can inject the canonical copy)
- `src/` - The gRPC server implementation (implement `Backend` via `tonic`)
- `Makefile` - Copies `backend.proto` into the crate, runs `cargo build --release`, then `package.sh`
- `package.sh` - Uses `ldd` to bundle the binary's dynamic deps and `ld.so` into `package/lib/`
- `run.sh` - Sets `LD_LIBRARY_PATH`/`SSL_CERT_DIR` and execs the binary via the bundled `lib/ld.so`
- `sources/<UpstreamProject>/` - Git submodule with the upstream Rust crate
+## 2. Add Build Configurations to `.github/workflows/backend.yml`

-## 2. Add Build Configurations to `.github/backend-matrix.yml`
-
-The build matrix is data-only YAML at `.github/backend-matrix.yml` (not inside `backend.yml` itself). `backend.yml` (master push) and `backend_pr.yml` (PR) load it via `scripts/changed-backends.js`, which also handles per-file path filtering so only touched backends rebuild on PRs and master pushes alike. Add build matrix entries to `.github/backend-matrix.yml` for each platform/GPU type you want to support. Look at similar backends for reference — `chatterbox`/`faster-whisper` for Python, `piper`/`silero-vad` for Go, `kokoros` for Rust.
-
-**Without an entry here no image is ever built or pushed, and the gallery entry in `backend/index.yaml` will point at a tag that does not exist.** The `dockerfile:` field must point at `./backend/Dockerfile.<lang>` matching the language bucket from step 1 (e.g. `Dockerfile.python`, `Dockerfile.golang`, `Dockerfile.rust`). The `tag-suffix` must match the `uri:` in the corresponding `backend/index.yaml` image entry exactly.
-
-**`scripts/changed-backends.js` registration — REQUIRED for any new dockerfile suffix.** This is the single most common omission, because it has no effect on the PR that adds the backend (when no prior path filter could catch it anyway) — it only breaks the *next* PR that touches your backend's directory, which then gets zero CI jobs and looks broken for unrelated reasons. Edit `scripts/changed-backends.js:inferBackendPath` and add a branch BEFORE the more-generic suffixes:
-
-```js
-if (item.dockerfile.endsWith("<your-dockerfile-suffix>")) {
-    return `backend/cpp/<your-backend>/`;   // or backend/python|go|rust/...
-}
-```
-
-The `endsWith()` test is against the matrix entry's `dockerfile:` value (e.g. `./backend/Dockerfile.ds4` → `endsWith("ds4")`). Specificity order matters here just like it does for importers: more-specific suffixes go BEFORE more-generic ones (e.g. `ds4` before `llama-cpp` even though both end with letters, because some upstream might one day call itself `super-ds4-llama-cpp`). Verify locally before pushing:
-
-```bash
-# Confirm your dockerfile suffix is unique enough
-node -e "
-const yaml = require('js-yaml'); const fs = require('fs');
-const m = yaml.load(fs.readFileSync('.github/backend-matrix.yml','utf8'));
-for (const e of m.include.filter(e => e.backend === '<your-backend>')) {
-  console.log(e.dockerfile, '->', e.dockerfile.endsWith('<suffix>'));
-}"
-```
-
-A quick way to find the right insertion point: `grep -n 'item.dockerfile.endsWith' scripts/changed-backends.js`.
-
-**`bump_deps.yaml` registration — REQUIRED for any backend pinning an upstream commit.** If your backend's Makefile has a `*_VERSION?=<sha>` pin to a third-party repo, the daily auto-bump bot at `.github/workflows/bump_deps.yaml` won't notice it unless you register the backend in its matrix. The bot runs `.github/bump_deps.sh` which `grep`s for `^$VAR?=` in the Makefile you list — so the pin MUST live in the Makefile (not in a separate shell script). The bump for ds4 (#9761) had to walk this back because the original landed the pin in `prepare.sh`, which the bot can't see. Pattern (for `antirez/ds4`):
-
-```yaml
-# .github/workflows/bump_deps.yaml
-matrix:
-  include:
-    - repository: "antirez/ds4"
-      variable: "DS4_VERSION"
-      branch: "main"
-      file: "backend/cpp/ds4/Makefile"
-```
-
-And the corresponding Makefile shape (mirror `backend/cpp/llama-cpp/Makefile`):
-
-```makefile
-DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
-DS4_REPO?=https://github.com/antirez/ds4
-...
-ds4:
-	mkdir -p ds4
-	cd ds4 && git init -q && \
-	git remote add origin $(DS4_REPO) && \
-	git fetch --depth 1 origin $(DS4_VERSION) && \
-	git checkout FETCH_HEAD
-```
-
-If you have a `prepare.sh` doing the clone, delete it — the recipe belongs in the Makefile target so `make purge && make` works as a clean-and-rebuild and so the bump bot finds the pin.
+Add build matrix entries for each platform/GPU type you want to support. Look at similar backends (e.g., `chatterbox`, `faster-whisper`) for reference.

 **Placement in file:**
 - CPU builds: Add after other CPU builds (e.g., after `cpu-chatterbox`)
@@ -91,17 +29,9 @@ If you have a `prepare.sh` doing the clone, delete it — the recipe belongs in

 **Additional build types you may need:**
 - ROCm/HIP: Use `build-type: 'hipblas'` with `base-image: "rocm/dev-ubuntu-24.04:7.2.1"`
- Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"`
+- Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"`
 - L4T (ARM): Use `build-type: 'l4t'` with `platforms: 'linux/arm64'` and `runs-on: 'ubuntu-24.04-arm'`

-**Per-arch native builds (`linux/amd64` + `linux/arm64`):**
-
-Multi-arch backends are NOT a single matrix entry with `platforms: 'linux/amd64,linux/arm64'`. Instead, add **two** entries — one with `platforms: 'linux/amd64'` + `platform-tag: 'amd64'` + `runs-on: 'ubuntu-latest'`, one with `platforms: 'linux/arm64'` + `platform-tag: 'arm64'` + `runs-on: 'ubuntu-24.04-arm'` — both sharing the same `tag-suffix`. The script detects the shared `tag-suffix` and emits a `merge-matrix` entry, so `backend-merge-jobs` (in `backend.yml`/`backend_pr.yml`) automatically assembles the manifest list from per-arch digest artifacts. See `-cpu-faster-whisper` in `.github/backend-matrix.yml` for a reference shape.
-
-**llama-cpp / ik-llama-cpp / turboquant variants only — `builder-base-image`:**
-
-Entries whose `dockerfile` is `./backend/Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` must also set a `builder-base-image` field pointing at a prebuilt base from `quay.io/go-skynet/ci-cache:base-grpc-*` (CI builds these via `.github/workflows/base-images.yml`). The mapping is by `(build-type, platforms)` — see existing entries for the pattern. CI uses these prebuilt bases to skip the gRPC compile (~25–35 min cold). Local `make backends/<name>` ignores `builder-base-image` and uses the from-source path inside the Dockerfile, so you don't need quay access for local builds.
-
 ## 3. Add Backend Metadata to `backend/index.yaml`

 **Step 3a: Add Meta Definition**
@@ -112,8 +42,6 @@ Add a YAML anchor definition in the `## metas` section (around line 2-300). Look

 Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags.

-**Note on integrity:** OCI backends installed from a gallery whose `verification:` block is set are verified against a keyless-cosign policy before extraction; tarball/HTTP backends use the optional `sha256:` field. New backends do not need any extra YAML — the gallery-level `verification:` block covers every entry. See [.agents/backend-signing.md](backend-signing.md) for the producer-side CI step.
-
 ## 4. Update the Makefile

 The Makefile needs to be updated in several places to support building and testing the new backend:
@@ -128,28 +56,24 @@ Add `backends/<backend-name>` to the `.NOTPARALLEL` line (around line 2) to prev

 **Step 4b: Add to `prepare-test-extra`**

-Add the backend to the `prepare-test-extra` target to prepare it for testing. Use the path matching your language bucket (`backend/python/`, `backend/go/`, `backend/rust/`, …):
+Add the backend to the `prepare-test-extra` target (around line 312) to prepare it for testing:

 ```makefile
 prepare-test-extra: protogen-python
 	...
-	$(MAKE) -C backend/<lang>/<backend-name>
+	$(MAKE) -C backend/python/<backend-name>
 ```

-For Rust backends the target is usually the crate build target itself (e.g. `$(MAKE) -C backend/rust/<backend-name> <backend-name>-grpc`) so the binary is in place before `test` runs.
-
 **Step 4c: Add to `test-extra`**

-Add the backend to the `test-extra` target to run its tests — applies to Go and Rust backends too, not only Python:
+Add the backend to the `test-extra` target (around line 319) to run its tests:

 ```makefile
 test-extra: prepare-test-extra
 	...
-	$(MAKE) -C backend/<lang>/<backend-name> test
+	$(MAKE) -C backend/python/<backend-name> test
 ```

-Each backend's own `Makefile` should define a `test` target so this line works regardless of language. Integration tests that need large model downloads should be gated behind an env var (see `backend/rust/kokoros/`'s `KOKOROS_MODEL_PATH` pattern) so CI only runs unit tests.
-
 **Step 4d: Add Backend Definition**

 Add a backend definition variable in the backend definitions section (around line 428-457). The format depends on the backend type:
@@ -169,13 +93,6 @@ BACKEND_<BACKEND_NAME> = <backend-name>|python|./backend|false|true
 BACKEND_<BACKEND_NAME> = <backend-name>|golang|.|false|true
 ```

-**For Rust backends**:
-```makefile
-BACKEND_<BACKEND_NAME> = <backend-name>|rust|.|false|true
-```
-
-The language field (`python`/`golang`/`rust`/…) must match a `backend/Dockerfile.<lang>` file.
-
 **Step 4e: Generate Docker Build Target**

 Add an eval call to generate the docker-build target (around line 480-501):
@@ -203,7 +120,7 @@ docker-build-backends: ... docker-build-<backend-name>
 After adding a new backend, verify:

 - [ ] Backend directory structure is complete with all necessary files
- [ ] Build configurations added to `.github/backend-matrix.yml` for all desired platforms (per-arch entries with `platform-tag` for multi-arch; `builder-base-image` for llama-cpp / ik-llama-cpp / turboquant)
+- [ ] Build configurations added to `.github/workflows/backend.yml` for all desired platforms
 - [ ] Meta definition added to `backend/index.yaml` in the `## metas` section
 - [ ] Image entries added to `backend/index.yaml` for all build variants (latest + development)
 - [ ] Tag suffixes match between workflow file and index.yaml
@@ -236,29 +153,6 @@ ls /tmp/check    # expect the bundled .so files + symlinks

 Then boot it inside a fresh `ubuntu:24.04` (which intentionally does *not* have the lib installed) to confirm it actually loads from the backend dir.

-## Importer integration
-
-When you add a new backend, you MUST also make it importable via the model import form (`/import-model`). The import form dropdown is sourced dynamically from `GET /backends/known` — it reads the importer registry at `core/gallery/importers/importers.go`, so the steps below are the ONLY way to make your backend show up.
-
-Required steps:
-
-1. **If your backend has unambiguous detection signals** (unique file extension, HF `pipeline_tag`, unique repo name pattern, unique artefact like `modules.json`):
-   - Create an importer file at `core/gallery/importers/<backend>.go` following the Match/Import pattern in `llama-cpp.go`.
-   - Register it in `importers.go:defaultImporters` in **specificity order** — more specific detectors must appear BEFORE more generic ones (e.g. `sentencetransformers` before `transformers`, `stablediffusion-ggml` before `llama-cpp`, `vllm-omni` before `vllm`). First match wins.
-2. **If your backend is a drop-in replacement** (same artefacts as another backend, e.g. `ik-llama-cpp` and `turboquant` both consume GGUF the same way `llama-cpp` does):
-   - Do NOT create a new importer. Extend the existing importer's `Import()` to swap the emitted `backend:` field when `preferences.backend` matches. See `llama-cpp.go` for the pattern.
-3. **If your backend has no reliable auto-detect signal** (preference-only — e.g. `sglang`, `tinygrad`, `whisperx`):
-   - Do NOT create an importer. Instead add the backend name to the curated pref-only slice in `core/http/endpoints/localai/backend.go` that feeds `/backends/known`. A single line addition.
-4. **Always** add a table-driven test in `core/gallery/importers/importers_test.go` (Ginkgo/Gomega):
-   - Use a real public HuggingFace repo URI as the test fixture (existing tests already hit the live HF API — follow that pattern).
-   - Cover detection (auto-match without preferences), preference-override (explicit `backend:` in preferences wins), and — if the backend's modality has a common `pipeline_tag` but ambiguous artefacts — an ambiguity test asserting `errors.Is(err, importers.ErrAmbiguousImport)`.
-
-Rules of thumb:
-
- When in doubt, lean pref-only. A wrong auto-detect is worse than a forced preference.
- Never silently emit a modality mismatch (e.g. emit `llama-cpp` for a TTS repo because `.gguf` is present). Return `ErrAmbiguousImport` instead.
- Registration order is the single most common source of bugs. Check by running `go test ./core/gallery/importers/...` — the existing suite will fail if you've shadowed a pre-existing detector.
-
 ## 6. Example: Adding a Python Backend

 For reference, when `moonshine` was added:
--- a/.agents/ai-coding-assistants.md
+++ b/.agents/ai-coding-assistants.md
@@ -1,101 +0,0 @@
-# AI Coding Assistants
-
-This document provides guidance for AI tools and developers using AI
-assistance when contributing to LocalAI.
-
-**LocalAI follows the same guidelines as the Linux kernel project for
-AI-assisted contributions.** See the upstream policy here:
-<https://docs.kernel.org/process/coding-assistants.html>
-
-The rules below mirror that policy, adapted to LocalAI's license and
-project layout. If anything is unclear, the kernel document is the
-authoritative reference for intent.
-
-AI tools helping with LocalAI development should follow the standard
-project development process:
-
- [CONTRIBUTING.md](../CONTRIBUTING.md) — development workflow, commit
-  conventions, and PR guidelines
- [.agents/coding-style.md](coding-style.md) — code style, editorconfig,
-  logging, and documentation conventions
- [.agents/building-and-testing.md](building-and-testing.md) — build and
-  test procedures
-
-## Licensing and Legal Requirements
-
-All contributions must comply with LocalAI's licensing requirements:
-
- LocalAI is licensed under the **MIT License** — see the [LICENSE](../LICENSE)
-  file
- New source files should use the SPDX license identifier `MIT` where
-  applicable to the file type
- Contributions must be compatible with the MIT License and must not
-  introduce code under incompatible licenses (e.g., GPL) without an
-  explicit discussion with maintainers
-
-## Signed-off-by and Developer Certificate of Origin
-
-**AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally
-certify the Developer Certificate of Origin (DCO). The human submitter
-is responsible for:
-
- Reviewing all AI-generated code
- Ensuring compliance with licensing requirements
- Adding their own `Signed-off-by` tag (when the project requires DCO)
-  to certify the contribution
- Taking full responsibility for the contribution
-
-AI agents MUST NOT add `Co-Authored-By` trailers for themselves either.
-A human reviewer owns the contribution; the AI's involvement is recorded
-via `Assisted-by` (see below).
-
-## Attribution
-
-When AI tools contribute to LocalAI development, proper attribution helps
-track the evolving role of AI in the development process. Contributions
-should include an `Assisted-by` tag in the commit message trailer in the
-following format:
-
-```
-Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
-```
-
-Where:
-
- `AGENT_NAME` — name of the AI tool or framework (e.g., `Claude`,
-  `Copilot`, `Cursor`)
- `MODEL_VERSION` — specific model version used (e.g.,
-  `claude-opus-4-7`, `gpt-5`)
- `[TOOL1] [TOOL2]` — optional specialized analysis tools invoked by the
-  agent (e.g., `golangci-lint`, `staticcheck`, `go vet`)
-
-Basic development tools (git, go, make, editors) should **not** be listed.
-
-### Example
-
-```
-fix(llama-cpp): handle empty tool call arguments
-
-Previously the parser panicked when the model returned a tool call with
-an empty arguments object. Fall back to an empty JSON object in that
-case so downstream consumers receive a valid payload.
-
-Assisted-by: Claude:claude-opus-4-7 golangci-lint
-Signed-off-by: Jane Developer <jane@example.com>
-```
-
-## Scope and Responsibility
-
-Using an AI assistant does not reduce the contributor's responsibility.
-The human submitter must:
-
- Understand every line that lands in the PR
- Verify that generated code compiles, passes tests, and follows the
-  project style
- Confirm that any referenced APIs, flags, or file paths actually exist
-  in the current tree (AI models may hallucinate identifiers)
- Not submit AI output verbatim without review
-
-Reviewers may ask for clarification on any change regardless of how it
-was produced. "An AI wrote it" is not an acceptable answer to a design
-question.
--- a/.agents/api-endpoints-and-auth.md
+++ b/.agents/api-endpoints-and-auth.md
@@ -2,8 +2,6 @@

 This guide covers how to add new API endpoints and properly integrate them with the auth/permissions system.

-> **Before you ship a new endpoint or capability surface**, re-read the [checklist at the bottom of this file](#checklist). LocalAI advertises its feature surface in several independent places — miss any one of them and clients/admins/UI won't know the endpoint exists.
-
 ## Architecture overview

 Authentication and authorization flow through three layers:
@@ -236,76 +234,6 @@ Use these HTTP status codes:

 If your endpoint should be tracked for usage (token counts, request counts), add the `usageMiddleware` to its middleware chain. See `core/http/middleware/usage.go` and how it's applied in `routes/openai.go`.

-## Advertising surfaces — where to register a new capability
-
-Beyond routing and auth, LocalAI publishes its capability surface in **four independent places**. When you add an endpoint — especially one introducing a net-new capability like a new media type or a new auth-gated feature — you must update every relevant surface. These aren't optional: missing them means the endpoint works but is invisible to clients, admins, and the UI.
-
-### 1. Swagger `@Tags` annotation (mandatory)
-
-Every handler needs a swagger block so the endpoint appears in `/swagger/index.html` and in the `/api/instructions` output. The `@Tags` value is what groups the endpoint into a capability area:
-
-```go
-// MyEndpoint does X.
-// @Summary Do X.
-// @Tags my-capability
-// @Param request body schema.MyRequest true "payload"
-// @Success 200 {object} schema.MyResponse "Response"
-// @Router /v1/my-endpoint [post]
-func MyEndpoint(...) echo.HandlerFunc { ... }
-```
-
-Use an existing tag when the endpoint extends an existing area (e.g. `audio`, `images`, `face-recognition`). Create a new tag only when the endpoint introduces a genuinely new capability surface — and in that case, also register it in step 2.
-
-After adding endpoints, regenerate the embedded spec so the runtime serves it:
-
-```bash
-make protogen-go         # ensures gRPC codegen is fresh first
-make swagger             # regenerates swagger/swagger.json
-```
-
-### 2. `/api/instructions` registry (for new capability areas)
-
-`core/http/endpoints/localai/api_instructions.go` defines `instructionDefs` — a lightweight, machine-readable index of capability areas that groups swagger endpoints by tag. It's the primary discovery surface for agents and SDKs ("what can this server do?").
-
-**When to update:** only when adding a new capability area (a new swagger tag). Existing-tag additions automatically surface without any change here.
-
-Add an entry to `instructionDefs`:
-
-```go
-{
-    Name:        "my-capability",             // URL segment at /api/instructions/my-capability
-    Description: "Short sentence describing the capability",
-    Tags:        []string{"my-capability"},   // must match swagger @Tags
-    Intro:       "Optional gotcha/context that isn't in the swagger descriptions (caveats, defaults, cross-references to other endpoints).",
-},
-```
-
-Also bump the expected-length count in `api_instructions_test.go` and add the name to the `ContainElements` assertion.
-
-### 3. `capabilities.js` symbol (for new model-config FLAG_* flags)
-
-If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), you need to update **all** of:
-
- `Usecase<Name>` string constant in `core/config/backend_capabilities.go`
- `UsecaseInfoMap` entry mapping the string to its flag + gRPC method
- `FLAG_<NAME>` bitmask in `core/config/model_config.go`
- `GetAllModelConfigUsecases()` map entry (otherwise the YAML loader silently ignores the string)
- `ModalityGroups` membership if the flag should affect `IsMultimodal()` (e.g. realtime_audio is in both speech-input and audio-output groups so a lone flag still reads as multimodal)
- `GuessUsecases()` branch listing the backends that own this capability
- `usecaseFilters` in `core/http/routes/ui_api.go` (drives the gallery filter dropdown)
- `Models.jsx` `FILTERS` array + matching `filters.<camelCase>` i18n key in `core/http/react-ui/public/locales/en/models.json`
- `core/http/react-ui/src/utils/capabilities.js`:
-
-```js
-export const CAP_MY_CAPABILITY = 'FLAG_MY_CAPABILITY'
-```
-
-React pages that want to filter the ModelSelector by capability import this symbol. Declare it even if you're not building the UI page yet — the declaration keeps the Go/JS vocabularies in sync.
-
-### 4. `docs/content/` (user-facing documentation)
-
-A new capability deserves its own page under `docs/content/features/`, plus cross-links from related features and an entry in `docs/content/whats-new.md`. See the pattern used by `face-recognition.md` / `object-detection.md`.
-
 ## Path protection rules

 The global auth middleware classifies paths as API paths or non-API paths:
@@ -320,36 +248,12 @@ If you add endpoints under a new top-level path prefix, add it to `isAPIPath()`

 When adding a new endpoint:

-**Routing & auth**
 - [ ] Handler in `core/http/endpoints/`
 - [ ] Route registered in appropriate `core/http/routes/` file
 - [ ] Auth level chosen: public / standard / admin / feature-gated
- [ ] Entry added to `RouteFeatureRegistry` in `core/http/auth/features.go` (one row per route/method — all /v1/* routes gate through this, not per-route middleware)
- [ ] If new feature: constant in `permissions.go`, added to the right slice (`APIFeatures` default-ON / `AgentFeatures` default-OFF), metadata in `features.go` `*FeatureMetas()`
- [ ] If feature uses group middleware: wired in `core/http/app.go` and passed to the route registration function
+- [ ] If feature-gated: constant in `permissions.go`, metadata in `features.go`, middleware in `app.go`
 - [ ] If new path prefix: added to `isAPIPath()` in `middleware.go`
+- [ ] If OpenAI-compatible: entry in `RouteFeatureRegistry`
 - [ ] If token-counting: `usageMiddleware` added to middleware chain
-
-**Advertising surfaces (easy to miss — see the [Advertising surfaces](#advertising-surfaces--where-to-register-a-new-capability) section)**
- [ ] Swagger block on the handler: `@Summary`, `@Tags`, `@Param`, `@Success`, `@Router`
- [ ] If new capability area (new swagger tag): entry in `instructionDefs` in `core/http/endpoints/localai/api_instructions.go` + test count bumped in `api_instructions_test.go`
- [ ] If new `FLAG_*` usecase flag: matching `CAP_*` symbol exported from `core/http/react-ui/src/utils/capabilities.js`
- [ ] `docs/content/features/<feature>.md` created; cross-links from related feature pages; entry in `docs/content/whats-new.md`
-
-**Quality**
- [ ] Error responses use `schema.ErrorResponse` format (or `echo.NewHTTPError` with a mapped gRPC status — see the `mapBackendError` helper in `core/http/endpoints/localai/images.go`)
+- [ ] Error responses use `schema.ErrorResponse` format
 - [ ] Tests cover both authenticated and unauthenticated access
- [ ] Swagger regenerated (`make swagger`) if you changed any `@Router`/`@Tags`/`@Param` annotation
-
-## Companion: MCP admin tool surface
-
-**Required for admin endpoints.** Every new admin endpoint MUST be considered for the MCP admin tool surface — the REST API and the MCP tool catalog can drift silently otherwise, and both the LocalAI Assistant chat modality and the standalone `local-ai mcp-server` rely on `pkg/mcp/localaitools/` to mirror REST.
-
-Two outcomes are acceptable; one is not:
-
- **Tool added.** The new endpoint is something an admin would manage conversationally (install, list, edit, toggle, upgrade). Follow the full checklist in [.agents/localai-assistant-mcp.md](localai-assistant-mcp.md): add a `LocalAIClient` interface method, implement it in both `inproc` and `httpapi`, register the tool with a `Tool*` constant, update the skill prompts, **and add the route to `toolToHTTPRoute` in `pkg/mcp/localaitools/coverage_test.go`**.
- **Tool deliberately skipped.** The endpoint is internal/diagnostic and adding a chat path would be misleading. Document the decision in the PR description; no code action.
- **Forgot.** This breaks the contract. The `TestToolHTTPRouteMappingComplete` test in `pkg/mcp/localaitools` is a partial guard (it checks every `Tool*` has a route mapping), but it does NOT detect new REST endpoints without a tool — that's still a process check on the PR author.
-
-**Add to the bottom of the checklist below**:
- [ ] If admin: decided whether MCP coverage is needed; if yes, tool registered + map updated; if no, skip-reason in PR description.
--- a/.agents/backend-signing.md
+++ b/.agents/backend-signing.md
@@ -1,126 +0,0 @@
-# Backend image signing & verification
-
-LocalAI verifies backend OCI images against a per-gallery keyless-cosign
-policy. This page documents the trust model, the producer side
-(`.github/workflows/backend_merge.yml` in this repo), and the consumer
-side (`pkg/oci/cosignverify` plus the gallery YAML).
-
-## Trust model
-
- **Producer:** `.github/workflows/backend_merge.yml` signs each pushed
-  manifest list with `cosign sign --recursive` in keyless mode after
-  `docker buildx imagetools create`. The signing cert is issued by
-  Fulcio bound to the workflow's OIDC identity. There is no long-lived
-  signing key. `--recursive` signs both the manifest list and every
-  per-arch entry — needed because our consumer resolves a tag to a
-  per-arch manifest before checking signatures.
- **Storage:** Signatures are written as OCI 1.1 referrers
-  (`--registry-referrers-mode=oci-1-1`) in the new Sigstore bundle format
-  (current cosign releases do this by default; no `--new-bundle-format`
-  flag). No `:sha256-<hex>.sig` tag clutter.
- **Consumer:** `pkg/oci/cosignverify` discovers the bundle via the
-  referrers API, hands it to `sigstore-go`, and verifies it against the
-  policy declared in the gallery YAML (`Gallery.Verification`).
- **Revocation:** Keyless cosign certs are ephemeral (10-minute Fulcio
-  validity), so revocation is policy-side, not CA-side. The gallery's
-  `verification.not_before` (RFC3339) is the kill-switch — advance it to
-  invalidate every signature produced before a known compromise window.
-
-## Producer setup
-
-`backend_merge.yml` is the workflow that joins per-arch digests into the
-multi-arch manifest list users actually pull, so it's also the right place
-to sign. The job needs:
-
- `permissions: { id-token: write, contents: read }` at the job level so
-  the runner can exchange its GitHub OIDC token for a Fulcio cert.
- `sigstore/cosign-installer@v3` step (current cosign releases already
-  default to the new bundle format).
- After each `docker buildx imagetools create`, resolve the resulting
-  list digest with `docker buildx imagetools inspect <tag> --format
-  '{{.Manifest.Digest}}'` and sign:
-
-```sh
-cosign sign --yes --recursive \
-  --registry-referrers-mode=oci-1-1 \
-  "${REGISTRY_REPO}@${DIGEST}"
-```
-
-Sign by digest, never by tag — signing by tag binds the signature to
-whatever the tag points at *now*, and a subsequent tag push orphans it.
-
-`--registry-referrers-mode=oci-1-1` is still gated behind
-`COSIGN_EXPERIMENTAL=1` in cosign v2.4.x (set at the job env level in
-`backend_merge.yml`). Re-evaluate when bumping the pinned cosign release
-— newer versions are expected to graduate this flag and the env var can
-then be dropped.
-
-`backend_build_darwin.yml` builds and pushes single-arch darwin images
-that bypass the manifest-list merge. If/when those entries get a gallery
-`verification:` policy, the equivalent cosign step has to land there
-too.
-
-## Consumer setup (in `mudler/LocalAI` gallery YAML)
-
-Once CI is signing, add a `verification:` block to the backend gallery
-entry (`backend/index.yaml`):
-
-```yaml
- name: localai
-  url: github:mudler/LocalAI/backend/index.yaml@master
-  verification:
-    issuer: "https://token.actions.githubusercontent.com"
-    identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@refs/heads/master$"
-    # Optional revocation cutoff; advance during incident response.
-    # not_before: "2026-06-01T00:00:00Z"
-```
-
-Identity matching pins the OIDC subject Fulcio issued the signing cert
-to. Without this, any image signed by *anyone* with a Fulcio cert would
-pass — the regex is what makes a signature mean "produced by our CI".
-
-## Strict mode
-
-Default behaviour: OCI backends without a `verification:` block install
-with a warning (logs include `installing OCI backend without signature
-verification`). Tarball/HTTP backends without a `sha256` field log a
-similar warning.
-
-For production, set `LOCALAI_REQUIRE_BACKEND_INTEGRITY=1` (or pass
-`--require-backend-integrity` to `local-ai run` / `local-ai backends
-install` / `local-ai models install`). The warning becomes a hard error
-and unverifiable backends refuse to install.
-
-## Revocation playbook
-
-If `backend_merge.yml` (or any workflow with `id-token: write`) is
-compromised and we've shipped malicious signed images:
-
-1. **Identify the compromise window.** Find the earliest IntegratedTime
-   from the bad signatures (Rekor search by `subject` filter).
-2. **Set `verification.not_before`** in `backend/index.yaml` to a
-   timestamp just *after* that window's start.
-3. **Push the YAML.** Deployed LocalAI instances pick it up on next
-   gallery refresh (1-hour cache in `core/gallery/gallery.go`).
-4. **Fix the underlying compromise** in the workflow and re-sign images
-   with the new build, which will have IntegratedTime > `not_before`.
-5. **Optional:** for absolute decisiveness, also rotate to a new
-   workflow path (`backend_merge_v2.yml`) and update `identity_regex`.
-
-## Where the code lives
-
- `pkg/oci/cosignverify/` — verifier, policy, OCI referrer fetch, NotBefore enforcement.
- `pkg/downloader/uri.go` — `WithImageVerifier` option threaded through `DownloadFileWithContext`.
- `core/gallery/backends.go` — `backendDownloadOptions` builds the verifier from the gallery's policy.
- `core/config/gallery.go` — `Gallery.Verification` YAML schema.
- `core/cli/run.go`, `core/cli/backends.go`, `core/cli/models.go` — `--require-backend-integrity` flag propagation.
- `.github/workflows/backend_merge.yml` — producer-side `cosign sign --recursive` after each multi-arch manifest list push.
-
-## Out of scope (follow-ups)
-
- **Signing the gallery YAML itself.** The index is fetched over HTTPS
-  from GitHub; we trust the host. A cosign blob signature on the YAML
-  would close that gap but adds key-management overhead. Revisit this
-  page if/when added.
- **Tarball/HTTP backend signing.** Cosign can sign arbitrary blobs, but
-  for now non-OCI backends keep using the `sha256:` field in YAML.
--- a/.agents/building-and-testing.md
+++ b/.agents/building-and-testing.md
@@ -8,42 +8,9 @@ Let's say the user wants to build a particular backend for a given platform. For

 - The Makefile has targets like `docker-build-coqui` created with `generate-docker-build-target` at the time of writing. Recently added backends may require a new target.
 - At a minimum we need to set the BUILD_TYPE, BASE_IMAGE build-args
-  - Use `.github/backend-matrix.yml` as a reference — it's the data-only YAML that lists every backend variant's `build-type`, `base-image`, `platforms`, etc. (`backend.yml` and `backend_pr.yml` consume it via `scripts/changed-backends.js`).
-  - l4t and cublas also require the CUDA major and minor version.
-  - For llama-cpp / ik-llama-cpp / turboquant the matrix also sets `builder-base-image` pointing at a prebuilt `quay.io/go-skynet/ci-cache:base-grpc-*` tag. Local `make backends/<name>` defaults to `BUILDER_TARGET=builder-fromsource` and doesn't need it — the Dockerfile's from-source stage installs everything itself.
+  - Use .github/workflows/backend.yml as a reference it lists the needed args in the `include` job strategy matrix
+  - l4t and cublas also requires the CUDA major and minor version
 - You can pretty print a command like `DOCKER_MAKEFLAGS=-j$(nproc --ignore=1) BUILD_TYPE=hipblas BASE_IMAGE=rocm/dev-ubuntu-24.04:7.2.1 make docker-build-coqui`
 - Unless the user specifies that they want you to run the command, then just print it because not all agent frontends handle long running jobs well and the output may overflow your context
 - The user may say they want to build AMD or ROCM instead of hipblas, or Intel instead of SYCL or NVIDIA insted of l4t or cublas. Ask for confirmation if there is ambiguity.
 - Sometimes the user may need extra parameters to be added to `docker build` (e.g. `--platform` for cross-platform builds or `--progress` to view the full logs), in which case you can generate the `docker build` command directly.
-
-## Test coverage gate
-
-The core Go suites (`./pkg`, `./core`, plus the in-process integration suite `./tests/e2e`) are covered by a **strict, monotonic coverage ratchet**:
-
- `make test-coverage` — runs the suites with `covermode=atomic` instrumentation and writes a merged profile to `coverage/coverage.out`. Uses the same prerequisites as `make test`.
-  - **`--coverpkg` (`COVERAGE_COVERPKG = core/...,pkg/...`):** coverage is attributed to the core+pkg packages, not just the package under test. This is what lets the in-process `tests/e2e` suite (which drives the real HTTP server over loopback via `application.New`) credit the `core/http/endpoints/...` handlers it exercises — folding it in roughly doubled endpoint coverage (e.g. `endpoints/openai` 13.6% → 52%). The denominator is therefore *all* of `core`+`pkg` (minus generated proto, dropped via `COVERAGE_EXCLUDE_RE`), so the number isn't comparable to a plain per-package figure.
-  - **Integration suites (`COVERAGE_E2E_ROOTS = ./tests/e2e`)** run non-recursively (excludes `tests/e2e/distributed`, which needs containers) with `--label-filter=!real-models` (those need a downloaded model) against the mock backend built by `prepare-test`. `tests/integration` is deliberately excluded — it needs `make backends/local-store`, which the coverage CI job doesn't build.
-  - **Flake note:** folding integration tests into a *strict* gate means a hard e2e failure (or a spec that silently stops running) can fail the coverage gate, not just the test. `--flake-attempts` absorbs transient retryable failures; covermode=atomic keeps line coverage deterministic otherwise.
-  - **Why one ginkgo run per root (`scripts/run-coverage.sh`):** passing several recursive roots to a *single* ginkgo invocation (e.g. `ginkgo -r ./pkg ./core`) only merges **one** root's coverprofile into `--output-dir`/`--coverprofile` — the others are silently dropped. Verified with ginkgo 2.29.0: `-r ./pkg ./core` yields only `./pkg` coverage, while `-r ./core` alone yields all 34 core packages. So the script runs each root separately and concatenates the (disjoint) profiles. Don't "simplify" it back to a single multi-root invocation — that's how `core/` (including all of `core/http`, ~7.4k statements) silently vanished from the number before.
-  - **Build tags (`COVERAGE_TAGS`, passed via `GINKGO_TAGS`):** defaults to `debug auth`. The `auth` tag is required to compile the real (sqlite-backed) auth implementation and its ~150 `//go:build auth` tests — without it those files aren't built, the tests don't run, and the gate scores auth against a stub (~3.7% instead of ~38%). If you add new tag-gated tests, extend `COVERAGE_TAGS` or they won't count (and likely won't run in CI at all).
- `make test-coverage-check` — runs `test-coverage`, then `scripts/coverage-check.sh` fails the build if total coverage is **below** the committed baseline in `coverage-baseline.txt`. The Linux job in `.github/workflows/test.yml` runs this instead of `make test`.
- `make test-coverage-baseline` — regenerates and overwrites `coverage-baseline.txt` from the current run.
- `make install-hooks` — sets `core.hooksPath` to the versioned `.githooks/`, whose `pre-commit` runs checks scoped to what's staged: Go changes → `make lint` + `make test-coverage-check`; `core/http/react-ui/` changes → `make test-ui-coverage-check` (Playwright e2e + UI coverage gate). A commit touching neither is skipped; bypass with `git commit --no-verify`. The hook resolves golangci-lint's new-from base to `upstream/master` → `origin/master` → `master`, so it works from a fork clone where `origin/master` is stale (passed to `make lint` via `LINT_NEW_FROM`).
-
-### React UI coverage
-
-The React UI (`core/http/react-ui/`) has **no component/unit tests** — its only tests are the Playwright e2e specs in `e2e/`, which run against the real app served by `tests/e2e-ui/ui-test-server` (the dist is `//go:embed`ed, so the server is rebuilt per coverage run). Those specs do genuinely exercise the UI (clicks, `fill`, `setInputFiles`, `getByRole`/`getByText`, visibility/value assertions).
-
- `make test-ui-coverage` — builds an istanbul-instrumented bundle (`COVERAGE=true`, via `vite-plugin-istanbul` with `forceBuildInstrument: true` — the plugin skips production builds otherwise), re-embeds it into `ui-test-server` (the dist is `//go:embed`ed), runs the Playwright specs, and writes an `nyc` report to `core/http/react-ui/coverage/`. The specs import `{ test, expect }` from `e2e/coverage-fixtures.js` (re-exports Playwright's, plus harvests `window.__coverage__` into `.nyc_output/` after each test). Instrumentation is off unless `COVERAGE=true`, so dev/prod builds and plain `make test-ui-e2e` are unaffected (the fixture no-ops when `window.__coverage__` is absent).
- **Browser:** the flake dev shell ships `chromium` and exports `PLAYWRIGHT_CHROMIUM_PATH`; `playwright.config.js` uses it via `launchOptions.executablePath`, and the Makefile skips `playwright install` when it's set. This avoids Playwright's downloaded browser, which can't resolve system libs (`libglib-2.0`, …) on NixOS. In CI (no `PLAYWRIGHT_CHROMIUM_PATH`) the Makefile falls back to `playwright install --with-deps chromium`.
- The app is a React SPA, so coverage accumulates across in-app navigation within a test; a full `page.goto`/reload resets it.
- `.nycrc.json` uses `all: true`, so **every `src/**` file is in the report**, including 0%-coverage ones — that's how you spot features with no test at all (sort the HTML report or `coverage-summary.json` by line% ascending). 
- **UI coverage gate:** `make test-ui-coverage-check` runs the suite then `scripts/ui-coverage-check.sh`, failing if total line coverage drops more than `UI_COVERAGE_TOLERANCE` below `core/http/react-ui/coverage-baseline.txt`. `make test-ui-coverage-baseline` regenerates the baseline. Runs in CI (`tests-ui-e2e.yml`) and pre-commit on `core/http/react-ui/` changes.
- **Why it has a tolerance (unlike the strict Go gate):** UI e2e coverage is *non-deterministic*. Specs that assert on state and end while async/lazy render work is still in flight collect those lines only when the render beats the coverage teardown — so the total drifts with machine speed/load (a fast local box reads higher than a slow CI runner), diffusely across many specs. The tolerance absorbs that drift, so set the baseline *below* the slow-CI floor, never to a fast-local `make test-ui-coverage-baseline` number, or CI flaps.
- **Raising coverage is cheap:** a *render-smoke* spec (navigate to a route, assert its header renders) mounts a lazy page and runs its full render + initial effects, capturing most of its lines in a few lines of test — see `e2e/page-render-smoke.spec.js`. Auth is disabled in the test server (`isAdmin=true`), so `RequireAdmin`/`RequireFeature` routes render without a mock. The most *deterministic* win is removing a race: make a spec `await` a rendered element before ending (see `e2e/agents.spec.js` → AgentCreate) so its lines count every run.
-
-Rules (both gates):
- **Install the hooks:** `make install-hooks` once per clone so lint + coverage run pre-commit. Don't lean on CI for what the hook catches.
- **Don't work around the gate:** never `git commit --no-verify`, and never hand-lower a baseline or widen a tolerance to turn a red gate green. The ratchet only moves up.
- If a change drops coverage, **add tests** (sort `coverage-summary.json` by line% ascending to find untested code) rather than editing the baseline. When coverage legitimately rises, commit the regenerated baseline (`make test-coverage-baseline` / `test-ui-coverage-baseline`).
- The Go gate is **strict — no tolerance**; `covermode=atomic` keeps it deterministic. The UI gate keeps a small tolerance only because its e2e coverage isn't.
--- a/.agents/ci-caching.md
+++ b/.agents/ci-caching.md
@@ -1,250 +0,0 @@
-# CI Build Caching
-
-Container builds — both the root LocalAI image (`Dockerfile`) and the per-backend images (`backend/Dockerfile.*`) — share a registry-backed BuildKit cache plus a layered set of prebuilt base images. This file explains how the cache is laid out, what invalidates it, and how to bypass it.
-
-## Workflow surfaces
-
-| Workflow | Purpose | Triggers |
-|---|---|---|
-| `.github/workflows/backend.yml` | Backend container images on master | `push` to master + tags, weekly Sunday cron, `workflow_dispatch` |
-| `.github/workflows/backend_pr.yml` | Backend container images on PRs | `pull_request` |
-| `.github/workflows/backend_build.yml` | Reusable: builds one backend (one arch) by digest | `workflow_call` from above |
-| `.github/workflows/backend_merge.yml` | Reusable: assembles per-arch digests into a multi-arch manifest list | `workflow_call` |
-| `.github/workflows/backend_build_darwin.yml` | Reusable: macOS-native backend builds | `workflow_call` |
-| `.github/workflows/image.yml` / `image-pr.yml` | Root LocalAI image (push / PR) | push / PR |
-| `.github/workflows/image_build.yml` / `image_merge.yml` | Reusable: per-arch root-image build + merge | `workflow_call` |
-| `.github/workflows/base-images.yml` | Builds the prebuilt `base-grpc-*` builder bases | Saturdays 05:00 UTC cron, `workflow_dispatch`, master push touching `Dockerfile.base-grpc-builder`, `.docker/install-base-deps.sh`, `.docker/apt-mirror.sh`, or this workflow |
-
-The matrix that drives `backend.yml` / `backend_pr.yml` lives in **`.github/backend-matrix.yml`** (data-only YAML, not embedded in the workflow). `scripts/changed-backends.js` parses it, applies path-filter logic against the PR diff (PR events) or the GitHub Compare API (push events), and emits the filtered matrix plus a `merge-matrix` for backends with multiple per-arch entries.
-
-## Cache layout
-
- **Cache registry**: `quay.io/go-skynet/ci-cache`
- **One tag per matrix entry per arch**, derived from `tag-suffix` and `platform-tag`:
-  - Backend builds (`backend_build.yml`): `cache<tag-suffix>-<platform-tag>`
-    - e.g. `cache-cpu-faster-whisper-amd64`, `cache-cpu-faster-whisper-arm64`, `cache-gpu-nvidia-cuda-13-llama-cpp-amd64`
-  - Root image builds (`image_build.yml`): `cache-localai<tag-suffix>-<platform-tag>` (with a `-core` placeholder when `tag-suffix` is empty, so `cache-localai-core-amd64` for the core image)
-  - Pre-built base images (`base-images.yml`): `cache-base-grpc-<variant>` (one per `(BUILD_TYPE, arch)` permutation)
- Each tag stores a multi-arch BuildKit cache manifest (`mode=max`), so every intermediate stage is re-usable, not just the final image.
-
-The per-arch suffix exists because amd64 and arm64 builds produce different intermediate content; sharing one cache key would thrash on every cross-arch rebuild.
-
-## Read/write semantics
-
-| Trigger | `cache-from` | `cache-to` |
-|---|---|---|
-| `push` to `master` / tag / cron / dispatch | yes | yes (`mode=max,ignore-error=true`) |
-| `pull_request` | yes | **no** |
-
-PR builds read master's warm cache but never write — this prevents PRs from polluting the shared cache with their experimental state. After merge, the master build for that matrix entry refreshes the cache.
-
-`ignore-error=true` on the write side means a transient quay push failure does not fail the build; the next master push retries.
-
-## Pre-built base images (`base-grpc-*`)
-
-The C++ backend Dockerfiles (`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}`) compile gRPC from source. On a cold build that's ~25–35 min before any LocalAI source compiles. To skip that on CI, `.github/workflows/base-images.yml` builds and pushes a set of pre-prepped builder bases:
-
-| Tag | Contents |
-|---|---|
-| `base-grpc-amd64` / `base-grpc-arm64` | Ubuntu 24.04 + apt build deps + protoc + cmake + gRPC at `/opt/grpc` |
-| `base-grpc-cuda-12-amd64` | the above + CUDA 12.8 toolkit |
-| `base-grpc-cuda-13-amd64` | the above + CUDA 13.0 toolkit (Ubuntu 22.04 base) |
-| `base-grpc-cuda-13-arm64` | the above + CUDA 13.0 sbsa toolkit (Ubuntu 24.04 base) |
-| `base-grpc-l4t-cuda-12-arm64` | JetPack r36.4.0 base (CUDA preinstalled, `SKIP_DRIVERS=true`) + gRPC |
-| `base-grpc-rocm-amd64` | rocm/dev-ubuntu-24.04:7.2.1 base + hipblas/hipblaslt/rocblas + gRPC |
-| `base-grpc-vulkan-amd64` / `base-grpc-vulkan-arm64` | Ubuntu 24.04 + Vulkan SDK 1.4.335 + gRPC |
-| `base-grpc-intel-amd64` | intel/oneapi-basekit:2025.3.2 base + gRPC |
-
-**Single source of truth**: the install logic for all 10 variants lives in `.docker/install-base-deps.sh`. Both `Dockerfile.base-grpc-builder` AND each variant Dockerfile's `builder-fromsource` stage bind-mount and execute the same script — so the prebuilt CI base and the local from-source path are bit-equivalent by construction.
-
-### How variant Dockerfiles consume the base
-
-`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` are multi-target. Three stages plus a final aliasing stage:
-
- `builder-fromsource` — `FROM ${BASE_IMAGE}` then runs `install-base-deps.sh` and the per-backend compile script. Used when `BUILDER_TARGET=builder-fromsource` (the default; local `make backends/<name>`).
- `builder-prebuilt` — `FROM ${BUILDER_BASE_IMAGE}` (one of the prebuilt `base-grpc-*` tags) and runs only the per-backend compile script. Used when `BUILDER_TARGET=builder-prebuilt` (CI when the matrix entry sets `builder-base-image`).
- `FROM ${BUILDER_TARGET} AS builder` — alias resolves the ARG-selected stage to a fixed name (BuildKit doesn't allow ARG expansion in `COPY --from=`).
- `FROM scratch` + `COPY --from=builder ...package/. ./` — emits the final scratch image with just the package contents.
-
-BuildKit prunes the unreferenced builder stage, so each build only runs the path it needs. `backend_build.yml` derives `BUILDER_TARGET=builder-prebuilt` automatically when the matrix entry has a non-empty `builder-base-image`; otherwise it defaults to `builder-fromsource`.
-
-The matrix `(build-type, platforms)` → `builder-base-image` mapping for llama-cpp / ik-llama-cpp / turboquant entries:
-
-| `build-type` | `platforms` | tag |
-|---|---|---|
-| `''` | `linux/amd64` | `base-grpc-amd64` |
-| `''` | `linux/arm64` | `base-grpc-arm64` |
-| `cublas` cuda 12 | `linux/amd64` | `base-grpc-cuda-12-amd64` |
-| `cublas` cuda 13 | `linux/amd64` | `base-grpc-cuda-13-amd64` |
-| `cublas` cuda 13 | `linux/arm64` | `base-grpc-cuda-13-arm64` |
-| `cublas` cuda 12 + JetPack base | `linux/arm64` | `base-grpc-l4t-cuda-12-arm64` |
-| `hipblas` | `linux/amd64` | `base-grpc-rocm-amd64` |
-| `vulkan` | `linux/amd64` | `base-grpc-vulkan-amd64` |
-| `vulkan` | `linux/arm64` | `base-grpc-vulkan-arm64` |
-| `sycl_*` | `linux/amd64` | `base-grpc-intel-amd64` |
-
-### Bootstrap order when adding a new variant
-
-If you add a new entry to `base-images.yml`'s matrix, the new tag does not exist on quay until the workflow runs. To consume it from a variant entry safely, dispatch the base-images workflow on the branch first:
-
-```bash
-gh workflow run base-images.yml --ref <feature-branch>
-```
-
-Wait for the new variant to push, then merge the consumer change. Otherwise the consumer's CI fails with "image not found."
-
-## Per-arch native builds + manifest merge
-
-Multi-arch backends (and the core LocalAI image) build natively per arch instead of running both arches under QEMU emulation on a single x86 runner. The pattern:
-
- The matrix has TWO entries per multi-arch backend, sharing the same `tag-suffix` but distinct `platforms` + `platform-tag` + `runs-on`. Example: `-cpu-faster-whisper` has one amd64 entry on `ubuntu-latest` and one arm64 entry on `ubuntu-24.04-arm`.
- Each per-arch build pushes by **canonical digest only** (no tags) via `outputs: type=image,push-by-digest=true,name-canonical=true,push=true`. The digest is uploaded as an artifact named `digests<tag-suffix>-<platform-tag>` (or `digests-localai<...>` for root-image builds).
- `scripts/changed-backends.js` detects shared `tag-suffix` and emits a `merge-matrix` output. `backend.yml` / `backend_pr.yml` have a `backend-merge-jobs` job that consumes it and calls `backend_merge.yml`.
- `backend_merge.yml` downloads all matching digest artifacts and runs `docker buildx imagetools create` to publish the final tagged manifest list pointing at both per-arch digests. Same `docker/metadata-action` config as the original monolithic build, so consumers see no tag-shape change.
- `image_merge.yml` is the equivalent for the root LocalAI image (`-core` placeholder when `tag-suffix` is empty so the artifact-name glob doesn't over-match across `core` and `gpu-vulkan`).
-
-**`provenance: false` is required on multi-registry digest pushes**: with the default `mode=max` provenance attestation, BuildKit bundles a per-registry attestation manifest into each registry's manifest list, making the resulting list digest diverge across registries. `steps.build.outputs.digest` only matches one of them and the merge step's `imagetools create <reg>@sha256:<digest>` lookup fails on the other. Setting `provenance: false` keeps the digest content-only and identical across registries.
-
-## Path filter on master push
-
-Both `backend.yml` (push) and `backend_pr.yml` (PR) generate their matrix dynamically through `scripts/changed-backends.js`:
-
- **PR events**: paginated `pulls/{n}/files` API → filter the matrix to entries whose `dockerfile` path prefix matches the PR diff.
- **Push events**: GitHub Compare API (`/repos/{owner}/{repo}/compare/{before}...{after}`) → same path-filter logic. Falls back to "run everything" on first-branch push (`event.before` zero), API truncation (≥300 changed files), missing API token, or any thrown error.
- **Tag pushes**: `FORCE_ALL=true` is set from the workflow side (`startsWith(github.ref, 'refs/tags/')`) — releases rebuild every backend regardless of diff.
- **Schedule / `workflow_dispatch`**: no `event.before`, falls through to "run everything" automatically.
-
-The Sunday 06:00 UTC cron on `backend.yml` exists specifically because path filtering can leave Python backends frozen on stale wheels. `DEPS_REFRESH` (below) only fires when the build actually runs, so an untouched Python backend would never re-resolve its unpinned deps. The weekly cron is the safety net.
-
-## The `DEPS_REFRESH` cache-buster (Python backends)
-
-Every Python backend goes through the shared `backend/Dockerfile.python`, which ends with:
-
-```dockerfile
-ARG DEPS_REFRESH=initial
-RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
-```
-
-Most Python backends ship `requirements*.txt` files that **do not pin every transitive dep** (`torch`, `transformers`, `vllm`, `diffusers`, etc. are listed without a `==` pin, or with `>=` lower bounds only). With a warm BuildKit cache, the `make` layer hashes only on Dockerfile instructions + COPYed source — not on what `pip install` resolves at runtime. So a warm cache would ship the *first* version of `vllm` ever cached and never pick up upstream releases.
-
-`DEPS_REFRESH` defends against that:
-
- `backend_build.yml` computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W19`) before each build and passes it as a build-arg.
- The `RUN ... make` layer's BuildKit hash now includes that string, so the layer invalidates **at most once per week**, automatically picking up newer wheels.
- Within a week, builds stay warm.
-
-This applies only to `Dockerfile.python` because:
- Go (`Dockerfile.golang`) pins versions in `go.mod` / `go.sum`.
- Rust (`Dockerfile.rust`) pins via `Cargo.lock`.
- C++ backends pin gRPC (`v1.65.0`) and llama.cpp at a specific commit; their inputs don't drift between rebuilds.
-
-### Adjusting the cadence
-
-Bump the format to daily (`+%Y-%m-%d`) or hourly (`+%Y-%m-%d-%H`) for faster refreshes. For one-shot rebuilds without changing the schedule, append a marker to the tag-suffix in the matrix or temporarily delete that backend's cache tag in quay.
-
-## ccache for C++ backend builds
-
-`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` declare a BuildKit cache mount on `/root/.ccache`:
-
-```dockerfile
-RUN --mount=type=cache,target=/root/.ccache,id=<backend>-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    bash /usr/local/sbin/compile.sh
-```
-
-The compile script exports `CMAKE_C/CXX/CUDA_COMPILER_LAUNCHER=ccache` so CMake threads ccache through gcc/g++/nvcc. `cache-to: type=registry,mode=max` exports the cache mount data into the registry cache, so subsequent builds restore it.
-
-On a `LLAMA_VERSION` bump, most translation units are byte-identical to the previous version's preprocessed source — ccache returns the previous `.o` and skips the real compile. Same for LocalAI source changes that don't actually touch llama.cpp's CMake inputs. Cache scope is per `(TARGETARCH, BUILD_TYPE)` so e.g. cublas-12 doesn't share with cublas-13 (their CUDA headers differ; cross-pollination would just be cache misses anyway).
-
-## Composite actions
-
-Two composite actions handle runner-side prep:
-
- **`.github/actions/free-disk-space/action.yml`** — wraps `jlumbroso/free-disk-space@main` plus an explicit apt purge of dotnet/android/ghc/mono/etc. Reclaims ~6–10 GB on `ubuntu-latest`. No-op on self-hosted runners. Used by `backend_build.yml`, `image_build.yml`, `test.yml`, `tests-aio.yml`, etc.
- **`.github/actions/setup-build-disk/action.yml`** — relocates Docker's data-root to `/mnt` on hosted X64 runners. GHA hosted `ubuntu-latest` ships ~75 GB of unused space at `/mnt`; combined with the free-disk-space cleanup this gives ~100 GB working space — enough for ROCm dev image + vLLM torch install + flash-attn intermediate layers. No-op on self-hosted and on non-X64 hosted runners. Used by `backend_build.yml`, `image_build.yml`, `base-images.yml`.
-
-Both actions run before any docker buildx step.
-
-## Concurrency
-
-All `backend.yml` / `image.yml` / `test.yml` / etc. workflows use:
-
-```yaml
-concurrency:
-  group: ci-<workflow>-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-```
-
- **PR events** group by PR number → newer pushes to the same PR cancel old runs (intended).
- **Push events** group by `github.sha` → each master commit gets its own run; rapid-fire merges don't cancel each other (this was a real issue prior — two master pushes 11 seconds apart would cancel the first's CI).
-
-## Self-warming, no separate populator
-
-There is no cron job that pre-warms the BuildKit cache for individual backends. The production builds *are* the populators. The first master build of a given matrix entry pays the cold cost; subsequent same-entry master builds reuse everything that hasn't changed (apt installs, gRPC compile in the variant `builder-fromsource` stage or skipped entirely when consuming `base-grpc-*`, Python wheel installs, etc.). The base-images workflow's weekly cron is the closest thing to a populator and only refreshes the prebuilt builder bases.
-
-## Manually evicting cache
-
-To force a fully cold build for one backend or the whole image:
-
-```bash
-# Delete a single tag (requires quay credentials with admin on the repo)
-curl -X DELETE \
-  -H "Authorization: Bearer ${QUAY_TOKEN}" \
-  https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/cache-gpu-nvidia-cuda-12-vllm-amd64
-
-# List all tags
-curl -s -H "Authorization: Bearer ${QUAY_TOKEN}" \
-  "https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/?limit=100" | jq '.tags[].name'
-```
-
-Eviction is rarely needed in normal operation — `DEPS_REFRESH` handles weekly drift, source changes invalidate naturally, and `mode=max` keeps the cache scoped per matrix entry per arch so a stale tag never bleeds into a different build.
-
-## What the cache does **not** cover
-
- The `free-disk-space` and `setup-build-disk` composite actions run on every job — these reclaim runner-state, not Docker layers, so BuildKit caches don't apply.
- Intermediate artifacts of `Build (PR)` are not pushed anywhere — PRs only build for verification.
- Darwin builds (see below) — macOS runners have no Docker daemon, so the registry-backed BuildKit cache cannot apply.
-
-## Darwin native caches
-
-`backend_build_darwin.yml` runs natively on `macOS-14` GitHub-hosted runners — there is no Docker, no BuildKit, no cross-job registry cache. Instead, the reusable workflow uses `actions/cache@v4` for four native caches that mirror the spirit of the Linux cache (warm by default, weekly refresh for unpinned Python deps, PRs read-only).
-
-| Cache | Path(s) | Key | Scope |
-|---|---|---|---|
-| Go modules + build | `~/go/pkg/mod`, `~/Library/Caches/go-build` | `go.sum` (managed by `actions/setup-go@v5` `cache: true`) | All darwin jobs |
-| Homebrew | `~/Library/Caches/Homebrew/downloads`, selected `/opt/homebrew/Cellar/*` | hash of `backend_build_darwin.yml` | All darwin jobs |
-| ccache (llama.cpp CMake) | `~/Library/Caches/ccache` | pinned `LLAMA_VERSION` from `backend/cpp/llama-cpp/Makefile` | `inputs.backend == 'llama-cpp'` only |
-| Python wheels (uv + pip) | `~/Library/Caches/pip`, `~/Library/Caches/uv` | `inputs.backend` + ISO week (`+%Y-W%V`) + hash of that backend's `requirements*.txt` | `inputs.lang == 'python'` only |
-
-Read/write semantics match the BuildKit cache: `actions/cache/restore` runs every time, `actions/cache/save` is gated on `github.event_name != 'pull_request'`. PRs read master's warm cache but never write back.
-
-The Python wheel cache uses the same ISO-week cache-buster as the Linux `DEPS_REFRESH` build-arg — same problem (unpinned `torch`/`mlx`/`diffusers`/`transformers` resolve to fresh wheels weekly), same ~one-cold-rebuild-per-week solution.
-
-The brew Cellar cache requires `HOMEBREW_NO_AUTO_UPDATE=1` and `HOMEBREW_NO_INSTALL_CLEANUP=1` (set as job-level env). Without those, `brew install` would mutate the very directories that were just restored, defeating the cache.
-
-**Force-link after cache restore**: `actions/cache` restores `/opt/homebrew/Cellar/*` but NOT the `/opt/homebrew/bin/*` symlinks. After a cache hit, `brew install` sees the Cellar entries and decides "already installed" without re-running its link step, leaving the formulas off PATH. The Dependencies step explicitly runs `brew link --overwrite` for every cached formula afterwards to ensure the symlinks exist.
-
-For ccache, the workflow exports `CMAKE_ARGS=… -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache` via `$GITHUB_ENV` before running `make build-darwin-go-backend`. The Makefile in `backend/cpp/llama-cpp/` already forwards `CMAKE_ARGS` through to each variant build (`fallback`, `grpc`, `rpc-server`), so no script changes are needed. The three variants share most TUs, so ccache dedupes object files across them.
-
-`backend_build_darwin.yml` also has a llama-cpp-specific build-step branch that runs `make backends/llama-cpp-darwin` (the bespoke script that compiles three CMake variants and bundles dylibs via `otool`), distinct from the generic `make build-darwin-${lang}-backend` path. This was consolidated from a previously-bespoke top-level `llama-cpp-darwin` job in `backend.yml` so llama-cpp on Darwin honors the same path filter as the other 34 Darwin backends.
-
-### Cache budget on Darwin
-
-GitHub Actions caches are limited to 10 GB per repo. Steady-state worst case: ~800 MB Go cache + ~2 GB brew Cellar + up to 2 GB ccache + ~1.5 GB × 5 python backends. If the cap is hit, prefer collapsing the per-backend Python keys into a shared `pyenv-darwin-shared-<week>` key (accepts more cross-backend churn for a smaller footprint) before reducing other caches.
-
-## Self-hosted runners
-
-`.github/backend-matrix.yml` has zero references to `arc-runner-set` or `bigger-runner` — all backends run on GHA free-tier hosted runners (`ubuntu-latest` for amd64, `ubuntu-24.04-arm` for arm64 native, `macos-14` for Darwin). The migration off self-hosted relied on the per-arch native split (no QEMU emulation) plus `setup-build-disk`'s `/mnt` relocation (~100 GB working space, enough for ROCm dev image + vLLM/torch installs).
-
-One residual self-hosted reference remains in `test-extra.yml` (`tests-vibevoice-cpp-grpc-transcription` uses `bigger-runner` for the 30s JFK-decode timeout headroom). That's a separate concern.
-
-## Touching the cache pipeline
-
-When changing `image_build.yml`, `backend_build.yml`, any of the `backend/Dockerfile.*` files, `Dockerfile.base-grpc-builder`, `.docker/install-base-deps.sh`, `.docker/<backend>-compile.sh`, or `scripts/changed-backends.js`:
-
-1. **Don't drop `DEPS_REFRESH=...` from the build-args** without a replacement strategy (lockfiles, pinned requirements). Otherwise master will silently freeze on whichever versions were cached at the time.
-2. **Keep `(tag-suffix, platform-tag)` unique per matrix entry** — together they're the cache namespace. Two matrix entries sharing a key would clobber each other's cache.
-3. **Keep `cache-to` gated on `github.event_name != 'pull_request'`** — PRs must not write.
-4. **Keep `ignore-error=true` on `cache-to`** — quay registry hiccups must not fail builds.
-5. **Keep `provenance: false` on push-by-digest steps** — multi-registry digest divergence is the Bug We Already Fixed; reintroducing provenance attestation re-breaks the merge.
-6. **`install-base-deps.sh` is the single source of truth for base contents.** Both `Dockerfile.base-grpc-builder` (CI) and the variant Dockerfiles' `builder-fromsource` (local) bind-mount and execute it. If you add a package to one path, add it to the script — don't fork the logic into a Dockerfile RUN.
-7. **After adding a `base-images.yml` matrix variant, run the workflow on your branch before merging consumer changes** that depend on the new tag — otherwise the consumer's CI fails "image not found."
--- a/.agents/coding-style.md
+++ b/.agents/coding-style.md
@@ -42,25 +42,6 @@ trim_trailing_whitespace = false

 Use `github.com/mudler/xlog` for logging which has the same API as slog.

-## Go tests
-
-All Go tests — including backend tests — must use [Ginkgo](https://onsi.github.io/ginkgo/) (v2) with Gomega matchers, not the stdlib `testing` package with `t.Run` / `t.Errorf`. A test file should register a suite with `RegisterFailHandler(Fail)` in a `TestXxx(t *testing.T)` bootstrap and use `Describe`/`Context`/`It` blocks for the actual cases. Look at any existing `*_test.go` under `core/` or `pkg/` for a template.
-
-Do not mix styles within a package. If you are extending tests in a package that already uses Ginkgo, keep using Ginkgo. If you find stdlib-style Go tests in the tree, treat them as tech debt to be migrated rather than as a pattern to follow.
-
-This is enforced by `golangci-lint` via the `forbidigo` linter (see `.golangci.yml`); calls like `t.Errorf` / `t.Fatalf` / `t.Run` / `t.Skip` / `t.Logf` are flagged. Run `make lint` locally before submitting; the same check runs in CI (`.github/workflows/lint.yml`).
-
-## Outbound HTTP
-
-All outbound HTTP must go through `github.com/mudler/LocalAI/pkg/httpclient` rather than the standard library's default client. Use `httpclient.New(...)` (no body deadline — safe for streaming/SSE) or `httpclient.NewWithTimeout(d, ...)` (simple request/response). Both **refuse redirects by default** and set a TLS 1.2 floor.
-
-The reason is GHSA-3mj3-57v2-4636: the std default client follows redirects, and on a *cross-host* redirect Go forwards custom credential headers (e.g. Anthropic's `x-api-key`) to the redirect target, leaking the secret. `httpclient` fails closed instead.
-
- Need to follow redirects (download CDNs, registry blobs, GitHub asset URLs)? Pass `httpclient.WithFollowRedirects()` — it still strips credential headers on any cross-host hop.
- Have a custom transport (IP-pinned dialer, HTTP/2 tuning, a credential-injecting `RoundTripper`)? Pass `httpclient.WithTransport(rt)`, basing the transport on `httpclient.HardenedTransport()` to keep the TLS floor. Handed a `*http.Client` by a library? `httpclient.Harden(c)` applies the policy in place.
-
-This is enforced by `forbidigo` (see `.golangci.yml`): `http.DefaultClient` and `http.Get`/`Post`/`PostForm`/`Head` are flagged. The `&http.Client{}` composite literal can't be matched precisely by forbidigo without also flagging legitimate `*http.Client` type references, so that form is caught by review — don't construct raw clients.
-
 ## Documentation

 The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.
--- a/.agents/ds4-backend.md
+++ b/.agents/ds4-backend.md
@@ -1,145 +0,0 @@
-# Working on the ds4 Backend
-
-`antirez/ds4` is a single-model inference engine for DeepSeek V4 Flash.
-LocalAI wraps the engine's C API (`ds4/ds4.h`) with a fresh C++ gRPC server at
-`backend/cpp/ds4/` - NOT a fork of llama-cpp's grpc-server.cpp.
-
-## Pin
-
-`backend/cpp/ds4/Makefile` pins `DS4_VERSION?=<sha>` at the top. The `ds4`
-target in the Makefile clones `antirez/ds4` at that commit (mirroring the
-llama-cpp / ik-llama-cpp / turboquant pattern). The bump-deps bot
-(`.github/workflows/bump_deps.yaml`) finds this pin via grep and opens a
-daily PR to update it. To bump manually: edit the `DS4_VERSION?=` line,
-then `make purge && make` (or rely on CI's clean build).
-
-## Wire shape
-
-| RPC | Implementation |
-|---|---|
-| Health, Free, Status | Trivial; no engine dependency for Health |
-| LoadModel | `ds4_engine_open` + `ds4_session_create`; backend is compile-time (DS4_NO_GPU → CPU, __APPLE__ → Metal, otherwise CUDA) |
-| TokenizeString | `ds4_tokenize_text` |
-| Predict | `ds4_engine_generate_argmax` + `DsmlParser` → one ChatDelta with content / reasoning_content / tool_calls[] |
-| PredictStream | Same, per-token ChatDelta writes |
-
-## DSML
-
-ds4 emits tool calls as literal text markers (`<｜DSML｜tool_calls>` etc.) -
-NOT special tokens. `dsml_parser.{h,cpp}` is our streaming state machine that
-classifies token bytes into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END
-events. `dsml_renderer.{h,cpp}` does the prompt direction: turns
-OpenAI tool_calls + role=tool messages back into DSML for the next turn.
-
-## Thinking modes
-
-`PredictOptions.Metadata["enable_thinking"]` gates thinking on/off (default ON).
-`["reasoning_effort"] == "max" | "xhigh"` selects `DS4_THINK_MAX`; anything else
-maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_prefix`.
-
-## Disk KV cache
-
-`kv_cache.{h,cpp}` implements an SHA1-keyed file cache using ds4's public
-`ds4_session_save_payload` / `ds4_session_load_payload` API. Enable per request
-via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
-NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).
-
-## Engine options (LoadModel)
-
-`LoadModel` maps `ModelOptions.Options[]` (`"key:value"`, from model-YAML
-`options:`) onto `ds4_engine_options` through a **declarative table**
-(`kEngineOptSpecs` + `apply_engine_option` in `grpc-server.cpp`). The struct is
-plain C with no reflection, so the field set is enumerated once in the table;
-adding a future engine knob is a one-line table row, not a new branch. Unknown
-keys are ignored (back-compat). A bare flag (`ssd_streaming` with no value)
-means `true`. Path-type values (`mtp_path`, `expert_profile_path`,
-`directional_steering_file`) resolve **relative to the model directory**, so a
-gallery entry can reference a companion file it downloaded by bare filename;
-absolute values pass through. `ds4_role` / `ds4_layers` / `ds4_listen` /
-`ds4_route_timeout` / `kv_cache_dir` keep their dedicated handling (validation
-+ coordinator wiring) and are not in the table.
-
-Wired keys: `mtp_path`, `mtp_draft`, `mtp_margin`, `prefill_chunk`,
-`power_percent`, `warm_weights`, `quality`, `ssd_streaming`,
-`ssd_streaming_cold`, `ssd_streaming_preload_experts`,
-`ssd_streaming_cache_experts` (count or `NGB`, sets both experts+bytes via
-`ds4_parse_streaming_cache_experts_arg`), `simulate_used_memory` (`NGB` via
-`ds4_parse_gib_arg`), `expert_profile_path`, `directional_steering_file`,
-`directional_steering_attn`, `directional_steering_ffn`.
-
-## SSD streaming (running models larger than RAM)
-
-ds4's **SSD streaming** keeps non-routed weights resident and streams routed MoE
-experts from the GGUF on cache misses, turning "does it fit in RAM" into a speed
-spectrum. **Metal (Darwin) only** - it is a no-op on CUDA/CPU. Enable with
-`options: ["ssd_streaming"]`; size the routed-expert cache with
-`ssd_streaming_cache_experts:NGB` (omit for ds4's automatic 80%-of-working-set
-budget). Gallery entries built on this: `deepseek-v4-flash-q4-ssd` (153 GB Flash
-on a 128 GB Mac) and `deepseek-v4-pro-q2-ssd` (433 GB Pro, experimental).
-
-## Build matrix
-
-| Build | Where | Notes |
-|---|---|---|
-| `cpu-ds4` (amd64 + arm64) | Linux GHA | ds4 considers CPU debug-only; useful only for wiring tests |
-| `cuda13-ds4` (amd64 + arm64) | Linux GHA + DGX Spark validation | Primary production path on Linux |
-| `ds4-darwin` (arm64) | macOS GHA runners | Metal; uses `scripts/build/ds4-darwin.sh` like llama-cpp-darwin |
-
-cuda12 is intentionally omitted. ROCm / Vulkan / SYCL are not applicable.
-
-## Hardware-gated validation
-
-`tests/e2e-backends/backend_test.go` in `BACKEND_BINARY` mode:
-
-```
-BACKEND_BINARY=$(pwd)/backend/cpp/ds4/package/run.sh \
-BACKEND_TEST_MODEL_FILE=/path/to/ds4flash.gguf \
-BACKEND_TEST_CAPS=health,load,predict,stream,tools \
-BACKEND_TEST_TOOL_PROMPT="What's the weather in Paris?" \
-go test -count=1 -timeout=30m -v ./tests/e2e-backends/...
-```
-
-CI does not load the model; the suite is opt-in via env vars.
-
-## Distributed mode
-
-ds4 supports **layer-split** distributed inference (a model too big for one host,
-split by transformer layer; the GGUF must be present on every machine, each loads
-only its slice). Topology is **inverted** vs llama.cpp: the coordinator listens,
-workers dial in.
-
- **`ds4-worker` binary**: built and packaged next to `grpc-server` (`package.sh`
-  copies it into `package/`). Links the same engine objects plus `ds4_distributed.o`;
-  **no gRPC/protobuf dependency** (speaks ds4's own TCP transport), so it builds
-  even where `grpc-server` can't. Runs the worker serving loop (`ds4_dist_run`).
- **Coordinator wiring**: the ds4 `grpc-server` acts as coordinator when `LoadModel`
-  `ModelOptions.Options` (from model-YAML `options:`) carry:
-  - `ds4_role:coordinator` (enables distributed mode; absent → single-node, back-compat)
-  - `ds4_layers:0:19` (coordinator's own slice, inclusive; `N:output` includes the head)
-  - `ds4_listen:0.0.0.0:1234` (address workers dial into)
-  - `ds4_route_timeout:60` (optional; seconds Predict/PredictStream wait for the route
-    to form before returning gRPC `UNAVAILABLE`; default 60)
- **Worker CLI**: `local-ai worker ds4-distributed -- <ds4-worker args>` resolves the
-  ds4 backend and execs the packaged `ds4-worker` (raw passthrough), e.g.
-  `--role worker --model /models/ds4flash.gguf --layers 20:output --coordinator <host> 1234`.
-
-Opt-in e2e in `tests/e2e-backends/backend_test.go`, gated by
-`BACKEND_TEST_DS4_DISTRIBUTED=1` (plus `BACKEND_TEST_DS4_WORKER_BINARY`,
-`BACKEND_TEST_DS4_WORKER_LAYERS`, `BACKEND_TEST_DS4_COORDINATOR_LAYERS`,
-`BACKEND_TEST_DS4_LISTEN`). Design spec:
-`docs/superpowers/specs/2026-05-30-ds4-distributed-inference-design.md`.
-
-## Importer
-
-`core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by
-matching the `antirez/deepseek-v4-gguf` repo URI or the
-`DeepSeek-V4-Flash-*.gguf` filename pattern. **Registered BEFORE
-`LlamaCPPImporter`** in `defaultImporters` - both match `.gguf` but ds4 is more
-specific, and first-match-wins. The importer emits `backend: ds4`, uses
-`ds4flash.gguf` as the local filename (matches ds4's own CLI default), and
-disables the Go-side automatic tool-parsing fallback (the C++ backend emits
-ChatDelta.tool_calls natively via `DsmlParser`).
-
-ds4 is also listed in `core/http/endpoints/localai/backend.go`'s pref-only
-slice so the `/import-model` UI surfaces it as a manual choice for users who
-want to force the backend on a non-canonical URI.
--- a/.agents/llama-cpp-backend.md
+++ b/.agents/llama-cpp-backend.md
@@ -61,12 +61,6 @@ Always check `llama.cpp` for new model configuration options that should be supp
   - `reasoning_format` - Reasoning format options
   - Any new flags or parameters

-### Speculative Decoding Types
-
-The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
-
-`draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
-
 ### Implementation Guidelines

 1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation
--- a/.agents/localai-assistant-mcp.md
+++ b/.agents/localai-assistant-mcp.md
@@ -1,97 +0,0 @@
-# LocalAI Assistant — admin MCP server
-
-This document is the contract for **anyone** (human or AI agent) touching LocalAI's admin REST surface, the in-process MCP server that wraps it, or the embedded skill prompts that teach the assistant how to use it. Read this before adding/removing/renaming admin endpoints, MCP tools, or skill recipes.
-
-## What this feature is
-
-`pkg/mcp/localaitools/` is a public Go package that exposes LocalAI's admin/management surface as an MCP server. It is used in two ways:
-
-1. **In-process**: when an admin opens a chat with `metadata.localai_assistant=true`, the chat handler injects the in-memory MCP server (paired `net.Pipe()` transport, no HTTP loopback) so the LLM can install models, manage backends and edit configs by chatting.
-2. **Standalone**: the `local-ai mcp-server --target=…` subcommand serves the same MCP server over stdio, talking HTTP to a remote LocalAI instance.
-
-The two modes share **all** tool definitions and skill prompts. They differ only in their `LocalAIClient` implementation (`inproc/` calls services directly; `httpapi/` calls REST).
-
-## The three things you must keep in sync
-
-When you change LocalAI's admin surface, three layers must stay aligned:
-
-1. **REST endpoint** in `core/http/endpoints/localai/*.go`.
-2. **MCP tool registration** in `pkg/mcp/localaitools/tools_*.go`, plus a method on `LocalAIClient` (in `client.go`) and implementations in both `inproc/client.go` **and** `httpapi/client.go`.
-3. **Skill prompt** under `pkg/mcp/localaitools/prompts/skills/*.md` — the markdown that teaches the LLM how to use the new tool. If the new tool fits an existing recipe, update that recipe; otherwise add a new file.
-
-If you ship a REST endpoint without (2) and (3), conversational admins won't see the feature.
-
-## Checklist for adding a new admin endpoint
-
- [ ] REST endpoint exists in `core/http/endpoints/localai/*.go` and is gated by `auth.RequireAdmin()` in `core/http/routes/localai.go`.
- [ ] `LocalAIClient` interface in `pkg/mcp/localaitools/client.go` has a method covering the new operation.
- [ ] DTOs added/updated in `pkg/mcp/localaitools/dto.go` (JSON-tagged; never expose raw service types).
- [ ] `inproc/client.go` implements the new method by calling the service directly (not via HTTP loopback).
- [ ] `httpapi/client.go` implements the new method by calling the REST endpoint.
- [ ] Tool registration added in the appropriate `pkg/mcp/localaitools/tools_*.go`. Mutating tools must reference safety rule 1 in the description.
- [ ] If the tool is mutating, ensure `Options{DisableMutating: true}` skips it (mirror the pattern in `tools_models.go`).
- [ ] Skill prompt added or updated under `pkg/mcp/localaitools/prompts/skills/`. The prompt must instruct the LLM when to call the tool, what to ask the user first, and what to do on error.
- [ ] Tests:
-   - `pkg/mcp/localaitools/server_test.go` adds the tool name to `expectedFullCatalog` and `expectedReadOnlyCatalog` (if read-only).
-   - Tool dispatch is added to `TestEachToolDispatchesToClient`.
-   - `pkg/mcp/localaitools/httpapi/client_test.go` covers the new HTTP path.
-
-## Adding a new skill recipe (no new tool)
-
-Sometimes you want to teach the LLM a new pattern that uses existing tools. Drop a markdown file under `pkg/mcp/localaitools/prompts/skills/<verb>_<noun>.md`. The file is automatically embedded by `//go:embed` and assembled into the system prompt in lexicographic order. No Go changes needed.
-
-Conventions:
- Filename: `<verb>_<noun>.md` (e.g. `install_chat_model.md`, `upgrade_backend.md`).
- First line: `# Skill: <Title Case description>`.
- Number the steps. Reference exact tool names in backticks.
- If the skill mutates state, remind the LLM to confirm with the user.
-
-## Code conventions
-
-These rules guard against the magic-literal drift that surfaced in the first audit. Do not re-introduce bare strings.
-
- **Tool names** always come from the `Tool*` constants in `pkg/mcp/localaitools/tools.go`. Tool registrations, the test catalog (`server_test.go`'s `expectedFullCatalog` / `expectedReadOnlyCatalog`), and dispatch tables reference the constants. The embedded skill prompts under `prompts/` keep bare strings — that's the one allowed exception, and `TestPromptsContainSafetyAnchors` enforces alignment.
- **Toggle/pin actions** use the `modeladmin.Action` type (`pkg/mcp/localaitools` and `core/services/modeladmin`). Use `ActionEnable`/`ActionDisable`/`ActionPin`/`ActionUnpin`; never bare `"enable"`/`"pin"` strings.
- **Capability tags** for `list_installed_models` use the `localaitools.Capability` type (`capability.go`). The `LocalAIClient.ListInstalledModels` interface takes a typed `Capability`, and the `inproc` switch only accepts canonical values (`"embed"`/`"embedding"` are not aliases — only `CapabilityEmbeddings`).
- **HTTP error checks** in `httpapi.Client` use `errors.Is(err, ErrHTTPNotFound)`, not substring matches on `err.Error()`. The typed `*HTTPError` carries `StatusCode` and `Body`; add new sentinel errors as needed rather than re-introducing string matching.
- **Channel sends** to `GalleryService.ModelGalleryChannel` / `BackendGalleryChannel` from inproc clients MUST select on `ctx.Done()` so a cancelled chat completion releases the goroutine. See `inproc.sendModelOp` / `sendBackendOp`.
- **Disk writes** of model config YAML go through `modeladmin.writeFileAtomic` (temp file + `os.Rename`). `os.WriteFile` truncates on crash and corrupts the model.
- **MCP server lifecycle**: every initialised holder MUST register `Close()` with `signals.RegisterGracefulTerminationHandler`. The standalone `mcp-server` CLI uses `signal.NotifyContext` to honour SIGINT/SIGTERM.
-
-## File map (where to look)
-
-```
-pkg/mcp/localaitools/
-  client.go              # LocalAIClient interface + DTO registry
-  dto.go                 # JSON-tagged DTOs shared by both client impls
-  server.go              # NewServer(client, opts) — registers tools
-  tools.go               # Tool* name constants (single source of truth)
-  capability.go          # Capability type + constants
-  tools_models.go        # gallery_search, install_model, import_model_uri, ...
-  tools_backends.go
-  tools_config.go
-  tools_system.go
-  tools_state.go
-  prompts.go             # //go:embed loader + SystemPrompt(opts)
-  prompts/00_role.md
-  prompts/10_safety.md   # SAFETY RULES — change with care
-  prompts/20_tools.md    # curated tool catalog with one-liners
-  prompts/skills/*.md
-  inproc/client.go       # in-process LocalAIClient (services-direct)
-  httpapi/client.go      # REST LocalAIClient (for standalone CLI / remote)
-core/http/endpoints/mcp/
-  localai_assistant.go   # process-wide holder + LocalToolExecutor
-core/cli/mcp_server.go   # local-ai mcp-server subcommand
-```
-
-## Why two clients
-
-The in-process MCP server runs inside the same LocalAI binary that serves chat. Going over HTTP loopback would (a) require minting a synthetic admin API key for the server to authenticate against itself, (b) double-marshal every tool dispatch, and (c) lose access to in-process channels (e.g. `GalleryService.ModelGalleryChannel` for streaming install progress). So in-process uses `inproc.Client`. The standalone stdio CLI talks to a *remote* LocalAI; HTTP is the only option, so it uses `httpapi.Client`. Both implement the same `LocalAIClient` interface, and the parity test in `pkg/mcp/localaitools/parity_test.go` (when present) keeps their output equivalent.
-
-## Why prompt-enforced confirmation, not code gates
-
-The user chose KISS. Every mutating tool has a safety rule (`prompts/10_safety.md` rule 1) that requires the LLM to summarise the action and wait for explicit user confirmation before calling it. There is no `plan_*`/`apply_*` two-step in code. If you add a mutating tool, do **not** add per-tool confirmation logic in Go — instead, list the new tool name in `prompts/10_safety.md` so the LLM knows it falls under the confirmation rule.
-
-## Distributed mode
-
-The in-memory MCP server runs only on the head node (where the chat handler runs). `inproc.Client` wraps services that are already distributed-aware (`GalleryService` coordinates with workers; `ListNodes` reads the NATS-populated registry). No NATS routing of MCP tools — the admin surface lives on the head, period.
--- a/.agents/sglang-backend.md
+++ b/.agents/sglang-backend.md
@@ -1,62 +0,0 @@
-# Working on the SGLang Backend
-
-The SGLang backend lives at `backend/python/sglang/backend.py` (async gRPC). It wraps SGLang's `Engine` (`sglang.srt.entrypoints.engine.Engine`) and translates LocalAI's gRPC `PredictOptions` into SGLang sampling params + outputs into `Reply.chat_deltas`. Structurally it mirrors `backend/python/vllm/backend.py` — keep them shaped the same so changes in one have an obvious analog in the other.
-
-## `engine_args` is the universal escape hatch
-
-A small fixed set of fields on `ModelOptions` is mapped to typed SGLang kwargs in `LoadModel` (model, quantization, load_format, gpu_memory_utilization → mem_fraction_static, trust_remote_code, enforce_eager → disable_cuda_graph, tensor_parallel_size → tp_size, max_model_len → context_length, dtype). **Everything else** flows through the `engine_args:` YAML map.
-
-Validation happens in `_apply_engine_args`. Keys are checked against `dataclasses.fields(ServerArgs)` (`sglang.srt.server_args.ServerArgs` is a flat `@dataclass` with ~380 fields). Unknown keys raise `ValueError` at LoadModel time with a `difflib.get_close_matches` suggestion — same shape as the vLLM backend.
-
-**Precedence:** typed `ModelOptions` fields populate `engine_kwargs` first, then `engine_args` overrides them. So a YAML that sets both `gpu_memory_utilization: 0.9` and `engine_args.mem_fraction_static: 0.5` ends up at `0.5`. Document this when answering "why didn't my YAML field stick?".
-
-**ServerArgs is flat.** Unlike vLLM, where speculative decoding is nested under `engine_args.speculative_config: {...}`, SGLang exposes flat top-level fields: `speculative_algorithm`, `speculative_draft_model_path`, `speculative_num_steps`, `speculative_eagle_topk`, `speculative_num_draft_tokens`, `speculative_dflash_block_size`, etc. There is no `speculative_config:` dict. Same goes for compilation, kv-transfer, attention — all flat.
-
-The canonical reference is `python/sglang/srt/server_args.py:ServerArgs` (line ~304). When SGLang adds new flags, no LocalAI code change is needed — they're automatically available via `engine_args:`. The validator picks them up because it introspects the live dataclass.
-
-## Speculative decoding cheatsheet
-
-`--speculative-algorithm` accepts `EAGLE`, `EAGLE3`, `NEXTN`, `STANDALONE`, `NGRAM`, `DFLASH`. `NEXTN` is silently rewritten to `EAGLE` in `ServerArgs.__post_init__` (`server_args.py:3286-3287`). MTP (Multi-Token Prediction) is the same EAGLE path with `num_steps=1, eagle_topk=1, num_draft_tokens=2` against a target whose architecture has multi-token heads (e.g. MiMo-7B-RL, DeepSeek-V3-MTP).
-
-| Algorithm | Drafter requirement | Gallery demo target | Gallery demo drafter |
-|-----------|--------------------|---------------------|----------------------|
-| `NEXTN` / `EAGLE` (MTP) | Assistant drafter or built-in heads | google/gemma-4-E2B-it, google/gemma-4-E4B-it | google/gemma-4-E2B-it-assistant, google/gemma-4-E4B-it-assistant |
-| `EAGLE3` | EAGLE3 draft head | (no gallery entry yet) | e.g. jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B |
-| `DFLASH` | Block-diffusion drafter | (no gallery entry yet) | e.g. z-lab/Qwen3-4B-DFlash-b16 |
-| `STANDALONE` | Smaller LLM as drafter | (no gallery entry yet) | any smaller chat-tuned LLM in the same family |
-| `NGRAM` | None — uses prefix history | (no gallery entry yet) | n/a |
-
-The Gemma 4 demos use `mem_fraction_static: 0.85` (cookbook default) and the cookbook's `num_steps=5, num_draft_tokens=6, eagle_topk=1` parameters. Other algorithms are reachable from any user YAML via `engine_args:` but don't have shipped demos yet — that's a deliberate gallery scope choice, not a backend limitation.
-
-Gemma 4 support requires sglang built from a commit that includes [PR #21952](https://github.com/sgl-project/sglang/pull/21952). LocalAI's pinned release for cublas12 / cublas13 includes it. The `l4t13` (JetPack 7 / sbsa cu130) build floors at `sglang>=0.5.0` because the `pypi.jetson-ai-lab.io` mirror still ships only `0.5.1.post2` as of 2026-05-06 — Gemma 4 / MTP recipes are therefore not available on l4t13 until that mirror catches up. `backend.py` keeps backward compat with the 0.5.x → 0.5.11 `SamplingParams.seed` → `sampling_seed` rename via runtime detection.
-
-Compatibility caveats per the SGLang docs: DFLASH and NGRAM are incompatible with `enable_dp_attention`; DFLASH requires `pp_size == 1`; STANDALONE is incompatible with `enable_dp_attention`; NGRAM is CUDA-only and disables the overlap scheduler.
-
-### `mem_fraction_static` + quantization + MTP on consumer GPUs
-
-When combining online weight quantization (`engine_args.quantization: fp8` / `awq` / etc.) with built-in-head MTP (`speculative_algorithm: EAGLE`/`NEXTN`) on a tight VRAM budget, sglang's default `mem_fraction_static: 0.85` will OOM during draft-worker init. The reason: sglang quantizes the **target** model's transformer blocks but loads the **MTP draft worker's vocab embedding** at the source dtype (typically bf16). For a 7 B-class model with a 150k-token vocab × 4096 hidden, that's another ~1.2 GiB allocated *after* the static pool is reserved. At 0.85 fraction on a 16 GB card there's no room left.
-
-Workaround: drop `mem_fraction_static` to ~0.7 so the post-static heap can absorb the MTP embedding alloc + CUDA graph private pools. Verified end-to-end on MiMo-7B-RL + fp8 + MTP on a 16 GB RTX 5070 Ti (`gallery/sglang-mimo-7b-mtp.yaml`) at ~88 tok/s. Models with larger vocabs or more MTP layers (e.g. DeepSeek-V3-MTP) need an even smaller fraction.
-
-This isn't documented anywhere upstream as of 2026-05-06 — the SGLang Gemma 4 cookbook uses 0.85 because their MTP path doesn't go through `eagle_worker_v2.py` for an embedding-bearing draft module. Don't blanket-apply 0.7 across all sglang YAMLs; only when MTP-with-built-in-heads + quantization combine.
-
-## Tool-call and reasoning parsers stay on `Options[]`
-
-ServerArgs has `tool_call_parser` and `reasoning_parser` fields, and the backend does pass them through to `Engine` so SGLang's own HTTP/OAI surface keeps working. But for the **LocalAI** request path the backend constructs fresh per-request parser instances in `_make_parsers` (`backend.py:286`) because the parsers are stateful — the streaming and non-streaming paths each need their own.
-
-So the user-facing knob stays on `Options[]`:
-
-```yaml
-options:
-  - tool_parser:hermes
-  - reasoning_parser:deepseek_r1
-```
-
-Putting these in `engine_args:` will set them on `ServerArgs` but the LocalAI-level streaming `ChatDelta` will not pick them up. Don't recommend that path.
-
-## What's missing today (out of scope, but worth tracking)
-
- `core/config/hooks_sglang.go` — there is no SGLang equivalent of `hooks_vllm.go`. The vLLM hook auto-selects parsers for known model families from `parser_defaults.json` and seeds production engine_args defaults. A symmetric hook for SGLang could reuse the same `parser_defaults.json` (the SGLang parser names are different but the family detection is shared) and seed defaults like `enable_metrics: true` or attention-backend choices.
- `core/gallery/importers/sglang.go` — vLLM has an importer that resolves model architecture → parser defaults at gallery-import time. A matching importer for SGLang would let `local-ai install` populate sensible parsers automatically.
-
-These should be a follow-up PR, not a blocker for the engine_args feature.
--- a/.docker/apt-mirror.sh
+++ b/.docker/apt-mirror.sh
@@ -1,39 +0,0 @@
-#!/bin/sh
-# Reconfigure Ubuntu apt sources to point at an alternate mirror.
-#
-# Used by Dockerfiles via `RUN --mount=type=bind,source=.docker/apt-mirror.sh,...`
-# and by CI workflows on the runner to mitigate outages of the default
-# archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com pool.
-#
-# Inputs (env):
-#   APT_MIRROR        Replacement for archive.ubuntu.com and security.ubuntu.com
-#                     (e.g. "http://azure.archive.ubuntu.com" or
-#                      "https://mirrors.edge.kernel.org").
-#                     Leave empty to keep upstream. The trailing "/ubuntu/..."
-#                     path is preserved by the rewrite.
-#   APT_PORTS_MIRROR  Replacement for ports.ubuntu.com (arm64/ppc64el/...).
-#                     Leave empty to keep upstream.
-#
-# Both default to empty, in which case the script is a no-op.
-
-set -e
-
-if [ -z "${APT_MIRROR}" ] && [ -z "${APT_PORTS_MIRROR}" ]; then
-    exit 0
-fi
-
-# Ubuntu 24.04 (noble) ships DEB822 sources at /etc/apt/sources.list.d/ubuntu.sources;
-# older releases use /etc/apt/sources.list. We rewrite whichever exists.
-for f in /etc/apt/sources.list.d/ubuntu.sources /etc/apt/sources.list; do
-    [ -f "$f" ] || continue
-    if [ -n "${APT_MIRROR}" ]; then
-        # Use a comma delimiter so the alternation pipe in the regex
-        # is not interpreted as the s/// separator.
-        sed -i -E "s,https?://(archive\.ubuntu\.com|security\.ubuntu\.com),${APT_MIRROR},g" "$f"
-    fi
-    if [ -n "${APT_PORTS_MIRROR}" ]; then
-        sed -i -E "s,https?://ports\.ubuntu\.com,${APT_PORTS_MIRROR},g" "$f"
-    fi
-done
-
-echo "apt-mirror: rewrote sources (APT_MIRROR='${APT_MIRROR}', APT_PORTS_MIRROR='${APT_PORTS_MIRROR}')"
--- a/.docker/ik-llama-cpp-compile.sh
+++ b/.docker/ik-llama-cpp-compile.sh
@@ -1,30 +0,0 @@
-#!/usr/bin/env bash
-# Shared compile logic for backend/Dockerfile.ik-llama-cpp.
-# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
-
-set -euxo pipefail
-
-export CCACHE_DIR=/root/.ccache
-ccache --max-size=5G || true
-ccache -z || true
-
-export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
-
-if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
-  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
-  export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
-  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
-  rm -rf /LocalAI/backend/cpp/ik-llama-cpp-*-build
-fi
-
-cd /LocalAI/backend/cpp/ik-llama-cpp
-
-if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
-  # ARM64 / ROCm: build without x86 SIMD
-  make ik-llama-cpp-fallback
-else
-  # ik_llama.cpp's IQK kernels require at least AVX2
-  make ik-llama-cpp-avx2
-fi
-
-ccache -s || true
--- a/.docker/install-base-deps.sh
+++ b/.docker/install-base-deps.sh
@@ -1,244 +0,0 @@
-#!/usr/bin/env bash
-# Single source of truth for builder-base contents.
-#
-# Used by:
-#   - backend/Dockerfile.base-grpc-builder        (CI prebuilt-base source of truth)
-#   - backend/Dockerfile.llama-cpp                (builder-fromsource stage)
-#   - backend/Dockerfile.ik-llama-cpp             (builder-fromsource stage)
-#   - backend/Dockerfile.turboquant               (builder-fromsource stage)
-#
-# All four files invoke this script via
-#   RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
-#       --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-#       bash /usr/local/sbin/install-base-deps
-#
-# so the prebuilt CI base image and the from-source local-dev path are
-# bit-equivalent by construction.
-#
-# Inputs (env, populated from Dockerfile ARG/ENV):
-#   BUILD_TYPE                ("cublas"|"l4t"|"hipblas"|"vulkan"|"sycl"|"clblas"|"")
-#   CUDA_MAJOR_VERSION        ("12" | "13" | "")
-#   CUDA_MINOR_VERSION        ("8" | "0" | "")
-#   TARGETARCH                ("amd64" | "arm64")
-#   UBUNTU_VERSION            ("2204" | "2404")
-#   SKIP_DRIVERS              ("false" | "true")
-#   CMAKE_FROM_SOURCE         ("false" | "true")
-#   CMAKE_VERSION             ("3.31.10")
-#   GRPC_VERSION              ("v1.65.0")
-#   GRPC_MAKEFLAGS            ("-j4 -Otarget")
-#   APT_MIRROR / APT_PORTS_MIRROR  (optional; consumed by /usr/local/sbin/apt-mirror)
-#   AMDGPU_TARGETS            (optional; only relevant for hipblas downstream)
-#
-# IMPORTANT: install logic is copied verbatim from the prior in-Dockerfile
-# RUN blocks. Do not paraphrase apt invocations / version pins / sed line
-# numbers / deb URLs — the bit-equivalence guarantee depends on it.
-
-set -eux
-
-# --- 0. apt mirror rewrite (no-op when APT_MIRROR / APT_PORTS_MIRROR unset) ---
-if [ -x /usr/local/sbin/apt-mirror ]; then
-    APT_MIRROR="${APT_MIRROR:-}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR:-}" \
-        sh /usr/local/sbin/apt-mirror
-fi
-
-export DEBIAN_FRONTEND=noninteractive
-export MAKEFLAGS="${GRPC_MAKEFLAGS:-}"
-
-# --- 1. Base apt build deps ---
-apt-get update
-apt-get install -y --no-install-recommends \
-    build-essential \
-    ccache git \
-    ca-certificates \
-    make \
-    pkg-config libcurl4-openssl-dev \
-    curl unzip \
-    libssl-dev wget
-apt-get clean
-rm -rf /var/lib/apt/lists/*
-
-# --- 2. Vulkan SDK (BUILD_TYPE=vulkan) ---
-# NB: this block intentionally installs `cmake` via apt as part of the
-# Vulkan tooling — must run before the dedicated CMake step below.
-if [ "${BUILD_TYPE:-}" = "vulkan" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
-    apt-get update
-    apt-get install -y  --no-install-recommends \
-        software-properties-common pciutils wget gpg-agent
-    apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
-        libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
-        libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
-        git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
-        ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-        clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
-    if [ "amd64" = "${TARGETARCH:-}" ]; then
-        wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz"
-        tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz
-        rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz
-        mkdir -p /opt/vulkan-sdk
-        mv 1.4.335.0 /opt/vulkan-sdk/
-        ( cd /opt/vulkan-sdk/1.4.335.0 && \
-          ./vulkansdk --no-deps --maxjobs \
-              vulkan-loader \
-              vulkan-validationlayers \
-              vulkan-extensionlayer \
-              vulkan-tools \
-              shaderc )
-        cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/
-        cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/
-        cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/
-        cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/
-        rm -rf /opt/vulkan-sdk
-    fi
-    if [ "arm64" = "${TARGETARCH:-}" ]; then
-        mkdir vulkan
-        ( cd vulkan && \
-          curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
-          tar -xvf vulkan-sdk.tar.xz && \
-          rm vulkan-sdk.tar.xz && \
-          cd 1.4.335.0 && \
-          cp -rfv aarch64/bin/* /usr/bin/ && \
-          cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
-          cp -rfv aarch64/include/* /usr/include/ && \
-          cp -rfv aarch64/share/* /usr/share/ )
-        rm -rf vulkan
-    fi
-    ldconfig
-    apt-get clean
-    rm -rf /var/lib/apt/lists/*
-fi
-
-# --- 3. CUDA toolkit (BUILD_TYPE=cublas|l4t) ---
-if { [ "${BUILD_TYPE:-}" = "cublas" ] || [ "${BUILD_TYPE:-}" = "l4t" ]; } && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
-    apt-get update
-    apt-get install -y  --no-install-recommends \
-        software-properties-common pciutils
-    if [ "amd64" = "${TARGETARCH:-}" ]; then
-        curl -O "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb"
-    fi
-    if [ "arm64" = "${TARGETARCH:-}" ]; then
-        if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
-            curl -O "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb"
-        else
-            curl -O "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb"
-        fi
-    fi
-    dpkg -i cuda-keyring_1.1-1_all.deb
-    rm -f cuda-keyring_1.1-1_all.deb
-    apt-get update
-    apt-get install -y --no-install-recommends \
-        "cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-        "libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-        "libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-        "libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-        "libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-        "libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}"
-    if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "${TARGETARCH:-}" ]; then
-        apt-get install -y --no-install-recommends \
-            "libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-            "libcudnn9-cuda-${CUDA_MAJOR_VERSION}" \
-            "cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-            "libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}"
-    fi
-    apt-get clean
-    rm -rf /var/lib/apt/lists/*
-fi
-
-# --- 4. cuDSS / NVPL on arm64 + cublas (legacy JetPack / Tegra) ---
-# https://github.com/NVIDIA/Isaac-GR00T/issues/343
-if [ "${BUILD_TYPE:-}" = "cublas" ] && [ "${TARGETARCH:-}" = "arm64" ]; then
-    wget "https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb"
-    dpkg -i "cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb"
-    cp /var/cudss-local-tegra-repo-ubuntu"${UBUNTU_VERSION}"-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/
-    apt-get update
-    apt-get -y install cudss "cudss-cuda-${CUDA_MAJOR_VERSION}"
-    wget "https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb"
-    dpkg -i "nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb"
-    cp /var/nvpl-local-repo-ubuntu"${UBUNTU_VERSION}"-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/
-    apt-get update
-    apt-get install -y nvpl
-fi
-
-# --- 5. clBLAS (BUILD_TYPE=clblas) ---
-# Present in variant Dockerfiles' from-source path but not in master's
-# Dockerfile.base-grpc-builder. No CI matrix entry currently uses this,
-# but keep parity so a future BUILD_TYPE=clblas build doesn't drift.
-if [ "${BUILD_TYPE:-}" = "clblas" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
-    apt-get update
-    apt-get install -y --no-install-recommends \
-        libclblast-dev
-    apt-get clean
-    rm -rf /var/lib/apt/lists/*
-fi
-
-# --- 6. ROCm / HIP build deps (BUILD_TYPE=hipblas) ---
-if [ "${BUILD_TYPE:-}" = "hipblas" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
-    apt-get update
-    apt-get install -y --no-install-recommends \
-        hipblas-dev \
-        hipblaslt-dev \
-        rocblas-dev
-    apt-get clean
-    rm -rf /var/lib/apt/lists/*
-    # I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install,
-    # which results in local-ai and others not being able to locate the libraries.
-    # We run ldconfig ourselves to work around this packaging deficiency.
-    ldconfig
-    # Log which GPU architectures have rocBLAS kernel support
-    echo "rocBLAS library data architectures:"
-    (ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
-        echo "WARNING: No rocBLAS kernel data found"
-fi
-
-echo "TARGETARCH: ${TARGETARCH:-}"
-
-# --- 7. protoc (always) ---
-# The version in 22.04 is too old. We will create one as part of installing
-# the GRPC build below but that will also bring in a newer version of absl
-# which stablediffusion cannot compile with. This version of protoc is only
-# here so that we can generate the grpc code for the stablediffusion build.
-if [ "amd64" = "${TARGETARCH:-}" ]; then
-    curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip
-    unzip -j -d /usr/local/bin protoc.zip bin/protoc
-    rm protoc.zip
-fi
-if [ "arm64" = "${TARGETARCH:-}" ]; then
-    curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip
-    unzip -j -d /usr/local/bin protoc.zip bin/protoc
-    rm protoc.zip
-fi
-
-# --- 8. CMake (apt or compiled from source) ---
-# The version in 22.04 is too old. Vulkan path above already pulled cmake
-# via apt; the from-source branch here will install over it which is fine.
-if [ "${CMAKE_FROM_SOURCE:-false}" = "true" ]; then
-    curl -L -s "https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz" -o cmake.tar.gz
-    tar xvf cmake.tar.gz
-    ( cd "cmake-${CMAKE_VERSION}" && ./configure && make && make install )
-else
-    apt-get update
-    apt-get install -y \
-        cmake
-    apt-get clean
-    rm -rf /var/lib/apt/lists/*
-fi
-
-# --- 9. gRPC compile + install at /opt/grpc ---
-# We install GRPC to a different prefix here so that we can copy in only
-# the build artifacts later — saves several hundred MB on the final docker
-# image size vs copying in the entire GRPC source tree and running
-# `make install` in the target container.
-#
-# The TESTONLY abseil sed patch and /opt/grpc prefix are load-bearing —
-# downstream Dockerfiles `COPY` /opt/grpc to /usr/local (or rely on the
-# prebuilt base having it at /opt/grpc).
-mkdir -p /build
-cd /build
-git clone --recurse-submodules --jobs 4 -b "${GRPC_VERSION}" --depth 1 --shallow-submodules https://github.com/grpc/grpc
-mkdir -p /build/grpc/cmake/build
-cd /build/grpc/cmake/build
-sed -i "216i\\  TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt"
-cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../..
-make
-make install
-cd /
-rm -rf /build
--- a/.docker/llama-cpp-compile.sh
+++ b/.docker/llama-cpp-compile.sh
@@ -1,35 +0,0 @@
-#!/usr/bin/env bash
-# Shared compile logic for backend/Dockerfile.llama-cpp.
-# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
-
-set -euxo pipefail
-
-export CCACHE_DIR=/root/.ccache
-ccache --max-size=5G || true
-ccache -z || true
-
-export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
-
-if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
-  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
-  export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
-  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
-  rm -rf /LocalAI/backend/cpp/llama-cpp-*-build
-fi
-
-if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
-  cd /LocalAI/backend/cpp/llama-cpp
-  make llama-cpp-fallback
-  make llama-cpp-grpc
-  make llama-cpp-rpc-server
-else
-  cd /LocalAI/backend/cpp/llama-cpp
-  make llama-cpp-avx
-  make llama-cpp-avx2
-  make llama-cpp-avx512
-  make llama-cpp-fallback
-  make llama-cpp-grpc
-  make llama-cpp-rpc-server
-fi
-
-ccache -s || true
--- a/.docker/turboquant-compile.sh
+++ b/.docker/turboquant-compile.sh
@@ -1,35 +0,0 @@
-#!/usr/bin/env bash
-# Shared compile logic for backend/Dockerfile.turboquant.
-# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
-
-set -euxo pipefail
-
-export CCACHE_DIR=/root/.ccache
-ccache --max-size=5G || true
-ccache -z || true
-
-export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
-
-if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
-  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
-  export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
-  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
-  rm -rf /LocalAI/backend/cpp/turboquant-*-build
-fi
-
-cd /LocalAI/backend/cpp/turboquant
-
-if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
-  make turboquant-fallback
-  make turboquant-grpc
-  make turboquant-rpc-server
-else
-  make turboquant-avx
-  make turboquant-avx2
-  make turboquant-avx512
-  make turboquant-fallback
-  make turboquant-grpc
-  make turboquant-rpc-server
-fi
-
-ccache -s || true
--- a/.dockerignore
+++ b/.dockerignore
@@ -4,7 +4,6 @@
 .devcontainer
 models
 backends
-volumes
 examples/chatbot-ui/models
 backend/go/image/stablediffusion-ggml/build/
 backend/go/*/build
@@ -22,36 +21,3 @@ __pycache__
 # backend virtual environments
 **/venv
 backend/python/**/source
-
-# In-place llama.cpp clone + per-variant build copies. The Makefile
-# clones llama.cpp itself at the pinned LLAMA_VERSION; if a stale
-# local checkout is COPY'd into the image, the `llama.cpp:` target
-# sees the directory and skips re-cloning, so grpc-server.cpp ends
-# up compiled against whatever (likely older) commit the host had.
-backend/cpp/llama-cpp/llama.cpp
-backend/cpp/llama-cpp-*-build
-
-# privacy-filter: same in-place pattern. The Makefile fetches privacy-filter.cpp
-# at the pinned commit (or symlinks a PRIVACY_FILTER_SRC checkout for local dev).
-# A stale dir/symlink COPY'd into the image makes the clone step fail (dangling
-# symlink) or compile against the wrong commit, so keep host build state out.
-backend/cpp/privacy-filter/privacy-filter.cpp
-backend/cpp/privacy-filter/build
-backend/cpp/privacy-filter/grpc-server
-backend/cpp/privacy-filter/package
-
-# Rust backend build output (sources are tracked; target/ is generated)
-backend/rust/*/target
-
-# Local-only artifacts that bloat the build context but the image never needs.
-# Saved image tarballs, locally-installed backends, the host-built binary, and
-# assorted tool/scratch dirs. None of these are git-tracked.
-backend-images
-local-backends
-local-ai
-.crush
-protoc
-tests
-
-# Installed via npm inside the build stage; no need to ship the host copy.
-**/node_modules
--- a/.githooks/pre-commit
+++ b/.githooks/pre-commit
@@ -1,60 +0,0 @@
-#!/usr/bin/env sh
-#
-# LocalAI pre-commit hook. Install it (once per clone) with:
-#
-#     make install-hooks
-#
-# Runs only the checks relevant to what's staged:
-#   - Go files          -> make lint + make test-coverage-check
-#   - core/http/react-ui -> make test-ui-coverage-check (Playwright e2e + gate)
-# A commit touching neither is skipped entirely (docs/YAML/etc. can't change
-# lint findings, Go coverage, or the UI).
-#
-# To bypass for a single commit (e.g. a WIP checkpoint): git commit --no-verify
-set -eu
-
-repo_root="$(git rev-parse --show-toplevel)"
-cd "$repo_root"
-
-staged="$(git diff --cached --name-only --diff-filter=ACMRD)"
-
-go_changed=0
-ui_changed=0
-if echo "$staged" | grep -qE '\.go$'; then go_changed=1; fi
-if echo "$staged" | grep -qE '^core/http/react-ui/'; then ui_changed=1; fi
-
-if [ "$go_changed" -eq 0 ] && [ "$ui_changed" -eq 0 ]; then
-	echo "pre-commit: no Go or React UI changes staged — skipping."
-	exit 0
-fi
-
-if [ "$go_changed" -eq 1 ]; then
-	# Resolve the ref golangci-lint's new-from-merge-base should compare
-	# against. .golangci.yml pins origin/master, which is correct in CI
-	# (origin == the canonical repo) but wrong from a fork clone, where
-	# origin/master lags behind and lint would report the whole upstream
-	# backlog. Prefer upstream/master, then origin/master, then master.
-	lint_base=""
-	for ref in upstream/master origin/master master; do
-		if git rev-parse --verify --quiet "${ref}^{commit}" >/dev/null 2>&1; then
-			lint_base="$ref"
-			break
-		fi
-	done
-
-	echo "pre-commit ▶ golangci-lint (make lint${lint_base:+, new-from $lint_base})"
-	make lint LINT_NEW_FROM="$lint_base"
-
-	echo "pre-commit ▶ coverage gate (make test-coverage-check) — builds and runs the"
-	echo "             pkg/core suites plus tests/e2e; can take a few minutes."
-	make test-coverage-check
-fi
-
-if [ "$ui_changed" -eq 1 ]; then
-	echo "pre-commit ▶ React UI e2e + coverage gate (make test-ui-coverage-check) —"
-	echo "             rebuilds the UI + ui-test-server, runs the Playwright specs, and"
-	echo "             fails if line coverage regressed; can take a couple of minutes."
-	make test-ui-coverage-check
-fi
-
-echo "pre-commit ✓ all relevant checks passed"
--- a/.github/actions/configure-apt-mirror/action.yml
+++ b/.github/actions/configure-apt-mirror/action.yml
@@ -1,100 +0,0 @@
-name: 'Configure apt mirror'
-description: |
-  Reconfigure the GitHub Actions runner's Ubuntu apt sources to use an
-  alternate mirror, and emit the effective URLs as outputs so callers can
-  forward them as Docker build-args.
-
-  Two mirror profiles depending on where the runner lives, because the
-  best mirror differs by network:
-
-    * github-hosted runners run on Azure, so they default to the
-      Azure-hosted Ubuntu mirror (lowest latency, same VPC).
-    * self-hosted runners (arc-runner-set, bigger-runner, ...) typically
-      cannot route to azure.archive.ubuntu.com, so they default to the
-      kernel.org mirror, which is publicly reachable from anywhere.
-
-  Pass an empty string to either input to skip the rewrite for that
-  profile and keep upstream archive.ubuntu.com / ports.ubuntu.com.
-
-inputs:
-  github-hosted-mirror:
-    description: 'archive/security mirror URL for github-hosted runners (empty = upstream)'
-    required: false
-    default: 'http://azure.archive.ubuntu.com'
-  github-hosted-ports-mirror:
-    description: 'ports.ubuntu.com mirror URL for github-hosted runners (empty = upstream)'
-    required: false
-    default: 'http://azure.ports.ubuntu.com'
-  self-hosted-mirror:
-    description: 'archive/security mirror URL for self-hosted runners (empty = upstream)'
-    required: false
-    # HTTP, not HTTPS: the bare ubuntu:24.04 builder image doesn't ship
-    # ca-certificates, so the very first apt-get update over TLS would
-    # fail with "No system certificates available" before it can install
-    # anything. apt validates package integrity via GPG signatures, so
-    # plain HTTP is safe for the archive itself.
-    default: 'http://mirrors.edge.kernel.org'
-  self-hosted-ports-mirror:
-    description: 'ports.ubuntu.com mirror URL for self-hosted runners (empty = upstream)'
-    required: false
-    # mirrors.edge.kernel.org does NOT carry /ubuntu-ports/ — only the
-    # main /ubuntu/ archive — so arm64 builds 404 there. Leave ports
-    # upstream by default. The original DDoS was on archive.ubuntu.com
-    # so ports.ubuntu.com remains the path of least surprise.
-    default: ''
-
-outputs:
-  effective-mirror:
-    description: 'The mirror URL actually applied for this runner (or empty)'
-    value: ${{ steps.pick.outputs.mirror }}
-  effective-ports-mirror:
-    description: 'The ports mirror URL actually applied for this runner (or empty)'
-    value: ${{ steps.pick.outputs.ports-mirror }}
-
-runs:
-  using: 'composite'
-  steps:
-    - name: Pick effective mirror for this runner
-      id: pick
-      shell: bash
-      env:
-        RUNNER_ENV: ${{ runner.environment }}
-        GH_MIRROR: ${{ inputs.github-hosted-mirror }}
-        GH_PORTS_MIRROR: ${{ inputs.github-hosted-ports-mirror }}
-        SH_MIRROR: ${{ inputs.self-hosted-mirror }}
-        SH_PORTS_MIRROR: ${{ inputs.self-hosted-ports-mirror }}
-      run: |
-        if [ "${RUNNER_ENV}" = "github-hosted" ]; then
-          MIRROR="${GH_MIRROR}"
-          PORTS_MIRROR="${GH_PORTS_MIRROR}"
-        else
-          MIRROR="${SH_MIRROR}"
-          PORTS_MIRROR="${SH_PORTS_MIRROR}"
-        fi
-        echo "configure-apt-mirror: runner=${RUNNER_ENV} mirror='${MIRROR}' ports-mirror='${PORTS_MIRROR}'"
-        echo "mirror=${MIRROR}" >> "$GITHUB_OUTPUT"
-        echo "ports-mirror=${PORTS_MIRROR}" >> "$GITHUB_OUTPUT"
-
-    - name: Rewrite apt sources
-      if: steps.pick.outputs.mirror != '' || steps.pick.outputs.ports-mirror != ''
-      shell: bash
-      env:
-        APT_MIRROR: ${{ steps.pick.outputs.mirror }}
-        APT_PORTS_MIRROR: ${{ steps.pick.outputs.ports-mirror }}
-      run: |
-        set -e
-        # Ubuntu 24.04 (noble) ships DEB822 sources at
-        # /etc/apt/sources.list.d/ubuntu.sources; older releases use
-        # /etc/apt/sources.list. Rewrite whichever exists.
-        for f in /etc/apt/sources.list.d/ubuntu.sources /etc/apt/sources.list; do
-          sudo test -f "$f" || continue
-          if [ -n "${APT_MIRROR}" ]; then
-            # Comma delimiter so the alternation pipe in the regex is not
-            # interpreted as the s/// separator.
-            sudo sed -i -E "s,https?://(archive\.ubuntu\.com|security\.ubuntu\.com),${APT_MIRROR},g" "$f"
-          fi
-          if [ -n "${APT_PORTS_MIRROR}" ]; then
-            sudo sed -i -E "s,https?://ports\.ubuntu\.com,${APT_PORTS_MIRROR},g" "$f"
-          fi
-        done
-        echo "Runner apt mirror configured (APT_MIRROR='${APT_MIRROR}', APT_PORTS_MIRROR='${APT_PORTS_MIRROR}')"
--- a/.github/actions/free-disk-space/action.yml
+++ b/.github/actions/free-disk-space/action.yml
@@ -1,65 +0,0 @@
-name: 'Free disk space on hosted runners'
-description: |
-  Aggressively clean GitHub-hosted ubuntu-latest runners to reclaim ~6-10 GB
-  of working space before docker buildx steps. Combines jlumbroso/free-disk-space
-  with explicit apt purges of large packages we never use (dotnet, ghc, mono,
-  android, jdk, ...).
-
-  No-op on self-hosted runners; pass mode=skip to force-disable.
-
-inputs:
-  mode:
-    description: 'hosted (default — clean) or skip (no-op)'
-    required: false
-    default: 'hosted'
-
-runs:
-  using: 'composite'
-  steps:
-    - name: Free Disk Space (Ubuntu)
-      if: inputs.mode == 'hosted' && runner.environment == 'github-hosted'
-      uses: jlumbroso/free-disk-space@main
-      with:
-        tool-cache: true
-        android: true
-        dotnet: true
-        haskell: true
-        large-packages: true
-        docker-images: true
-        swap-storage: true
-
-    - name: Release space from worker
-      if: inputs.mode == 'hosted' && runner.environment == 'github-hosted'
-      shell: bash
-      run: |
-        echo "Listing top largest packages"
-        pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
-        head -n 30 <<< "${pkgs}"
-        df -h
-        sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
-        sudo apt-get remove --auto-remove android-sdk-platform-tools snapd || true
-        sudo apt-get purge --auto-remove android-sdk-platform-tools snapd || true
-        sudo rm -rf /usr/local/lib/android
-        sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
-        sudo rm -rf /usr/share/dotnet
-        sudo apt-get remove -y '^mono-.*' || true
-        sudo apt-get remove -y '^ghc-.*' || true
-        sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
-        sudo apt-get remove -y 'php.*' || true
-        sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
-        sudo apt-get remove -y '^google-.*' || true
-        sudo apt-get remove -y azure-cli || true
-        sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
-        sudo apt-get remove -y '^gfortran-.*' || true
-        sudo apt-get remove -y microsoft-edge-stable || true
-        sudo apt-get remove -y firefox || true
-        sudo apt-get remove -y powershell || true
-        sudo apt-get remove -y r-base-core || true
-        sudo apt-get autoremove -y
-        sudo apt-get clean
-        sudo rm -rfv build || true
-        sudo rm -rf /usr/share/dotnet || true
-        sudo rm -rf /opt/ghc || true
-        sudo rm -rf "/usr/local/share/boost" || true
-        sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
-        df -h
--- a/.github/actions/setup-build-disk/action.yml
+++ b/.github/actions/setup-build-disk/action.yml
@@ -1,59 +0,0 @@
-name: 'Set up build disk on hosted runners'
-description: |
-  Relocate Docker's data-root to /mnt (which has ~75 GB free, vs ~20 GB
-  on / after free-disk-space). Combined with the apt cleanup, gives
-  ~100 GB working space for buildx — enough for ROCm dev image + vLLM
-  torch install + flash-attn build.
-
-  No-op on:
-    - self-hosted runners (no /mnt expectation)
-    - non-X64 runners (verify /mnt shape on ubuntu-24.04-arm separately
-      before enabling there — see Task 3.2 in the migration plan)
-    - mode=skip (force-disable from caller)
-
-  Must run after free-disk-space (which removes large packages — would
-  fail mid-uninstall if Docker were stopped) and before any Docker
-  operation (setup-qemu, setup-buildx, login, build) so the relocated
-  data-root catches all subsequent docker activity.
-
-inputs:
-  mode:
-    description: 'auto (default — relocate on hosted X64 only) or skip'
-    required: false
-    default: 'auto'
-
-runs:
-  using: 'composite'
-  steps:
-    - name: Relocate Docker data-root to /mnt
-      if: inputs.mode == 'auto' && runner.environment == 'github-hosted' && runner.arch == 'X64'
-      shell: bash
-      run: |
-        set -euo pipefail
-        echo "Before relocation:"
-        df -h / /mnt || true
-        sudo systemctl stop docker docker.socket
-        sudo mkdir -p /mnt/docker-data /mnt/docker-tmp
-        # buildx CLI runs as the unprivileged runner user and creates
-        # config dirs under TMPDIR before binding them into the buildkit
-        # container. /mnt is owned by root by default; mirror /tmp's
-        # 1777 (world-writable + sticky) so non-root processes can write.
-        sudo chmod 1777 /mnt/docker-tmp
-        if [ -d /var/lib/docker ] && [ ! -L /var/lib/docker ]; then
-          sudo rsync -a /var/lib/docker/ /mnt/docker-data/
-          sudo rm -rf /var/lib/docker
-          sudo ln -s /mnt/docker-data /var/lib/docker
-        fi
-        # daemon.json may not exist; merge data-root in or create minimal.
-        if [ -f /etc/docker/daemon.json ]; then
-          sudo jq '."data-root" = "/mnt/docker-data"' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.new >/dev/null
-          sudo mv /etc/docker/daemon.json.new /etc/docker/daemon.json
-        else
-          echo '{"data-root":"/mnt/docker-data"}' | sudo tee /etc/docker/daemon.json
-        fi
-        sudo systemctl start docker
-        # Make TMPDIR persist for subsequent steps in the same job.
-        echo "TMPDIR=/mnt/docker-tmp" >> "$GITHUB_ENV"
-        echo "After relocation:"
-        df -h / /mnt
-        docker info | grep -i 'docker root dir' || true
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
--- a/.github/bump_vllm_wheel.sh
+++ b/.github/bump_vllm_wheel.sh
@@ -1,45 +0,0 @@
-#!/bin/bash
-# Bump the cublas13 vLLM wheel pin in requirements-cublas13-after.txt.
-#
-# vLLM's PyPI wheel is built against CUDA 12 so the cublas13 build pulls a
-# cu130-flavoured wheel from vLLM's per-tag index at
-# https://wheels.vllm.ai/<TAG>/cu130/. That URL segment is itself version-locked
-# (no /latest/ alias upstream), so bumping vLLM means rewriting both the URL
-# segment and the version constraint atomically. bump_deps.sh handles git-sha
-# vars in Makefiles; this script handles the two-value rewrite specific to the
-# vLLM requirements file.
-set -xe
-REPO=$1   # vllm-project/vllm
-FILE=$2   # backend/python/vllm/requirements-cublas13-after.txt
-VAR=$3    # VLLM_VERSION (used for output file names so the workflow can read them)
-
-if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
-    echo "usage: $0 <repo> <requirements-file> <var-name>" >&2
-    exit 1
-fi
-
-# /releases/latest returns the most recent non-prerelease tag.
-LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
-    "https://api.github.com/repos/$REPO/releases/latest" \
-    | python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
-
-# Strip leading 'v' (vLLM tags are 'v0.20.0', the URL/version use '0.20.0').
-NEW_VERSION="${LATEST_TAG#v}"
-
-set +e
-CURRENT_VERSION=$(grep -oE '^vllm==[0-9]+\.[0-9]+\.[0-9]+' "$FILE" | head -1 | cut -d= -f3)
-set -e
-
-# sed both lines unconditionally — peter-evans/create-pull-request opens no PR
-# when the working tree is clean, so a no-op rewrite is safe.
-sed -i "$FILE" \
-    -e "s|wheels\.vllm\.ai/[^/]*/cu130|wheels.vllm.ai/$NEW_VERSION/cu130|g" \
-    -e "s|^vllm==.*|vllm==$NEW_VERSION|"
-
-if [ -z "$CURRENT_VERSION" ]; then
-    echo "Could not find vllm==X.Y.Z in $FILE."
-    exit 0
-fi
-
-echo "Changes: https://github.com/$REPO/compare/v${CURRENT_VERSION}...${LATEST_TAG}" >> "${VAR}_message.txt"
-echo "${NEW_VERSION}" >> "${VAR}_commit.txt"
--- a/.github/gallery-agent/main.go
+++ b/.github/gallery-agent/main.go
@@ -3,7 +3,6 @@ package main
 import (
 	"context"
 	"encoding/json"
-	"errors"
 	"fmt"
 	"os"
 	"strconv"
@@ -114,17 +113,6 @@ func main() {
 	fmt.Println("Searching for trending models on HuggingFace...")
 	rawModels, err := client.GetTrending(searchTerm, limit)
 	if err != nil {
-		if errors.Is(err, hfapi.ErrRateLimited) {
-			fmt.Printf("HuggingFace API is rate limited after retries, skipping this run: %v\n", err)
-			writeSummary(AddedModelSummary{
-				SearchTerm:     searchTerm,
-				TotalFound:     0,
-				ModelsAdded:    0,
-				Quantization:   quantization,
-				ProcessingTime: time.Since(startTime).String(),
-			})
-			return
-		}
 		fmt.Fprintf(os.Stderr, "Error fetching models: %v\n", err)
 		os.Exit(1)
 	}
@@ -289,3 +277,4 @@ func truncateString(s string, maxLen int) string {
 	}
 	return s[:maxLen] + "..."
 }
+
--- a/.github/scripts/anchor-digest-in-cache.sh
+++ b/.github/scripts/anchor-digest-in-cache.sh
@@ -1,46 +0,0 @@
-#!/usr/bin/env bash
-# Anchor a backend per-arch digest in quay.io/go-skynet/ci-cache so quay's
-# garbage collector won't reap the manifest before backend_merge.yml runs.
-#
-# Context: backend_build.yml pushes by canonical digest only
-# (push-by-digest=true). Unreferenced manifests on quay can be reaped within
-# ~1-2h, but backend-merge-jobs runs only after the *entire* per-arch build
-# matrix drains (max-parallel: 8 × dozens of entries → ~2h+). Without an
-# anchoring tag, the earliest digests are gone by the time `imagetools create`
-# tries to read them, producing "manifest not found" merge failures.
-#
-# We tag the digest under our internal ci-cache image; quay does not GC tagged
-# manifests. The user-facing manifest list still references the original
-# digest in local-ai-backends. backend_merge.yml deletes the anchor tag after
-# the user-facing manifest is published — see cleanup-keepalive-tags.sh.
-#
-# Required env:
-#   GITHUB_RUN_ID  - current workflow run id (set automatically by GHA)
-#   TAG_SUFFIX     - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
-#   PLATFORM_TAG   - amd64 / arm64 / single (single = singleton matrix entry)
-#   DIGEST         - canonical content digest from build step (sha256:...)
-#
-# Optional env:
-#   ANCHOR_IMAGE   - target image (default: quay.io/go-skynet/ci-cache)
-#   SOURCE_IMAGE   - source image (default: quay.io/go-skynet/local-ai-backends)
-#   GITHUB_STEP_SUMMARY - if set, an anchored-by line is appended to it
-set -euo pipefail
-
-: "${GITHUB_RUN_ID:?}"
-: "${TAG_SUFFIX:?}"
-: "${PLATFORM_TAG:?}"
-: "${DIGEST:?}"
-
-anchor_image="${ANCHOR_IMAGE:-quay.io/go-skynet/ci-cache}"
-source_image="${SOURCE_IMAGE:-quay.io/go-skynet/local-ai-backends}"
-
-tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${PLATFORM_TAG}"
-
-docker buildx imagetools create \
-  -t "${anchor_image}:${tag}" \
-  "${source_image}@${DIGEST}"
-
-echo "anchored ${DIGEST} as ${anchor_image}:${tag}"
-if [[ -n "${GITHUB_STEP_SUMMARY:-}" ]]; then
-  echo "anchored \`${DIGEST}\` as \`${anchor_image}:${tag}\`" >> "${GITHUB_STEP_SUMMARY}"
-fi
--- a/.github/scripts/cleanup-keepalive-tags.sh
+++ b/.github/scripts/cleanup-keepalive-tags.sh
@@ -1,49 +0,0 @@
-#!/usr/bin/env bash
-# Best-effort cleanup of the keepalive anchor tags written by
-# anchor-digest-in-cache.sh. Called from backend_merge.yml after the
-# user-facing manifest list has been published.
-#
-# Quay's docker registry v2 doesn't allow tag deletes — only digest deletes.
-# The proper delete is the quay REST API, which requires an OAuth-scoped
-# token. We try QUAY_TOKEN as a bearer token: if the secret is an OAuth app
-# token (typical for service accounts) the delete succeeds; otherwise this
-# is a soft no-op and the tag persists until manually pruned.
-#
-# Cleanup failure MUST NOT fail the merge — the merge has already produced
-# the user-facing manifest list at this point and the keepalive tags are
-# pure overhead. We always exit 0.
-#
-# Required env:
-#   GITHUB_RUN_ID  - current workflow run id (set automatically by GHA)
-#   TAG_SUFFIX     - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
-#   QUAY_TOKEN     - bearer token for quay's REST API
-#
-# Optional env:
-#   QUAY_REPO      - target repo (default: go-skynet/ci-cache)
-#   PLATFORM_TAGS  - space-separated list of platform-tag values to try
-#                    (default: "amd64 arm64 single")
-#                    We don't know which platform-tag(s) exist for this
-#                    tag-suffix without an extra API call, so we just try
-#                    all three and ignore 404s for the ones that don't.
-set -uo pipefail
-
-: "${GITHUB_RUN_ID:?}"
-: "${TAG_SUFFIX:?}"
-: "${QUAY_TOKEN:?}"
-
-quay_repo="${QUAY_REPO:-go-skynet/ci-cache}"
-platform_tags="${PLATFORM_TAGS:-amd64 arm64 single}"
-
-for plat in $platform_tags; do
-  tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${plat}"
-  url="https://quay.io/api/v1/repository/${quay_repo}/tag/${tag}"
-  http=$(curl -sS -o /dev/null -w '%{http_code}' \
-    -X DELETE -H "Authorization: Bearer ${QUAY_TOKEN}" "$url" || echo "000")
-  case "$http" in
-    204|200) echo "deleted $tag" ;;
-    404)     echo "not present: $tag" ;;
-    401|403) echo "auth not OAuth-scoped (http $http) for $tag - skipping; orphan tag will persist" ;;
-    *)       echo "unexpected http $http deleting $tag - skipping" ;;
-  esac
-done
-exit 0
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -24,17 +24,6 @@ on:
        description: 'Platforms'
        default: ''
        type: string
-      platform-tag:
-        description: |
-          Short tag identifying the platform leg, e.g. "amd64" or "arm64".
-          Used to scope the per-arch registry cache and the digest artifact name.
-          Required for split-and-merge multi-arch builds; pass "amd64" for
-          single-arch amd64 builds too. Optional (default '') during the
-          migration to per-arch matrix expansion; will be flipped to
-          required: true in Phase 6 once all callers pass an explicit value.
-        required: false
-        default: ''
-        type: string
      tag-latest:
        description: 'Tag latest'
        default: ''
@@ -69,20 +58,6 @@ on:
        required: false
        default: '2204'
        type: string
-      amdgpu-targets:
-        description: 'AMD GPU targets for ROCm/HIP builds'
-        required: false
-        default: ''
-        type: string
-      builder-base-image:
-        description: |
-          Pre-built builder base image (e.g. quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64).
-          When set, the variant Dockerfile uses its `builder-prebuilt` stage which FROMs this
-          image directly instead of running its own gRPC stage + apt installs. Empty for
-          backends whose Dockerfile doesn't support a prebuilt base.
-        required: false
-        default: ''
-        type: string
    secrets:
      dockerUsername:
        required: false
@@ -100,22 +75,76 @@ jobs:
        quay_username: ${{ secrets.quayUsername }}
    steps:

+
+      - name: Free Disk Space (Ubuntu)
+        if: inputs.runs-on == 'ubuntu-latest'
+        uses: jlumbroso/free-disk-space@main
+        with:
+          # this might remove tools that are actually needed,
+          # if set to "true" but frees about 6 GB
+          tool-cache: true
+          # all of these default to true, but feel free to set to
+          # "false" if necessary for your workflow
+          android: true
+          dotnet: true
+          haskell: true
+          large-packages: true
+          docker-images: true
+          swap-storage: true
+
+      - name: Force Install GIT latest
+        run: |
+          sudo apt-get update \
+          && sudo apt-get install -y software-properties-common \
+          && sudo apt-get update \
+          && sudo add-apt-repository -y ppa:git-core/ppa \
+          && sudo apt-get update \
+          && sudo apt-get install -y git
+
      - name: Checkout
        uses: actions/checkout@v6
-        with:
-          submodules: true

-      - name: Configure apt mirror on runner
-        id: apt_mirror
-        uses: ./.github/actions/configure-apt-mirror
-
-      - name: Free disk space
-        uses: ./.github/actions/free-disk-space
-        with:
-          mode: ${{ inputs.runs-on == 'ubuntu-latest' && 'hosted' || 'skip' }}
-
-      - name: Set up build disk
-        uses: ./.github/actions/setup-build-disk
+      - name: Release space from worker
+        if: inputs.runs-on == 'ubuntu-latest'
+        run: |
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          df -h
+          echo
+          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
+          sudo apt-get remove --auto-remove android-sdk-platform-tools snapd || true
+          sudo apt-get purge --auto-remove android-sdk-platform-tools snapd || true
+          sudo rm -rf /usr/local/lib/android
+          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
+          sudo rm -rf /usr/share/dotnet
+          sudo apt-get remove -y '^mono-.*' || true
+          sudo apt-get remove -y '^ghc-.*' || true
+          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
+          sudo apt-get remove -y 'php.*' || true
+          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
+          sudo apt-get remove -y '^google-.*' || true
+          sudo apt-get remove -y azure-cli || true
+          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
+          sudo apt-get remove -y '^gfortran-.*' || true
+          sudo apt-get remove -y microsoft-edge-stable || true
+          sudo apt-get remove -y firefox || true
+          sudo apt-get remove -y powershell || true
+          sudo apt-get remove -y r-base-core || true
+          sudo apt-get autoremove -y
+          sudo apt-get clean
+          echo
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          sudo rm -rfv build || true
+          sudo rm -rf /usr/share/dotnet || true
+          sudo rm -rf /opt/ghc || true
+          sudo rm -rf "/usr/local/share/boost" || true
+          sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
+          df -h

      - name: Docker meta
        id: meta
@@ -172,17 +201,7 @@ jobs:
          username: ${{ secrets.quayUsername }}
          password: ${{ secrets.quayPassword }}

-      # Weekly cache-buster for the per-backend `make` step. Most Python
-      # backends list unpinned deps (torch, transformers, vllm, ...), so a
-      # warm cache freezes upstream versions indefinitely. Rolling this
-      # weekly forces a re-resolve of the install layer at most once per
-      # week, picking up newer wheels without a full cold rebuild.
-      - name: Compute deps refresh key
-        id: deps_refresh
-        run: echo "key=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
-
-      - name: Build and push by digest
-        id: build
+      - name: Build and push
        uses: docker/build-push-action@v7
        if: github.event_name != 'pull_request'
        with:
@@ -195,66 +214,15 @@ jobs:
            BASE_IMAGE=${{ inputs.base-image }}
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
-            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
-            APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
-            APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
-            DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
-            BUILDER_BASE_IMAGE=${{ inputs.builder-base-image }}
-            BUILDER_TARGET=${{ inputs.builder-base-image != '' && 'builder-prebuilt' || 'builder-fromsource' }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
-          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}-${{ inputs.platform-tag }},mode=max,ignore-error=true
+          cache-from: type=gha
          platforms: ${{ inputs.platforms }}
-          outputs: |
-            type=image,name=quay.io/go-skynet/local-ai-backends,push-by-digest=true,name-canonical=true,push=true
-            type=image,name=localai/localai-backends,push-by-digest=true,name-canonical=true,push=true
-          # Disable provenance: with mode=max (the default for push:true)
-          # buildx bundles a per-registry attestation manifest into each
-          # registry's manifest list, which makes the resulting list digest
-          # diverge across registries. steps.build.outputs.digest then
-          # only matches one of them, and the merge job's
-          # `imagetools create <reg>@sha256:<digest>` lookup fails on the
-          # other. Disabling provenance keeps the digest content-only and
-          # identical across both registries — required for digest-based
-          # cross-registry merge.
-          provenance: false
+          push: ${{ github.event_name != 'pull_request' }}
+          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}

-      - name: Export digest
-        if: github.event_name != 'pull_request'
-        run: |
-          mkdir -p /tmp/digests
-          digest="${{ steps.build.outputs.digest }}"
-          touch "/tmp/digests/${digest#sha256:}"
-
-      # See .github/scripts/anchor-digest-in-cache.sh for why this is needed
-      # and how it interacts with backend_merge.yml's cleanup step.
-      - name: Anchor digest in ci-cache so quay GC won't reap before merge
-        if: github.event_name != 'pull_request'
-        env:
-          TAG_SUFFIX: ${{ inputs.tag-suffix }}
-          PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
-          DIGEST: ${{ steps.build.outputs.digest }}
-        run: .github/scripts/anchor-digest-in-cache.sh
-
-      # Artifact name uses a `--` separator between tag-suffix and platform-tag
-      # to avoid prefix collisions during the merge job's pattern-based download.
-      # Tag-suffixes are not prefix-disjoint (e.g. -gpu-nvidia-cuda-12-vllm is a
-      # prefix of -gpu-nvidia-cuda-12-vllm-omni); a single `-` separator plus the
-      # merge-side `digests<tag-suffix>-*` glob would let one merge over-match
-      # the other backend's artifacts. The `-single` placeholder for empty
-      # platform-tag (single-arch entries) keeps the artifact name non-trailing.
-      - name: Upload digest artifact
-        if: github.event_name != 'pull_request'
-        uses: actions/upload-artifact@v7
-        with:
-          name: digests${{ inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
-          path: /tmp/digests/*
-          if-no-files-found: error
-          retention-days: 1
-
-      - name: Build (PR)
+      - name: Build and push (PR)
        uses: docker/build-push-action@v7
        if: github.event_name == 'pull_request'
        with:
@@ -267,15 +235,9 @@ jobs:
            BASE_IMAGE=${{ inputs.base-image }}
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
-            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
-            APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
-            APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
-            DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
-            BUILDER_BASE_IMAGE=${{ inputs.builder-base-image }}
-            BUILDER_TARGET=${{ inputs.builder-base-image != '' && 'builder-prebuilt' || 'builder-fromsource' }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
+          cache-from: type=gha
          platforms: ${{ inputs.platforms }}
          push: ${{ env.quay_username != '' }}
          tags: ${{ steps.meta_pull_request.outputs.tags }}
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -48,13 +48,6 @@ jobs:
    strategy:
      matrix:
        go-version: ['${{ inputs.go-version }}']
-    env:
-      # Keep the brew Cellar stable across cache restores. Without these,
-      # `brew install` would auto-update brew itself and re-link formulas,
-      # mutating the very paths the cache just restored.
-      HOMEBREW_NO_AUTO_UPDATE: '1'
-      HOMEBREW_NO_INSTALL_CLEANUP: '1'
-      HOMEBREW_NO_ANALYTICS: '1'
    steps:
      - name: Clone
        uses: actions/checkout@v6
@@ -65,195 +58,21 @@ jobs:
        uses: actions/setup-go@v5
        with:
          go-version: ${{ matrix.go-version }}
-          # Caches ~/go/pkg/mod and ~/Library/Caches/go-build keyed on go.sum.
-          # Shared across every darwin matrix entry — first job in a run warms
-          # it, the rest hit warm.
-          cache: true
+          cache: false

      # You can test your matrix by printing the current Go version
      - name: Display Go version
        run: go version

-      # ---- Homebrew cache ----
-      # macOS runners have no Docker daemon, so the BuildKit registry cache used
-      # for Linux backend images (see .agents/ci-caching.md) doesn't apply here.
-      # We cache the brew downloads + Cellar entries for the formulas we install
-      # below. Read on every run, write only on master/tag pushes — same policy
-      # as the Linux registry cache.
-      - name: Restore Homebrew cache
-        id: brew-cache
-        uses: actions/cache/restore@v4
-        with:
-          path: |
-            ~/Library/Caches/Homebrew/downloads
-            /opt/homebrew/Cellar/protobuf
-            /opt/homebrew/Cellar/grpc
-            /opt/homebrew/Cellar/protoc-gen-go
-            /opt/homebrew/Cellar/protoc-gen-go-grpc
-            /opt/homebrew/Cellar/libomp
-            /opt/homebrew/Cellar/llvm
-            /opt/homebrew/Cellar/ccache
-            /opt/homebrew/Cellar/blake3
-            /opt/homebrew/Cellar/fmt
-            /opt/homebrew/Cellar/hiredis
-            /opt/homebrew/Cellar/xxhash
-            /opt/homebrew/Cellar/zstd
-            /opt/homebrew/Cellar/nlohmann-json
-          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
-
      - name: Dependencies
        run: |
-          # ccache is always installed (used by the llama-cpp variant build) so
-          # the brew cache content stays stable across every backend in the
-          # matrix — they all share one cache key.
-          # blake3, fmt, hiredis, xxhash, zstd are ccache's runtime dylib deps.
-          # Without explicitly installing them, a brew cache-hit run restores
-          # ccache's Cellar dir but skips installing those transitive deps,
-          # and ccache fails at runtime with `dyld: Library not loaded`.
-          # nlohmann-json is header-only and required by the ds4 backend
-          # (dsml_renderer.cpp includes <nlohmann/json.hpp>); on Linux it comes
-          # from the apt-installed nlohmann-json3-dev in the build image.
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json
-          # Force-reinstall ccache so brew re-validates its full runtime-dep
-          # closure on every run. This is the durable fix: when the upstream
-          # ccache formula gains a new transitive dep (as it has multiple times
-          # already), we don't have to chase missing dylibs one at a time.
-          # The downloads cache makes the reinstall fast (~5s on a hit).
-          brew reinstall ccache
-          # Same pattern for grpc: its CMake config (used by the llama-cpp
-          # `grpc-server` target) does find_package(absl). The cache restores
-          # /opt/homebrew/Cellar/grpc so brew above no-ops the install, but
-          # abseil isn't in our Cellar cache list and never gets installed
-          # alongside, leaving grpc's CMake unable to resolve it. Reinstalling
-          # grpc re-validates and pulls abseil in, mirroring the ccache fix.
-          brew reinstall grpc
-          # The brew cache restores the Cellar dirs but NOT the bin symlinks
-          # at /opt/homebrew/bin/*. brew install above sees the Cellar present
-          # and decides "already installed" without re-linking, so on a cache-
-          # hit run the formulas aren't on PATH. Force-link them; --overwrite
-          # tolerates pre-existing symlinks from earlier installs.
-          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json 2>/dev/null || true
-
-      - name: Save Homebrew cache
-        if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
-        uses: actions/cache/save@v4
-        with:
-          path: |
-            ~/Library/Caches/Homebrew/downloads
-            /opt/homebrew/Cellar/protobuf
-            /opt/homebrew/Cellar/grpc
-            /opt/homebrew/Cellar/protoc-gen-go
-            /opt/homebrew/Cellar/protoc-gen-go-grpc
-            /opt/homebrew/Cellar/libomp
-            /opt/homebrew/Cellar/llvm
-            /opt/homebrew/Cellar/ccache
-            /opt/homebrew/Cellar/blake3
-            /opt/homebrew/Cellar/fmt
-            /opt/homebrew/Cellar/hiredis
-            /opt/homebrew/Cellar/xxhash
-            /opt/homebrew/Cellar/zstd
-            /opt/homebrew/Cellar/nlohmann-json
-          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
-
-      # ---- ccache for llama.cpp CMake builds ----
-      # Three CMake variants (fallback, grpc, rpc-server) compile the same
-      # llama.cpp source tree with overlapping flags — ccache dedupes object
-      # files across them. Key on the pinned LLAMA_VERSION so a pin bump
-      # invalidates cleanly; restore-keys fall back to the latest entry for the
-      # same pin so unchanged TUs stay warm even when the cache is fresh.
-      - name: Compute llama.cpp version
-        if: inputs.backend == 'llama-cpp'
-        id: llama-version
-        run: |
-          version=$(grep '^LLAMA_VERSION' backend/cpp/llama-cpp/Makefile | head -1 | cut -d= -f2 | cut -d'?' -f1 | tr -d ' ')
-          echo "version=${version}" >> "$GITHUB_OUTPUT"
-
-      - name: Restore ccache
-        if: inputs.backend == 'llama-cpp'
-        id: ccache-cache
-        uses: actions/cache/restore@v4
-        with:
-          path: ~/Library/Caches/ccache
-          key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
-          restore-keys: |
-            ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-
-
-      - name: Configure ccache
-        if: inputs.backend == 'llama-cpp'
-        run: |
-          mkdir -p "$HOME/Library/Caches/ccache"
-          ccache -M 2G
-          ccache -z
-          # llama-cpp-darwin.sh reads CMAKE_ARGS / CCACHE_DIR from env.
-          {
-            echo "CMAKE_ARGS=${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache"
-            echo "CCACHE_DIR=$HOME/Library/Caches/ccache"
-          } >> "$GITHUB_ENV"
-
-      # ---- Python wheel cache (uv + pip) ----
-      # Mirrors the Linux DEPS_REFRESH cadence (see .agents/ci-caching.md): the
-      # ISO-week segment of the cache key forces at most one cold rebuild per
-      # backend per week, automatically picking up newer wheels for unpinned
-      # deps (torch, mlx, diffusers, …). Restore-keys fall back to the most
-      # recent build of the same backend so off-week PRs still hit warm.
-      - name: Compute weekly cache bucket
-        if: inputs.lang == 'python'
-        id: weekly
-        run: echo "bucket=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
-
-      - name: Restore Python wheel cache
-        if: inputs.lang == 'python'
-        id: pyenv-cache
-        uses: actions/cache/restore@v4
-        with:
-          path: |
-            ~/Library/Caches/pip
-            ~/Library/Caches/uv
-          key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
-          restore-keys: |
-            pyenv-darwin-${{ inputs.backend }}-
-
-      # llama-cpp on Darwin uses a bespoke build script (scripts/build/llama-cpp-darwin.sh)
-      # that compiles three CMake variants from backend/cpp/llama-cpp and bundles dylibs
-      # via otool — it doesn't fit the build-darwin-go-backend / build-darwin-python-backend
-      # mold. Drive it via its dedicated `backends/llama-cpp-darwin` make target instead.
-      - name: Build ${{ inputs.backend }}-darwin (llama-cpp)
-        if: inputs.backend == 'llama-cpp'
-        run: |
-          make protogen-go
-          make backends/llama-cpp-darwin
-
-      - name: Build ds4 backend (Darwin Metal)
-        if: inputs.backend == 'ds4'
-        run: |
-          make backends/ds4-darwin
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm

      - name: Build ${{ inputs.backend }}-darwin
-        if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4'
        run: |
          make protogen-go
          BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend

-      - name: ccache stats
-        if: inputs.backend == 'llama-cpp'
-        run: ccache -s
-
-      - name: Save ccache
-        if: inputs.backend == 'llama-cpp' && github.event_name != 'pull_request'
-        uses: actions/cache/save@v4
-        with:
-          path: ~/Library/Caches/ccache
-          key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
-
-      - name: Save Python wheel cache
-        if: inputs.lang == 'python' && github.event_name != 'pull_request' && steps.pyenv-cache.outputs.cache-hit != 'true'
-        uses: actions/cache/save@v4
-        with:
-          path: |
-            ~/Library/Caches/pip
-            ~/Library/Caches/uv
-          key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
-
      - name: Upload ${{ inputs.backend }}.tar
        uses: actions/upload-artifact@v7
        with:
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -1,217 +0,0 @@
---
-name: 'merge backend manifest list (reusable)'
-
-# Reusable workflow that joins per-arch digest artifacts (uploaded by
-# backend_build.yml when called with platform-tag) into a single tagged
-# multi-arch manifest list. Called once per backend by backend.yml after
-# both per-arch build jobs succeed.
-
-on:
-  workflow_call:
-    inputs:
-      tag-latest:
-        description: 'Whether the manifest list should also be tagged latest (auto/false/true)'
-        required: false
-        type: string
-        default: ''
-      tag-suffix:
-        description: 'Backend tag suffix (e.g. -cpu-faster-whisper). Used to compute the artifact pattern and the final tag suffix.'
-        required: true
-        type: string
-    secrets:
-      dockerUsername:
-        required: false
-      dockerPassword:
-        required: false
-      quayUsername:
-        required: true
-      quayPassword:
-        required: true
-
-jobs:
-  merge:
-    runs-on: ubuntu-latest
-    # id-token: write is required for keyless cosign — the workflow
-    # exchanges the GitHub OIDC token for a short-lived Fulcio cert that
-    # signs each pushed manifest. Without this permission the runner
-    # cannot mint the token, and `cosign sign` fails with "no token".
-    permissions:
-      contents: read
-      id-token: write
-    env:
-      quay_username: ${{ secrets.quayUsername }}
-      # cosign v2.4.x still gates --registry-referrers-mode=oci-1-1 behind
-      # this flag. Without it, signing fails with:
-      #   invalid argument "oci-1-1" for "--registry-referrers-mode" flag:
-      #   in order to use mode "oci-1-1", you must set COSIGN_EXPERIMENTAL=1
-      COSIGN_EXPERIMENTAL: '1'
-    steps:
-      # Sparse checkout: the merge job needs `.github/scripts/` (for the
-      # keepalive cleanup script) but none of the source tree.
-      - name: Checkout (.github/scripts only)
-        uses: actions/checkout@v6
-        with:
-          sparse-checkout: |
-            .github/scripts
-          sparse-checkout-cone-mode: false
-
-      # `--` separator anchors the glob so we don't over-match sibling
-      # backends whose tag-suffix happens to be a prefix of ours
-      # (e.g. -cpu-vllm vs -cpu-vllm-omni). Must stay in sync with the
-      # upload-artifact name in backend_build.yml.
-      - name: Download digests
-        uses: actions/download-artifact@v8
-        with:
-          pattern: digests${{ inputs.tag-suffix }}--*
-          merge-multiple: true
-          path: /tmp/digests
-
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@master
-
-      # cosign signs each pushed manifest list with --recursive so the
-      # index and every per-arch entry get an attached Sigstore bundle.
-      # Recent cosign releases always emit the new bundle format, so
-      # there's no extra CLI flag to opt into it.
-      - name: Install cosign
-        if: github.event_name != 'pull_request'
-        uses: sigstore/cosign-installer@v3
-        with:
-          cosign-release: 'v2.4.1'
-
-      - name: Login to DockerHub
-        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
-        with:
-          username: ${{ secrets.dockerUsername }}
-          password: ${{ secrets.dockerPassword }}
-
-      - name: Login to Quay.io
-        if: ${{ env.quay_username != '' }}
-        uses: docker/login-action@v4
-        with:
-          registry: quay.io
-          username: ${{ secrets.quayUsername }}
-          password: ${{ secrets.quayPassword }}
-
-      - name: Docker meta
-        id: meta
-        if: github.event_name != 'pull_request'
-        uses: docker/metadata-action@v6
-        with:
-          images: |
-            quay.io/go-skynet/local-ai-backends
-            localai/localai-backends
-          tags: |
-            type=ref,event=branch
-            type=semver,pattern={{raw}}
-            type=sha
-          flavor: |
-            latest=${{ inputs.tag-latest }}
-            suffix=${{ inputs.tag-suffix }},onlatest=true
-
-      # Source from ci-cache, not local-ai-backends.
-      #
-      # The build job pushes per-arch manifests to local-ai-backends with
-      # push-by-digest=true (no tag), then anchors a tagged copy into
-      # ci-cache so the manifest can be retrieved hours later when this
-      # merge runs. Quay's manifest GC, however, is per-repository: the
-      # anchor tag in ci-cache protects the manifest there, but the same
-      # digest in local-ai-backends has no tag in *that* repo and gets
-      # reaped independently. Sourcing local-ai-backends@<digest> here
-      # then fails with "manifest not found" — exactly the regression
-      # we hit on v4.2.2 (19/37 multiarch merges failed).
-      #
-      # ci-cache@<digest> resolves because we anchored it there. buildx
-      # imagetools create copies the manifest into local-ai-backends
-      # (cross-repo within the same registry, blobs already cross-mounted
-      # from the original push so no transfer needed) and publishes the
-      # manifest list with the user-facing tags. The resulting manifest
-      # list is fully self-contained in local-ai-backends — child digests
-      # only, no embedded references to ci-cache.
-      - name: Create manifest list and push (quay)
-        if: github.event_name != 'pull_request'
-        working-directory: /tmp/digests
-        run: |
-          set -euo pipefail
-          tags=$(jq -cr '
-            .tags
-            | map(select(startswith("quay.io/")))
-            | map("-t " + .)
-            | join(" ")
-          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          if [ -z "$tags" ]; then
-            echo "No quay.io tags from docker/metadata-action; skipping quay merge"
-            exit 0
-          fi
-          # shellcheck disable=SC2086
-          docker buildx imagetools create $tags \
-            $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
-          # Resolve the manifest-list digest (any tag points at it) so
-          # cosign can sign by digest. Signing by tag would leave the
-          # signature orphaned the next time the tag moves.
-          first_tag=$(jq -cr '
-            .tags | map(select(startswith("quay.io/"))) | .[0]
-          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
-          # --recursive walks the list and signs every per-arch entry
-          # too — clients that resolve a tag to a platform-specific
-          # manifest before checking signatures need the per-arch
-          # signatures, not just the list-level one.
-          cosign sign --yes --recursive \
-            --registry-referrers-mode=oci-1-1 \
-            "quay.io/go-skynet/local-ai-backends@${digest}"
-
-      - name: Create manifest list and push (dockerhub)
-        if: github.event_name != 'pull_request'
-        working-directory: /tmp/digests
-        run: |
-          set -euo pipefail
-          tags=$(jq -cr '
-            .tags
-            | map(select(startswith("localai/")))
-            | map("-t " + .)
-            | join(" ")
-          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          if [ -z "$tags" ]; then
-            echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
-            exit 0
-          fi
-          # shellcheck disable=SC2086
-          docker buildx imagetools create $tags \
-            $(printf 'localai/localai-backends@sha256:%s ' *)
-          first_tag=$(jq -cr '
-            .tags | map(select(startswith("localai/"))) | .[0]
-          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
-          cosign sign --yes --recursive \
-            --registry-referrers-mode=oci-1-1 \
-            "localai/localai-backends@${digest}"
-
-      - name: Inspect manifest
-        if: github.event_name != 'pull_request'
-        run: |
-          set -euo pipefail
-          first_tag=$(jq -cr '.tags[0]' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          if [ -n "$first_tag" ] && [ "$first_tag" != "null" ]; then
-            docker buildx imagetools inspect "$first_tag"
-          fi
-
-      # See .github/scripts/cleanup-keepalive-tags.sh for why this is
-      # best-effort and what the failure modes are.
-      - name: Cleanup keepalive tags in ci-cache
-        if: github.event_name != 'pull_request' && success()
-        env:
-          TAG_SUFFIX: ${{ inputs.tag-suffix }}
-          QUAY_TOKEN: ${{ secrets.quayPassword }}
-        run: .github/scripts/cleanup-keepalive-tags.sh
-
-      - name: Job summary
-        if: github.event_name != 'pull_request'
-        run: |
-          set -euo pipefail
-          echo "Merged manifest tags:" >> "$GITHUB_STEP_SUMMARY"
-          jq -r '.tags[]' <<< "$DOCKER_METADATA_OUTPUT_JSON" | sed 's/^/- /' >> "$GITHUB_STEP_SUMMARY"
-          echo >> "$GITHUB_STEP_SUMMARY"
-          echo "Per-arch digests:" >> "$GITHUB_STEP_SUMMARY"
-          ls -1 /tmp/digests | sed 's/^/- sha256:/' >> "$GITHUB_STEP_SUMMARY"
--- a/.github/workflows/backend_pr.yml
+++ b/.github/workflows/backend_pr.yml
@@ -4,23 +4,17 @@ on:
  pull_request:

 concurrency:
-  group: ci-backends-pr-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+  group: ci-backends-pr-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true

 jobs:
  generate-matrix:
    runs-on: ubuntu-latest
    outputs:
-      matrix-singlearch: ${{ steps.set-matrix.outputs['matrix-singlearch'] }}
-      matrix-multiarch: ${{ steps.set-matrix.outputs['matrix-multiarch'] }}
-      matrix-darwin: ${{ steps.set-matrix.outputs['matrix-darwin'] }}
-      merge-matrix-multiarch: ${{ steps.set-matrix.outputs['merge-matrix-multiarch'] }}
-      merge-matrix-singlearch: ${{ steps.set-matrix.outputs['merge-matrix-singlearch'] }}
-      has-backends-singlearch: ${{ steps.set-matrix.outputs['has-backends-singlearch'] }}
-      has-backends-multiarch: ${{ steps.set-matrix.outputs['has-backends-multiarch'] }}
-      has-backends-darwin: ${{ steps.set-matrix.outputs['has-backends-darwin'] }}
-      has-merges-multiarch: ${{ steps.set-matrix.outputs['has-merges-multiarch'] }}
-      has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
+      matrix-darwin: ${{ steps.set-matrix.outputs.matrix-darwin }}
+      has-backends: ${{ steps.set-matrix.outputs.has-backends }}
+      has-backends-darwin: ${{ steps.set-matrix.outputs.has-backends-darwin }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
@@ -33,9 +27,7 @@ jobs:
          bun add js-yaml
          bun add @octokit/core

-      # filters the matrix in backend.yml; splits into single-arch and
-      # multi-arch groups so backend-merge-jobs can `needs:` only the latter
-      # (matches backend.yml's structure).
+      # filters the matrix in backend.yml
      - name: Filter matrix for changed backends
        id: set-matrix
        env:
@@ -43,10 +35,10 @@ jobs:
          GITHUB_EVENT_PATH: ${{ github.event_path }}
        run: bun run scripts/changed-backends.js

-  backend-jobs-multiarch:
+  backend-jobs:
    needs: generate-matrix
    uses: ./.github/workflows/backend_build.yml
-    if: needs.generate-matrix.outputs['has-backends-multiarch'] == 'true'
+    if: needs.generate-matrix.outputs.has-backends == 'true'
    with:
      tag-latest: ${{ matrix.tag-latest }}
      tag-suffix: ${{ matrix.tag-suffix }}
@@ -54,83 +46,19 @@ jobs:
      cuda-major-version: ${{ matrix.cuda-major-version }}
      cuda-minor-version: ${{ matrix.cuda-minor-version }}
      platforms: ${{ matrix.platforms }}
-      platform-tag: ${{ matrix.platform-tag || '' }}
      runs-on: ${{ matrix.runs-on }}
-      builder-base-image: ${{ matrix.builder-base-image || '' }}
      base-image: ${{ matrix.base-image }}
      backend: ${{ matrix.backend }}
      dockerfile: ${{ matrix.dockerfile }}
      skip-drivers: ${{ matrix.skip-drivers }}
      context: ${{ matrix.context }}
      ubuntu-version: ${{ matrix.ubuntu-version }}
-      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
    secrets:
      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    strategy:
      fail-fast: true
-      max-parallel: 8
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-multiarch']) }}
-  backend-jobs-singlearch:
-    needs: generate-matrix
-    uses: ./.github/workflows/backend_build.yml
-    if: needs.generate-matrix.outputs['has-backends-singlearch'] == 'true'
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-      build-type: ${{ matrix.build-type }}
-      cuda-major-version: ${{ matrix.cuda-major-version }}
-      cuda-minor-version: ${{ matrix.cuda-minor-version }}
-      platforms: ${{ matrix.platforms }}
-      platform-tag: ${{ matrix.platform-tag || '' }}
-      runs-on: ${{ matrix.runs-on }}
-      builder-base-image: ${{ matrix.builder-base-image || '' }}
-      base-image: ${{ matrix.base-image }}
-      backend: ${{ matrix.backend }}
-      dockerfile: ${{ matrix.dockerfile }}
-      skip-drivers: ${{ matrix.skip-drivers }}
-      context: ${{ matrix.context }}
-      ubuntu-version: ${{ matrix.ubuntu-version }}
-      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: true
-      max-parallel: 8
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch']) }}
-  backend-merge-jobs-multiarch:
-    needs: [generate-matrix, backend-jobs-multiarch]
-    # backend_merge.yml's push-side steps are all gated on
-    # github.event_name != 'pull_request', so on a PR the merge job would
-    # do nothing. Skip it entirely to avoid spinning up an empty runner.
-    # !cancelled() lets the merge run even when a few build legs fail —
-    # see the matching note in backend.yml.
-    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
-    uses: ./.github/workflows/backend_merge.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-multiarch']) }}
-
-  backend-merge-jobs-singlearch:
-    needs: [generate-matrix, backend-jobs-singlearch]
-    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
-    uses: ./.github/workflows/backend_merge.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch']) }}
+      matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix) }}
  backend-jobs-darwin:
    needs: generate-matrix
    uses: ./.github/workflows/backend_build_darwin.yml
@@ -138,7 +66,7 @@ jobs:
    with:
      backend: ${{ matrix.backend }}
      build-type: ${{ matrix.build-type }}
-      go-version: "1.25.x"
+      go-version: "1.24.x"
      tag-suffix: ${{ matrix.tag-suffix }}
      lang: ${{ matrix.lang || 'python' }}
      use-pip: ${{ matrix.backend == 'diffusers' }}
--- a/.github/workflows/base-images.yml
+++ b/.github/workflows/base-images.yml
@@ -1,161 +0,0 @@
---
-name: 'build base-grpc images'
-
-# Builds + pushes pre-compiled builder base images that downstream
-# llama-cpp / ik-llama-cpp / turboquant variant Dockerfiles will FROM
-# (PR 2). Each base contains apt deps + protoc + cmake + gRPC at
-# /opt/grpc + (conditionally) CUDA / ROCm / Vulkan toolchains.
-#
-# Triggers:
-#   - schedule (Saturdays 05:00 UTC) - picks up Ubuntu/CUDA/ROCm
-#     security updates and re-runs ahead of the backend.yml weekly
-#     cron (Sundays 06:00 UTC).
-#   - workflow_dispatch - manual one-off rebuild.
-#   - push to master that touches Dockerfile.base-grpc-builder or
-#     this workflow itself - keeps bases in sync with their inputs.
-#
-# Bootstrap (one-time after this PR merges):
-#   gh workflow run base-images.yml --ref master
-# Wait ~30 min for all 9 matrix variants to push to
-# quay.io/go-skynet/ci-cache:base-grpc-* before merging PR 2.
-
-on:
-  schedule:
-    - cron: '0 5 * * 6'
-  workflow_dispatch:
-  push:
-    branches: [master]
-    paths:
-      - 'backend/Dockerfile.base-grpc-builder'
-      - '.github/workflows/base-images.yml'
-      # The install logic and apt-mirror helper are bind-mounted into
-      # Dockerfile.base-grpc-builder at build time — changes to either
-      # affect the produced base images and must trigger a rebuild.
-      - '.docker/install-base-deps.sh'
-      - '.docker/apt-mirror.sh'
-
-concurrency:
-  group: ci-base-images-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  build:
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ${{ matrix.runs-on }}
-    strategy:
-      fail-fast: false
-      matrix:
-        include:
-          - tag: 'base-grpc-amd64'
-            runs-on: 'ubuntu-latest'
-            base-image: 'ubuntu:24.04'
-            build-type: ''
-            cuda-major-version: ''
-            cuda-minor-version: ''
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-arm64'
-            runs-on: 'ubuntu-24.04-arm'
-            base-image: 'ubuntu:24.04'
-            build-type: ''
-            cuda-major-version: ''
-            cuda-minor-version: ''
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-cuda-12-amd64'
-            runs-on: 'ubuntu-latest'
-            base-image: 'ubuntu:24.04'
-            build-type: 'cublas'
-            cuda-major-version: '12'
-            cuda-minor-version: '8'
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-cuda-13-amd64'
-            runs-on: 'ubuntu-latest'
-            base-image: 'ubuntu:22.04'
-            build-type: 'cublas'
-            cuda-major-version: '13'
-            cuda-minor-version: '0'
-            ubuntu-version: '2204'
-          - tag: 'base-grpc-cuda-13-arm64'
-            runs-on: 'ubuntu-24.04-arm'
-            base-image: 'ubuntu:24.04'
-            build-type: 'cublas'
-            cuda-major-version: '13'
-            cuda-minor-version: '0'
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-rocm-amd64'
-            runs-on: 'ubuntu-latest'
-            base-image: 'rocm/dev-ubuntu-24.04:7.2.1'
-            build-type: 'hipblas'
-            cuda-major-version: ''
-            cuda-minor-version: ''
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-vulkan-amd64'
-            runs-on: 'ubuntu-latest'
-            base-image: 'ubuntu:24.04'
-            build-type: 'vulkan'
-            cuda-major-version: ''
-            cuda-minor-version: ''
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-vulkan-arm64'
-            runs-on: 'ubuntu-24.04-arm'
-            base-image: 'ubuntu:24.04'
-            build-type: 'vulkan'
-            cuda-major-version: ''
-            cuda-minor-version: ''
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-intel-amd64'
-            runs-on: 'ubuntu-latest'
-            base-image: 'intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04'
-            build-type: 'sycl'
-            cuda-major-version: ''
-            cuda-minor-version: ''
-            ubuntu-version: '2404'
-          # Legacy JetPack r36.4.0 base for older Jetson devices (CUDA 12).
-          # Distinct from base-grpc-cuda-13-arm64 (Ubuntu 24.04 + CUDA 13 sbsa)
-          # which targets newer Jetsons. Some matrix entries
-          # (-nvidia-l4t-arm64-llama-cpp / -turboquant) still build against
-          # the JetPack image, so we need a matching base.
-          - tag: 'base-grpc-l4t-cuda-12-arm64'
-            runs-on: 'ubuntu-24.04-arm'
-            base-image: 'nvcr.io/nvidia/l4t-jetpack:r36.4.0'
-            build-type: 'l4t'
-            cuda-major-version: '12'
-            cuda-minor-version: '0'
-            ubuntu-version: '2204'
-            # JetPack r36.4.0 already ships CUDA preinstalled at /usr/local/cuda;
-            # apt-installing cuda-nvcc-12-0 from the public repos fails because
-            # those packages aren't published for the JetPack apt feed. Match
-            # the original l4t matrix entry which set skip-drivers: 'true'.
-            skip-drivers: 'true'
-    steps:
-      - uses: actions/checkout@v6
-        with:
-          submodules: false
-      - name: Free disk space
-        uses: ./.github/actions/free-disk-space
-      - name: Set up build disk
-        uses: ./.github/actions/setup-build-disk
-      - uses: docker/setup-qemu-action@master
-        with:
-          platforms: all
-      - uses: docker/setup-buildx-action@master
-      - uses: docker/login-action@v4
-        with:
-          registry: quay.io
-          username: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-          password: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-      - uses: docker/build-push-action@v7
-        with:
-          context: .
-          file: ./backend/Dockerfile.base-grpc-builder
-          build-args: |
-            BASE_IMAGE=${{ matrix.base-image }}
-            BUILD_TYPE=${{ matrix.build-type }}
-            CUDA_MAJOR_VERSION=${{ matrix.cuda-major-version }}
-            CUDA_MINOR_VERSION=${{ matrix.cuda-minor-version }}
-            UBUNTU_VERSION=${{ matrix.ubuntu-version }}
-            SKIP_DRIVERS=${{ matrix.skip-drivers || 'false' }}
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-${{ matrix.tag }}
-          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache-${{ matrix.tag }},mode=max,ignore-error=true
-          provenance: false
-          tags: quay.io/go-skynet/ci-cache:${{ matrix.tag }}
-          push: true
--- a/.github/workflows/build-test.yaml
+++ b/.github/workflows/build-test.yaml
@@ -50,8 +50,6 @@ jobs:
        uses: actions/checkout@v6
        with:
          fetch-depth: 0
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
      - name: Set up Go
        uses: actions/setup-go@v5
        with:
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -22,30 +22,10 @@ jobs:
            variable: "TURBOQUANT_VERSION"
            branch: "feature/turboquant-kv-cache"
            file: "backend/cpp/turboquant/Makefile"
-          - repository: "antirez/ds4"
-            variable: "DS4_VERSION"
-            branch: "main"
-            file: "backend/cpp/ds4/Makefile"
-          - repository: "localai-org/privacy-filter.cpp"
-            variable: "PRIVACY_FILTER_VERSION"
-            branch: "master"
-            file: "backend/cpp/privacy-filter/Makefile"
          - repository: "ggml-org/whisper.cpp"
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
            file: "backend/go/whisper/Makefile"
-          - repository: "CrispStrobe/CrispASR"
-            variable: "CRISPASR_VERSION"
-            branch: "main"
-            file: "backend/go/crispasr/Makefile"
-          - repository: "mudler/parakeet.cpp"
-            variable: "PARAKEET_VERSION"
-            branch: "master"
-            file: "backend/go/parakeet-cpp/Makefile"
-          - repository: "mudler/depth-anything.cpp"
-            variable: "DEPTHANYTHING_VERSION"
-            branch: "master"
-            file: "backend/go/depth-anything-cpp/Makefile"
          - repository: "leejet/stable-diffusion.cpp"
            variable: "STABLEDIFFUSION_GGML_VERSION"
            branch: "master"
@@ -66,26 +46,10 @@ jobs:
            variable: "SAM3_VERSION"
            branch: "main"
            file: "backend/go/sam3-cpp/Makefile"
-          - repository: "mudler/rf-detr.cpp"
-            variable: "RFDETR_VERSION"
-            branch: "main"
-            file: "backend/go/rfdetr-cpp/Makefile"
-          - repository: "mudler/locate-anything.cpp"
-            variable: "LOCATEANYTHING_VERSION"
-            branch: "master"
-            file: "backend/go/locate-anything-cpp/Makefile"
-          - repository: "ServeurpersoCom/qwentts.cpp"
+          - repository: "predict-woo/qwen3-tts.cpp"
            variable: "QWEN3TTS_CPP_VERSION"
-            branch: "master"
+            branch: "main"
            file: "backend/go/qwen3-tts-cpp/Makefile"
-          - repository: "ServeurpersoCom/omnivoice.cpp"
-            variable: "OMNIVOICE_VERSION"
-            branch: "master"
-            file: "backend/go/omnivoice-cpp/Makefile"
-          - repository: "localai-org/vibevoice.cpp"
-            variable: "VIBEVOICE_CPP_VERSION"
-            branch: "master"
-            file: "backend/go/vibevoice-cpp/Makefile"
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
@@ -116,37 +80,5 @@ jobs:
          body: ${{ steps.bump.outputs.message }}
          signoff: true

-  bump-vllm-wheel:
-    # vLLM's cu130 wheel comes from a per-tag index URL (no /latest/ alias),
-    # so the cublas13 requirements file pins both a URL segment and a version
-    # constraint. bump_deps.sh handles git-sha-in-Makefile only — this job
-    # rewrites both values atomically when a new vLLM stable tag ships.
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v6
-      - name: Bump vLLM cu130 wheel pin 🔧
-        id: bump
-        run: |
-          bash .github/bump_vllm_wheel.sh vllm-project/vllm backend/python/vllm/requirements-cublas13-after.txt VLLM_VERSION
-          {
-            echo 'message<<EOF'
-            cat "VLLM_VERSION_message.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          {
-            echo 'commit<<EOF'
-            cat "VLLM_VERSION_commit.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          rm -rfv VLLM_VERSION_message.txt VLLM_VERSION_commit.txt
-      - name: Create Pull Request
-        uses: peter-evans/create-pull-request@v8
-        with:
-          token: ${{ secrets.UPDATE_BOT_TOKEN }}
-          push-to-fork: ci-forks/LocalAI
-          commit-message: ':arrow_up: Update vllm-project/vllm cu130 wheel'
-          title: 'chore: :arrow_up: Update vllm-project/vllm cu130 wheel to `${{ steps.bump.outputs.commit }}`'
-          branch: "update/VLLM_VERSION"
-          body: ${{ steps.bump.outputs.message }}
-          signoff: true
+
+
--- a/.github/workflows/checksum_checker.yaml
+++ b/.github/workflows/checksum_checker.yaml
@@ -8,9 +8,15 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
+      - name: Force Install GIT latest
+        run: |
+          sudo apt-get update \
+          && sudo apt-get install -y software-properties-common \
+          && sudo apt-get update \
+          && sudo add-apt-repository -y ppa:git-core/ppa \
+          && sudo apt-get update \
+          && sudo apt-get install -y git
      - uses: actions/checkout@v6
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
      - name: Install dependencies
        run: |
          sudo apt-get update
--- a/.github/workflows/gallery-agent.yaml
+++ b/.github/workflows/gallery-agent.yaml
@@ -2,7 +2,7 @@ name: Gallery Agent
 on:

  schedule:
-    - cron: '0 */12 * * *'  # Run every 4 hours
+    - cron: '0 */3 * * *'  # Run every 4 hours
  workflow_dispatch:
    inputs:
      search_term:
@@ -54,41 +54,24 @@ jobs:
          REPO: ${{ github.repository }}
          SEARCH: 'gallery agent in:title'
        run: |
-          # Walk gallery-agent PRs and act on maintainer comments:
+          # Walk open gallery-agent PRs and act on maintainer comments:
          #   /gallery-agent blacklist → label `gallery-agent/blacklisted` + close (never repropose)
          #   /gallery-agent recreate  → close without label (next run may repropose)
          # Only comments from OWNER / MEMBER / COLLABORATOR are honored so
          # random users can't drive the bot.
-          #
-          # We scan both open PRs AND recently-closed PRs that don't already
-          # carry the blacklist label. This covers the common flow where a
-          # maintainer writes /gallery-agent blacklist and immediately clicks
-          # Close — without this, the next scheduled run wouldn't see the
-          # command (PR is already closed) and would repropose the model.
          gh label create gallery-agent/blacklisted \
            --repo "$REPO" --color ededed \
            --description "gallery-agent must not repropose this model" 2>/dev/null || true

-          prs_open=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" \
-            --json number --jq '.[].number')
-          # Closed PRs from the last 14 days that don't yet have the blacklist label.
-          # Bounded window keeps the scan cheap while covering late-applied commands.
-          since=$(date -u -d '14 days ago' +%Y-%m-%d)
-          prs_closed=$(gh pr list --repo "$REPO" --state closed \
-            --search "$SEARCH closed:>=$since -label:gallery-agent/blacklisted" \
-            --json number --jq '.[].number')
-          prs=$(printf '%s\n%s\n' "$prs_open" "$prs_closed" | sort -u | sed '/^$/d')
+          prs=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" --json number --jq '.[].number')
          for pr in $prs; do
-            state=$(gh pr view "$pr" --repo "$REPO" --json state --jq '.state')
            cmds=$(gh pr view "$pr" --repo "$REPO" --json comments \
              --jq '.comments[] | select(.authorAssociation=="OWNER" or .authorAssociation=="MEMBER" or .authorAssociation=="COLLABORATOR") | .body')
            if echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+blacklist([[:space:]]|$)'; then
-              echo "PR #$pr: blacklist command found (state=$state)"
+              echo "PR #$pr: blacklist command found"
              gh pr edit "$pr" --repo "$REPO" --add-label gallery-agent/blacklisted || true
-              if [ "$state" = "OPEN" ]; then
-                gh pr close "$pr" --repo "$REPO" --comment "Blacklisted via \`/gallery-agent blacklist\`. This model will not be reproposed." || true
-              fi
-            elif [ "$state" = "OPEN" ] && echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+recreate([[:space:]]|$)'; then
+              gh pr close "$pr" --repo "$REPO" --comment "Blacklisted via \`/gallery-agent blacklist\`. This model will not be reproposed." || true
+            elif echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+recreate([[:space:]]|$)'; then
              echo "PR #$pr: recreate command found"
              gh pr close "$pr" --repo "$REPO" --comment "Closed via \`/gallery-agent recreate\`. The next scheduled run will propose this model again." || true
            fi
--- a/.github/workflows/generate_grpc_cache.yaml
+++ b/.github/workflows/generate_grpc_cache.yaml
@@ -0,0 +1,96 @@
+name: 'generate and publish GRPC docker caches'
+
+on:
+  workflow_dispatch:
+
+  schedule:
+    # daily at midnight
+    - cron: '0 0 * * *'
+
+concurrency:
+  group: grpc-cache-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true
+
+jobs:
+  generate_caches:
+    if: github.repository == 'mudler/LocalAI'
+    strategy:
+      matrix:
+        include:
+          - grpc-base-image: ubuntu:24.04
+            runs-on: 'ubuntu-latest'
+            platforms: 'linux/amd64,linux/arm64'
+    runs-on: ${{matrix.runs-on}}
+    steps:
+      - name: Release space from worker
+        if: matrix.runs-on == 'ubuntu-latest'
+        run: |
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          df -h
+          echo
+          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
+          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
+          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
+          sudo rm -rf /usr/local/lib/android
+          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
+          sudo rm -rf /usr/share/dotnet
+          sudo apt-get remove -y '^mono-.*' || true
+          sudo apt-get remove -y '^ghc-.*' || true
+          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
+          sudo apt-get remove -y 'php.*' || true
+          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
+          sudo apt-get remove -y '^google-.*' || true
+          sudo apt-get remove -y azure-cli || true
+          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
+          sudo apt-get remove -y '^gfortran-.*' || true
+          sudo apt-get remove -y microsoft-edge-stable || true
+          sudo apt-get remove -y firefox || true
+          sudo apt-get remove -y powershell || true
+          sudo apt-get remove -y r-base-core || true
+          sudo apt-get autoremove -y
+          sudo apt-get clean
+          echo
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          sudo rm -rfv build || true
+          sudo rm -rf /usr/share/dotnet || true
+          sudo rm -rf /opt/ghc || true
+          sudo rm -rf "/usr/local/share/boost" || true
+          sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
+          df -h
+
+      - name: Set up QEMU
+        uses: docker/setup-qemu-action@master
+        with:
+          platforms: all
+
+      - name: Set up Docker Buildx
+        id: buildx
+        uses: docker/setup-buildx-action@master
+
+      - name: Checkout
+        uses: actions/checkout@v6
+
+      - name: Cache GRPC
+        uses: docker/build-push-action@v7
+        with:
+          builder: ${{ steps.buildx.outputs.name }}
+          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
+          # This means that even the MAKEFLAGS have to be an EXACT match.
+          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
+          build-args: |
+            GRPC_BASE_IMAGE=${{ matrix.grpc-base-image }}
+            GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
+            GRPC_VERSION=v1.65.0
+          context: .
+          file: ./Dockerfile
+          cache-to: type=gha,ignore-error=true
+          cache-from: type=gha
+          target: grpc
+          platforms: ${{ matrix.platforms }}
+          push: false
--- a/.github/workflows/generate_intel_image.yaml
+++ b/.github/workflows/generate_intel_image.yaml
@@ -7,8 +7,8 @@ on:
      - master

 concurrency:
-  group: intel-cache-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+  group: intel-cache-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true

 jobs:
  generate_caches:
@@ -16,7 +16,7 @@ jobs:
    strategy:
      matrix:
        include:
-          - base-image: intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04
+          - base-image: intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04
            runs-on: 'arc-runner-set'
            platforms: 'linux/amd64'
    runs-on: ${{matrix.runs-on}}
--- a/.github/workflows/image-pr.yml
+++ b/.github/workflows/image-pr.yml
@@ -5,8 +5,8 @@
    pull_request:
  
  concurrency:
-    group: ci-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-    cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+    group: ci-${{ github.head_ref || github.ref }}-${{ github.repository }}
+    cancel-in-progress: true
  
  jobs:
    image-build:
@@ -18,9 +18,9 @@
        cuda-major-version: ${{ matrix.cuda-major-version }}
        cuda-minor-version: ${{ matrix.cuda-minor-version }}
        platforms: ${{ matrix.platforms }}
-        platform-tag: ${{ matrix.platform-tag || '' }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
+        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
      secrets:
@@ -60,35 +60,27 @@
              tag-latest: 'false'
              tag-suffix: '-hipblas'
              base-image: "rocm/dev-ubuntu-24.04:7.2.1"
+              grpc-base-image: "ubuntu:24.04"
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
              ubuntu-version: '2404'
            - build-type: 'sycl'
              platforms: 'linux/amd64'
              tag-latest: 'false'
-              base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
+              base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+              grpc-base-image: "ubuntu:24.04"
              tag-suffix: 'sycl'
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
              ubuntu-version: '2404'
            - build-type: 'vulkan'
-              platforms: 'linux/amd64'
-              platform-tag: 'amd64'
+              platforms: 'linux/amd64,linux/arm64'
              tag-latest: 'false'
              tag-suffix: '-vulkan-core'
              runs-on: 'ubuntu-latest'
              base-image: "ubuntu:24.04"
              makeflags: "--jobs=4 --output-sync=target"
              ubuntu-version: '2404'
-            - build-type: 'vulkan'
-              platforms: 'linux/arm64'
-              platform-tag: 'arm64'
-              tag-latest: 'false'
-              tag-suffix: '-vulkan-core'
-              runs-on: 'ubuntu-24.04-arm'
-              base-image: "ubuntu:24.04"
-              makeflags: "--jobs=4 --output-sync=target"
-              ubuntu-version: '2404'
            - build-type: 'cublas'
              cuda-major-version: "13"
              cuda-minor-version: "0"
--- a/.github/workflows/image.yml
+++ b/.github/workflows/image.yml
@@ -9,8 +9,8 @@
        - '*'
  
  concurrency:
-    group: ci-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-    cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+    group: ci-${{ github.head_ref || github.ref }}-${{ github.repository }}
+    cancel-in-progress: true
  
  jobs:
    hipblas-jobs:
@@ -25,6 +25,7 @@
        platforms: ${{ matrix.platforms }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
+        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
        ubuntu-codename: ${{ matrix.ubuntu-codename }}
@@ -41,11 +42,12 @@
              tag-latest: 'auto'
              tag-suffix: '-gpu-hipblas'
              base-image: "rocm/dev-ubuntu-24.04:7.2.1"
+              grpc-base-image: "ubuntu:24.04"
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
-
+  
    core-image-build:
      if: github.repository == 'mudler/LocalAI'
      uses: ./.github/workflows/image_build.yml
@@ -56,9 +58,9 @@
        cuda-major-version: ${{ matrix.cuda-major-version }}
        cuda-minor-version: ${{ matrix.cuda-minor-version }}
        platforms: ${{ matrix.platforms }}
-        platform-tag: ${{ matrix.platform-tag || '' }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
+        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        skip-drivers: ${{ matrix.skip-drivers }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
@@ -73,8 +75,7 @@
        matrix:
          include:
            - build-type: ''
-              platforms: 'linux/amd64'
-              platform-tag: 'amd64'
+              platforms: 'linux/amd64,linux/arm64'
              tag-latest: 'auto'
              tag-suffix: ''
              base-image: "ubuntu:24.04"
@@ -83,17 +84,6 @@
              skip-drivers: 'false'
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
-            - build-type: ''
-              platforms: 'linux/arm64'
-              platform-tag: 'arm64'
-              tag-latest: 'auto'
-              tag-suffix: ''
-              base-image: "ubuntu:24.04"
-              runs-on: 'ubuntu-24.04-arm'
-              makeflags: "--jobs=4 --output-sync=target"
-              skip-drivers: 'false'
-              ubuntu-version: '2404'
-              ubuntu-codename: 'noble'
            - build-type: 'cublas'
              cuda-major-version: "12"
              cuda-minor-version: "8"
@@ -119,8 +109,7 @@
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
            - build-type: 'vulkan'
-              platforms: 'linux/amd64'
-              platform-tag: 'amd64'
+              platforms: 'linux/amd64,linux/arm64'
              tag-latest: 'auto'
              tag-suffix: '-gpu-vulkan'
              runs-on: 'ubuntu-latest'
@@ -129,141 +118,17 @@
              makeflags: "--jobs=4 --output-sync=target"
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
-            - build-type: 'vulkan'
-              platforms: 'linux/arm64'
-              platform-tag: 'arm64'
-              tag-latest: 'auto'
-              tag-suffix: '-gpu-vulkan'
-              runs-on: 'ubuntu-24.04-arm'
-              base-image: "ubuntu:24.04"
-              skip-drivers: 'false'
-              makeflags: "--jobs=4 --output-sync=target"
-              ubuntu-version: '2404'
-              ubuntu-codename: 'noble'
            - build-type: 'intel'
              platforms: 'linux/amd64'
              tag-latest: 'auto'
-              base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
+              base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+              grpc-base-image: "ubuntu:24.04"
              tag-suffix: '-gpu-intel'
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
-
-    core-image-merge:
-      # !cancelled(): without it, GHA's default `needs:` cascade skips the
-      # merge whenever any matrix cell of the parent build fails or is
-      # cancelled. Same fix as backend.yml's merge jobs — we still want to
-      # publish the manifest list for tag-suffixes whose legs all succeeded.
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: core-image-build
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: ''
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    gpu-vulkan-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: core-image-build
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-gpu-vulkan'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    # Single-arch server-image merges. Same conceptual fix as the backend
-    # singletons in PR #9781: image_build.yml pushes by canonical digest
-    # only, so without a downstream merge step there's no tag for consumers
-    # (no :latest-gpu-nvidia-cuda-12, no :v<X>-gpu-nvidia-cuda-12, etc.).
-    # Each merge job needs only its parent build matrix and is filtered by
-    # tag-suffix in image_merge.yml's artifact-download pattern.
-    gpu-nvidia-cuda-12-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: core-image-build
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-gpu-nvidia-cuda-12'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    gpu-nvidia-cuda-13-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: core-image-build
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-gpu-nvidia-cuda-13'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    gpu-intel-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: core-image-build
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-gpu-intel'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    gpu-hipblas-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: hipblas-jobs
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-gpu-hipblas'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    nvidia-l4t-arm64-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: gh-runner
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-nvidia-l4t-arm64'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    nvidia-l4t-arm64-cuda-13-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: gh-runner
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-nvidia-l4t-arm64-cuda-13'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
+  
    gh-runner:
      if: github.repository == 'mudler/LocalAI'
      uses: ./.github/workflows/image_build.yml
@@ -276,6 +141,7 @@
        platforms: ${{ matrix.platforms }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
+        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        skip-drivers: ${{ matrix.skip-drivers }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -8,6 +8,11 @@ on:
        description: 'Base image'
        required: true
        type: string
+      grpc-base-image:
+        description: 'GRPC Base image, must be a compatible image with base-image'
+        required: false
+        default: ''
+        type: string
      build-type:
        description: 'Build type'
        default: ''
@@ -24,15 +29,6 @@ on:
        description: 'Platforms'
        default: ''
        type: string
-      platform-tag:
-        description: |
-          Short tag identifying the platform leg, e.g. "amd64" or "arm64".
-          Used to scope the per-arch registry cache and the digest artifact name.
-          Optional during the migration; will be flipped to required: true once
-          every caller passes an explicit value.
-        required: false
-        default: ''
-        type: string
      tag-latest:
        description: 'Tag latest'
        default: ''
@@ -79,20 +75,73 @@ jobs:
    runs-on: ${{ inputs.runs-on }}
    steps:

+      - name: Free Disk Space (Ubuntu)
+        if: inputs.runs-on == 'ubuntu-latest'
+        uses: jlumbroso/free-disk-space@main
+        with:
+          # this might remove tools that are actually needed,
+          # if set to "true" but frees about 6 GB
+          tool-cache: true
+          # all of these default to true, but feel free to set to
+          # "false" if necessary for your workflow
+          android: true
+          dotnet: true
+          haskell: true
+          large-packages: true
+          docker-images: true
+          swap-storage: true
+      - name: Force Install GIT latest
+        run: |
+          sudo apt-get update \
+          && sudo apt-get install -y software-properties-common \
+          && sudo apt-get update \
+          && sudo add-apt-repository -y ppa:git-core/ppa \
+          && sudo apt-get update \
+          && sudo apt-get install -y git
      - name: Checkout
        uses: actions/checkout@v6

-      - name: Configure apt mirror on runner
-        id: apt_mirror
-        uses: ./.github/actions/configure-apt-mirror
-
-      - name: Free disk space
-        uses: ./.github/actions/free-disk-space
-        with:
-          mode: ${{ inputs.runs-on == 'ubuntu-latest' && 'hosted' || 'skip' }}
-
-      - name: Set up build disk
-        uses: ./.github/actions/setup-build-disk
+      - name: Release space from worker
+        if: inputs.runs-on == 'ubuntu-latest'
+        run: |
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          df -h
+          echo
+          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
+          sudo apt-get remove --auto-remove android-sdk-platform-tools snapd || true
+          sudo apt-get purge --auto-remove android-sdk-platform-tools snapd || true
+          sudo rm -rf /usr/local/lib/android
+          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
+          sudo rm -rf /usr/share/dotnet
+          sudo apt-get remove -y '^mono-.*' || true
+          sudo apt-get remove -y '^ghc-.*' || true
+          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
+          sudo apt-get remove -y 'php.*' || true
+          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
+          sudo apt-get remove -y '^google-.*' || true
+          sudo apt-get remove -y azure-cli || true
+          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
+          sudo apt-get remove -y '^gfortran-.*' || true
+          sudo apt-get remove -y microsoft-edge-stable || true
+          sudo apt-get remove -y firefox || true
+          sudo apt-get remove -y powershell || true
+          sudo apt-get remove -y r-base-core || true
+          sudo apt-get autoremove -y
+          sudo apt-get clean
+          echo
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          sudo rm -rfv build || true
+          sudo rm -rf /usr/share/dotnet || true
+          sudo rm -rf /opt/ghc || true
+          sudo rm -rf "/usr/local/share/boost" || true
+          sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
+          df -h

      - name: Docker meta
        id: meta
@@ -106,7 +155,6 @@ jobs:
            type=ref,event=branch
            type=semver,pattern={{raw}}
            type=sha
-            type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true
@@ -148,89 +196,59 @@ jobs:
          username: ${{ secrets.quayUsername }}
          password: ${{ secrets.quayPassword }}

-      - name: Build and push by digest
-        id: build
+      - name: Build and push
        uses: docker/build-push-action@v7
        if: github.event_name != 'pull_request'
        with:
          builder: ${{ steps.buildx.outputs.name }}
+          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
+          # This means that even the MAKEFLAGS have to be an EXACT match.
+          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
+          # This is why some build args like GRPC_VERSION and MAKEFLAGS are hardcoded
          build-args: |
            BUILD_TYPE=${{ inputs.build-type }}
            CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
            CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
            BASE_IMAGE=${{ inputs.base-image }}
+            GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
+            GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
+            GRPC_VERSION=v1.65.0
            MAKEFLAGS=${{ inputs.makeflags }}
            SKIP_DRIVERS=${{ inputs.skip-drivers }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
-            APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
-            APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
          context: .
          file: ./Dockerfile
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
-          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}-${{ inputs.platform-tag }},mode=max,ignore-error=true
+          cache-from: type=gha
          platforms: ${{ inputs.platforms }}
-          outputs: |
-            type=image,name=quay.io/go-skynet/local-ai,push-by-digest=true,name-canonical=true,push=true
-            type=image,name=localai/localai,push-by-digest=true,name-canonical=true,push=true
-          # See backend_build.yml for the rationale — provenance=mode=max
-          # diverges the manifest-list digest per registry, breaking the
-          # downstream imagetools create lookup.
-          provenance: false
+          push: ${{ github.event_name != 'pull_request' }}
+          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
-
-      - name: Export digest
-        if: github.event_name != 'pull_request'
-        run: |
-          mkdir -p /tmp/digests
-          digest="${{ steps.build.outputs.digest }}"
-          touch "/tmp/digests/${digest#sha256:}"
-
-      # See .github/scripts/anchor-digest-in-cache.sh for why this is needed
-      # and how it interacts with image_merge.yml's cleanup step. Mirrors the
-      # same anchor in backend_build.yml — quay's per-repo manifest GC reaps
-      # untagged manifests in local-ai before the merge runs.
-      - name: Anchor digest in ci-cache so quay GC won't reap before merge
-        if: github.event_name != 'pull_request'
-        env:
-          TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
-          PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
-          DIGEST: ${{ steps.build.outputs.digest }}
-          SOURCE_IMAGE: quay.io/go-skynet/local-ai
-        run: .github/scripts/anchor-digest-in-cache.sh
-
-      - name: Upload digest artifact
-        if: github.event_name != 'pull_request'
-        uses: actions/upload-artifact@v7
-        with:
-          # `--` separator + 'single' placeholder for empty platform-tag —
-          # same pattern as backend_build.yml. Prevents prefix collisions
-          # in the merge-side glob (e.g. -nvidia-l4t-arm64 is a prefix of
-          # -nvidia-l4t-arm64-cuda-13).
-          name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
-          path: /tmp/digests/*
-          if-no-files-found: error
-          retention-days: 1
 ### Start testing image
      - name: Build and push
        uses: docker/build-push-action@v7
        if: github.event_name == 'pull_request'
        with:
          builder: ${{ steps.buildx.outputs.name }}
+          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
+          # This means that even the MAKEFLAGS have to be an EXACT match.
+          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
+          # This is why some build args like GRPC_VERSION and MAKEFLAGS are hardcoded
          build-args: |
            BUILD_TYPE=${{ inputs.build-type }}
            CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
            CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
            BASE_IMAGE=${{ inputs.base-image }}
+            GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
+            GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
+            GRPC_VERSION=v1.65.0
            MAKEFLAGS=${{ inputs.makeflags }}
            SKIP_DRIVERS=${{ inputs.skip-drivers }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
-            APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
-            APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
          context: .
          file: ./Dockerfile
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
+          cache-from: type=gha
          platforms: ${{ inputs.platforms }}
          #push: true
          tags: ${{ steps.meta_pull_request.outputs.tags }}
--- a/.github/workflows/image_merge.yml
+++ b/.github/workflows/image_merge.yml
@@ -1,146 +0,0 @@
---
-name: 'merge LocalAI image manifest list (reusable)'
-
-# Reusable workflow that joins per-arch digest artifacts (uploaded by
-# image_build.yml when called with platform-tag) into a single tagged
-# multi-arch manifest list.
-
-on:
-  workflow_call:
-    inputs:
-      tag-latest:
-        description: 'Whether the manifest list should also be tagged latest (auto/false/true)'
-        required: false
-        type: string
-        default: ''
-      tag-suffix:
-        description: 'Image tag suffix (empty for core image). Used in artifact pattern with a -core placeholder for empty.'
-        required: true
-        type: string
-    secrets:
-      dockerUsername:
-        required: false
-      dockerPassword:
-        required: false
-      quayUsername:
-        required: true
-      quayPassword:
-        required: true
-
-jobs:
-  merge:
-    runs-on: ubuntu-latest
-    env:
-      quay_username: ${{ secrets.quayUsername }}
-    steps:
-      # Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
-      # script). Skips the rest of the source tree.
-      - name: Checkout (.github/scripts only)
-        uses: actions/checkout@v6
-        with:
-          sparse-checkout: |
-            .github/scripts
-          sparse-checkout-cone-mode: false
-
-      - name: Download digests
-        uses: actions/download-artifact@v8
-        with:
-          # `--` separator anchors the glob so we don't over-match sibling
-          # tag-suffixes (e.g. -nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13).
-          # Must stay in sync with image_build.yml's upload-artifact name.
-          pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--*
-          merge-multiple: true
-          path: /tmp/digests
-
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@master
-
-      - name: Login to DockerHub
-        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
-        with:
-          username: ${{ secrets.dockerUsername }}
-          password: ${{ secrets.dockerPassword }}
-
-      - name: Login to Quay.io
-        uses: docker/login-action@v4
-        with:
-          registry: quay.io
-          username: ${{ secrets.quayUsername }}
-          password: ${{ secrets.quayPassword }}
-
-      - name: Docker meta
-        id: meta
-        uses: docker/metadata-action@v6
-        with:
-          images: |
-            quay.io/go-skynet/local-ai
-            localai/localai
-          tags: |
-            type=ref,event=branch
-            type=semver,pattern={{raw}}
-            type=sha
-            type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
-          flavor: |
-            latest=${{ inputs.tag-latest }}
-            suffix=${{ inputs.tag-suffix }},onlatest=true
-
-      # Source from ci-cache, not local-ai. See backend_merge.yml for the
-      # detailed rationale — quay's manifest GC is per-repository, so the
-      # untagged digest in local-ai gets reaped while the same content lives
-      # tagged under ci-cache (anchored by image_build.yml). buildx imagetools
-      # create copies the manifest into local-ai (blobs already cross-mounted)
-      # and publishes the manifest list with user-facing tags. End state in
-      # local-ai is self-contained; no embedded reference to ci-cache.
-      - name: Create manifest list and push (quay)
-        working-directory: /tmp/digests
-        run: |
-          set -euo pipefail
-          tags=$(jq -cr '.tags | map(select(startswith("quay.io/"))) | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          if [ -z "$tags" ]; then
-            echo "No quay.io tags from docker/metadata-action; skipping quay merge"
-          else
-            # shellcheck disable=SC2086
-            docker buildx imagetools create $tags \
-              $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
-          fi
-
-      - name: Create manifest list and push (dockerhub)
-        if: github.event_name != 'pull_request'
-        working-directory: /tmp/digests
-        run: |
-          set -euo pipefail
-          tags=$(jq -cr '.tags | map(select(startswith("localai/"))) | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          if [ -z "$tags" ]; then
-            echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
-          else
-            # shellcheck disable=SC2086
-            docker buildx imagetools create $tags \
-              $(printf 'localai/localai@sha256:%s ' *)
-          fi
-
-      - name: Inspect manifest
-        run: |
-          set -euo pipefail
-          first_tag=$(jq -cr '.tags[0]' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          if [ -n "$first_tag" ] && [ "$first_tag" != "null" ]; then
-            docker buildx imagetools inspect "$first_tag"
-          fi
-
-      # See .github/scripts/cleanup-keepalive-tags.sh for the best-effort
-      # semantics — fails soft when the registry credential isn't OAuth-scoped.
-      - name: Cleanup keepalive tags in ci-cache
-        if: github.event_name != 'pull_request' && success()
-        env:
-          TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
-          QUAY_TOKEN: ${{ secrets.quayPassword }}
-        run: .github/scripts/cleanup-keepalive-tags.sh
-
-      - name: Job summary
-        run: |
-          set -euo pipefail
-          echo "Merged manifest tags:" >> "$GITHUB_STEP_SUMMARY"
-          jq -r '.tags[]' <<< "$DOCKER_METADATA_OUTPUT_JSON" | sed 's/^/- /' >> "$GITHUB_STEP_SUMMARY"
-          echo >> "$GITHUB_STEP_SUMMARY"
-          echo "Per-arch digests:" >> "$GITHUB_STEP_SUMMARY"
-          ls -1 /tmp/digests | sed 's/^/- sha256:/' >> "$GITHUB_STEP_SUMMARY"
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -1,48 +0,0 @@
---
-name: 'lint'
-
-on:
-  pull_request:
-    paths-ignore:
-      - 'docs/**'
-      - 'examples/**'
-      - 'README.md'
-      - '**/*.md'
-  push:
-    branches:
-      - master
-
-concurrency:
-  group: ci-lint-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  golangci-lint:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v6
-        with:
-          # Full history so golangci-lint's new-from-merge-base can reach
-          # origin/master and compute the diff against it.
-          fetch-depth: 0
-      - uses: actions/setup-go@v5
-        with:
-          go-version: '1.26.x'
-          cache: false
-      - name: install golangci-lint
-        run: |
-          curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh \
-            | sh -s -- -b "$(go env GOPATH)/bin" v2.11.4
-      - name: generate grpc proto sources
-        # pkg/grpc/proto/*.go is generated, not checked in. Several packages
-        # import it, so without this step typecheck fails project-wide.
-        run: make protogen-go
-      - name: stub react-ui dist for go:embed
-        # core/http/app.go has //go:embed react-ui/dist/*; the glob needs at
-        # least one non-hidden entry to satisfy typecheck. We don't run
-        # `make react-ui` here because lint doesn't need the real bundle.
-        run: |
-          mkdir -p core/http/react-ui/dist
-          touch core/http/react-ui/dist/index.html
-      - name: lint
-        run: make lint
--- a/.github/workflows/release.yaml
+++ b/.github/workflows/release.yaml
@@ -49,8 +49,6 @@ jobs:
        uses: actions/checkout@v6
        with:
          fetch-depth: 0
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
      - name: Set up Go
        uses: actions/setup-go@v5
        with:
--- a/.github/workflows/secscan.yaml
+++ b/.github/workflows/secscan.yaml
@@ -18,13 +18,10 @@ jobs:
        if: ${{ github.actor != 'dependabot[bot]' }}
      - name: Run Gosec Security Scanner
        if: ${{ github.actor != 'dependabot[bot]' }}
-        uses: securego/gosec@v2.27.1
+        uses: securego/gosec@v2.22.9
        with:
          # we let the report trigger content trigger a failure using the GitHub Security features.
-          # backend/go/supertonic is excluded: it vendors upstream supertone-inc/supertonic
-          # (helper.go), whose findings (G304 model-file loads, G404 math/rand for flow-matching
-          # noise, G104 unhandled errors) are inherent to that upstream code, not ours to rewrite.
-          args: '-no-fail -exclude-dir=backend/go/supertonic -fmt sarif -out results.sarif ./...'
+          args: '-no-fail -fmt sarif -out results.sarif ./...'
      - name: Upload SARIF file
        if: ${{ github.actor != 'dependabot[bot]' }}
        uses: github/codeql-action/upload-sarif@v4
--- a/.github/workflows/stalebot.yml
+++ b/.github/workflows/stalebot.yml
@@ -11,7 +11,7 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/stale@eb5cf3af3ac0a1aa4c9c45633dd1ae542a27a899 # v9
+      - uses: actions/stale@b5d41d4e1d5dceea10e7104786b73624c18a190f # v9
        with:
          stale-issue-message: 'This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.'
          stale-pr-message: 'This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 10 days.'
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -10,8 +10,8 @@ on:
      - '*'

 concurrency:
-  group: ci-tests-extra-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+  group: ci-tests-extra-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true

 jobs:
  detect-changes:
@@ -28,7 +28,6 @@ jobs:
      qwen-asr: ${{ steps.detect.outputs.qwen-asr }}
      nemo: ${{ steps.detect.outputs.nemo }}
      voxcpm: ${{ steps.detect.outputs.voxcpm }}
-      liquid-audio: ${{ steps.detect.outputs.liquid-audio }}
      llama-cpp-quantization: ${{ steps.detect.outputs.llama-cpp-quantization }}
      llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
      ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
@@ -37,17 +36,8 @@ jobs:
      sglang: ${{ steps.detect.outputs.sglang }}
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
      qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
-      rfdetr-cpp: ${{ steps.detect.outputs.rfdetr-cpp }}
-      locate-anything-cpp: ${{ steps.detect.outputs.locate-anything-cpp }}
-      vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
-      localvqe: ${{ steps.detect.outputs.localvqe }}
      voxtral: ${{ steps.detect.outputs.voxtral }}
      kokoros: ${{ steps.detect.outputs.kokoros }}
-      insightface: ${{ steps.detect.outputs.insightface }}
-      speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
-      sherpa-onnx: ${{ steps.detect.outputs.sherpa-onnx }}
-      whisper: ${{ steps.detect.outputs.whisper }}
-      parakeet-cpp: ${{ steps.detect.outputs.parakeet-cpp }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
@@ -451,32 +441,6 @@ jobs:
        run: |
          make --jobs=5 --output-sync=target -C backend/python/voxcpm
          make --jobs=5 --output-sync=target -C backend/python/voxcpm test
-  # liquid-audio: LFM2.5-Audio any-to-any backend. The CI smoke test
-  # exercises Health() and LoadModel(mode:finetune) — fine-tune mode
-  # short-circuits before pulling weights (backend.py:192), so no
-  # HuggingFace download or GPU is needed. The full-inference path is
-  # gated on LIQUID_AUDIO_MODEL_ID, which we don't set here.
-  tests-liquid-audio:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.liquid-audio == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential ffmpeg
-          sudo apt-get install -y ca-certificates cmake curl patch python3-pip
-          # Install UV
-          curl -LsSf https://astral.sh/uv/install.sh | sh
-          pip install --user --no-cache-dir grpcio-tools==1.64.1
-      - name: Test liquid-audio
-        run: |
-          make --jobs=5 --output-sync=target -C backend/python/liquid-audio
-          make --jobs=5 --output-sync=target -C backend/python/liquid-audio test
  tests-llama-cpp-quantization:
    needs: detect-changes
    if: needs.detect-changes.outputs.llama-cpp-quantization == 'true' || needs.detect-changes.outputs.run-all == 'true'
@@ -540,140 +504,6 @@ jobs:
      - name: Build llama-cpp backend image and run audio transcription gRPC e2e tests
        run: |
          make test-extra-backend-llama-cpp-transcription
-  # PR-acceptance smoke gate: always runs on every PR (no detect-changes gate, no
-  # paths filter). Pulls the pre-built master CPU llama-cpp image from quay
-  # instead of building from source, so the cost is a docker pull (~30s) plus the
-  # short Qwen3-0.6B model download. Exercises the full gRPC surface — health,
-  # load, predict, stream — plus the logprobs/logit_bias specs that moved out of
-  # core/http/app_test.go. Anything heavier or per-backend is gated to the
-  # detect-changes path-filter above.
-  tests-llama-cpp-smoke:
-    runs-on: ubuntu-latest
-    timeout-minutes: 20
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.25.4'
-      - name: Pull pre-built llama-cpp backend image
-        run: docker pull quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
-      - name: Run e2e-backends smoke
-        env:
-          BACKEND_IMAGE: quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
-          BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias,tokenize
-        run: |
-          make test-extra-backend
-  # Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
-  # Builds the sherpa-onnx Docker image, extracts the rootfs so the e2e suite
-  # can discover the backend binary + shared libs, downloads the three model
-  # bundles (silero-vad, omnilingual-asr, vits-ljs) and drives the realtime
-  # websocket spec end-to-end.
-  tests-sherpa-onnx-realtime:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 90
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.25.4'
-      - name: Setup Node.js
-        uses: actions/setup-node@v6
-        with:
-          node-version: '22'
-      - name: Build sherpa-onnx backend image and run realtime e2e tests
-        run: |
-          make test-extra-e2e-realtime-sherpa
-  # Streaming ASR via the sherpa-onnx online recognizer (zipformer
-  # transducer). Exercises both AudioTranscription (buffered) and
-  # AudioTranscriptionStream (real-time deltas) on the e2e-backends
-  # harness.
-  tests-sherpa-onnx-grpc-transcription:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 90
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.25.4'
-      - name: Build sherpa-onnx backend image and run streaming ASR gRPC e2e tests
-        run: |
-          make test-extra-backend-sherpa-onnx-transcription
-  # End-to-end transcription via the e2e-backends gRPC harness against
-  # the whisper.cpp backend. Drives AudioTranscription (offline) and
-  # AudioTranscriptionStream (real, segment-callback-driven deltas) on
-  # ggml-base.en + the JFK 11s clip.
-  tests-whisper-grpc-transcription:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.whisper == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 90
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.25.4'
-      - name: Build whisper backend image and run transcription gRPC e2e tests
-        run: |
-          make test-extra-backend-whisper-transcription
-  # Parakeet ASR via the parakeet-cpp backend (C++/ggml port of NeMo
-  # Parakeet). Drives AudioTranscription (offline, with word timestamps) on
-  # tdt_ctc-110m + the JFK 11s clip.
-  tests-parakeet-cpp-grpc-transcription:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.parakeet-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 90
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.25.4'
-      - name: Build parakeet-cpp backend image and run transcription gRPC e2e tests
-        run: |
-          make test-extra-backend-parakeet-cpp-transcription
-  # VITS TTS via the sherpa-onnx backend. Drives both TTS (file write) and
-  # TTSStream (PCM chunks) on the e2e-backends harness.
-  tests-sherpa-onnx-grpc-tts:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 90
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.25.4'
-      - name: Build sherpa-onnx backend image and run TTS gRPC e2e tests
-        run: |
-          make test-extra-backend-sherpa-onnx-tts
  tests-ik-llama-cpp-grpc:
    needs: detect-changes
    if: needs.detect-changes.outputs.ik-llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
@@ -866,192 +696,6 @@ jobs:
      - name: Test qwen3-tts-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp test
-  # Per-backend smoke for rfdetr-cpp: builds the .so + Go binary and runs
-  # `make -C backend/go/rfdetr-cpp test`. test.sh fetches the small (~20 MB)
-  # rfdetr-nano-q8_0 GGUF from the published mudler/rfdetr-cpp-nano HF repo
-  # via curl and synthesises a tiny PNG to exercise the wire protocol.
-  tests-rfdetr-cpp:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.rfdetr-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential cmake curl libopenblas-dev
-      - name: Setup Go
-        uses: actions/setup-go@v5
-      - name: Display Go version
-        run: go version
-      - name: Proto Dependencies
-        run: |
-          # Install protoc
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Build rfdetr-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp
-      - name: Test rfdetr-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp test
-  # Per-backend e2e for locate-anything-cpp: builds the .so + Go binary and
-  # runs `make -C backend/go/locate-anything-cpp test`. test.sh fetches the
-  # locate-anything-q8_0 GGUF (~6.3 GB, NVIDIA LocateAnything-3B) from the
-  # published mudler/locate-anything.cpp-gguf HF repo + a COCO image, then the
-  # Go wire test loads the model and runs an open-vocabulary Detect, asserting
-  # at least one labeled box. Heavier than the other Go backends (it is a 3B),
-  # so it is gated to changes under backend/go/locate-anything-cpp/.
-  tests-locate-anything-cpp:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.locate-anything-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential cmake curl libopenblas-dev
-      - name: Setup Go
-        uses: actions/setup-go@v5
-      - name: Display Go version
-        run: go version
-      - name: Proto Dependencies
-        run: |
-          # Install protoc
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Build locate-anything-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp
-      - name: Test locate-anything-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp test
-  # Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
-  # runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
-  # the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K
-  # + tokenizer + voice) and runs the closed-loop TTS → ASR Go test.
-  tests-vibevoice-cpp:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 90
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential cmake curl libopenblas-dev ffmpeg
-      - name: Setup Go
-        uses: actions/setup-go@v5
-      - name: Display Go version
-        run: go version
-      - name: Proto Dependencies
-        run: |
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Build vibevoice-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/vibevoice-cpp
-      - name: Test vibevoice-cpp
-        run: |
-          make --jobs=5 --output-sync=target -C backend/go/vibevoice-cpp test
-  # End-to-end TTS via the e2e-backends gRPC harness. Builds the
-  # vibevoice-cpp Docker image and drives Backend/TTS against it with a
-  # real LocalAI gRPC client.
-  tests-vibevoice-cpp-grpc-tts:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 90
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.25.4'
-      - name: Build vibevoice-cpp backend image and run TTS gRPC e2e tests
-        run: |
-          make test-extra-backend-vibevoice-cpp-tts
-  # End-to-end transcription via the e2e-backends gRPC harness. The
-  # vibevoice ASR is a 7B-param model (Q4_K weights ~10 GB on disk)
-  # and the JFK 30 s decode is too heavy for a free 4-core
-  # ubuntu-latest pool runner - two CI attempts got SIGTERM'd during
-  # LoadModel, before the test could even progress. Use the
-  # self-hosted 'bigger-runner' label (same one the GPU image builds
-  # in backend.yml use) and the documented dotnet/ghc/android cache
-  # purge to clear ~10-20 GB of headroom for the model + Docker
-  # image + working dir.
-  tests-vibevoice-cpp-grpc-transcription:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: bigger-runner
-    timeout-minutes: 150
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y --no-install-recommends \
-              make build-essential curl unzip ca-certificates git tar
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.25.4'
-      - name: Free disk space
-        run: |
-          sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
-          df -h
-      - name: Build vibevoice-cpp backend image and run ASR gRPC e2e tests
-        run: |
-          make test-extra-backend-vibevoice-cpp-transcription
-  # End-to-end audio transform via the e2e-backends gRPC harness. The
-  # LocalVQE GGUF is small (~5 MB) and the model is real-time on CPU, so
-  # the default ubuntu-latest pool is plenty.
-  tests-localvqe-grpc-transform:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.localvqe == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 60
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.25.4'
-      - name: Build localvqe backend image and run audio_transform gRPC e2e tests
-        run: |
-          make test-extra-backend-localvqe-transform
  tests-voxtral:
    needs: detect-changes
    if: needs.detect-changes.outputs.voxtral == 'true' || needs.detect-changes.outputs.run-all == 'true'
@@ -1107,55 +751,3 @@ jobs:
      - name: Test kokoros
        run: |
          make -C backend/rust/kokoros test
-  tests-insightface-grpc:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.insightface == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 90
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y --no-install-recommends \
-              make build-essential curl unzip ca-certificates git tar
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.26.0'
-      - name: Free disk space
-        run: |
-          sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
-          df -h
-      - name: Build insightface backend image and run both model configurations
-        run: |
-          make test-extra-backend-insightface-all
-  tests-speaker-recognition-grpc:
-    needs: detect-changes
-    if: needs.detect-changes.outputs.speaker-recognition == 'true' || needs.detect-changes.outputs.run-all == 'true'
-    runs-on: ubuntu-latest
-    timeout-minutes: 90
-    steps:
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y --no-install-recommends \
-              make build-essential curl ca-certificates git tar
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.26.0'
-      - name: Free disk space
-        run: |
-          sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
-          df -h
-      - name: Build speaker-recognition backend image and run the ECAPA-TDNN configuration
-        run: |
-          make test-extra-backend-speaker-recognition-all
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -9,9 +9,12 @@ on:
    tags:
      - '*'

+env:
+  GRPC_VERSION: v1.65.0
+
 concurrency:
-  group: ci-tests-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+  group: ci-tests-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true

 jobs:
  tests-linux:
@@ -20,12 +23,56 @@ jobs:
      matrix:
        go-version: ['1.26.x']
    steps:
+      - name: Free Disk Space (Ubuntu)
+        uses: jlumbroso/free-disk-space@main
+        with:
+          # this might remove tools that are actually needed,
+          # if set to "true" but frees about 6 GB
+          tool-cache: true
+          # all of these default to true, but feel free to set to
+          # "false" if necessary for your workflow
+          android: true
+          dotnet: true
+          haskell: true
+          large-packages: true
+          docker-images: true
+          swap-storage: true
+      - name: Release space from worker
+        run: |
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          df -h
+          echo
+          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
+          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
+          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
+          sudo rm -rf /usr/local/lib/android
+          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
+          sudo rm -rf /usr/share/dotnet
+          sudo apt-get remove -y '^mono-.*' || true
+          sudo apt-get remove -y '^ghc-.*' || true
+          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
+          sudo apt-get remove -y 'php.*' || true
+          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
+          sudo apt-get remove -y '^google-.*' || true
+          sudo apt-get remove -y azure-cli || true
+          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
+          sudo apt-get remove -y '^gfortran-.*' || true
+          sudo apt-get autoremove -y
+          sudo apt-get clean
+          echo
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          sudo rm -rfv build || true
+          df -h
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
-      - name: Free disk space
-        uses: ./.github/actions/free-disk-space
      - name: Setup Go ${{ matrix.go-version }}
        uses: actions/setup-go@v5
        with:
@@ -53,22 +100,73 @@ jobs:
          node-version: '22'
      - name: Build React UI
        run: make react-ui
-      # Runs the core suite with coverage and fails if total coverage dropped
-      # below the committed baseline (coverage-baseline.txt). The gate is
-      # strict — any decrease fails. Raise the baseline with
-      # `make test-coverage-baseline` and commit it when coverage rises.
-      - name: Test (with coverage gate)
+      - name: Build backends
        run: |
-          PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test-coverage-check
-      - name: Upload coverage report
-        if: ${{ always() }}
-        uses: actions/upload-artifact@v4
+          make backends/transformers
+          mkdir external && mv backends/transformers external/transformers
+          make backends/llama-cpp backends/local-store backends/silero-vad backends/piper backends/whisper backends/stablediffusion-ggml
+      - name: Test
+        run: |
+          TRANSFORMER_BACKEND=$PWD/external/transformers/run.sh PATH="$PATH:/root/go/bin" GO_TAGS="tts" make --jobs 5 --output-sync=target test
+      - name: Setup tmate session if tests fail
+        if: ${{ failure() }}
+        uses: mxschmitt/action-tmate@v3.23
        with:
-          name: coverage-linux
-          path: |
-            coverage/coverage.out
-            coverage/coverage.html
-          if-no-files-found: ignore
+          detached: true
+          connect-timeout-seconds: 180
+          limit-access-to-actor: true
+
+  tests-e2e-container:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Release space from worker
+        run: |
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          df -h
+          echo
+          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
+          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
+          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
+          sudo rm -rf /usr/local/lib/android
+          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
+          sudo rm -rf /usr/share/dotnet
+          sudo apt-get remove -y '^mono-.*' || true
+          sudo apt-get remove -y '^ghc-.*' || true
+          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
+          sudo apt-get remove -y 'php.*' || true
+          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
+          sudo apt-get remove -y '^google-.*' || true
+          sudo apt-get remove -y azure-cli || true
+          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
+          sudo apt-get remove -y '^gfortran-.*' || true
+          sudo apt-get autoremove -y
+          sudo apt-get clean
+          echo
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          sudo rm -rfv build || true
+          df -h
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Dependencies
+        run: |
+          # Install protoc
+          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
+          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
+          rm protoc.zip
+          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
+          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
+          PATH="$PATH:$HOME/go/bin" make protogen-go
+      - name: Test
+        run: |
+            PATH="$PATH:$HOME/go/bin" make backends/local-store backends/silero-vad backends/llama-cpp backends/whisper backends/piper backends/stablediffusion-ggml docker-build-e2e e2e-aio
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
@@ -97,7 +195,7 @@ jobs:
        run: go version
      - name: Dependencies
        run: |
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus ffmpeg
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus
          pip install --user --no-cache-dir grpcio-tools grpcio
      - name: Setup Node.js
        uses: actions/setup-node@v6
@@ -105,6 +203,10 @@ jobs:
          node-version: '22'
      - name: Build React UI
        run: make react-ui
+      - name: Build llama-cpp-darwin
+        run: |
+          make protogen-go
+          make backends/llama-cpp-darwin
      - name: Test
        run: |
          export C_INCLUDE_PATH=/usr/local/include
--- a/.github/workflows/tests-aio.yml
+++ b/.github/workflows/tests-aio.yml
@@ -1,86 +0,0 @@
---
-name: 'tests-aio'
-
-# Runs the all-in-one (AIO) Docker image with real backends + real models.
-# Heavy: builds llama-cpp/whisper/piper/silero-vad/stablediffusion-ggml/local-store
-# and exercises end-to-end inference inside the container. Moved out of test.yml
-# (which used to run on every PR) so PR CI no longer pays this cost.
-#
-# Triggers:
-#   - schedule (nightly @ 04:00 UTC) — catches packaging/image regressions within 24h
-#   - workflow_dispatch — manual run on-demand
-#   - push to master/tags — sanity check after merge / before release
-
-on:
-  schedule:
-    - cron: '0 4 * * *'
-  workflow_dispatch:
-  push:
-    branches:
-      - master
-    tags:
-      - '*'
-
-concurrency:
-  group: ci-tests-aio-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  tests-aio:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Release space from worker
-        run: |
-          echo "Listing top largest packages"
-          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
-          head -n 30 <<< "${pkgs}"
-          echo
-          df -h
-          echo
-          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
-          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
-          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
-          sudo rm -rf /usr/local/lib/android
-          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
-          sudo rm -rf /usr/share/dotnet
-          sudo apt-get remove -y '^mono-.*' || true
-          sudo apt-get remove -y '^ghc-.*' || true
-          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
-          sudo apt-get remove -y 'php.*' || true
-          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
-          sudo apt-get remove -y '^google-.*' || true
-          sudo apt-get remove -y azure-cli || true
-          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
-          sudo apt-get remove -y '^gfortran-.*' || true
-          sudo apt-get autoremove -y
-          sudo apt-get clean
-          echo
-          echo "Listing top largest packages"
-          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
-          head -n 30 <<< "${pkgs}"
-          echo
-          sudo rm -rfv build || true
-          df -h
-      - name: Clone
-        uses: actions/checkout@v6
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          # Install protoc
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Test
-        run: |
-            PATH="$PATH:$HOME/go/bin" make backends/local-store backends/silero-vad backends/llama-cpp backends/whisper backends/piper backends/stablediffusion-ggml docker-build-e2e e2e-aio
-      - name: Setup tmate session if tests fail
-        if: ${{ failure() }}
-        uses: mxschmitt/action-tmate@v3.23
-        with:
-          detached: true
-          connect-timeout-seconds: 180
-          limit-access-to-actor: true
--- a/.github/workflows/tests-e2e.yml
+++ b/.github/workflows/tests-e2e.yml
@@ -10,8 +10,8 @@ on:
      - '*'

 concurrency:
-  group: ci-tests-e2e-backend-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+  group: ci-tests-e2e-backend-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true

 jobs:
  tests-e2e-backend:
@@ -24,8 +24,6 @@ jobs:
        uses: actions/checkout@v6
        with:
          submodules: true
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
      - name: Setup Go ${{ matrix.go-version }}
        uses: actions/setup-go@v5
        with:
--- a/.github/workflows/tests-ui-e2e.yml
+++ b/.github/workflows/tests-ui-e2e.yml
@@ -12,8 +12,8 @@ on:
      - master

 concurrency:
-  group: ci-tests-ui-e2e-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+  group: ci-tests-ui-e2e-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true

 jobs:
  tests-ui-e2e:
@@ -26,8 +26,6 @@ jobs:
        uses: actions/checkout@v6
        with:
          submodules: true
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
      - name: Setup Go ${{ matrix.go-version }}
        uses: actions/setup-go@v5
        with:
@@ -37,10 +35,6 @@ jobs:
        uses: actions/setup-node@v6
        with:
          node-version: '22'
-      - name: Setup Bun
-        uses: oven-sh/setup-bun@v2
-        with:
-          bun-version: '1.3.11'
      - name: Proto Dependencies
        run: |
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
@@ -52,12 +46,16 @@ jobs:
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential libopus-dev
-      # Builds an instrumented UI bundle, runs the Playwright specs, and fails
-      # if line coverage regressed beyond the jitter tolerance (the gate is
-      # in `make test-ui-coverage-check`). PLAYWRIGHT_CHROMIUM_PATH is unset
-      # here, so scripts/ensure-playwright-browser.sh installs Chromium via apt.
-      - name: Run UI e2e + coverage gate
-        run: PATH="$PATH:$HOME/go/bin" make test-ui-coverage-check
+      - name: Build UI test server
+        run: PATH="$PATH:$HOME/go/bin" make build-ui-test-server
+      - name: Install Playwright
+        working-directory: core/http/react-ui
+        run: |
+          npm install
+          npx playwright install --with-deps chromium
+      - name: Run Playwright tests
+        working-directory: core/http/react-ui
+        run: npx playwright test
      - name: Upload Playwright report
        if: ${{ failure() }}
        uses: actions/upload-artifact@v7
@@ -65,14 +63,6 @@ jobs:
          name: playwright-report
          path: core/http/react-ui/playwright-report/
          retention-days: 7
-      - name: Upload UI coverage report
-        if: ${{ always() }}
-        uses: actions/upload-artifact@v7
-        with:
-          name: ui-coverage
-          path: core/http/react-ui/coverage/
-          if-no-files-found: ignore
-          retention-days: 7
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
--- a/.github/workflows/update_swagger.yaml
+++ b/.github/workflows/update_swagger.yaml
@@ -11,8 +11,6 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
      - uses: actions/setup-go@v5
        with:
          go-version: 'stable'
--- a/.gitignore
+++ b/.gitignore
@@ -26,10 +26,6 @@ go-bert
 LocalAI
 /local-ai
 /local-ai-launcher
-# Root-level build artifacts when running `go build ./...` against
-# Go backend packages whose main lives under backend/go/.
-/cloud-proxy
-/local-store
 # prevent above rules from omitting the helm chart
 !charts/*
 # prevent above rules from omitting the api/localai folder
@@ -70,17 +66,10 @@ docs/static/gallery.html
 # per-developer customization files for the development container
 .devcontainer/customization/*

-# Coverage profiles (the committed baseline is coverage-baseline.txt)
-/coverage/
-
 # React UI build artifacts (keep placeholder dist/index.html)
 core/http/react-ui/node_modules/
 core/http/react-ui/dist

-# React UI coverage (vite-plugin-istanbul + nyc, via `make test-ui-coverage`)
-core/http/react-ui/.nyc_output/
-core/http/react-ui/coverage/
-
 # Extracted backend binaries for container-based testing
 local-backends/

@@ -88,6 +77,3 @@ local-backends/
 tests/e2e-ui/ui-test-server
 core/http/react-ui/playwright-report/
 core/http/react-ui/test-results/
-
-# Local worktrees
-.worktrees/
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -1,128 +0,0 @@
-version: "2"
-
-# Only issues introduced relative to master are reported. Pre-existing issues
-# in the codebase do not fail the lint job; they're treated as a baseline that
-# can be cleaned up incrementally. New code (added lines on a branch) is held
-# to the full linter set. Locally, `make lint-all` overrides this and reports
-# every issue.
-issues:
-  # origin/master because in shallow CI checkouts only the remote-tracking
-  # branch exists; a bare 'master' ref isn't reachable locally.
-  new-from-merge-base: origin/master
-
-linters:
-  default: standard
-  # staticcheck is noisy on this codebase (mostly QF style suggestions like
-  # "could use tagged switch" or "unnecessary fmt.Sprintf"). Re-enable
-  # selectively if a high-signal subset is identified.
-  disable:
-    - staticcheck
-  enable:
-    - forbidigo
-  settings:
-    forbidigo:
-      forbid:
-        - pattern: '^t\.Errorf$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(...) instead of t.Errorf. See .agents/coding-style.md.'
-        - pattern: '^t\.Error$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(...) instead of t.Error. See .agents/coding-style.md.'
-        - pattern: '^t\.Fatalf$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(Succeed()) / Fail(...) instead of t.Fatalf. See .agents/coding-style.md.'
-        - pattern: '^t\.Fatal$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(Succeed()) / Fail(...) instead of t.Fatal. See .agents/coding-style.md.'
-        - pattern: '^t\.Run$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Describe/Context/It instead of t.Run. See .agents/coding-style.md.'
-        - pattern: '^t\.Skip$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.Skip. See .agents/coding-style.md.'
-        - pattern: '^t\.Skipf$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.Skipf. See .agents/coding-style.md.'
-        - pattern: '^t\.SkipNow$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.SkipNow. See .agents/coding-style.md.'
-        - pattern: '^t\.Logf$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use GinkgoWriter / fmt.Fprintf(GinkgoWriter, ...) instead of t.Logf. See .agents/coding-style.md.'
-        - pattern: '^t\.Log$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use GinkgoWriter / fmt.Fprintln(GinkgoWriter, ...) instead of t.Log. See .agents/coding-style.md.'
-        - pattern: '^t\.Fail$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.'
-        - pattern: '^t\.FailNow$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.'
-        # In-process config should flow through ApplicationConfig / kong-bound
-        # CLI flags, not via os.Getenv. The CLI layer is the legitimate
-        # env→struct boundary (kong's `env:"..."` tag); anything deeper that
-        # reads env directly leaks process state into business logic and
-        # makes flags impossible to test or override per-request. Backend
-        # subprocesses, the system/capabilities probe, and a few places that
-        # read non-LocalAI env vars (HOME, PATH, AUTH_TOKEN passed by parent)
-        # are exempt — see linters.exclusions.rules below.
-        - pattern: '^os\.(Getenv|LookupEnv|Environ)$'
-          msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
-        # Outbound HTTP must go through pkg/httpclient, which refuses redirects
-        # by default and sets a TLS floor. The std-library default client and
-        # the http.Get/Post/... convenience helpers follow redirects (up to 10)
-        # and, on a cross-host redirect, forward custom credential headers such
-        # as Anthropic's x-api-key to the redirect target — leaking the secret
-        # (GHSA-3mj3-57v2-4636). forbidigo can't precisely match the
-        # `&http.Client{}` composite literal without also flagging legitimate
-        # `*http.Client` type references, so that form is enforced by
-        # convention + review; these two patterns catch the implicit-default
-        # client, which is the common footgun.
-        - pattern: '^http\.DefaultClient$'
-          msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.DefaultClient — the std client follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
-        - pattern: '^http\.(Get|Post|PostForm|Head)$'
-          msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.Get/Post/PostForm/Head — these use http.DefaultClient, which follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
-  exclusions:
-    paths:
-      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
-      - 'backend/go/whisper/sources'
-      # Vendored upstream supertonic pipeline (supertone-inc/supertonic go/helper.go).
-      - 'backend/go/supertonic/helper.go'
-      - 'docs/'
-    rules:
-      # CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
-      # boundary, and a handful of subcommands legitimately propagate values
-      # to spawned subprocesses (LLAMACPP_GRPC_SERVERS, MLX hostfile, ...).
-      - path: ^core/cli/
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      # Backend subprocesses are independent binaries with their own env
-      # surface; they're not "in-process config" of the LocalAI server.
-      - path: ^backend/
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      # System capability probe reads HOME, PATH-style vars to discover
-      # GPUs, default paths, etc. — not LocalAI config.
-      - path: ^pkg/system/
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      # gRPC server reads AUTH_TOKEN passed in by the parent process at spawn
-      # time; model.Loader sets/inherits env to communicate with subprocesses.
-      - path: ^pkg/grpc/
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      - path: ^pkg/model/
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      # Top-level main binaries (local-ai, launcher) are entry points.
-      - path: ^cmd/
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      # Tests legitimately read $HOME, $TMPDIR, and gating env vars
-      # (LOCALAI_COSIGN_LIVE, etc.) to skip live-network specs.
-      - path: _test\.go$
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      # pkg/httpclient is the sanctioned home for outbound HTTP clients; it
-      # necessarily references net/http directly.
-      - path: ^pkg/httpclient/
-        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
-        linters: [forbidigo]
-      # Tests drive local httptest servers where redirect/TLS hardening is
-      # irrelevant; the std client is fine there.
-      - path: _test\.go$
-        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
-        linters: [forbidigo]
-      # Vendored upstream whisper.cpp Go bindings are a separate module and
-      # cannot import pkg/httpclient.
-      - path: ^backend/go/whisper/sources/
-        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
-        linters: [forbidigo]
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,46 +1,26 @@
 # LocalAI Agent Instructions

-This file is the entry point for AI coding assistants (Claude Code, Cursor, Copilot, Codex, Aider, etc.) working on LocalAI. It is an index to detailed topic guides in the `.agents/` directory. Read the relevant file(s) for the task at hand — you don't need to load all of them.
-
-Human contributors: see [CONTRIBUTING.md](CONTRIBUTING.md) for the development workflow.
-
-## Policy for AI-Assisted Contributions
-
-LocalAI follows the Linux kernel project's [guidelines for AI coding assistants](https://docs.kernel.org/process/coding-assistants.html). Before submitting AI-assisted code, read [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md). Key rules:
-
- **No `Signed-off-by` from AI.** Only the human submitter may sign off on the Developer Certificate of Origin.
- **No `Co-Authored-By: <AI>` trailers.** The human contributor owns the change.
- **Use an `Assisted-by:` trailer** to attribute AI involvement. Format: `Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]`.
- **The human submitter is responsible** for reviewing, testing, and understanding every line of generated code.
+This file is an index to detailed topic guides in the `.agents/` directory. Read the relevant file(s) for the task at hand — you don't need to load all of them.

 ## Topics

 | File | When to read |
 |------|-------------|
-| [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md) | Policy for AI-assisted contributions — licensing, DCO, attribution |
 | [.agents/building-and-testing.md](.agents/building-and-testing.md) | Building the project, running tests, Docker builds for specific platforms |
-| [.agents/ci-caching.md](.agents/ci-caching.md) | CI build cache layout (registry-backed BuildKit cache on quay.io/go-skynet/ci-cache, per-arch keys), `DEPS_REFRESH` weekly cache-buster for unpinned Python deps, prebuilt `base-grpc-*` images for llama.cpp variants, per-arch native + manifest-merge pattern, `setup-build-disk` `/mnt` relocation, path filter on master push, manual eviction |
-| [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) |
+| [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist |
 | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
 | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
 | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
-| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
-| [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
 | [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
 | [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
 | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
 | [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery |
-| [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) | LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync |
-| [.agents/backend-signing.md](.agents/backend-signing.md) | Backend OCI image signing (keyless cosign + sigstore-go) — producer-side CI setup, consumer-side gallery `verification:` block, strict mode (`LOCALAI_REQUIRE_BACKEND_INTEGRITY`), revocation via `not_before` |

 ## Quick Reference

- **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md).
 - **Logging**: Use `github.com/mudler/xlog` (same API as slog)
 - **Go style**: Prefer `any` over `interface{}`
 - **Comments**: Explain *why*, not *what*
 - **Docs**: Update `docs/content/` when adding features or changing config
- **New API endpoints**: LocalAI advertises its capability surface in several independent places — swagger `@Tags`, `/api/instructions` registry, auth `RouteFeatureRegistry`, React UI `capabilities.js`, docs. Read [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) and follow its checklist — missing any surface means clients, admins, and the UI won't know the endpoint exists.
- **Admin endpoints → MCP tool**: every admin endpoint that an admin would manage conversationally (install/list/edit/toggle/upgrade) MUST also be exposed as an MCP tool in `pkg/mcp/localaitools/`. The LocalAI Assistant chat modality and the standalone `local-ai mcp-server` consume that package; drift between REST and MCP is a real risk. Read [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) — the `TestToolHTTPRouteMappingComplete` test fails until you wire the new tool and update the route map.
 - **Build**: Inspect `Makefile` and `.github/workflows/` — ask the user before running long builds
 - **UI**: The active UI is the React app in `core/http/react-ui/`. The older Alpine.js/HTML UI in `core/http/static/` is pending deprecation — all new UI work goes in the React UI
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -13,7 +13,6 @@ Thank you for your interest in contributing to LocalAI! We appreciate your time
  - [Development Workflow](#development-workflow)
  - [Creating a Pull Request (PR)](#creating-a-pull-request-pr)
 - [Coding Guidelines](#coding-guidelines)
- [AI Coding Assistants](#ai-coding-assistants)
 - [Testing](#testing)
 - [Documentation](#documentation)
 - [Community and Communication](#community-and-communication)
@@ -186,7 +185,7 @@ Before jumping into a PR for a massive feature or big change, it is preferred to

 This project uses an [`.editorconfig`](.editorconfig) file to define formatting standards (indentation, line endings, charset, etc.). Please configure your editor to respect it.

-For AI-assisted development, see [`AGENTS.md`](AGENTS.md) (or the equivalent [`CLAUDE.md`](CLAUDE.md) symlink) for agent-specific guidelines including build instructions and backend architecture details. Contributions produced with AI assistance must follow the rules in the [AI Coding Assistants](#ai-coding-assistants) section below.
+For AI-assisted development, see [`CLAUDE.md`](CLAUDE.md) for agent-specific guidelines including build instructions and backend architecture details.

 ### General Principles

@@ -198,7 +197,6 @@ For AI-assisted development, see [`AGENTS.md`](AGENTS.md) (or the equivalent [`C

 - Prefer modern Go idioms — for example, use `any` instead of `interface{}`.
 - Use [`golangci-lint`](https://golangci-lint.run) to catch common issues before submitting a PR.
- Run `make install-hooks` once per clone to enable the pre-commit hook: Go changes run `make lint` + the coverage gate (`make test-coverage-check`); `core/http/react-ui/` changes run the Playwright e2e suite (`make test-ui`). Bypass a single commit with `git commit --no-verify`.
 - Use [`github.com/mudler/xlog`](https://github.com/mudler/xlog) for logging (same API as `slog`). Do not use `fmt.Println` or the standard `log` package for operational logging.
 - Use tab indentation for Go files (as defined in `.editorconfig`).

@@ -213,26 +211,6 @@ For AI-assisted development, see [`AGENTS.md`](AGENTS.md) (or the equivalent [`C
 - Reviewers will check for correctness, test coverage, adherence to these guidelines, and clarity of intent.
 - Be responsive to review feedback and keep discussions constructive.

-## AI Coding Assistants
-
-LocalAI follows the **same guidelines as the Linux kernel project** for AI-assisted contributions: <https://docs.kernel.org/process/coding-assistants.html>.
-
-The full policy for this repository lives in [`.agents/ai-coding-assistants.md`](.agents/ai-coding-assistants.md). Summary:
-
- **AI agents MUST NOT add `Signed-off-by` tags.** Only humans can certify the Developer Certificate of Origin.
- **AI agents MUST NOT add `Co-Authored-By` trailers** attributing themselves as co-authors.
- **Attribute AI involvement with an `Assisted-by` trailer** in the commit message:
-
-  ```
-  Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
-  ```
-
-  Example: `Assisted-by: Claude:claude-opus-4-7 golangci-lint`
-
-  Basic development tools (git, go, make, editors) should not be listed.
- **The human submitter is responsible** for reviewing, testing, and fully understanding every line of AI-generated code — including verifying that any referenced APIs, flags, or file paths actually exist in the tree.
- Contributions must remain compatible with LocalAI's **MIT License**.
-
 ## Testing

 All new features and bug fixes should include test coverage. The project uses [Ginkgo](https://onsi.github.io/ginkgo/) as its test framework.
@@ -266,12 +244,6 @@ The e2e tests run LocalAI in a Docker container and exercise the API:
 make test-e2e
 ```

-### React UI tests and coverage
-
-The React UI (`core/http/react-ui/`) is covered by Playwright e2e specs, gated by a **monotonic line-coverage ratchet** (`make test-ui-coverage-check`, run in CI and pre-commit). The metric is non-deterministic — a fast local box reads higher than a slow CI runner for the same code — so a small tolerance is unavoidable.
-
-**If your change lowers UI coverage, raise it back by adding specs — do not widen the tolerance or hand-lower the baseline.** A *render-smoke* spec (navigate to a page, assert its header is visible) cheaply covers an entire lazy page. See `core/http/react-ui/e2e/page-render-smoke.spec.js` and the full policy in [.agents/building-and-testing.md](.agents/building-and-testing.md#react-ui-coverage).
-
 ### Running E2E container tests

 These tests build a standard LocalAI Docker image and run it with pre-configured model configs to verify that most endpoints work correctly:
--- a/21
+++ b/21
@@ -1,20 +1,13 @@
 ARG BASE_IMAGE=ubuntu:24.04
+ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
 ARG INTEL_BASE_IMAGE=${BASE_IMAGE}
 ARG UBUNTU_CODENAME=noble
-# Optional alternate Ubuntu apt mirror(s). Empty = use upstream.
-# See .docker/apt-mirror.sh for accepted values.
-ARG APT_MIRROR=""
-ARG APT_PORTS_MIRROR=""

 FROM ${BASE_IMAGE} AS requirements

-ARG APT_MIRROR
-ARG APT_PORTS_MIRROR
 ENV DEBIAN_FRONTEND=noninteractive

-RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-    APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
-    apt-get update && \
+RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        ca-certificates curl wget espeak-ng libgomp1 \
        ffmpeg libopenblas0 libopenblas-dev libopus0 sox && \
@@ -108,7 +101,6 @@ RUN <<EOT bash
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
@@ -157,7 +149,6 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
-            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
@@ -249,14 +240,10 @@ WORKDIR /build
 # This is a temporary workaround until Intel fixes their repository
 FROM ${INTEL_BASE_IMAGE} AS intel
 ARG UBUNTU_CODENAME=noble
-ARG APT_MIRROR
-ARG APT_PORTS_MIRROR
 RUN wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
 gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg
 RUN echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu ${UBUNTU_CODENAME}/lts/2350 unified" > /etc/apt/sources.list.d/intel-graphics.list
-RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-    APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
-    apt-get update && \
+RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        intel-oneapi-runtime-libs && \
    apt-get clean && \
@@ -306,7 +293,7 @@ EOT
 ###################################

 # Build React UI
-FROM node:26-slim AS react-ui-builder
+FROM node:25-slim AS react-ui-builder
 WORKDIR /app
 COPY core/http/react-ui/package*.json ./
 RUN npm install
--- a/629
+++ b/629
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -10,13 +10,6 @@ LAUNCHER_BINARY_NAME=local-ai-launcher
 UBUNTU_VERSION?=2404
 UBUNTU_CODENAME?=noble

-# Optional Ubuntu apt mirror overrides forwarded to docker builds.
-# Empty = use upstream archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com.
-# Set e.g. APT_MIRROR=http://azure.archive.ubuntu.com to route apt traffic
-# during outages of the default Ubuntu pool.
-APT_MIRROR?=
-APT_PORTS_MIRROR?=
-
 GORELEASER?=

 export BUILD_TYPE?=
@@ -69,41 +62,10 @@ else
 	GORELEASER=$(shell which goreleaser)
 endif

-TEST_PATHS?=./api/... ./pkg/... ./core/... ./backend/go/cloud-proxy/... ./backend/go/local-store/...
-
-## Coverage output and the committed baseline that CI compares against.
-## The gate is strict: total coverage must never decrease (no tolerance).
-## covermode=atomic makes line coverage deterministic regardless of test
-## ordering or flake retries, so there is no run-to-run jitter to absorb.
-COVERAGE_DIR?=$(abspath ./coverage)
-COVERAGE_PROFILE?=$(COVERAGE_DIR)/coverage.out
-COVERAGE_BASELINE?=coverage-baseline.txt
-## Coverage is collected one recursive root at a time and merged (see
-## scripts/run-coverage.sh): passing several recursive roots to a single
-## ginkgo invocation only keeps one root's coverprofile. Mirrors TEST_PATHS
-## minus ./api (which doesn't exist).
-COVERAGE_ROOTS?=./pkg ./core
-## Build tags for the coverage build. `auth` is required to compile the real
-## auth implementation and its ~150 `//go:build auth` tests (otherwise they're
-## invisible and the gate scores auth against a stub). `debug` matches `test`.
-COVERAGE_TAGS?=debug auth
-## Coverage is attributed to these packages via --coverpkg, so the in-process
-## integration suites (COVERAGE_E2E_ROOTS) credit the core/http handlers they
-## drive over HTTP — not just their own test package.
-COVERAGE_COVERPKG?=github.com/mudler/LocalAI/core/...,github.com/mudler/LocalAI/pkg/...
-## In-process integration suites folded into coverage. Run non-recursively
-## (excludes tests/e2e/distributed, which needs containers) with the mock
-## backend built by prepare-test. real-models specs need a downloaded model,
-## so they're filtered out. NOTE: tests/integration is intentionally NOT here —
-## it needs the local-store backend built (`make backends/local-store`), which
-## the coverage CI job doesn't do.
-COVERAGE_E2E_ROOTS?=./tests/e2e
-COVERAGE_E2E_LABELS?=!real-models
-## Drop generated protobuf from the denominator (it has no tests by design).
-COVERAGE_EXCLUDE_RE?=grpc/proto/.*[.]pb[.]go
+TEST_PATHS?=./api/... ./pkg/... ./core/...


-.PHONY: all test test-coverage test-coverage-baseline test-coverage-check test-ui test-ui-coverage-baseline test-ui-coverage-check install-hooks build vendor lint lint-all
+.PHONY: all test build vendor

 all: help

@@ -123,7 +85,6 @@ clean: ## Remove build related file
 clean-tests:
 	rm -rf test-models
 	rm -rf test-dir
-	rm -f tests/e2e/mock-backend/mock-backend

 ## Install Go tools
 install-go-tools:
@@ -180,104 +141,34 @@ osx-signed: build

 ## Run
 run: ## run local-ai
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./cmd/local-ai
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./

-prepare-test: protogen-go build-mock-backend
+test-models/testmodel.ggml:
+	mkdir -p test-models
+	mkdir -p test-dir
+	wget -q https://huggingface.co/mradermacher/gpt2-alpaca-gpt4-GGUF/resolve/main/gpt2-alpaca-gpt4.Q4_K_M.gguf -O test-models/testmodel.ggml
+	wget -q https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin -O test-models/whisper-en
+	wget -q https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav -O test-dir/audio.wav
+	cp tests/models_fixtures/* test-models
+
+prepare-test: protogen-go
+	cp tests/models_fixtures/* test-models

 ########################################################
 ## Tests
 ########################################################

 ## Test targets
-## After the test-suite reorg (see plans/test-reorg) the default `make test`
-## no longer downloads multi-GB GGUF/whisper fixtures or builds llama-cpp /
-## transformers / piper / whisper / stablediffusion-ggml. core/http/app_test.go
-## now drives the mock-backend binary built by build-mock-backend; real-backend
-## inference moved into tests/e2e-backends/ (per-backend, path-filtered) and
-## tests/e2e-aio/ (nightly).
-test: prepare-test
+test: test-models/testmodel.ggml protogen-go
 	@echo 'Running tests'
 	export GO_TAGS="debug"
+	$(MAKE) prepare-test
 	OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
-	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)
-
-## Runs the core suite ($(TEST_PATHS)) with statement-coverage instrumentation
-## and writes a merged profile to $(COVERAGE_PROFILE). Deliberately omits
-## --fail-fast so a single failure doesn't truncate the coverage number, and
-## uses covermode=atomic so the result is deterministic. Prints the total.
-test-coverage: prepare-test
-	@echo 'Running tests with coverage'
-	GINKGO_TAGS="$(COVERAGE_TAGS)" \
-	COVERAGE_COVERPKG="$(COVERAGE_COVERPKG)" \
-	COVERAGE_E2E_ROOTS="$(COVERAGE_E2E_ROOTS)" \
-	COVERAGE_E2E_LABELS="$(COVERAGE_E2E_LABELS)" \
-	COVERAGE_EXCLUDE_RE='$(COVERAGE_EXCLUDE_RE)' \
-	OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
-	scripts/run-coverage.sh $(COVERAGE_DIR) $(COVERAGE_PROFILE) $(TEST_FLAKES) $(COVERAGE_ROOTS)
-	@$(GOCMD) tool cover -html=$(COVERAGE_PROFILE) -o $(COVERAGE_DIR)/coverage.html
-	@$(GOCMD) tool cover -func=$(COVERAGE_PROFILE) | tail -n1
-
-## Writes the current total coverage to $(COVERAGE_BASELINE). Run this (and
-## commit the result) whenever a change legitimately raises coverage so the
-## ratchet moves up. Never lower it by hand.
-test-coverage-baseline: test-coverage
-	@$(GOCMD) tool cover -func=$(COVERAGE_PROFILE) | awk '/^total:/{gsub(/%/,"",$$NF); print $$NF}' > $(COVERAGE_BASELINE)
-	@echo "Saved coverage baseline: $$(cat $(COVERAGE_BASELINE))%"
-
-## CI gate: fails if total coverage dropped more than COVERAGE_TOLERANCE
-## (default 0.5pp) below the committed baseline. A small tolerance absorbs the
-## run-to-run jitter from the in-process tests/e2e suite folded in via
-## --coverpkg (timing-dependent which handler lines execute).
-test-coverage-check: test-coverage
-	@scripts/coverage-check.sh $(COVERAGE_PROFILE) $(COVERAGE_BASELINE)
-
-########################################################
-## Lint
-########################################################
-## Runs golangci-lint with config from .golangci.yml. Includes the standard
-## linter set plus forbidigo, which enforces the Ginkgo/Gomega-only test
-## convention documented in .agents/coding-style.md.
-##
-## LINT_EXCLUDE_DIRS_RE matches directories whose Go packages can't typecheck
-## without C/C++ headers we don't install in the lint runner (cgo wrappers
-## around llama.cpp, piper/spdlog, silero-vad/onnxruntime, and Fyne/OpenGL for
-## the launcher). Their compile-time correctness is enforced by their own
-## build pipelines. Keep this as a deny list — `go list ./...` discovers
-## everything else automatically, so new packages are scanned by default.
-LINT_EXCLUDE_DIRS_RE=/(backend/go/(piper|silero-vad|llm)|cmd/launcher)(/|$$)
-
-## Set LINT_NEW_FROM to a git ref to override .golangci.yml's
-## new-from-merge-base (origin/master). Useful from a fork clone where
-## origin/master is stale relative to the canonical repo — the pre-commit
-## hook passes the resolved upstream ref here so local lint matches CI.
-LINT_NEW_FROM?=
-lint:
-	@command -v golangci-lint >/dev/null 2>&1 || { \
-		echo 'golangci-lint not installed. Install: go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest'; \
-		exit 1; \
-	}
-	golangci-lint run $(if $(LINT_NEW_FROM),--new-from-merge-base=$(LINT_NEW_FROM),) $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')
-
-## Like `lint` but reports every issue, including the pre-existing baseline
-## that `lint` ignores via .golangci.yml's new-from-merge-base. Use this to
-## see what's available to clean up.
-lint-all:
-	@command -v golangci-lint >/dev/null 2>&1 || { \
-		echo 'golangci-lint not installed. Install: go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest'; \
-		exit 1; \
-	}
-	golangci-lint run --new=false --new-from-merge-base= --new-from-rev= $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')
-
-########################################################
-## Git hooks
-########################################################
-## Points git at the versioned .githooks/ directory so the pre-commit hook
-## (lint + coverage gate) runs locally. Run once per clone. Undo with:
-## `git config --unset core.hooksPath`. Skip a single commit with
-## `git commit --no-verify`.
-install-hooks:
-	git config core.hooksPath .githooks
-	@echo 'Installed git hooks: core.hooksPath -> .githooks (pre-commit runs lint + test-coverage-check on Go changes)'
+	HUGGINGFACE_GRPC=$(abspath ./)/backend/python/transformers/run.sh TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
+	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="!llama-gguf"  --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)
+	$(MAKE) test-llama-gguf
+	$(MAKE) test-tts
+	$(MAKE) test-stablediffusion

 ########################################################
 ## E2E AIO tests (uses standard image with pre-configured models)
@@ -293,8 +184,6 @@ docker-build-e2e:
 		--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
 		--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
 		--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
-		--build-arg APT_MIRROR=$(APT_MIRROR) \
-		--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
 		--build-arg GO_TAGS="$(GO_TAGS)" \
 		-t local-ai:tests -f Dockerfile .

@@ -309,27 +198,6 @@ run-e2e-aio: protogen-go
 	@echo 'Running e2e AIO tests'
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e-aio

-# Distributed architecture e2e (PostgreSQL + NATS via testcontainers).
-# Includes NatsJWT specs (JWT-enabled NATS). Requires Docker.
-# VLLMMultinode is excluded here; use test-e2e-vllm-multinode for that.
-test-e2e-distributed: protogen-go
-	@echo 'Running distributed e2e tests (label Distributed, incl. NatsJWT)'
-	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter='Distributed && !VLLMMultinode' --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e/distributed
-
-# vLLM multi-node DP smoke (CPU). Builds local-ai:tests and the
-# cpu-vllm backend from the current working tree, then drives a
-# head + headless follower via testcontainers-go and asserts a chat
-# completion. BuildKit caches both images, so re-runs only rebuild
-# what changed. The test lives under tests/e2e/distributed and is
-# selected by the VLLMMultinode label so it doesn't run alongside
-# test-e2e-distributed.
-test-e2e-vllm-multinode: docker-build-e2e extract-backend-vllm protogen-go
-	@echo 'Running e2e vLLM multi-node DP test'
-	LOCALAI_IMAGE=local-ai \
-	LOCALAI_IMAGE_TAG=tests \
-	LOCALAI_VLLM_BACKEND_DIR=$(abspath ./local-backends/vllm) \
-	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter='VLLMMultinode' -v -r ./tests/e2e/distributed
-
 ########################################################
 ## E2E tests
 ########################################################
@@ -343,8 +211,6 @@ prepare-e2e:
 		--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
 		--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
 		--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
-		--build-arg APT_MIRROR=$(APT_MIRROR) \
-		--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
 		--build-arg GO_TAGS="$(GO_TAGS)" \
 		--build-arg MAKEFLAGS="$(DOCKER_MAKEFLAGS)" \
 		-t localai-tests .
@@ -352,13 +218,12 @@ prepare-e2e:
 run-e2e-image:
 	docker run -p 5390:8080 -e MODELS_PATH=/models -e THREADS=1 -e DEBUG=true -d --rm -v $(TEST_DIR):/models --name e2e-tests-$(RANDOM) localai-tests

-test-e2e: build-mock-backend build-cloud-proxy-backend prepare-e2e run-e2e-image
+test-e2e: build-mock-backend prepare-e2e run-e2e-image
 	@echo 'Running e2e tests'
 	BUILD_TYPE=$(BUILD_TYPE) \
 	LOCALAI_API=http://$(E2E_BRIDGE_IP):5390 \
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e
 	$(MAKE) clean-mock-backend
-	$(MAKE) clean-cloud-proxy-backend
 	$(MAKE) teardown-e2e
 	docker rmi localai-tests

@@ -370,12 +235,20 @@ teardown-e2e:
 ## Integration and unit tests
 ########################################################

-## Storage / vector-store integration. Requires the local-store backend to
-## be available — we build it on demand and pass its location via
-## BACKENDS_PATH (the model loader looks there for the gRPC binary).
-test-stores: backends/local-store
-	BACKENDS_PATH=$(abspath ./)/backends \
-	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r tests/integration
+test-llama-gguf: prepare-test
+	TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
+	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="llama-gguf" --flake-attempts $(TEST_FLAKES) -v -r $(TEST_PATHS)
+
+test-tts: prepare-test
+	TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
+	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="tts" --flake-attempts $(TEST_FLAKES) -v -r $(TEST_PATHS)
+
+test-stablediffusion: prepare-test
+	TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
+	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="stablediffusion" --flake-attempts $(TEST_FLAKES) -v -r $(TEST_PATHS)
+
+test-stores:
+	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="stores" --flake-attempts $(TEST_FLAKES) -v -r tests/integration

 test-opus:
 	@echo 'Running opus backend tests'
@@ -387,8 +260,6 @@ test-opus-docker:
 	docker build --target builder \
 	  --build-arg BUILD_TYPE=$(or $(BUILD_TYPE),) \
 	  --build-arg BASE_IMAGE=$(or $(BASE_IMAGE),ubuntu:24.04) \
-	  --build-arg APT_MIRROR=$(APT_MIRROR) \
-	  --build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
 	  --build-arg BACKEND=opus \
 	  -t localai-opus-test -f backend/Dockerfile.golang .
 	docker run --rm localai-opus-test \
@@ -398,13 +269,23 @@ test-realtime: build-mock-backend
 	@echo 'Running realtime e2e tests (mock backend)'
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="Realtime && !real-models" --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e

-# Container-based real-model realtime testing. Build env vars / pipeline
-# definition kept here so test-realtime-models-docker can drive a fully wired
-# pipeline (VAD + STT + LLM + TTS) from inside a containerised runner.
+# Real-model realtime tests. Set REALTIME_TEST_MODEL to use your own pipeline,
+# or leave unset to auto-build one from the component env vars below.
 REALTIME_VAD?=silero-vad-ggml
 REALTIME_STT?=whisper-1
 REALTIME_LLM?=qwen3-0.6b
 REALTIME_TTS?=tts-1
+REALTIME_BACKENDS_PATH?=$(abspath ./)/backends
+
+test-realtime-models: build-mock-backend
+	@echo 'Running realtime e2e tests (real models)'
+	REALTIME_TEST_MODEL=$${REALTIME_TEST_MODEL:-realtime-test-pipeline} \
+	REALTIME_VAD=$(REALTIME_VAD) \
+	REALTIME_STT=$(REALTIME_STT) \
+	REALTIME_LLM=$(REALTIME_LLM) \
+	REALTIME_TTS=$(REALTIME_TTS) \
+	REALTIME_BACKENDS_PATH=$(REALTIME_BACKENDS_PATH) \
+	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="Realtime" --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e

 # --- Container-based real-model testing ---

@@ -418,7 +299,7 @@ local-backends:

 extract-backend-%: docker-build-% local-backends
 	@echo "Extracting backend $*..."
-	@CID=$$(docker create --entrypoint=/run.sh local-ai-backend:$*) && \
+	@CID=$$(docker create local-ai-backend:$*) && \
 	  rm -rf local-backends/$* && mkdir -p local-backends/$* && \
 	  docker cp $$CID:/ - | tar -xf - -C local-backends/$* && \
 	  docker rm $$CID > /dev/null
@@ -430,8 +311,6 @@ test-realtime-models-docker: build-mock-backend
 	  --build-arg BUILD_TYPE=$(or $(BUILD_TYPE),cublas) \
 	  --build-arg CUDA_MAJOR_VERSION=$(or $(CUDA_MAJOR_VERSION),13) \
 	  --build-arg CUDA_MINOR_VERSION=$(or $(CUDA_MINOR_VERSION),0) \
-	  --build-arg APT_MIRROR=$(APT_MIRROR) \
-	  --build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
 	  -t localai-test-runner .
 	docker run --rm \
 	  $(REALTIME_DOCKER_FLAGS) \
@@ -515,13 +394,7 @@ protoc:
 .PHONY: protogen-go
 protogen-go: protoc install-go-tools
 	mkdir -p pkg/grpc/proto
-	# install-go-tools writes protoc-gen-go and protoc-gen-go-grpc into
-	# $(shell go env GOPATH)/bin, which isn't on every dev's PATH. protoc
-	# resolves its code-gen plugins via PATH, so without this prefix the
-	# generate step fails with "protoc-gen-go: program not found". Prepend
-	# GOPATH/bin so the freshly-installed plugins win without requiring a
-	# shell-profile change.
-	PATH="$$(go env GOPATH)/bin:$$PATH" ./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
+	./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
    backend/backend.proto

 core/config/inference_defaults.json: ## Fetch inference defaults from unsloth (only if missing)
@@ -548,7 +421,6 @@ prepare-test-extra: protogen-python
 	$(MAKE) -C backend/python/vllm-omni
 	$(MAKE) -C backend/python/sglang
 	$(MAKE) -C backend/python/vibevoice
-	$(MAKE) -C backend/python/liquid-audio
 	$(MAKE) -C backend/python/moonshine
 	$(MAKE) -C backend/python/pocket-tts
 	$(MAKE) -C backend/python/qwen-tts
@@ -562,11 +434,7 @@ prepare-test-extra: protogen-python
 	$(MAKE) -C backend/python/ace-step
 	$(MAKE) -C backend/python/trl
 	$(MAKE) -C backend/python/tinygrad
-	$(MAKE) -C backend/python/insightface
-	$(MAKE) -C backend/python/speaker-recognition
 	$(MAKE) -C backend/rust/kokoros kokoros-grpc
-	$(MAKE) -C backend/go/rfdetr-cpp
-	$(MAKE) -C backend/go/locate-anything-cpp

 test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/transformers test
@@ -576,7 +444,6 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/vllm test
 	$(MAKE) -C backend/python/vllm-omni test
 	$(MAKE) -C backend/python/vibevoice test
-	$(MAKE) -C backend/python/liquid-audio test
 	$(MAKE) -C backend/python/moonshine test
 	$(MAKE) -C backend/python/pocket-tts test
 	$(MAKE) -C backend/python/qwen-tts test
@@ -590,13 +457,7 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/ace-step test
 	$(MAKE) -C backend/python/trl test
 	$(MAKE) -C backend/python/tinygrad test
-	$(MAKE) -C backend/python/insightface test
-	$(MAKE) -C backend/python/speaker-recognition test
 	$(MAKE) -C backend/rust/kokoros test
-	$(MAKE) -C backend/go/rfdetr-cpp test
-	$(MAKE) -C backend/go/locate-anything-cpp test
-	$(MAKE) -C backend/go/depth-anything-cpp test
-	$(MAKE) -C backend/go/supertonic test

 ##
 ## End-to-end gRPC tests that exercise a built backend container image.
@@ -646,20 +507,11 @@ test-extra-backend: protogen-go
 	BACKEND_TEST_TOOL_NAME="$$BACKEND_TEST_TOOL_NAME" \
 	BACKEND_TEST_CACHE_TYPE_K="$$BACKEND_TEST_CACHE_TYPE_K" \
 	BACKEND_TEST_CACHE_TYPE_V="$$BACKEND_TEST_CACHE_TYPE_V" \
-	BACKEND_TEST_FACE_IMAGE_1_URL="$$BACKEND_TEST_FACE_IMAGE_1_URL" \
-	BACKEND_TEST_FACE_IMAGE_1_FILE="$$BACKEND_TEST_FACE_IMAGE_1_FILE" \
-	BACKEND_TEST_FACE_IMAGE_2_URL="$$BACKEND_TEST_FACE_IMAGE_2_URL" \
-	BACKEND_TEST_FACE_IMAGE_2_FILE="$$BACKEND_TEST_FACE_IMAGE_2_FILE" \
-	BACKEND_TEST_FACE_IMAGE_3_URL="$$BACKEND_TEST_FACE_IMAGE_3_URL" \
-	BACKEND_TEST_FACE_IMAGE_3_FILE="$$BACKEND_TEST_FACE_IMAGE_3_FILE" \
-	BACKEND_TEST_VERIFY_DISTANCE_CEILING="$$BACKEND_TEST_VERIFY_DISTANCE_CEILING" \
 	go test -v -timeout 30m ./tests/e2e-backends/...

 ## Convenience wrappers: build the image, then exercise it.
 test-extra-backend-llama-cpp: docker-build-llama-cpp
-	BACKEND_IMAGE=local-ai-backend:llama-cpp \
-	BACKEND_TEST_CAPS=health,load,predict,stream,logprobs,logit_bias \
-	$(MAKE) test-extra-backend
+	BACKEND_IMAGE=local-ai-backend:llama-cpp $(MAKE) test-extra-backend

 test-extra-backend-ik-llama-cpp: docker-build-ik-llama-cpp
 	BACKEND_IMAGE=local-ai-backend:ik-llama-cpp $(MAKE) test-extra-backend
@@ -687,7 +539,6 @@ test-extra-backend-llama-cpp-transcription: docker-build-llama-cpp
 	BACKEND_TEST_MMPROJ_URL=https://huggingface.co/ggml-org/Qwen3-ASR-0.6B-GGUF/resolve/main/mmproj-Qwen3-ASR-0.6B-Q8_0.gguf \
 	BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
 	BACKEND_TEST_CAPS=health,load,transcription \
-	BACKEND_TEST_CTX_SIZE=2048 \
 	$(MAKE) test-extra-backend

 ## vllm is resolved from a HuggingFace model id (no file download) and
@@ -702,14 +553,6 @@ test-extra-backend-vllm: docker-build-vllm
 	BACKEND_TEST_OPTIONS=tool_parser:hermes \
 	$(MAKE) test-extra-backend

-## vllm multi-node data-parallel smoke test. Runs LocalAI head + a
-## `local-ai p2p-worker vllm` follower in docker compose against
-## Qwen2.5-0.5B with data_parallel_size=2. Requires 2 NVIDIA GPUs and
-## nvidia-container-runtime on the host — vLLM v1's DP coordinator is
-## not viable on CPU so this cannot run in CI without GPU.
-test-extra-backend-vllm-multinode:
-	./tests/e2e/vllm-multinode/smoke.sh
-
 ## tinygrad mirrors the vllm target (same model, same caps, same parser) so
 ## the two backends are directly comparable. The LLM path covers Predict,
 ## streaming and native tool-call extraction. Companion targets below cover
@@ -760,271 +603,6 @@ test-extra-backend-tinygrad-all: \
 	test-extra-backend-tinygrad-sd \
 	test-extra-backend-tinygrad-whisper

-## insightface — face recognition.
-##
-## Face fixtures default to the sample images shipped in the
-## deepinsight/insightface repository (MIT-licensed). For offline/local
-## runs override with BACKEND_TEST_FACE_IMAGE_{1,2,3}_FILE pointing at
-## local paths.
-FACE_IMAGE_1_URL ?= https://github.com/deepinsight/insightface/raw/master/python-package/insightface/data/images/t1.jpg
-FACE_IMAGE_2_URL ?= https://github.com/deepinsight/insightface/raw/master/python-package/insightface/data/images/t1.jpg
-FACE_IMAGE_3_URL ?= https://github.com/deepinsight/insightface/raw/master/python-package/insightface/data/images/mask_white.jpg
-## Known spoof fixture used by the face_antispoof e2e cap. This is
-## upstream's own `image_F2.jpg` (Silent-Face repo, via yakhyo mirror)
-## — verified to classify as is_real=false with score < 0.05 on the
-## MiniFASNetV2 + MiniFASNetV1SE ensemble.
-FACE_SPOOF_IMAGE_URL ?= https://github.com/yakhyo/face-anti-spoofing/raw/main/assets/image_F2.jpg
-
-## Host-side cache for the OpenCV Zoo face ONNX files used by the
-## opencv e2e target. The backend image no longer bakes model weights —
-## gallery installs bring them via `files:` — but the e2e suite drives
-## LoadModel over gRPC directly without going through the gallery. We
-## pre-download the ONNX files to a stable host path and pass absolute
-## paths in BACKEND_TEST_OPTIONS; `make` skips the downloads when the
-## SHA-256 already matches.
-INSIGHTFACE_OPENCV_DIR := /tmp/localai-insightface-opencv-cache
-INSIGHTFACE_OPENCV_YUNET_URL := https://github.com/opencv/opencv_zoo/raw/main/models/face_detection_yunet/face_detection_yunet_2023mar.onnx
-INSIGHTFACE_OPENCV_SFACE_URL := https://github.com/opencv/opencv_zoo/raw/main/models/face_recognition_sface/face_recognition_sface_2021dec.onnx
-INSIGHTFACE_OPENCV_YUNET_SHA := 8f2383e4dd3cfbb4553ea8718107fc0423210dc964f9f4280604804ed2552fa4
-INSIGHTFACE_OPENCV_SFACE_SHA := 0ba9fbfa01b5270c96627c4ef784da859931e02f04419c829e83484087c34e79
-
-## buffalo_sc (insightface) — pack zip + SHA-256 mirrors the gallery
-## entry so the e2e target matches exactly what `local-ai models install
-## insightface-buffalo-sc` would have fetched. Smallest insightface pack
-## (~16MB) — keeps CI fast while still covering the insightface engine
-## code path end-to-end.
-INSIGHTFACE_BUFFALO_SC_DIR := /tmp/localai-insightface-buffalo-sc-cache
-INSIGHTFACE_BUFFALO_SC_URL := https://github.com/deepinsight/insightface/releases/download/v0.7/buffalo_sc.zip
-INSIGHTFACE_BUFFALO_SC_SHA := 57d31b56b6ffa911c8a73cfc1707c73cab76efe7f13b675a05223bf42de47c72
-
-## Silent-Face antispoofing (MiniFASNetV2 + MiniFASNetV1SE) — shared
-## between the buffalo_sc and opencv e2e targets. Both ONNX files are
-## ~1.7MB, Apache 2.0. URLs + SHAs mirror the gallery entries.
-INSIGHTFACE_ANTISPOOF_DIR := /tmp/localai-insightface-antispoof-cache
-INSIGHTFACE_ANTISPOOF_V2_URL := https://github.com/yakhyo/face-anti-spoofing/releases/download/weights/MiniFASNetV2.onnx
-INSIGHTFACE_ANTISPOOF_V2_SHA := b32929adc2d9c34b9486f8c4c7bc97c1b69bc0ea9befefc380e4faae4e463907
-INSIGHTFACE_ANTISPOOF_V1SE_URL := https://github.com/yakhyo/face-anti-spoofing/releases/download/weights/MiniFASNetV1SE.onnx
-INSIGHTFACE_ANTISPOOF_V1SE_SHA := ebab7f90c7833fbccd46d3a555410e78d969db5438e169b6524be444862b3676
-
-.PHONY: insightface-opencv-models
-insightface-opencv-models:
-	@mkdir -p $(INSIGHTFACE_OPENCV_DIR)
-	@if [ "$$(sha256sum $(INSIGHTFACE_OPENCV_DIR)/yunet.onnx 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_OPENCV_YUNET_SHA)" ]; then \
-		echo "Fetching YuNet..."; \
-		curl -fsSL -o $(INSIGHTFACE_OPENCV_DIR)/yunet.onnx $(INSIGHTFACE_OPENCV_YUNET_URL); \
-		echo "$(INSIGHTFACE_OPENCV_YUNET_SHA)  $(INSIGHTFACE_OPENCV_DIR)/yunet.onnx" | sha256sum -c; \
-	fi
-	@if [ "$$(sha256sum $(INSIGHTFACE_OPENCV_DIR)/sface.onnx 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_OPENCV_SFACE_SHA)" ]; then \
-		echo "Fetching SFace..."; \
-		curl -fsSL -o $(INSIGHTFACE_OPENCV_DIR)/sface.onnx $(INSIGHTFACE_OPENCV_SFACE_URL); \
-		echo "$(INSIGHTFACE_OPENCV_SFACE_SHA)  $(INSIGHTFACE_OPENCV_DIR)/sface.onnx" | sha256sum -c; \
-	fi
-
-.PHONY: insightface-antispoof-models
-insightface-antispoof-models:
-	@mkdir -p $(INSIGHTFACE_ANTISPOOF_DIR)
-	@if [ "$$(sha256sum $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_ANTISPOOF_V2_SHA)" ]; then \
-		echo "Fetching MiniFASNetV2..."; \
-		curl -fsSL -o $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx $(INSIGHTFACE_ANTISPOOF_V2_URL); \
-		echo "$(INSIGHTFACE_ANTISPOOF_V2_SHA)  $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx" | sha256sum -c; \
-	fi
-	@if [ "$$(sha256sum $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_ANTISPOOF_V1SE_SHA)" ]; then \
-		echo "Fetching MiniFASNetV1SE..."; \
-		curl -fsSL -o $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx $(INSIGHTFACE_ANTISPOOF_V1SE_URL); \
-		echo "$(INSIGHTFACE_ANTISPOOF_V1SE_SHA)  $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx" | sha256sum -c; \
-	fi
-
-.PHONY: insightface-buffalo-sc-models
-insightface-buffalo-sc-models:
-	@mkdir -p $(INSIGHTFACE_BUFFALO_SC_DIR)
-	@if [ "$$(sha256sum $(INSIGHTFACE_BUFFALO_SC_DIR)/buffalo_sc.zip 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_BUFFALO_SC_SHA)" ]; then \
-		echo "Fetching buffalo_sc..."; \
-		curl -fsSL -o $(INSIGHTFACE_BUFFALO_SC_DIR)/buffalo_sc.zip $(INSIGHTFACE_BUFFALO_SC_URL); \
-		echo "$(INSIGHTFACE_BUFFALO_SC_SHA)  $(INSIGHTFACE_BUFFALO_SC_DIR)/buffalo_sc.zip" | sha256sum -c; \
-		rm -f $(INSIGHTFACE_BUFFALO_SC_DIR)/*.onnx; \
-	fi
-	@if [ ! -f "$(INSIGHTFACE_BUFFALO_SC_DIR)/det_500m.onnx" ]; then \
-		echo "Extracting buffalo_sc..."; \
-		unzip -o -q $(INSIGHTFACE_BUFFALO_SC_DIR)/buffalo_sc.zip -d $(INSIGHTFACE_BUFFALO_SC_DIR); \
-	fi
-
-## buffalo_sc — smallest insightface pack (SCRFD-500MF detector + MBF
-## recognizer, ~16MB). Exercises the insightface engine code path
-## (model_zoo-backed inference) without the ~326MB buffalo_l download.
-## No age/gender/landmark heads — face_analyze is dropped from caps.
-## The pack is pre-fetched on the host and passed as `root:<dir>` since
-## the e2e suite drives LoadModel directly without going through
-## LocalAI's gallery flow (which is what would normally populate
-## ModelPath and in turn the engine's `_model_dir` option).
-test-extra-backend-insightface-buffalo-sc: docker-build-insightface insightface-buffalo-sc-models insightface-antispoof-models
-	BACKEND_IMAGE=local-ai-backend:insightface \
-	BACKEND_TEST_MODEL_NAME=insightface-buffalo-sc \
-	BACKEND_TEST_OPTIONS=engine:insightface,model_pack:buffalo_sc,root:$(INSIGHTFACE_BUFFALO_SC_DIR),antispoof_v2_onnx:$(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx,antispoof_v1se_onnx:$(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx \
-	BACKEND_TEST_CAPS=health,load,face_detect,face_embed,face_verify,face_antispoof \
-	BACKEND_TEST_FACE_IMAGE_1_URL=$(FACE_IMAGE_1_URL) \
-	BACKEND_TEST_FACE_IMAGE_2_URL=$(FACE_IMAGE_2_URL) \
-	BACKEND_TEST_FACE_IMAGE_3_URL=$(FACE_IMAGE_3_URL) \
-	BACKEND_TEST_FACE_SPOOF_IMAGE_URL=$(FACE_SPOOF_IMAGE_URL) \
-	BACKEND_TEST_VERIFY_DISTANCE_CEILING=0.55 \
-	$(MAKE) test-extra-backend
-
-## OpenCV Zoo YuNet + SFace — Apache 2.0, commercial-safe. face_analyze
-## cap is dropped (SFace has no demographic head). The ONNX files are
-## pre-fetched on the host via the insightface-opencv-models target and
-## passed as absolute paths, since the e2e suite drives LoadModel
-## directly without going through LocalAI's gallery flow.
-test-extra-backend-insightface-opencv: docker-build-insightface insightface-opencv-models insightface-antispoof-models
-	BACKEND_IMAGE=local-ai-backend:insightface \
-	BACKEND_TEST_MODEL_NAME=insightface-opencv \
-	BACKEND_TEST_OPTIONS=engine:onnx_direct,detector_onnx:$(INSIGHTFACE_OPENCV_DIR)/yunet.onnx,recognizer_onnx:$(INSIGHTFACE_OPENCV_DIR)/sface.onnx,antispoof_v2_onnx:$(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx,antispoof_v1se_onnx:$(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx \
-	BACKEND_TEST_CAPS=health,load,face_detect,face_embed,face_verify,face_antispoof \
-	BACKEND_TEST_FACE_IMAGE_1_URL=$(FACE_IMAGE_1_URL) \
-	BACKEND_TEST_FACE_IMAGE_2_URL=$(FACE_IMAGE_2_URL) \
-	BACKEND_TEST_FACE_IMAGE_3_URL=$(FACE_IMAGE_3_URL) \
-	BACKEND_TEST_FACE_SPOOF_IMAGE_URL=$(FACE_SPOOF_IMAGE_URL) \
-	BACKEND_TEST_VERIFY_DISTANCE_CEILING=0.55 \
-	$(MAKE) test-extra-backend
-
-## Aggregate — runs both face-recognition model configurations so CI
-## catches regressions across engines together.
-test-extra-backend-insightface-all: \
-	test-extra-backend-insightface-buffalo-sc \
-	test-extra-backend-insightface-opencv
-
-## speaker-recognition — voice (speaker) biometrics.
-##
-## Audio fixtures default to the speechbrain test samples served
-## straight from their GitHub repo — public, no auth needed, and they
-## ship as 16kHz mono WAV/FLAC which is exactly what the engine wants.
-## example{1,2,5} are three different speakers; the suite treats
-## example1 as the "same-image twin" probe (verify(clip, clip) must
-## return distance≈0) and the other two as cross-speaker ceilings.
-## Override with BACKEND_TEST_VOICE_AUDIO_{1,2,3}_FILE for offline runs.
-VOICE_AUDIO_1_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example1.wav
-VOICE_AUDIO_2_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example2.flac
-VOICE_AUDIO_3_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example5.wav
-
-## ECAPA-TDNN via SpeechBrain — default CI configuration. Auto-downloads
-## the checkpoint from HuggingFace on first LoadModel (bundled in the
-## backend image pip install). 192-d embeddings, cosine-distance based.
-## The e2e suite drives LoadModel directly so we don't rely on LocalAI's
-## gallery flow here.
-test-extra-backend-speaker-recognition-ecapa: docker-build-speaker-recognition
-	BACKEND_IMAGE=local-ai-backend:speaker-recognition \
-	BACKEND_TEST_MODEL_NAME=speechbrain/spkrec-ecapa-voxceleb \
-	BACKEND_TEST_OPTIONS=engine:speechbrain,source:speechbrain/spkrec-ecapa-voxceleb \
-	BACKEND_TEST_CAPS=health,load,voice_embed,voice_verify \
-	BACKEND_TEST_VOICE_AUDIO_1_URL=$(VOICE_AUDIO_1_URL) \
-	BACKEND_TEST_VOICE_AUDIO_2_URL=$(VOICE_AUDIO_2_URL) \
-	BACKEND_TEST_VOICE_AUDIO_3_URL=$(VOICE_AUDIO_3_URL) \
-	BACKEND_TEST_VOICE_VERIFY_DISTANCE_CEILING=0.4 \
-	$(MAKE) test-extra-backend
-
-## Aggregate — today there's only one voice config; the target exists
-## so the CI workflow matches the insightface-all naming convention and
-## can grow to include WeSpeaker / 3D-Speaker later.
-test-extra-backend-speaker-recognition-all: \
-	test-extra-backend-speaker-recognition-ecapa
-
-## Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked
-## LLM. Extracts the sherpa-onnx Docker image rootfs, downloads the three
-## gallery-referenced model bundles (silero-vad, omnilingual-asr, vits-ljs),
-## writes the corresponding model config YAMLs, and runs the realtime
-## websocket spec in tests/e2e with REALTIME_* env vars wiring the sherpa
-## slots into the pipeline. The LLM slot stays on the in-repo mock-backend
-## registered unconditionally by tests/e2e/e2e_suite_test.go. See
-## tests/e2e/run-realtime-sherpa.sh for the full orchestration.
-test-extra-e2e-realtime-sherpa: build-mock-backend docker-build-sherpa-onnx protogen-go react-ui
-	bash tests/e2e/run-realtime-sherpa.sh
-
-## Streaming ASR via the sherpa-onnx online recognizer. Uses the streaming
-## zipformer English model (encoder/decoder/joiner int8 + tokens) from the
-## sherpa-onnx gallery entry. Drives both AudioTranscription and
-## AudioTranscriptionStream via the e2e-backends gRPC harness; streaming
-## emits real partial deltas during decode. Each file is renamed on download
-## to the shape sherpa-onnx's online loader expects (encoder.int8.onnx etc.).
-test-extra-backend-sherpa-onnx-transcription: docker-build-sherpa-onnx
-	BACKEND_IMAGE=local-ai-backend:sherpa-onnx \
-	BACKEND_TEST_MODEL_URL='https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/encoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx#encoder.int8.onnx' \
-	BACKEND_TEST_EXTRA_FILES='https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/decoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx#decoder.int8.onnx|https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/joiner-epoch-99-avg-1-chunk-16-left-128.int8.onnx#joiner.int8.onnx|https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/tokens.txt' \
-	BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
-	BACKEND_TEST_CAPS=health,load,transcription \
-	BACKEND_TEST_OPTIONS=subtype=online \
-	$(MAKE) test-extra-backend
-
-## VITS TTS via the sherpa-onnx backend. Pulls the individual files from
-## HuggingFace (the vits-ljs release tarball lives on the k2-fsa github
-## but is also mirrored as discrete files on HF). Exercises both
-## TTS (write-to-file) and TTSStream (PCM chunks + WAV header) via the
-## e2e-backends gRPC harness.
-test-extra-backend-sherpa-onnx-tts: docker-build-sherpa-onnx
-	BACKEND_IMAGE=local-ai-backend:sherpa-onnx \
-	BACKEND_TEST_MODEL_URL='https://huggingface.co/csukuangfj/vits-ljs/resolve/main/vits-ljs.onnx#vits-ljs.onnx' \
-	BACKEND_TEST_EXTRA_FILES='https://huggingface.co/csukuangfj/vits-ljs/resolve/main/tokens.txt|https://huggingface.co/csukuangfj/vits-ljs/resolve/main/lexicon.txt' \
-	BACKEND_TEST_CAPS=health,load,tts \
-	$(MAKE) test-extra-backend
-
-## VibeVoice TTS via the vibevoice-cpp backend. ModelFile is the
-## realtime gguf; the supplementary tokenizer + voice prompt land
-## alongside it under the harness's models dir and are wired through
-## via the standard Options[] convention (tokenizer=, voice=).
-test-extra-backend-vibevoice-cpp-tts: docker-build-vibevoice-cpp
-	BACKEND_IMAGE=local-ai-backend:vibevoice-cpp \
-	BACKEND_TEST_MODEL_URL='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/vibevoice-realtime-0.5B-q8_0.gguf#vibevoice-realtime-0.5B-q8_0.gguf' \
-	BACKEND_TEST_EXTRA_FILES='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/tokenizer.gguf#tokenizer.gguf|https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/voice-en-Carter_man.gguf#voice-en-Carter_man.gguf' \
-	BACKEND_TEST_OPTIONS=tokenizer:tokenizer.gguf,voice:voice-en-Carter_man.gguf \
-	BACKEND_TEST_CAPS=health,load,tts \
-	$(MAKE) test-extra-backend
-
-## VibeVoice ASR (long-form, with diarization). type=asr tells the
-## backend's Load() to slot ModelFile into the asr_model role; the
-## tokenizer is supplied via Options[]. Uses the Q4_K quant (~10 GB)
-## rather than Q8_0 (~14 GB) so the bundle fits inside ubuntu-latest's
-## post-image disk budget.
-test-extra-backend-vibevoice-cpp-transcription: docker-build-vibevoice-cpp
-	BACKEND_IMAGE=local-ai-backend:vibevoice-cpp \
-	BACKEND_TEST_MODEL_URL='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/vibevoice-asr-q4_k.gguf#vibevoice-asr-q4_k.gguf' \
-	BACKEND_TEST_EXTRA_FILES='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/tokenizer.gguf#tokenizer.gguf' \
-	BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
-	BACKEND_TEST_OPTIONS=type:asr,tokenizer:tokenizer.gguf \
-	BACKEND_TEST_CAPS=health,load,transcription \
-	$(MAKE) test-extra-backend
-
-## Audio transcription wrapper for the whisper.cpp backend.
-## Drives the AudioTranscription / AudioTranscriptionStream RPCs against
-## ggml-base.en (~145 MB) using the JFK 11s clip. The streaming spec
-## asserts len(deltas) >= 1 and concat(deltas) == final.Text - whisper-
-## specific multi-segment assertions live in backend/go/whisper/gowhisper_test.go.
-test-extra-backend-whisper-transcription: docker-build-whisper
-	BACKEND_IMAGE=local-ai-backend:whisper \
-	BACKEND_TEST_MODEL_URL=https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin \
-	BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
-	BACKEND_TEST_CAPS=health,load,transcription \
-	$(MAKE) test-extra-backend
-
-## Audio transcription wrapper for the parakeet-cpp (parakeet.cpp ggml port)
-## backend. Mirrors test-extra-backend-whisper-transcription: drives the
-## AudioTranscription / AudioTranscriptionStream RPCs against a published
-## Parakeet GGUF using the JFK 11s clip from whisper.cpp's CI samples. Not
-## part of the default test suite - run explicitly once the pinned model URL
-## is reachable.
-test-extra-backend-parakeet-cpp-transcription: docker-build-parakeet-cpp
-	BACKEND_IMAGE=local-ai-backend:parakeet-cpp \
-	BACKEND_TEST_MODEL_URL=https://huggingface.co/mudler/parakeet-cpp-gguf/resolve/main/tdt_ctc-110m-f16.gguf \
-	BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
-	BACKEND_TEST_CAPS=health,load,transcription \
-	$(MAKE) test-extra-backend
-
-## LocalVQE audio transform (joint AEC + noise suppression + dereverb).
-## Exercises the audio_transform capability end-to-end: batch transform
-## of a real WAV fixture and bidi streaming of synthetic silent frames.
-test-extra-backend-localvqe-transform: docker-build-localvqe
-	BACKEND_IMAGE=local-ai-backend:localvqe \
-	BACKEND_TEST_MODEL_URL='https://huggingface.co/LocalAI-io/LocalVQE/resolve/main/localvqe-v1-1.3M-f32.gguf#localvqe-v1-1.3M-f32.gguf' \
-	BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
-	BACKEND_TEST_CAPS=health,load,audio_transform \
-	$(MAKE) test-extra-backend
-
 ## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
 ## tool-call extraction via sglang's native qwen parser. CPU builds use
 ## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
@@ -1067,8 +645,6 @@ docker:
 		--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
 		--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
 		--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
-		--build-arg APT_MIRROR=$(APT_MIRROR) \
-		--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
 		-t $(DOCKER_IMAGE) .

 docker-cuda12:
@@ -1082,13 +658,11 @@ docker-cuda12:
 		--build-arg BUILD_TYPE=$(BUILD_TYPE) \
 		--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
 		--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
-		--build-arg APT_MIRROR=$(APT_MIRROR) \
-		--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
 		-t $(DOCKER_IMAGE)-cuda-12 .

 docker-image-intel:
 	docker build \
-		--build-arg BASE_IMAGE=intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04 \
+		--build-arg BASE_IMAGE=intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04 \
 		--build-arg IMAGE_TYPE=$(IMAGE_TYPE) \
 		--build-arg GO_TAGS="$(GO_TAGS)" \
 		--build-arg MAKEFLAGS="$(DOCKER_MAKEFLAGS)" \
@@ -1097,8 +671,6 @@ docker-image-intel:
 		--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
 		--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
 		--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
-		--build-arg APT_MIRROR=$(APT_MIRROR) \
-		--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
 		-t $(DOCKER_IMAGE) .

 ########################################################
@@ -1115,10 +687,6 @@ backends/llama-cpp-darwin: build
 	bash ./scripts/build/llama-cpp-darwin.sh
 	./local-ai backends install "ocifile://$(abspath ./backend-images/llama-cpp.tar)"

-backends/ds4-darwin: build
-	bash ./scripts/build/ds4-darwin.sh
-	./local-ai backends install "ocifile://$(abspath ./backend-images/ds4.tar)"
-
 build-darwin-python-backend: build
 	bash ./scripts/build/python-darwin.sh

@@ -1160,35 +728,18 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
 # turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
 # Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
 BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
-# ds4 is antirez/ds4, a DeepSeek V4 Flash-specific inference engine.
-# Single-model; hardware-only validation lives at tests/e2e-backends/
-# (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
-BACKEND_DS4 = ds4|ds4|.|false|false
-# privacy-filter wraps the standalone privacy-filter.cpp GGML engine (the
-# openai-privacy-filter PII/NER token classifier) — the TokenClassify RPC for
-# the PII redactor tier, on stock ggml with no llama.cpp carry-patches.
-BACKEND_PRIVACY_FILTER = privacy-filter|privacy-filter|.|false|false

 # Golang backends
 BACKEND_PIPER = piper|golang|.|false|true
 BACKEND_LOCAL_STORE = local-store|golang|.|false|true
-BACKEND_CLOUD_PROXY = cloud-proxy|golang|.|false|true
 BACKEND_HUGGINGFACE = huggingface|golang|.|false|true
 BACKEND_SILERO_VAD = silero-vad|golang|.|false|true
 BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|true
 BACKEND_WHISPER = whisper|golang|.|false|true
-BACKEND_CRISPASR = crispasr|golang|.|false|true
-BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
-BACKEND_DEPTH_ANYTHING_CPP = depth-anything-cpp|golang|.|false|true
 BACKEND_VOXTRAL = voxtral|golang|.|false|true
 BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
 BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
-BACKEND_OMNIVOICE_CPP = omnivoice-cpp|golang|.|false|true
-BACKEND_VIBEVOICE_CPP = vibevoice-cpp|golang|.|false|true
-BACKEND_LOCALVQE = localvqe|golang|.|false|true
 BACKEND_OPUS = opus|golang|.|false|true
-BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true
-BACKEND_SUPERTONIC = supertonic|golang|.|false|true

 # Python backends with root context
 BACKEND_RERANKERS = rerankers|python|.|false|true
@@ -1197,8 +748,6 @@ BACKEND_OUTETTS = outetts|python|.|false|true
 BACKEND_FASTER_WHISPER = faster-whisper|python|.|false|true
 BACKEND_COQUI = coqui|python|.|false|true
 BACKEND_RFDETR = rfdetr|python|.|false|true
-BACKEND_INSIGHTFACE = insightface|python|.|false|true
-BACKEND_SPEAKER_RECOGNITION = speaker-recognition|python|.|false|true
 BACKEND_KITTEN_TTS = kitten-tts|python|.|false|true
 BACKEND_NEUTTS = neutts|python|.|false|true
 BACKEND_KOKORO = kokoro|python|.|false|true
@@ -1208,7 +757,6 @@ BACKEND_SGLANG = sglang|python|.|false|true
 BACKEND_DIFFUSERS = diffusers|python|.|--progress=plain|true
 BACKEND_CHATTERBOX = chatterbox|python|.|false|true
 BACKEND_VIBEVOICE = vibevoice|python|.|--progress=plain|true
-BACKEND_LIQUID_AUDIO = liquid-audio|python|.|--progress=plain|true
 BACKEND_MOONSHINE = moonshine|python|.|false|true
 BACKEND_POCKET_TTS = pocket-tts|python|.|false|true
 BACKEND_QWEN_TTS = qwen-tts|python|.|false|true
@@ -1231,7 +779,6 @@ BACKEND_KOKOROS = kokoros|rust|.|false|true

 # C++ backends (Go wrapper with purego)
 BACKEND_SAM3_CPP = sam3-cpp|golang|.|false|true
-BACKEND_RFDETR_CPP = rfdetr-cpp|golang|.|false|true

 # Helper function to build docker image for a backend
 # Usage: $(call docker-build-backend,BACKEND_NAME,DOCKERFILE_TYPE,BUILD_CONTEXT,PROGRESS_FLAG,NEEDS_BACKEND_ARG)
@@ -1243,10 +790,7 @@ define docker-build-backend
 		--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
 		--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
 		--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
-		--build-arg APT_MIRROR=$(APT_MIRROR) \
-		--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
 		$(if $(FROM_SOURCE),--build-arg FROM_SOURCE=$(FROM_SOURCE)) \
-		$(if $(AMDGPU_TARGETS),--build-arg AMDGPU_TARGETS=$(AMDGPU_TARGETS)) \
 		$(if $(filter true,$(5)),--build-arg BACKEND=$(1)) \
 		-t local-ai-backend:$(1) -f backend/Dockerfile.$(2) $(3)
 endef
@@ -1261,18 +805,12 @@ endef
 $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
-$(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
-$(eval $(call generate-docker-build-target,$(BACKEND_PRIVACY_FILTER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
-$(eval $(call generate-docker-build-target,$(BACKEND_CLOUD_PROXY)))
 $(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SILERO_VAD)))
 $(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
 $(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
-$(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
-$(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_DEPTH_ANYTHING_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
 $(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
@@ -1281,8 +819,6 @@ $(eval $(call generate-docker-build-target,$(BACKEND_OUTETTS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_FASTER_WHISPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_COQUI)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RFDETR)))
-$(eval $(call generate-docker-build-target,$(BACKEND_INSIGHTFACE)))
-$(eval $(call generate-docker-build-target,$(BACKEND_SPEAKER_RECOGNITION)))
 $(eval $(call generate-docker-build-target,$(BACKEND_KITTEN_TTS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_NEUTTS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_KOKORO)))
@@ -1292,7 +828,6 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SGLANG)))
 $(eval $(call generate-docker-build-target,$(BACKEND_DIFFUSERS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_CHATTERBOX)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE)))
-$(eval $(call generate-docker-build-target,$(BACKEND_LIQUID_AUDIO)))
 $(eval $(call generate-docker-build-target,$(BACKEND_MOONSHINE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_POCKET_TTS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_QWEN_TTS)))
@@ -1305,9 +840,6 @@ $(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX)))
 $(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_OMNIVOICE_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_LOCALVQE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_MLX)))
 $(eval $(call generate-docker-build-target,$(BACKEND_MLX_VLM)))
 $(eval $(call generate-docker-build-target,$(BACKEND_MLX_DISTRIBUTED)))
@@ -1316,15 +848,12 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TINYGRAD)))
 $(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_RFDETR_CPP)))
-$(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
-$(eval $(call generate-docker-build-target,$(BACKEND_SUPERTONIC)))

 # Pattern rule for docker-save targets
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp

 ########################################################
 ### Mock Backend for E2E Tests
@@ -1336,12 +865,6 @@ build-mock-backend: protogen-go
 clean-mock-backend:
 	rm -f tests/e2e/mock-backend/mock-backend

-build-cloud-proxy-backend: protogen-go
-	$(GOCMD) build -o tests/e2e/mock-backend/cloud-proxy ./backend/go/cloud-proxy
-
-clean-cloud-proxy-backend:
-	rm -f tests/e2e/mock-backend/cloud-proxy
-
 ########################################################
 ### UI E2E Test Server
 ########################################################
@@ -1352,50 +875,6 @@ build-ui-test-server: build-mock-backend react-ui protogen-go
 test-ui-e2e: build-ui-test-server
 	cd core/http/react-ui && npm install && npx playwright install --with-deps chromium && npx playwright test

-## Optional Playwright worker count for the UI e2e targets below. Pass
-## UI_TEST_WORKERS=N (e.g. `make test-ui-coverage UI_TEST_WORKERS=20`) to
-## override Playwright's default (cores/2). Empty by default so Playwright
-## picks its own worker count.
-UI_TEST_WORKERS ?=
-PLAYWRIGHT_WORKERS_FLAG = $(if $(UI_TEST_WORKERS),--workers=$(UI_TEST_WORKERS),)
-
-## Fast Playwright e2e run used by the pre-commit hook on React UI changes.
-## Force-rebuilds the (non-instrumented) dist so the suite tests the working
-## tree — not a stale dist the `react-ui` skip-guard would leave — re-embeds
-## it into ui-test-server, and runs the specs. Uses the nix-provided browser
-## when PLAYWRIGHT_CHROMIUM_PATH is set (flake dev shell), else falls back to
-## downloading it as `test-ui-e2e` does.
-test-ui: build-mock-backend protogen-go
-	cd core/http/react-ui && bun install && bun run build
-	$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui
-	cd core/http/react-ui && sh $(CURDIR)/scripts/ensure-playwright-browser.sh && bunx playwright test $(PLAYWRIGHT_WORKERS_FLAG)
-
-## React UI code coverage from the Playwright e2e suite. Builds a
-## NON-instrumented bundle with source maps (COVERAGE_V8=true), re-embeds it
-## into the ui-test-server (the dist is //go:embed'ed at compile time), runs the
-## Playwright specs which collect native Chromium V8 coverage (PW_V8_COVERAGE=1)
-## — far cheaper than istanbul's build-time counters (~40% faster end-to-end) —
-## convert it to istanbul via v8-to-istanbul in the coverage fixture, and write
-## an nyc report to core/http/react-ui/coverage/. Removes the dist afterwards so
-## normal builds aren't served source-mapped assets. (The legacy istanbul path
-## still exists: `bun run build:coverage` + unset PW_V8_COVERAGE.)
-test-ui-coverage: build-mock-backend protogen-go
-	trap 'rm -rf "$(CURDIR)/core/http/react-ui/dist"' EXIT; \
-	( cd core/http/react-ui && bun install && bun run build:coverage-v8 ) && \
-	$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui && \
-	( cd core/http/react-ui && rm -rf .nyc_output coverage && \
-	    sh $(CURDIR)/scripts/ensure-playwright-browser.sh && \
-	    PW_V8_COVERAGE=1 bunx playwright test $(PLAYWRIGHT_WORKERS_FLAG) && bun run coverage:report )
-
-## UI coverage baseline (committed) and the strict gate that compares against
-## it — the React mirror of test-coverage-baseline / test-coverage-check.
-test-ui-coverage-baseline: test-ui-coverage
-	@node -e 'const fs=require("fs");process.stdout.write(String(JSON.parse(fs.readFileSync("core/http/react-ui/coverage/coverage-summary.json")).total.lines.pct))' > core/http/react-ui/coverage-baseline.txt
-	@echo "Saved UI coverage baseline: $$(cat core/http/react-ui/coverage-baseline.txt)% lines"
-
-test-ui-coverage-check: test-ui-coverage
-	sh $(CURDIR)/scripts/ui-coverage-check.sh core/http/react-ui/coverage/coverage-summary.json core/http/react-ui/coverage-baseline.txt
-
 test-ui-e2e-docker:
 	docker build -t localai-ui-e2e -f tests/e2e-ui/Dockerfile .
 	docker run --rm localai-ui-e2e
--- a/README.md
+++ b/README.md
@@ -29,34 +29,16 @@
 <a href="https://trendshift.io/repositories/5539" target="_blank"><img src="https://trendshift.io/api/badge/repositories/5539" alt="mudler%2FLocalAI | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 </p>

-<!-- Keep these links, translations synced daily. -->
-<p align="center">
-<a href="https://zdoc.app/de/mudler/LocalAI">Deutsch</a> |
-<a href="https://zdoc.app/es/mudler/LocalAI">Español</a> |
-<a href="https://zdoc.app/fr/mudler/LocalAI">français</a> |
-<a href="https://zdoc.app/ja/mudler/LocalAI">日本語</a> |
-<a href="https://zdoc.app/ko/mudler/LocalAI">한국어</a> |
-<a href="https://zdoc.app/pt/mudler/LocalAI">Português</a> |
-<a href="https://zdoc.app/ru/mudler/LocalAI">Русский</a> |
-<a href="https://zdoc.app/zh/mudler/LocalAI">中文</a>
-</p>
-
 **LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

-**A small core, not a bundle.** Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.
+- **Drop-in API compatibility** — OpenAI, Anthropic, ElevenLabs APIs
+- **36+ backends** — llama.cpp, vLLM, transformers, whisper, diffusers, MLX...
+- **Any hardware** — NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
+- **Multi-user ready** — API key auth, user quotas, role-based access
+- **Built-in AI agents** — autonomous agents with tool use, RAG, MCP, and skills
+- **Privacy-first** — your data never leaves your infrastructure

- **Composable by design**: backends are separate and pulled on demand, so you install only what your model needs
- **Open and extensible**: load any model, or build your own backend in any language against an open interface
- **Drop-in API compatibility**: OpenAI, Anthropic, and ElevenLabs APIs across every backend
- **Any model, any modality**: LLMs, vision, voice, image, and video behind one API
- **Any hardware**: NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
- **Multi-user ready**: API key auth, user quotas, role-based access
- **Built-in AI agents**: autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first**: your data never leaves your infrastructure
-
-![A small LocalAI core with backends (llama.cpp, vLLM, MLX, whisper.cpp, stable-diffusion, kokoro, parakeet.cpp...) plugged in as separate on-demand images](docs/static/images/diagrams/composable-core.png)
-
-Created by [Ettore Di Giacinto](https://github.com/mudler) and maintained by the [LocalAI team](#team).
+Created and maintained by [Ettore Di Giacinto](https://github.com/mudler).

 > [:book: Documentation](https://localai.io/) | [:speech_balloon: Discord](https://discord.gg/uJAeKSAGDy) | [💻 Quickstart](https://localai.io/basics/getting_started/) | [🖼️ Models](https://models.localai.io/) | [❓FAQ](https://localai.io/faq/)

@@ -161,30 +143,13 @@ local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
 local-ai run oci://localai/phi-2:latest
 ```

-To test a running LocalAI server from the terminal, open an interactive chat session from another shell. Inside the prompt, `/models` lists installed models and `/model <name>` switches between them.
-
-```bash
-# Terminal 1
-local-ai run llama-3.2-1b-instruct:q4_k_m
-
-# Terminal 2
-local-ai chat --model llama-3.2-1b-instruct:q4_k_m
-```
-
 > **Automatic Backend Detection**: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/).

 For more details, see the [Getting Started guide](https://localai.io/basics/getting_started/).

 ## Latest News

- **June 2026**: New [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (a tiny Go client for the Realtime API with a full talk-back voice loop and tool calling), plus [streaming of the realtime LLM / TTS / transcription pipeline stages](https://github.com/mudler/LocalAI/pull/10176) and [configurable WebRTC ICE candidates](https://github.com/mudler/LocalAI/pull/10231).
- **June 2026**: Big speech push: the [parakeet.cpp](https://github.com/mudler/parakeet.cpp) ASR engine gains [NeMo-faithful segment timestamps](https://github.com/mudler/LocalAI/pull/10207), a [multilingual streaming Nemotron-3.5 model](https://github.com/mudler/LocalAI/pull/10199), [dynamic batching for concurrent transcription](https://github.com/mudler/LocalAI/pull/10112) and [CUDA graphs](https://github.com/mudler/LocalAI/pull/10273); the new [CrispASR backend](https://github.com/mudler/LocalAI/pull/10099) adds multi-architecture ASR + TTS, and [60 Piper TTS voices across 42 languages](https://github.com/mudler/LocalAI/pull/10296) land in the gallery (plus [per-request TTS instructions and params](https://github.com/mudler/LocalAI/pull/10172)).
- **June 2026**: New backends and models: [locate-anything.cpp](https://github.com/mudler/LocalAI/pull/10264) for open-vocabulary object detection via ggml, [Ideogram4 image generation](https://github.com/mudler/LocalAI/pull/10201) in stablediffusion-ggml, [llama.cpp video input](https://github.com/mudler/LocalAI/pull/10216), and the [Gemma 4 QAT family with MTP speculative-decoding pairs](https://github.com/mudler/LocalAI/pull/10215). Plus an [interactive CLI chat mode](https://github.com/mudler/LocalAI/pull/10226) and [RAG source citations in agent responses](https://github.com/mudler/LocalAI/pull/10228).
- **June 2026**: Distributed mode hardening: [prefix-cache-aware routing](https://github.com/mudler/LocalAI/pull/10071), a [production-ready request router with auto-sized embedding/rerank batches](https://github.com/mudler/LocalAI/pull/10104), [ds4 layer-split distributed inference](https://github.com/mudler/LocalAI/pull/10098), [NATS JWT auth + TLS/mTLS](https://github.com/mudler/LocalAI/pull/10159), and [resumable file uploads](https://github.com/mudler/LocalAI/pull/10109).
- **May 2026**: **LocalAI 4.3.0** - `llama.cpp` [prompt cache on by default](https://github.com/mudler/LocalAI/pull/9925) (repeated system prompts collapse from minutes to seconds), [keyless cosign signing of backend OCI images](https://github.com/mudler/LocalAI/pull/9823), [per-API-key + per-user usage attribution](https://github.com/mudler/LocalAI/pull/9920), Distributed v3 with [per-request replica routing](https://github.com/mudler/LocalAI/pull/9968). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.3.0)
- **May 2026**: **LocalAI 4.2.0** - LocalAI sees and hears: [voice recognition](https://github.com/mudler/LocalAI/pull/9500), [face recognition + antispoofing liveness](https://github.com/mudler/LocalAI/pull/9480), speaker diarization. Plus [drop-in Ollama API](https://github.com/mudler/LocalAI/pull/9284), [video generation](https://github.com/mudler/LocalAI/pull/9420), redesigned UI with i18n + admin-configurable branding, vLLM at feature parity with llama.cpp, and 11 new backends. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.2.0)
- **April 2026**: **LocalAI 4.1.0** - LocalAI becomes a control tower: distributed cluster mode with VRAM-aware smart routing + autoscaling, multi-user platform with OIDC and API keys, per-user quotas with predictive analytics, in-UI fine-tuning with TRL (auto-export to GGUF), on-the-fly quantization backend, visual pipeline editor. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.1.0)
- **March 2026**: **LocalAI 4.0.0** - native agentic orchestration with the new [Agenthub](https://agenthub.localai.io) community hub, full React UI rewrite with Canvas mode, [MCP Apps + client-side](https://github.com/mudler/LocalAI/pull/8947) with tool streaming, [WebRTC realtime audio](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed](https://github.com/mudler/LocalAI/pull/8801). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.0.0)
+- **March 2026**: [Agent management](https://github.com/mudler/LocalAI/pull/8820), [New React UI](https://github.com/mudler/LocalAI/pull/8772), [WebRTC](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed via P2P and RDMA](https://github.com/mudler/LocalAI/pull/8801), [MCP Apps, MCP Client-side](https://github.com/mudler/LocalAI/pull/8947)
 - **February 2026**: [Realtime API for audio-to-audio with tool calling](https://github.com/mudler/LocalAI/pull/6245), [ACE-Step 1.5 support](https://github.com/mudler/LocalAI/pull/8396)
 - **January 2026**: **LocalAI 3.10.0** — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v3.10.0)
 - **December 2025**: [Dynamic Memory Resource reclaimer](https://github.com/mudler/LocalAI/pull/7583), [Automatic multi-GPU model fitting (llama.cpp)](https://github.com/mudler/LocalAI/pull/7584), [Vibevoice backend](https://github.com/mudler/LocalAI/pull/7494)
@@ -220,26 +185,10 @@ For older news and full release notes, see [GitHub Releases](https://github.com/

 ## Supported Backends & Acceleration

-LocalAI supports **60+ backends** including llama.cpp, vLLM, SGLang, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
+LocalAI supports **36+ backends** including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).

 See the full [Backend & Model Compatibility Table](https://localai.io/model-compatibility/) and [GPU Acceleration guide](https://localai.io/features/gpu-acceleration/).

-### Backends built by us
-
-Most backends wrap a best-in-class upstream engine. A handful of them are native C/C++/GGML engines (no Python at inference) developed and maintained by the LocalAI project itself:
-
-| Backend | What it does |
-|---------|-------------|
-| [parakeet.cpp](https://github.com/mudler/parakeet.cpp) | C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription |
-| [voxtral.c](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in pure C |
-| [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp) | Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization |
-| [rf-detr.cpp](https://github.com/mudler/rf-detr.cpp) | Native RF-DETR object detection and instance segmentation |
-| [locate-anything.cpp](https://github.com/mudler/locate-anything.cpp) | Open-vocabulary object detection and visual grounding (LocateAnything-3B) |
-| [depth-anything.cpp](https://github.com/mudler/depth-anything.cpp) | Depth Anything 3 monocular metric depth + camera pose estimation |
-| [privacy-filter.cpp](https://github.com/localai-org/privacy-filter.cpp) | Standalone GGML PII/NER token-classification engine powering LocalAI's PII redaction tier |
-| [LocalVQE](https://github.com/localai-org/LocalVQE) | Joint acoustic echo cancellation, noise suppression, and dereverberation |
-| [local-store](https://github.com/mudler/LocalAI) | Local-first vector database for embeddings (shipped in-tree) |
-
 ## Resources

 - [Documentation](https://localai.io/)
@@ -249,16 +198,15 @@ Most backends wrap a best-in-class upstream engine. A handful of them are native
 - [Integrations & community projects](https://localai.io/docs/integrations/)
 - [Installation video walkthrough](https://www.youtube.com/watch?v=cMVNnlqwfw4)
 - [Media & blog posts](https://localai.io/basics/news/#media-blogs-social)
- [Examples](https://github.com/mudler/LocalAI-examples) — including the [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (Go client for the Realtime API with tool calling)
+- [Examples](https://github.com/mudler/LocalAI-examples)

-## Team
+## Autonomous Development Team

-LocalAI is maintained by a small team of humans, together with the wider community of contributors.
+LocalAI is helped being maintained by a team of autonomous AI agents led by an AI Scrum Master.

- **[Ettore Di Giacinto](https://github.com/mudler)** — original author and project lead
- **[Richard Palethorpe](https://github.com/richiejp)** — maintainer
-
-A huge thank you to everyone who contributes code, reviews PRs, files issues, and helps users in [Discord](https://discord.gg/uJAeKSAGDy) — LocalAI is a community-driven project and wouldn't exist without you. See the full [contributors list](https://github.com/mudler/LocalAI/graphs/contributors).
+- **Live Reports**: [reports.localai.io](http://reports.localai.io)
+- **Project Board**: [Agent task tracking](https://github.com/users/mudler/projects/6)
+- **Blog Post**: [Learn about the experiment](https://mudler.pm/posts/2026/02/28/a-call-to-open-source-maintainers-stop-babysitting-ai-how-i-built-a-100-local-autonomous-dev-team-to-maintain-localai-and-why-you-should-too/)

 ## Citation

@@ -286,22 +234,11 @@ A huge thank you to our generous sponsors who support this project covering CI e
  <a href="https://www.spectrocloud.com/" target="blank">
    <img height="200" src="https://github.com/user-attachments/assets/72eab1dd-8b93-4fc0-9ade-84db49f24962">
  </a>
-</p>
-
-<details>
-
-<summary>
-Past sponsors
-</summary>
-
-<p align="center">
  <a href="https://www.premai.io/" target="blank">
    <img height="200" src="https://github.com/mudler/LocalAI/assets/2420543/42e4ca83-661e-4f79-8e46-ae43689683d6"> <br>
  </a>
 </p>

-</details>
-
 ### Individual sponsors

 A special thanks to individual sponsors, a full list is on [GitHub](https://github.com/sponsors/mudler) and [buymeacoffee](https://buymeacoffee.com/mudler). Special shout out to [drikster80](https://github.com/drikster80) for being generous. Thank you everyone!
@@ -312,7 +249,7 @@ A special thanks to individual sponsors, a full list is on [GitHub](https://gith

 ## License

-LocalAI is a community-driven project created by [Ettore Di Giacinto](https://github.com/mudler/) and maintained by the [LocalAI team](#team).
+LocalAI is a community-driven project created by [Ettore Di Giacinto](https://github.com/mudler/).

 MIT - Author Ettore Di Giacinto <mudler@localai.io>

--- a/backend/Dockerfile.base-grpc-builder
+++ b/backend/Dockerfile.base-grpc-builder
@@ -1,98 +0,0 @@
-# syntax=docker/dockerfile:1.7
-#
-# Pre-built builder base image for LocalAI's C++ backends.
-#
-# This Dockerfile is the source of truth for the
-# `quay.io/go-skynet/ci-cache:base-grpc-*` images that
-# `.github/workflows/base-images.yml` builds and pushes. The output of a
-# build is a fully-prepped builder layer containing:
-#
-#   - apt build deps (build-essential, ccache, git, make, pkg-config,
-#     libcurl4-openssl-dev, libssl-dev, curl, unzip, wget, ca-certificates)
-#   - cmake (apt or, when CMAKE_FROM_SOURCE=true, compiled from
-#     ${CMAKE_VERSION})
-#   - protoc v27.1 at /usr/local/bin/protoc
-#   - gRPC ${GRPC_VERSION} compiled and installed at /opt/grpc
-#   - Conditional CUDA toolkit (BUILD_TYPE=cublas|l4t, SKIP_DRIVERS=false)
-#     including the cuda-13 + arm64 cudss/nvpl special case
-#   - Conditional ROCm/HIP build deps (BUILD_TYPE=hipblas)
-#   - Conditional Vulkan SDK 1.4.335.0 (BUILD_TYPE=vulkan)
-#
-# Variants built by the workflow (matrix in base-images.yml):
-#
-#   base-grpc-amd64                 ubuntu:24.04, CPU-only
-#   base-grpc-arm64                 ubuntu:24.04, CPU-only
-#   base-grpc-cuda-12-amd64         ubuntu:24.04 + CUDA 12.8
-#   base-grpc-cuda-13-amd64         ubuntu:22.04 + CUDA 13.0
-#   base-grpc-cuda-13-arm64         ubuntu:24.04 + CUDA 13.0 (sbsa)
-#   base-grpc-l4t-cuda-12-arm64     ubuntu:22.04 + CUDA 12.x (legacy JetPack)
-#   base-grpc-rocm-amd64            rocm/dev-ubuntu-24.04:7.2.1 + hipblas
-#   base-grpc-vulkan-amd64          ubuntu:24.04 + Vulkan SDK 1.4.335
-#   base-grpc-vulkan-arm64          ubuntu:24.04 + Vulkan SDK ARM 1.4.335
-#   base-grpc-intel-amd64           intel/oneapi-basekit:2025.3.2 (sycl)
-#
-# This is a SINGLE-stage Dockerfile by design: the final image IS the
-# builder base. The intermediate gRPC compile happens inside this same
-# stage so consumer Dockerfiles in PR 2 can simply
-# `FROM quay.io/go-skynet/ci-cache:base-grpc-<variant>` without needing a
-# COPY --from=grpc step. /opt/grpc is the canonical install prefix and
-# downstream builds will add it to CMAKE_PREFIX_PATH (or copy to
-# /usr/local) the same way Dockerfile.llama-cpp does today.
-#
-# Install logic lives in .docker/install-base-deps.sh, which is also
-# bind-mounted by the variant Dockerfiles' builder-fromsource stage.
-# This guarantees bit-equivalence between the prebuilt CI base and the
-# from-source local-dev path — both invoke the same script with the
-# same env inputs.
-
-ARG BASE_IMAGE=ubuntu:24.04
-
-FROM ${BASE_IMAGE}
-
-ARG BASE_IMAGE=ubuntu:24.04
-ARG BUILD_TYPE=""
-ARG CUDA_MAJOR_VERSION=""
-ARG CUDA_MINOR_VERSION=""
-ARG CMAKE_FROM_SOURCE=false
-# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain
-# detection / arch table issues.
-ARG CMAKE_VERSION=3.31.10
-ARG GRPC_VERSION=v1.65.0
-ARG GRPC_MAKEFLAGS="-j4 -Otarget"
-ARG SKIP_DRIVERS=false
-ARG TARGETARCH
-ARG UBUNTU_VERSION=2404
-ARG APT_MIRROR=""
-ARG APT_PORTS_MIRROR=""
-ARG AMDGPU_TARGETS=""
-
-ENV BUILD_TYPE=${BUILD_TYPE} \
-    CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
-    CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
-    CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
-    CMAKE_VERSION=${CMAKE_VERSION} \
-    GRPC_VERSION=${GRPC_VERSION} \
-    GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
-    SKIP_DRIVERS=${SKIP_DRIVERS} \
-    TARGETARCH=${TARGETARCH} \
-    UBUNTU_VERSION=${UBUNTU_VERSION} \
-    APT_MIRROR=${APT_MIRROR} \
-    APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
-    AMDGPU_TARGETS=${AMDGPU_TARGETS} \
-    MAKEFLAGS=${GRPC_MAKEFLAGS} \
-    DEBIAN_FRONTEND=noninteractive
-
-# CUDA on PATH (no-op when CUDA isn't installed)
-ENV PATH=/usr/local/cuda/bin:${PATH}
-# HipBLAS / ROCm on PATH (no-op when ROCm isn't installed)
-ENV PATH=/opt/rocm/bin:${PATH}
-
-WORKDIR /build
-
-# Single RUN that delegates to .docker/install-base-deps.sh — the same
-# script the variant Dockerfiles' builder-fromsource stage runs.
-RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
-    --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-    bash /usr/local/sbin/install-base-deps
-
-WORKDIR /
--- a/backend/Dockerfile.ds4
+++ b/backend/Dockerfile.ds4
@@ -1,41 +0,0 @@
-ARG BASE_IMAGE=ubuntu:24.04
-ARG APT_MIRROR=""
-ARG APT_PORTS_MIRROR=""
-
-# BASE_IMAGE is either ubuntu:24.04 (for cpu builds) or nvidia/cuda:13.0.0-devel-ubuntu24.04
-# (for cublas builds). Both ship apt + Ubuntu Noble packages; the nvidia/cuda base
-# additionally provides /usr/local/cuda. Darwin (Metal) builds bypass this Dockerfile
-# entirely via scripts/build/ds4-darwin.sh.
-FROM ${BASE_IMAGE} AS builder
-ARG BUILD_TYPE
-ARG TARGETARCH
-ARG TARGETVARIANT
-
-ENV BUILD_TYPE=${BUILD_TYPE} \
-    DEBIAN_FRONTEND=noninteractive \
-    PATH=/usr/local/cuda/bin:${PATH}
-
-WORKDIR /build
-
-# Install build-time deps via plain apt - install-base-deps.sh's full pipeline
-# (CUDA keyring + from-source gRPC) is unnecessary here:
-#   - CUDA: when BASE_IMAGE=nvidia/cuda:*, /usr/local/cuda is already populated;
-#     for the cpu build we don't need CUDA at all.
-#   - gRPC/Protobuf: system apt packages are sufficient; ds4's wrapper only links
-#     against them, it doesn't ship the gRPC source tree.
-#   - nlohmann-json: dsml_renderer's only third-party dep.
-RUN apt-get update && \
-    apt-get install -y --no-install-recommends \
-        git cmake build-essential pkg-config ca-certificates \
-        libgrpc++-dev libprotobuf-dev protobuf-compiler protobuf-compiler-grpc \
-        nlohmann-json3-dev && \
-    apt-get clean && \
-    rm -rf /var/lib/apt/lists/*
-
-COPY . /LocalAI
-
-RUN --mount=type=cache,target=/root/.ccache,id=ds4-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    make -C /LocalAI/backend/cpp/ds4 BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
-
-FROM scratch
-COPY --from=builder /LocalAI/backend/cpp/ds4/package/. ./
--- a/backend/Dockerfile.golang
+++ b/backend/Dockerfile.golang
@@ -1,6 +1,4 @@
 ARG BASE_IMAGE=ubuntu:24.04
-ARG APT_MIRROR=""
-ARG APT_PORTS_MIRROR=""

 FROM ${BASE_IMAGE} AS builder
 ARG BACKEND=rerankers
@@ -16,20 +14,8 @@ ARG TARGETARCH
 ARG TARGETVARIANT
 ARG GO_VERSION=1.25.4
 ARG UBUNTU_VERSION=2404
-ARG AMDGPU_TARGETS
-ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
-ARG APT_MIRROR
-ARG APT_PORTS_MIRROR

-# gcc-14 is the default on noble (ubuntu:24.04) but absent from jammy
-# (the L4T jetpack r36.4.0 base). LocalVQE specifically needs it; the
-# other Go backends compile fine with the default gcc shipped via
-# build-essential. So: try gcc-14 from the configured repos, fall back
-# gracefully when it's not available so jammy-based builds don't fail
-# at the apt step.
-RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-    APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
-    apt-get update && \
+RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        build-essential \
        git ccache \
@@ -37,12 +23,6 @@ RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mi
        make cmake wget libopenblas-dev \
        curl unzip \
        libssl-dev && \
-    if apt-cache show gcc-14 >/dev/null 2>&1 && apt-cache show g++-14 >/dev/null 2>&1; then \
-        apt-get install -y --no-install-recommends gcc-14 g++-14 && \
-        update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100 \
-            --slave /usr/bin/g++ g++ /usr/bin/g++-14 \
-            --slave /usr/bin/gcov gcov /usr/bin/gcov-14; \
-    fi && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

@@ -167,7 +147,6 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
-            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
@@ -206,16 +185,6 @@ RUN if [ "${BACKEND}" = "opus" ]; then \
    apt-get clean && rm -rf /var/lib/apt/lists/*; \
 fi

-# CrispASR's piper TTS backend dlopens libespeak-ng at runtime to phonemize
-# non-English text (the MIT-clean path; English uses a built-in G2P). Install
-# the espeak-ng runtime + its libpcaudio/libsonic deps + voice data so
-# package.sh can bundle them into the FROM scratch image.
-RUN if [ "${BACKEND}" = "crispasr" ]; then \
-    apt-get update && apt-get install -y --no-install-recommends \
-        espeak-ng-data libespeak-ng1 libpcaudio0 libsonic0 && \
-    apt-get clean && rm -rf /var/lib/apt/lists/*; \
-fi
-
 COPY . /LocalAI

 RUN git config --global --add safe.directory /LocalAI
--- a/backend/Dockerfile.ik-llama-cpp
+++ b/backend/Dockerfile.ik-llama-cpp
@@ -1,149 +1,279 @@
 ARG BASE_IMAGE=ubuntu:24.04
-# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses even
-# when no prebuilt base is supplied. The builder-prebuilt stage is only
-# entered when BUILDER_TARGET=builder-prebuilt, so a "wrong" fallback
-# content here is harmless — BuildKit prunes the unreferenced builder.
-ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
-# BUILDER_TARGET selects which builder stage the final scratch image copies
-# package output from. Declared at global scope (before any FROM) so it's
-# usable in `FROM ${BUILDER_TARGET}` below. Default keeps local
-# `make backends/ik-llama-cpp` on the from-source path.
-ARG BUILDER_TARGET=builder-fromsource
-ARG APT_MIRROR=""
-ARG APT_PORTS_MIRROR=""
+ARG GRPC_BASE_IMAGE=${BASE_IMAGE}


-# ============================================================================
-# Stage: builder-fromsource — self-contained build path.
-# Runs .docker/install-base-deps.sh (apt deps + cmake + protoc + gRPC +
-# conditional CUDA/ROCm/Vulkan), copies /opt/grpc to /usr/local, then
-# compiles the variant. Used when BUILDER_TARGET=builder-fromsource (the
-# default; local `make backends/ik-llama-cpp`).
-#
-# The install script is the same one that backend/Dockerfile.base-grpc-builder
-# runs, so the result is bit-equivalent to the prebuilt-base path
-# (builder-prebuilt below).
-# ============================================================================
-FROM ${BASE_IMAGE} AS builder-fromsource
-ARG BUILD_TYPE
-ARG CUDA_MAJOR_VERSION
-ARG CUDA_MINOR_VERSION
+# The grpc target does one thing, it builds and installs GRPC.  This is in it's own layer so that it can be effectively cached by CI.
+# You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
+FROM ${GRPC_BASE_IMAGE} AS grpc
+
+# This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
+ARG GRPC_MAKEFLAGS="-j4 -Otarget"
+ARG GRPC_VERSION=v1.65.0
 ARG CMAKE_FROM_SOURCE=false
 # CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
 ARG CMAKE_VERSION=3.31.10
-ARG GRPC_VERSION=v1.65.0
-ARG GRPC_MAKEFLAGS="-j4 -Otarget"
+
+ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
+
+WORKDIR /build
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        ca-certificates \
+        build-essential curl libssl-dev \
+        git wget && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+# Install CMake (the version in 22.04 is too old)
+RUN <<EOT bash
+    if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
+        curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
+    else
+        apt-get update && \
+        apt-get install -y \
+            cmake && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+# We install GRPC to a different prefix here so that we can copy in only the build artifacts later
+# saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
+# and running make install in the target container
+RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
+    mkdir -p /build/grpc/cmake/build && \
+    cd /build/grpc/cmake/build && \
+    sed -i "216i\  TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
+    cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
+    make && \
+    make install && \
+    rm -rf /build
+
+FROM ${BASE_IMAGE} AS builder
+ARG CMAKE_FROM_SOURCE=false
+ARG CMAKE_VERSION=3.31.10
+# We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
+ARG CUDA_DOCKER_ARCH
+ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
+ARG CMAKE_ARGS
+ENV CMAKE_ARGS=${CMAKE_ARGS}
+ARG BACKEND=rerankers
+ARG BUILD_TYPE
+ENV BUILD_TYPE=${BUILD_TYPE}
+ARG CUDA_MAJOR_VERSION
+ARG CUDA_MINOR_VERSION
 ARG SKIP_DRIVERS=false
+ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
+ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
+ENV DEBIAN_FRONTEND=noninteractive
 ARG TARGETARCH
 ARG TARGETVARIANT
 ARG GO_VERSION=1.25.4
 ARG UBUNTU_VERSION=2404
-ARG APT_MIRROR
-ARG APT_PORTS_MIRROR
-ARG AMDGPU_TARGETS=""
-ARG BACKEND=rerankers
-# CUDA target archs, e.g. --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
-ARG CUDA_DOCKER_ARCH
-ARG CMAKE_ARGS

-ENV BUILD_TYPE=${BUILD_TYPE} \
-    CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
-    CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
-    CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
-    CMAKE_VERSION=${CMAKE_VERSION} \
-    GRPC_VERSION=${GRPC_VERSION} \
-    GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
-    SKIP_DRIVERS=${SKIP_DRIVERS} \
-    TARGETARCH=${TARGETARCH} \
-    UBUNTU_VERSION=${UBUNTU_VERSION} \
-    APT_MIRROR=${APT_MIRROR} \
-    APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
-    AMDGPU_TARGETS=${AMDGPU_TARGETS} \
-    CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH} \
-    CMAKE_ARGS=${CMAKE_ARGS} \
-    DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        build-essential \
+        ccache git \
+        ca-certificates \
+        make \
+        pkg-config libcurl4-openssl-dev \
+        curl unzip \
+        libssl-dev wget && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*

-# CUDA on PATH (no-op when CUDA isn't installed)
+# Cuda
 ENV PATH=/usr/local/cuda/bin:${PATH}
-# HipBLAS / ROCm on PATH (no-op when ROCm isn't installed)
+
+# HipBLAS requirements
 ENV PATH=/opt/rocm/bin:${PATH}

-WORKDIR /build

-# Install everything via the shared script — the same one that
-# backend/Dockerfile.base-grpc-builder runs, so the prebuilt CI base and
-# this from-source path are bit-equivalent.
-RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
-    --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-    bash /usr/local/sbin/install-base-deps
+# Vulkan requirements
+RUN <<EOT bash
+    if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
+        apt-get update && \
+        apt-get install -y  --no-install-recommends \
+            software-properties-common pciutils wget gpg-agent && \
+        apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
+            libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
+            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
+            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
+            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
+        if [ "amd64" = "$TARGETARCH" ]; then
+            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
+            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
+            rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
+            mkdir -p /opt/vulkan-sdk && \
+            mv 1.4.335.0 /opt/vulkan-sdk/ && \
+            cd /opt/vulkan-sdk/1.4.335.0 && \
+            ./vulkansdk --no-deps --maxjobs \
+                vulkan-loader \
+                vulkan-validationlayers \
+                vulkan-extensionlayer \
+                vulkan-tools \
+                shaderc && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
+            rm -rf /opt/vulkan-sdk
+        fi
+        if [ "arm64" = "$TARGETARCH" ]; then
+            mkdir vulkan && cd vulkan && \
+            curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
+            tar -xvf vulkan-sdk.tar.xz && \
+            rm vulkan-sdk.tar.xz && \
+            cd 1.4.335.0 && \
+            cp -rfv aarch64/bin/* /usr/bin/ && \
+            cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
+            cp -rfv aarch64/include/* /usr/include/ && \
+            cp -rfv aarch64/share/* /usr/share/ && \
+            cd ../.. && \
+            rm -rf vulkan
+        fi
+        ldconfig && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+# CuBLAS requirements
+RUN <<EOT bash
+    if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
+        apt-get update && \
+        apt-get install -y  --no-install-recommends \
+            software-properties-common pciutils
+        if [ "amd64" = "$TARGETARCH" ]; then
+            curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
+        fi
+        if [ "arm64" = "$TARGETARCH" ]; then
+            if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
+                curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
+            else
+                curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
+            fi
+        fi
+        dpkg -i cuda-keyring_1.1-1_all.deb && \
+        rm -f cuda-keyring_1.1-1_all.deb && \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
+        if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
+            apt-get install -y --no-install-recommends \
+            libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
+        fi
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+
+# https://github.com/NVIDIA/Isaac-GR00T/issues/343
+RUN <<EOT bash
+    if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
+        wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
+        dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
+        cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
+        apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
+        wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
+        dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
+        cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
+        apt-get update && apt-get install -y nvpl
+    fi
+EOT
+
+# If we are building with clblas support, we need the libraries for the builds
+RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            libclblast-dev && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/* \
+    ; fi
+
+RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            hipblas-dev \
+            rocblas-dev && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/* && \
+        # I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
+        # to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
+        ldconfig \
+    ; fi
+
+RUN echo "TARGETARCH: $TARGETARCH"
+
+# We need protoc installed, and the version in 22.04 is too old.  We will create one as part installing the GRPC build below
+# but that will also being in a newer version of absl which stablediffusion cannot compile with.  This version of protoc is only
+# here so that we can generate the grpc code for the stablediffusion build
+RUN <<EOT bash
+    if [ "amd64" = "$TARGETARCH" ]; then
+        curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
+        unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
+        rm protoc.zip
+    fi
+    if [ "arm64" = "$TARGETARCH" ]; then
+        curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
+        unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
+        rm protoc.zip
+    fi
+EOT
+
+# Install CMake (the version in 22.04 is too old)
+RUN <<EOT bash
+    if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
+        curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
+    else
+        apt-get update && \
+        apt-get install -y \
+            cmake && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+COPY --from=grpc /opt/grpc /usr/local

-# Mirror builder-prebuilt: copy gRPC from /opt/grpc to /usr/local so
-# CMake's find_package finds it at the canonical prefix the Makefile expects.
-RUN cp -a /opt/grpc/. /usr/local/

 COPY . /LocalAI

-# BuildKit cache mount for ccache. See Dockerfile.llama-cpp (commit 9228e5b4)
-# for the rationale. Distinct mount id so ik-llama-cpp's cache doesn't
-# overlap with llama-cpp's — ik_llama.cpp is a different fork with
-# different source.
-#
-# The compile body is shared with builder-prebuilt via .docker/ik-llama-cpp-compile.sh.
-RUN --mount=type=bind,source=.docker/ik-llama-cpp-compile.sh,target=/usr/local/sbin/compile.sh \
-    --mount=type=cache,target=/root/.ccache,id=ik-llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    bash /usr/local/sbin/compile.sh
+RUN <<'EOT' bash
+set -euxo pipefail
+
+if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
+  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
+  export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
+  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
+  rm -rf /LocalAI/backend/cpp/ik-llama-cpp-*-build
+fi
+
+cd /LocalAI/backend/cpp/ik-llama-cpp
+
+if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
+  # ARM64 / ROCm: build without x86 SIMD
+  make ik-llama-cpp-fallback
+else
+  # ik_llama.cpp's IQK kernels require at least AVX2
+  make ik-llama-cpp-avx2
+fi
+EOT


 # Copy libraries using a script to handle architecture differences
 RUN make -BC /LocalAI/backend/cpp/ik-llama-cpp package


-# ============================================================================
-# Stage: builder-prebuilt — uses the pre-built base from
-# quay.io/go-skynet/ci-cache:base-grpc-* (built by .github/workflows/base-images.yml).
-# That image already has gRPC at /opt/grpc + apt deps + CUDA/ROCm/Vulkan
-# pre-installed, so we just copy gRPC to /usr/local and compile. Used when
-# BUILDER_TARGET=builder-prebuilt (CI when the matrix entry sets
-# builder-base-image).
-# ============================================================================
-FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
-
-ARG BUILD_TYPE
-ENV BUILD_TYPE=${BUILD_TYPE}
-ARG CUDA_DOCKER_ARCH
-ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
-ARG CMAKE_ARGS
-ENV CMAKE_ARGS=${CMAKE_ARGS}
-ARG TARGETARCH
-ARG TARGETVARIANT
-
-# The base-grpc-* image installs gRPC to /opt/grpc but doesn't copy it to
-# /usr/local. Mirror what the from-source path does so the compile step
-# can find gRPC at the canonical prefix the Makefile expects.
-RUN cp -a /opt/grpc/. /usr/local/
-
-COPY . /LocalAI
-
-RUN --mount=type=bind,source=.docker/ik-llama-cpp-compile.sh,target=/usr/local/sbin/compile.sh \
-    --mount=type=cache,target=/root/.ccache,id=ik-llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    bash /usr/local/sbin/compile.sh
-
-RUN make -BC /LocalAI/backend/cpp/ik-llama-cpp package
-
-
-# ============================================================================
-# Final stage — copies package output from one of the two builders.
-# BUILDER_TARGET selects which one. BuildKit prunes the unreferenced builder.
-#
-# BuildKit doesn't support variable expansion in `COPY --from=` directly,
-# so we resolve the ARG by aliasing the chosen builder to a fixed stage
-# name via `FROM ${BUILDER_TARGET} AS builder` and then COPY --from=builder.
-# BUILDER_TARGET itself is declared as a global ARG at the top of this
-# file (required for use in FROM), so we just re-import it into this
-# stage's scope before the FROM directive.
-# ============================================================================
-FROM ${BUILDER_TARGET} AS builder
-
 FROM scratch


--- a/backend/Dockerfile.llama-cpp
+++ b/backend/Dockerfile.llama-cpp
@@ -1,155 +1,290 @@
 ARG BASE_IMAGE=ubuntu:24.04
-# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses even
-# when no prebuilt base is supplied. The builder-prebuilt stage is only
-# entered when BUILDER_TARGET=builder-prebuilt, so a "wrong" fallback
-# content here is harmless — BuildKit prunes the unreferenced builder.
-ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
-# BUILDER_TARGET selects which builder stage the final scratch image copies
-# package output from. Declared at global scope (before any FROM) so it's
-# usable in `FROM ${BUILDER_TARGET}` below. Default keeps local
-# `make backends/llama-cpp` on the from-source path.
-ARG BUILDER_TARGET=builder-fromsource
-ARG APT_MIRROR=""
-ARG APT_PORTS_MIRROR=""
+ARG GRPC_BASE_IMAGE=${BASE_IMAGE}


-# ============================================================================
-# Stage: builder-fromsource — self-contained build path.
-# Runs .docker/install-base-deps.sh (apt deps + cmake + protoc + gRPC +
-# conditional CUDA/ROCm/Vulkan), copies /opt/grpc to /usr/local, then
-# compiles the variant. Used when BUILDER_TARGET=builder-fromsource (the
-# default; local `make backends/llama-cpp`).
-#
-# The install script is the same one that backend/Dockerfile.base-grpc-builder
-# runs, so the result is bit-equivalent to the prebuilt-base path
-# (builder-prebuilt below).
-# ============================================================================
-FROM ${BASE_IMAGE} AS builder-fromsource
-ARG BUILD_TYPE
-ARG CUDA_MAJOR_VERSION
-ARG CUDA_MINOR_VERSION
+# The grpc target does one thing, it builds and installs GRPC.  This is in it's own layer so that it can be effectively cached by CI.
+# You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
+FROM ${GRPC_BASE_IMAGE} AS grpc
+
+# This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
+ARG GRPC_MAKEFLAGS="-j4 -Otarget"
+ARG GRPC_VERSION=v1.65.0
 ARG CMAKE_FROM_SOURCE=false
 # CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
 ARG CMAKE_VERSION=3.31.10
-ARG GRPC_VERSION=v1.65.0
-ARG GRPC_MAKEFLAGS="-j4 -Otarget"
-ARG SKIP_DRIVERS=false
-ARG TARGETARCH
-ARG TARGETVARIANT
-ARG GO_VERSION=1.25.4
-ARG UBUNTU_VERSION=2404
-ARG APT_MIRROR
-ARG APT_PORTS_MIRROR
-ARG AMDGPU_TARGETS
-# CUDA target archs, e.g. --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
-ARG CUDA_DOCKER_ARCH
-ARG CMAKE_ARGS

-ENV BUILD_TYPE=${BUILD_TYPE} \
-    CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
-    CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
-    CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
-    CMAKE_VERSION=${CMAKE_VERSION} \
-    GRPC_VERSION=${GRPC_VERSION} \
-    GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
-    SKIP_DRIVERS=${SKIP_DRIVERS} \
-    TARGETARCH=${TARGETARCH} \
-    UBUNTU_VERSION=${UBUNTU_VERSION} \
-    APT_MIRROR=${APT_MIRROR} \
-    APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
-    AMDGPU_TARGETS=${AMDGPU_TARGETS} \
-    CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH} \
-    CMAKE_ARGS=${CMAKE_ARGS} \
-    DEBIAN_FRONTEND=noninteractive
-
-# CUDA on PATH (no-op when CUDA isn't installed)
-ENV PATH=/usr/local/cuda/bin:${PATH}
-# HipBLAS / ROCm on PATH (no-op when ROCm isn't installed)
-ENV PATH=/opt/rocm/bin:${PATH}
+ENV MAKEFLAGS=${GRPC_MAKEFLAGS}

 WORKDIR /build

-# Install everything via the shared script — the same one that
-# backend/Dockerfile.base-grpc-builder runs, so the prebuilt CI base and
-# this from-source path are bit-equivalent.
-RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
-    --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-    bash /usr/local/sbin/install-base-deps
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        ca-certificates \
+        build-essential curl libssl-dev \
+        git wget && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*

-# Mirror builder-prebuilt: copy gRPC from /opt/grpc to /usr/local so
-# CMake's find_package finds it at the canonical prefix the Makefile expects.
-RUN cp -a /opt/grpc/. /usr/local/
+# Install CMake (the version in 22.04 is too old)
+RUN <<EOT bash
+    if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
+        curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
+    else
+        apt-get update && \
+        apt-get install -y \
+            cmake && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT

-COPY . /LocalAI
+# We install GRPC to a different prefix here so that we can copy in only the build artifacts later
+# saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
+# and running make install in the target container
+RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
+    mkdir -p /build/grpc/cmake/build && \
+    cd /build/grpc/cmake/build && \
+    sed -i "216i\  TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
+    cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
+    make && \
+    make install && \
+    rm -rf /build

-# BuildKit cache mount for ccache. Persists compiler outputs across builds
-# via the registry cache (cache-to: type=registry,mode=max in CI). On a
-# LLAMA_VERSION bump most TUs are byte-identical to the previous version's
-# preprocessed source — ccache returns the previous .o file and skips the
-# real compile. Same for LocalAI source changes that don't touch llama.cpp.
-# CMAKE_*_COMPILER_LAUNCHER threads ccache through CMake to wrap gcc/g++/nvcc.
-# sharing=locked serializes concurrent writes if multiple matrix variants
-# share the same cache mount id.
-#
-# The compile body is shared with builder-prebuilt via .docker/llama-cpp-compile.sh.
-RUN --mount=type=bind,source=.docker/llama-cpp-compile.sh,target=/usr/local/sbin/compile.sh \
-    --mount=type=cache,target=/root/.ccache,id=llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    bash /usr/local/sbin/compile.sh
-
-
-# Copy libraries using a script to handle architecture differences
-RUN make -BC /LocalAI/backend/cpp/llama-cpp package
-
-
-# ============================================================================
-# Stage: builder-prebuilt — uses the pre-built base from
-# quay.io/go-skynet/ci-cache:base-grpc-* (built by .github/workflows/base-images.yml).
-# That image already has gRPC at /opt/grpc + apt deps + CUDA/ROCm/Vulkan
-# pre-installed, so we just copy gRPC to /usr/local and compile. Used when
-# BUILDER_TARGET=builder-prebuilt (CI when the matrix entry sets
-# builder-base-image).
-# ============================================================================
-FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
-
-ARG BUILD_TYPE
-ENV BUILD_TYPE=${BUILD_TYPE}
+FROM ${BASE_IMAGE} AS builder
+ARG CMAKE_FROM_SOURCE=false
+ARG CMAKE_VERSION=3.31.10
+# We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
 ARG CUDA_DOCKER_ARCH
 ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
 ARG CMAKE_ARGS
 ENV CMAKE_ARGS=${CMAKE_ARGS}
 ARG AMDGPU_TARGETS
 ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
+ARG BACKEND=rerankers
+ARG BUILD_TYPE
+ENV BUILD_TYPE=${BUILD_TYPE}
+ARG CUDA_MAJOR_VERSION
+ARG CUDA_MINOR_VERSION
+ARG SKIP_DRIVERS=false
+ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
+ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
+ENV DEBIAN_FRONTEND=noninteractive
 ARG TARGETARCH
 ARG TARGETVARIANT
+ARG GO_VERSION=1.25.4
+ARG UBUNTU_VERSION=2404
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        build-essential \
+        ccache git \
+        ca-certificates \
+        make \
+        pkg-config libcurl4-openssl-dev \
+        curl unzip \
+        libssl-dev wget && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+# Cuda
+ENV PATH=/usr/local/cuda/bin:${PATH}
+
+# HipBLAS requirements
+ENV PATH=/opt/rocm/bin:${PATH}
+
+
+# Vulkan requirements
+RUN <<EOT bash
+    if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
+        apt-get update && \
+        apt-get install -y  --no-install-recommends \
+            software-properties-common pciutils wget gpg-agent && \
+        apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
+            libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
+            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
+            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
+            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
+        if [ "amd64" = "$TARGETARCH" ]; then
+            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
+            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
+            rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
+            mkdir -p /opt/vulkan-sdk && \
+            mv 1.4.335.0 /opt/vulkan-sdk/ && \
+            cd /opt/vulkan-sdk/1.4.335.0 && \
+            ./vulkansdk --no-deps --maxjobs \
+                vulkan-loader \
+                vulkan-validationlayers \
+                vulkan-extensionlayer \
+                vulkan-tools \
+                shaderc && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
+            rm -rf /opt/vulkan-sdk
+        fi
+        if [ "arm64" = "$TARGETARCH" ]; then
+            mkdir vulkan && cd vulkan && \
+            curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
+            tar -xvf vulkan-sdk.tar.xz && \
+            rm vulkan-sdk.tar.xz && \
+            cd 1.4.335.0 && \
+            cp -rfv aarch64/bin/* /usr/bin/ && \
+            cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
+            cp -rfv aarch64/include/* /usr/include/ && \
+            cp -rfv aarch64/share/* /usr/share/ && \
+            cd ../.. && \
+            rm -rf vulkan
+        fi
+        ldconfig && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+# CuBLAS requirements
+RUN <<EOT bash
+    if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
+        apt-get update && \
+        apt-get install -y  --no-install-recommends \
+            software-properties-common pciutils
+        if [ "amd64" = "$TARGETARCH" ]; then
+            curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
+        fi
+        if [ "arm64" = "$TARGETARCH" ]; then
+            if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
+                curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
+            else
+                curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
+            fi
+        fi
+        dpkg -i cuda-keyring_1.1-1_all.deb && \
+        rm -f cuda-keyring_1.1-1_all.deb && \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
+        if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
+            apt-get install -y --no-install-recommends \
+            libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
+        fi
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+
+# https://github.com/NVIDIA/Isaac-GR00T/issues/343
+RUN <<EOT bash
+    if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
+        wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
+        dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
+        cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
+        apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
+        wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
+        dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
+        cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
+        apt-get update && apt-get install -y nvpl
+    fi
+EOT
+
+# If we are building with clblas support, we need the libraries for the builds
+RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            libclblast-dev && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/* \
+    ; fi
+
+RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            hipblas-dev \
+            rocblas-dev && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/* && \
+        # I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
+        # to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
+        ldconfig && \
+        # Log which GPU architectures have rocBLAS kernel support
+        echo "rocBLAS library data architectures:" && \
+        (ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
+        echo "WARNING: No rocBLAS kernel data found" \
+    ; fi
+
+RUN echo "TARGETARCH: $TARGETARCH"
+
+# We need protoc installed, and the version in 22.04 is too old.  We will create one as part installing the GRPC build below
+# but that will also being in a newer version of absl which stablediffusion cannot compile with.  This version of protoc is only
+# here so that we can generate the grpc code for the stablediffusion build
+RUN <<EOT bash
+    if [ "amd64" = "$TARGETARCH" ]; then
+        curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
+        unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
+        rm protoc.zip
+    fi
+    if [ "arm64" = "$TARGETARCH" ]; then
+        curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
+        unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
+        rm protoc.zip
+    fi
+EOT
+
+# Install CMake (the version in 22.04 is too old)
+RUN <<EOT bash
+    if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
+        curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
+    else
+        apt-get update && \
+        apt-get install -y \
+            cmake && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+COPY --from=grpc /opt/grpc /usr/local

-# The base-grpc-* image installs gRPC to /opt/grpc but doesn't copy it to
-# /usr/local. The variant Dockerfile's from-source path does that too;
-# mirror it here so the compile step can find gRPC at the canonical
-# prefix the Makefile expects.
-RUN cp -a /opt/grpc/. /usr/local/

 COPY . /LocalAI

-RUN --mount=type=bind,source=.docker/llama-cpp-compile.sh,target=/usr/local/sbin/compile.sh \
-    --mount=type=cache,target=/root/.ccache,id=llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    bash /usr/local/sbin/compile.sh
+RUN <<'EOT' bash
+set -euxo pipefail

+if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
+  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
+  export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
+  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
+  rm -rf /LocalAI/backend/cpp/llama-cpp-*-build
+fi
+
+if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
+  cd /LocalAI/backend/cpp/llama-cpp
+  make llama-cpp-fallback
+  make llama-cpp-grpc
+  make llama-cpp-rpc-server
+else
+  cd /LocalAI/backend/cpp/llama-cpp
+  make llama-cpp-avx
+  make llama-cpp-avx2
+  make llama-cpp-avx512
+  make llama-cpp-fallback
+  make llama-cpp-grpc
+  make llama-cpp-rpc-server
+fi
+EOT
+
+
+# Copy libraries using a script to handle architecture differences
 RUN make -BC /LocalAI/backend/cpp/llama-cpp package


-# ============================================================================
-# Final stage — copies package output from one of the two builders.
-# BUILDER_TARGET selects which one. BuildKit prunes the unreferenced builder.
-#
-# BuildKit doesn't support variable expansion in `COPY --from=` directly,
-# so we resolve the ARG by aliasing the chosen builder to a fixed stage
-# name via `FROM ${BUILDER_TARGET} AS builder` and then COPY --from=builder.
-# BUILDER_TARGET itself is declared as a global ARG at the top of this
-# file (required for use in FROM), so we just re-import it into this
-# stage's scope before the FROM directive.
-# ============================================================================
-FROM ${BUILDER_TARGET} AS builder
-
 FROM scratch


--- a/backend/Dockerfile.privacy-filter
+++ b/backend/Dockerfile.privacy-filter
@@ -1,109 +0,0 @@
-ARG BASE_IMAGE=ubuntu:24.04
-# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses when no
-# prebuilt base is supplied; the builder-prebuilt stage is only entered when
-# BUILDER_TARGET=builder-prebuilt, so the fallback content is harmless
-# (BuildKit prunes the unreferenced builder).
-ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
-# BUILDER_TARGET selects which builder stage the scratch image copies from.
-# Declared before any FROM so it is usable in `FROM ${BUILDER_TARGET}`. The
-# backend_build workflow sets it to builder-prebuilt when the matrix entry
-# provides builder-base-image, else builder-fromsource (the local default).
-ARG BUILDER_TARGET=builder-fromsource
-ARG APT_MIRROR=""
-ARG APT_PORTS_MIRROR=""
-
-# privacy-filter: standalone GGML engine for the openai-privacy-filter PII/NER
-# token classifier, wrapped as a LocalAI gRPC backend.
-#
-# Mirrors backend/Dockerfile.llama-cpp: the build toolchain (gRPC + cmake +
-# protoc + conditional CUDA/Vulkan) comes from the shared
-# .docker/install-base-deps.sh (from-source path) or a prebuilt
-# quay.io/go-skynet/ci-cache:base-grpc-* image (CI path) — nothing GPU-specific
-# is hand-rolled here. BUILD_TYPE selects the engine backend in the Makefile:
-# "" = cpu, "cublas" -> -DPF_CUDA=ON, "vulkan" -> -DPF_VULKAN=ON.
-
-# ============================================================================
-# Stage: builder-fromsource — self-contained build. Runs the same install
-# script backend/Dockerfile.base-grpc-builder runs, so this path is
-# bit-equivalent to the prebuilt base. Used when BUILDER_TARGET=builder-fromsource
-# (the default; local `make backends/privacy-filter`).
-# ============================================================================
-FROM ${BASE_IMAGE} AS builder-fromsource
-ARG BUILD_TYPE
-ARG CUDA_MAJOR_VERSION
-ARG CUDA_MINOR_VERSION
-ARG CMAKE_FROM_SOURCE=false
-# CUDA Toolkit 13.x needs CMake 3.31.9+ for correct toolchain/arch detection.
-ARG CMAKE_VERSION=3.31.10
-ARG GRPC_VERSION=v1.65.0
-ARG GRPC_MAKEFLAGS="-j4 -Otarget"
-ARG SKIP_DRIVERS=false
-ARG TARGETARCH
-ARG UBUNTU_VERSION=2404
-ARG APT_MIRROR
-ARG APT_PORTS_MIRROR
-
-ENV BUILD_TYPE=${BUILD_TYPE} \
-    CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
-    CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
-    CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
-    CMAKE_VERSION=${CMAKE_VERSION} \
-    GRPC_VERSION=${GRPC_VERSION} \
-    GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
-    SKIP_DRIVERS=${SKIP_DRIVERS} \
-    TARGETARCH=${TARGETARCH} \
-    UBUNTU_VERSION=${UBUNTU_VERSION} \
-    APT_MIRROR=${APT_MIRROR} \
-    APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
-    DEBIAN_FRONTEND=noninteractive
-# CUDA on PATH (a no-op when CUDA is not installed, e.g. cpu/vulkan builds).
-ENV PATH=/usr/local/cuda/bin:${PATH}
-
-WORKDIR /build
-
-# apt deps + cmake + protoc + gRPC + conditional CUDA/Vulkan, all from the
-# shared script (the source of truth that base-grpc-builder also runs).
-RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
-    --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-    bash /usr/local/sbin/install-base-deps
-
-# install-base-deps installs gRPC under /opt/grpc; copy it to /usr/local so the
-# backend's find_package(gRPC CONFIG) resolves it at the canonical prefix.
-RUN cp -a /opt/grpc/. /usr/local/
-
-COPY . /LocalAI
-
-RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
-
-# ============================================================================
-# Stage: builder-prebuilt — FROM a prebuilt
-# quay.io/go-skynet/ci-cache:base-grpc-* image (gRPC at /opt/grpc + apt deps +
-# CUDA/Vulkan already installed). Used in CI when the matrix entry sets
-# builder-base-image.
-# ============================================================================
-FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
-ARG BUILD_TYPE
-ARG TARGETARCH
-ENV BUILD_TYPE=${BUILD_TYPE}
-# CUDA on PATH (a no-op for the cpu/vulkan base images).
-ENV PATH=/usr/local/cuda/bin:${PATH}
-
-# Mirror builder-fromsource: the base-grpc image installs gRPC to /opt/grpc but
-# does not copy it to /usr/local.
-RUN cp -a /opt/grpc/. /usr/local/
-
-COPY . /LocalAI
-
-RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
-
-# ============================================================================
-# Final stage — copy the package output from the selected builder. BuildKit
-# does not expand variables in `COPY --from=`, so alias the chosen builder to a
-# fixed stage name first.
-# ============================================================================
-FROM ${BUILDER_TARGET} AS builder
-
-FROM scratch
-COPY --from=builder /LocalAI/backend/cpp/privacy-filter/package/. ./
--- a/backend/Dockerfile.python
+++ b/backend/Dockerfile.python
@@ -1,6 +1,4 @@
 ARG BASE_IMAGE=ubuntu:24.04
-ARG APT_MIRROR=""
-ARG APT_PORTS_MIRROR=""

 FROM ${BASE_IMAGE} AS builder
 ARG BACKEND=rerankers
@@ -15,12 +13,8 @@ ENV DEBIAN_FRONTEND=noninteractive
 ARG TARGETARCH
 ARG TARGETVARIANT
 ARG UBUNTU_VERSION=2404
-ARG APT_MIRROR
-ARG APT_PORTS_MIRROR

-RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-    APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
-    apt-get update && \
+RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        build-essential \
        ccache \
@@ -126,7 +120,6 @@ RUN <<EOT bash
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
-            cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
@@ -169,7 +162,6 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
-            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
@@ -210,13 +202,6 @@ COPY scripts/build/package-gpu-libs.sh /package-gpu-libs.sh
 ARG FROM_SOURCE=""
 ENV FROM_SOURCE=${FROM_SOURCE}

-# Cache-buster for the per-backend `make` step. Most Python backends list
-# unpinned deps (torch, transformers, vllm, ...), so a warm registry cache
-# would otherwise freeze upstream versions indefinitely. CI passes a value
-# that rolls weekly so the install layer is rebuilt at most once per week
-# and picks up newer wheels from PyPI / nightly indexes.
-ARG DEPS_REFRESH=initial
-
 RUN cd /${BACKEND} && PORTABLE_PYTHON=true make

 # Package GPU libraries into the backend's lib directory
@@ -231,4 +216,4 @@ RUN if [ -f "/${BACKEND}/package.sh" ]; then \

 FROM scratch
 ARG BACKEND=rerankers
-COPY --from=builder /${BACKEND}/ /
+COPY --from=builder /${BACKEND}/ /
--- a/backend/Dockerfile.rust
+++ b/backend/Dockerfile.rust
@@ -1,18 +1,12 @@
 ARG BASE_IMAGE=ubuntu:24.04
-ARG APT_MIRROR=""
-ARG APT_PORTS_MIRROR=""

 FROM ${BASE_IMAGE} AS builder
 ARG BACKEND=kokoros
 ENV DEBIAN_FRONTEND=noninteractive
 ARG TARGETARCH
 ARG TARGETVARIANT
-ARG APT_MIRROR
-ARG APT_PORTS_MIRROR

-RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-    APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
-    apt-get update && \
+RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        build-essential \
        git ccache \
--- a/backend/Dockerfile.turboquant
+++ b/backend/Dockerfile.turboquant
@@ -1,158 +1,288 @@
 ARG BASE_IMAGE=ubuntu:24.04
-# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses even
-# when no prebuilt base is supplied. The builder-prebuilt stage is only
-# entered when BUILDER_TARGET=builder-prebuilt, so a "wrong" fallback
-# content here is harmless — BuildKit prunes the unreferenced builder.
-ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
-# BUILDER_TARGET selects which builder stage the final scratch image copies
-# package output from. Declared at global scope (before any FROM) so it's
-# usable in `FROM ${BUILDER_TARGET}` below. Default keeps local
-# `make backends/turboquant` on the from-source path.
-ARG BUILDER_TARGET=builder-fromsource
-ARG APT_MIRROR=""
-ARG APT_PORTS_MIRROR=""
+ARG GRPC_BASE_IMAGE=${BASE_IMAGE}


-# ============================================================================
-# Stage: builder-fromsource — self-contained build path.
-# Runs .docker/install-base-deps.sh (apt deps + cmake + protoc + gRPC +
-# conditional CUDA/ROCm/Vulkan), copies /opt/grpc to /usr/local, then
-# compiles the variant. Used when BUILDER_TARGET=builder-fromsource (the
-# default; local `make backends/turboquant`).
-#
-# The install script is the same one that backend/Dockerfile.base-grpc-builder
-# runs, so the result is bit-equivalent to the prebuilt-base path
-# (builder-prebuilt below).
-# ============================================================================
-FROM ${BASE_IMAGE} AS builder-fromsource
-ARG BUILD_TYPE
-ARG CUDA_MAJOR_VERSION
-ARG CUDA_MINOR_VERSION
+# The grpc target does one thing, it builds and installs GRPC.  This is in it's own layer so that it can be effectively cached by CI.
+# You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
+FROM ${GRPC_BASE_IMAGE} AS grpc
+
+# This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
+ARG GRPC_MAKEFLAGS="-j4 -Otarget"
+ARG GRPC_VERSION=v1.65.0
 ARG CMAKE_FROM_SOURCE=false
 # CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
 ARG CMAKE_VERSION=3.31.10
-ARG GRPC_VERSION=v1.65.0
-ARG GRPC_MAKEFLAGS="-j4 -Otarget"
+
+ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
+
+WORKDIR /build
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        ca-certificates \
+        build-essential curl libssl-dev \
+        git wget && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+
+# Install CMake (the version in 22.04 is too old)
+RUN <<EOT bash
+    if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
+        curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
+    else
+        apt-get update && \
+        apt-get install -y \
+            cmake && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+# We install GRPC to a different prefix here so that we can copy in only the build artifacts later
+# saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
+# and running make install in the target container
+RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
+    mkdir -p /build/grpc/cmake/build && \
+    cd /build/grpc/cmake/build && \
+    sed -i "216i\  TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
+    cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
+    make && \
+    make install && \
+    rm -rf /build
+
+FROM ${BASE_IMAGE} AS builder
+ARG CMAKE_FROM_SOURCE=false
+ARG CMAKE_VERSION=3.31.10
+# We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
+ARG CUDA_DOCKER_ARCH
+ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
+ARG CMAKE_ARGS
+ENV CMAKE_ARGS=${CMAKE_ARGS}
+ARG BACKEND=rerankers
+ARG BUILD_TYPE
+ENV BUILD_TYPE=${BUILD_TYPE}
+ARG CUDA_MAJOR_VERSION
+ARG CUDA_MINOR_VERSION
 ARG SKIP_DRIVERS=false
+ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
+ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
+ENV DEBIAN_FRONTEND=noninteractive
 ARG TARGETARCH
 ARG TARGETVARIANT
 ARG GO_VERSION=1.25.4
 ARG UBUNTU_VERSION=2404
-ARG APT_MIRROR
-ARG APT_PORTS_MIRROR
-ARG AMDGPU_TARGETS=""
-ARG BACKEND=rerankers
-# CUDA target archs, e.g. --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
-ARG CUDA_DOCKER_ARCH
-ARG CMAKE_ARGS

-ENV BUILD_TYPE=${BUILD_TYPE} \
-    CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
-    CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
-    CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
-    CMAKE_VERSION=${CMAKE_VERSION} \
-    GRPC_VERSION=${GRPC_VERSION} \
-    GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
-    SKIP_DRIVERS=${SKIP_DRIVERS} \
-    TARGETARCH=${TARGETARCH} \
-    UBUNTU_VERSION=${UBUNTU_VERSION} \
-    APT_MIRROR=${APT_MIRROR} \
-    APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
-    AMDGPU_TARGETS=${AMDGPU_TARGETS} \
-    CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH} \
-    CMAKE_ARGS=${CMAKE_ARGS} \
-    DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        build-essential \
+        ccache git \
+        ca-certificates \
+        make \
+        pkg-config libcurl4-openssl-dev \
+        curl unzip \
+        libssl-dev wget && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*

-# CUDA on PATH (no-op when CUDA isn't installed)
+# Cuda
 ENV PATH=/usr/local/cuda/bin:${PATH}
-# HipBLAS / ROCm on PATH (no-op when ROCm isn't installed)
+
+# HipBLAS requirements
 ENV PATH=/opt/rocm/bin:${PATH}

-WORKDIR /build

-# Install everything via the shared script — the same one that
-# backend/Dockerfile.base-grpc-builder runs, so the prebuilt CI base and
-# this from-source path are bit-equivalent.
-RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
-    --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-    bash /usr/local/sbin/install-base-deps
+# Vulkan requirements
+RUN <<EOT bash
+    if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
+        apt-get update && \
+        apt-get install -y  --no-install-recommends \
+            software-properties-common pciutils wget gpg-agent && \
+        apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
+            libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
+            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
+            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
+            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
+            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
+        if [ "amd64" = "$TARGETARCH" ]; then
+            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
+            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
+            rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
+            mkdir -p /opt/vulkan-sdk && \
+            mv 1.4.335.0 /opt/vulkan-sdk/ && \
+            cd /opt/vulkan-sdk/1.4.335.0 && \
+            ./vulkansdk --no-deps --maxjobs \
+                vulkan-loader \
+                vulkan-validationlayers \
+                vulkan-extensionlayer \
+                vulkan-tools \
+                shaderc && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
+            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
+            rm -rf /opt/vulkan-sdk
+        fi
+        if [ "arm64" = "$TARGETARCH" ]; then
+            mkdir vulkan && cd vulkan && \
+            curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
+            tar -xvf vulkan-sdk.tar.xz && \
+            rm vulkan-sdk.tar.xz && \
+            cd 1.4.335.0 && \
+            cp -rfv aarch64/bin/* /usr/bin/ && \
+            cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
+            cp -rfv aarch64/include/* /usr/include/ && \
+            cp -rfv aarch64/share/* /usr/share/ && \
+            cd ../.. && \
+            rm -rf vulkan
+        fi
+        ldconfig && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+# CuBLAS requirements
+RUN <<EOT bash
+    if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
+        apt-get update && \
+        apt-get install -y  --no-install-recommends \
+            software-properties-common pciutils
+        if [ "amd64" = "$TARGETARCH" ]; then
+            curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
+        fi
+        if [ "arm64" = "$TARGETARCH" ]; then
+            if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
+                curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
+            else
+                curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
+            fi
+        fi
+        dpkg -i cuda-keyring_1.1-1_all.deb && \
+        rm -f cuda-keyring_1.1-1_all.deb && \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
+        if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
+            apt-get install -y --no-install-recommends \
+            libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
+        fi
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+
+# https://github.com/NVIDIA/Isaac-GR00T/issues/343
+RUN <<EOT bash
+    if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
+        wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
+        dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
+        cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
+        apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
+        wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
+        dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
+        cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
+        apt-get update && apt-get install -y nvpl
+    fi
+EOT
+
+# If we are building with clblas support, we need the libraries for the builds
+RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            libclblast-dev && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/* \
+    ; fi
+
+RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
+        apt-get update && \
+        apt-get install -y --no-install-recommends \
+            hipblas-dev \
+            rocblas-dev && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/* && \
+        # I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
+        # to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
+        ldconfig && \
+        # Log which GPU architectures have rocBLAS kernel support
+        echo "rocBLAS library data architectures:" && \
+        (ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
+        echo "WARNING: No rocBLAS kernel data found" \
+    ; fi
+
+RUN echo "TARGETARCH: $TARGETARCH"
+
+# We need protoc installed, and the version in 22.04 is too old.  We will create one as part installing the GRPC build below
+# but that will also being in a newer version of absl which stablediffusion cannot compile with.  This version of protoc is only
+# here so that we can generate the grpc code for the stablediffusion build
+RUN <<EOT bash
+    if [ "amd64" = "$TARGETARCH" ]; then
+        curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
+        unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
+        rm protoc.zip
+    fi
+    if [ "arm64" = "$TARGETARCH" ]; then
+        curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
+        unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
+        rm protoc.zip
+    fi
+EOT
+
+# Install CMake (the version in 22.04 is too old)
+RUN <<EOT bash
+    if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
+        curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
+    else
+        apt-get update && \
+        apt-get install -y \
+            cmake && \
+        apt-get clean && \
+        rm -rf /var/lib/apt/lists/*
+    fi
+EOT
+
+COPY --from=grpc /opt/grpc /usr/local

-# Mirror builder-prebuilt: copy gRPC from /opt/grpc to /usr/local so
-# CMake's find_package finds it at the canonical prefix the Makefile expects.
-RUN cp -a /opt/grpc/. /usr/local/

 COPY . /LocalAI

-# BuildKit cache mount for ccache. See Dockerfile.llama-cpp (commit 9228e5b4)
-# for rationale. turboquant is a llama.cpp fork that reuses
-# backend/cpp/llama-cpp source via a thin wrapper Makefile, so MOST TUs
-# are content-identical to the upstream llama-cpp build. Sharing a cache
-# id with llama-cpp could give cross-fork hits — but for now keep them
-# separate so a regression in one doesn't poison the other. Revisit
-# sharing after measuring the actual hit rate.
-#
-# The compile body is shared with builder-prebuilt via .docker/turboquant-compile.sh.
-RUN --mount=type=bind,source=.docker/turboquant-compile.sh,target=/usr/local/sbin/compile.sh \
-    --mount=type=cache,target=/root/.ccache,id=turboquant-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    bash /usr/local/sbin/compile.sh
+RUN <<'EOT' bash
+set -euxo pipefail
+
+if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
+  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
+  export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
+  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
+  rm -rf /LocalAI/backend/cpp/turboquant-*-build
+fi
+
+cd /LocalAI/backend/cpp/turboquant
+
+if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
+  make turboquant-fallback
+  make turboquant-grpc
+  make turboquant-rpc-server
+else
+  make turboquant-avx
+  make turboquant-avx2
+  make turboquant-avx512
+  make turboquant-fallback
+  make turboquant-grpc
+  make turboquant-rpc-server
+fi
+EOT


 # Copy libraries using a script to handle architecture differences
 RUN make -BC /LocalAI/backend/cpp/turboquant package


-# ============================================================================
-# Stage: builder-prebuilt — uses the pre-built base from
-# quay.io/go-skynet/ci-cache:base-grpc-* (built by .github/workflows/base-images.yml).
-# That image already has gRPC at /opt/grpc + apt deps + CUDA/ROCm/Vulkan
-# pre-installed, so we just copy gRPC to /usr/local and compile. Used when
-# BUILDER_TARGET=builder-prebuilt (CI when the matrix entry sets
-# builder-base-image).
-# ============================================================================
-FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
-
-ARG BUILD_TYPE
-ENV BUILD_TYPE=${BUILD_TYPE}
-ARG CUDA_DOCKER_ARCH
-ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
-ARG CMAKE_ARGS
-ENV CMAKE_ARGS=${CMAKE_ARGS}
-# AMDGPU_TARGETS must be forwarded into the env here too — backend/cpp/llama-cpp/Makefile
-# (which the turboquant Makefile reuses via a sibling build dir) errors out when the var
-# is empty on a hipblas build, and the prebuilt path is what CI exercises most of the
-# time. The builder-fromsource stage above already does this; mirror it here.
-ARG AMDGPU_TARGETS
-ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
-ARG TARGETARCH
-ARG TARGETVARIANT
-
-# The base-grpc-* image installs gRPC to /opt/grpc but doesn't copy it to
-# /usr/local. Mirror what the from-source path does so the compile step
-# can find gRPC at the canonical prefix the Makefile expects.
-RUN cp -a /opt/grpc/. /usr/local/
-
-COPY . /LocalAI
-
-RUN --mount=type=bind,source=.docker/turboquant-compile.sh,target=/usr/local/sbin/compile.sh \
-    --mount=type=cache,target=/root/.ccache,id=turboquant-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    bash /usr/local/sbin/compile.sh
-
-RUN make -BC /LocalAI/backend/cpp/turboquant package
-
-
-# ============================================================================
-# Final stage — copies package output from one of the two builders.
-# BUILDER_TARGET selects which one. BuildKit prunes the unreferenced builder.
-#
-# BuildKit doesn't support variable expansion in `COPY --from=` directly,
-# so we resolve the ARG by aliasing the chosen builder to a fixed stage
-# name via `FROM ${BUILDER_TARGET} AS builder` and then COPY --from=builder.
-# BUILDER_TARGET itself is declared as a global ARG at the top of this
-# file (required for use in FROM), so we just re-import it into this
-# stage's scope before the FROM directive.
-# ============================================================================
-FROM ${BUILDER_TARGET} AS builder
-
 FROM scratch


--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -24,12 +24,6 @@ service Backend {
  rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {}
  rpc Status(HealthMessage) returns (StatusResponse) {}
  rpc Detect(DetectOptions) returns (DetectResponse) {}
-  rpc Depth(DepthRequest) returns (DepthResponse) {}
-  rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
-  rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
-  rpc VoiceVerify(VoiceVerifyRequest) returns (VoiceVerifyResponse) {}
-  rpc VoiceAnalyze(VoiceAnalyzeRequest) returns (VoiceAnalyzeResponse) {}
-  rpc VoiceEmbed(VoiceEmbedRequest) returns (VoiceEmbedResponse) {}

  rpc StoresSet(StoresSetOptions) returns (Result) {}
  rpc StoresDelete(StoresDeleteOptions) returns (Result) {}
@@ -38,39 +32,13 @@ service Backend {

  rpc Rerank(RerankRequest) returns (RerankResult) {}

-  // TokenClassify runs a token-classification (NER) model on the
-  // supplied text and returns each detected entity span. Used by the
-  // PII redactor's optional NER tier — the regex tier still handles
-  // formatted hits cheaply, while this catches names, locations, and
-  // other unformatted PII that regex misses.
-  rpc TokenClassify(TokenClassifyRequest) returns (TokenClassifyResponse) {}
-
-  // Score evaluates the model's joint log-probability of each
-  // supplied candidate continuation given a shared prompt. The
-  // prompt's KV cache is computed once and reused across candidates.
-  // Used for routing-policy multi-label classification, reranking,
-  // calibrated confidence, and reward-model scoring — any task where
-  // the consumer wants the model's confidence in a pre-specified
-  // continuation rather than a generated one.
-  rpc Score(ScoreRequest) returns (ScoreResponse) {}
-
  rpc GetMetrics(MetricsRequest) returns (MetricsResponse);

  rpc VAD(VADRequest) returns (VADResponse) {}

-  rpc Diarize(DiarizeRequest) returns (DiarizeResponse) {}
-
  rpc AudioEncode(AudioEncodeRequest) returns (AudioEncodeResult) {}
  rpc AudioDecode(AudioDecodeRequest) returns (AudioDecodeResult) {}

-  rpc AudioTransform(AudioTransformRequest) returns (AudioTransformResult) {}
-  rpc AudioTransformStream(stream AudioTransformFrameRequest) returns (stream AudioTransformFrameResponse) {}
-  // AudioToAudioStream is the bidirectional any-to-any S2S RPC. Backends
-  // that load a speech-to-speech model consume input audio frames and emit
-  // interleaved audio + transcript + tool-call deltas as typed events.
-  // Backends without S2S support return UNIMPLEMENTED.
-  rpc AudioToAudioStream(stream AudioToAudioRequest) returns (stream AudioToAudioResponse) {}
-
  rpc ModelMetadata(ModelOptions) returns (ModelMetadataResponse) {}

  // Fine-tuning RPCs
@@ -85,23 +53,6 @@ service Backend {
  rpc QuantizationProgress(QuantizationProgressRequest) returns (stream QuantizationProgressUpdate) {}
  rpc StopQuantization(QuantizationStopRequest) returns (Result) {}

-  // Forward proxies a raw HTTP request to an upstream provider. The
-  // cloud-proxy backend implements this for passthrough-mode model
-  // configs: the client wire format is preserved end-to-end (no
-  // translation through internal proto), which means new provider
-  // fields work the day they ship. Translation-mode proxies use the
-  // standard Predict/PredictStream RPCs instead. Backends that don't
-  // support this return UNIMPLEMENTED.
-  //
-  // The request is bidirectionally streamed so large bodies can flow
-  // without buffering. In practice the first ForwardRequest carries
-  // path, method, headers, and the initial body chunk; subsequent
-  // messages append body chunks. The first ForwardReply carries the
-  // upstream status and response headers; subsequent messages stream
-  // body chunks (SSE frames or chunked transfer). Cancellation of the
-  // gRPC context closes the upstream connection.
-  rpc Forward(stream ForwardRequest) returns (stream ForwardReply) {}
-
 }

 // Define the empty request
@@ -115,76 +66,6 @@ message MetricsResponse {
  int32 prompt_tokens_processed = 5;
 }

-// TokenClassifyRequest carries the text to classify plus an optional
-// score threshold. The transformers backend interprets threshold as
-// the minimum confidence to include in the response; 0 = include all.
-message TokenClassifyRequest {
-  string text = 1;
-  float threshold = 2;
-}
-
-// TokenClassifyEntity is one detected entity span. Byte offsets are
-// into the original UTF-8 text — start..end is a half-open range that
-// addresses the substring corresponding to entity_group.
-//
-// entity_group follows HuggingFace's aggregated-tag convention (e.g.
-// "PER", "LOC", "ORG", or a PII-specific label like "EMAIL" /
-// "SSN" depending on the model). The redactor's per-pattern action
-// map keys off this string.
-message TokenClassifyEntity {
-  string entity_group = 1;
-  int32 start = 2;
-  int32 end = 3;
-  float score = 4;
-  string text = 5;
-}
-
-message TokenClassifyResponse {
-  repeated TokenClassifyEntity entities = 1;
-}
-
-// ScoreRequest carries one shared prompt and one or more continuations
-// to score against it. The backend tokenises the prompt once and reuses
-// the resulting KV cache across all candidates in this request.
-message ScoreRequest {
-  string prompt = 1;
-  repeated string candidates = 2;
-  // Return per-token logprobs for each candidate when true. Default
-  // false to keep the wire response small; the joint log_prob field
-  // covers the common ranking case.
-  bool include_token_logprobs = 3;
-  // When true, the response also populates length_normalized_log_prob
-  // (joint log-prob divided by candidate token count). Useful when
-  // candidates differ in length and the consumer wants a per-token
-  // measure comparable across them (PMI-style scoring).
-  bool length_normalize = 4;
-}
-
-// CandidateScore is one row in the ScoreResponse, matching by index
-// the candidate in ScoreRequest.candidates.
-message CandidateScore {
-  // Sum of log P(token_i | prompt, candidate_token_<i) across the
-  // candidate's tokens. The primary ranking signal.
-  double log_prob = 1;
-  // log_prob / num_tokens — populated when length_normalize=true on
-  // the request.
-  double length_normalized_log_prob = 2;
-  // Per-token detail — populated when include_token_logprobs=true.
-  repeated TokenLogProb tokens = 3;
-  // Number of tokens the backend tokenised this candidate into, after
-  // any backend-specific normalisation (e.g. leading-space handling).
-  int32 num_tokens = 4;
-}
-
-message TokenLogProb {
-  string token = 1;
-  double log_prob = 2;
-}
-
-message ScoreResponse {
-  repeated CandidateScore candidates = 1;
-}
-
 message RerankRequest {
  string query = 1;
  repeated string documents = 2;
@@ -424,30 +305,6 @@ message ModelOptions {
  bool Reranking = 71;

  repeated string Overrides = 72;
-
-  // EngineArgs carries a JSON-encoded map of backend-native engine arguments
-  // applied verbatim to the backend's engine constructor (e.g. vLLM AsyncEngineArgs).
-  // Unknown keys produce an error at LoadModel time.
-  string EngineArgs = 73;
-
-  // Proxy carries the cloud-proxy backend's per-model configuration.
-  // Empty for non-proxy backends.
-  ProxyOptions Proxy = 74;
-}
-
-// ProxyOptions configures the cloud-proxy backend. UpstreamURL and
-// Mode are always meaningful; Provider only matters in translate mode.
-// The two api_key_* fields are mutually exclusive and resolved by the
-// backend at LoadModel — core forwards the references rather than the
-// plaintext key.
-message ProxyOptions {
-  string upstream_url = 1;
-  string mode = 2;
-  string provider = 3;
-  string api_key_env = 4;
-  string api_key_file = 5;
-  string upstream_model = 6;
-  int32 request_timeout_seconds = 7;
 }

 message Result {
@@ -483,12 +340,6 @@ message TranscriptStreamResponse {
  TranscriptResult final_result = 2;
 }

-message TranscriptWord {
-  int64 start = 1;
-  int64 end = 2;
-  string text = 3;
-}
-
 message TranscriptSegment {
  int32 id = 1;
  int64 start = 2;
@@ -496,7 +347,6 @@ message TranscriptSegment {
  string text = 4;
  repeated int32 tokens = 5;
  string speaker = 6;
-  repeated TranscriptWord words = 7;
 }

 message GenerateImageRequest {
@@ -538,15 +388,6 @@ message TTSRequest {
  string dst = 3;
  string voice = 4;
  optional string language = 5;
-  // instructions is a free-form, per-request style/voice description (maps to
-  // the OpenAI `instructions` field). Backends that support expressive synthesis
-  // (e.g. Qwen3-TTS CustomVoice/VoiceDesign) prefer this over the static YAML
-  // option when set; backends that don't simply ignore it.
-  optional string instructions = 6;
-  // params carries optional, backend-specific per-request generation parameters
-  // (e.g. Chatterbox exaggeration/cfg_weight/temperature). Values are strings and
-  // coerced by the backend; unset leaves the backend's configured defaults.
-  map<string, string> params = 7;
 }

 message VADRequest {
@@ -562,43 +403,6 @@ message VADResponse {
  repeated VADSegment segments = 1;
 }

-// --- Speaker diarization messages ---
-//
-// Pure speaker diarization: "who spoke when". Returns time-stamped segments
-// labelled with cluster IDs (the same string for the same speaker across
-// segments). Some backends (e.g. vibevoice.cpp) produce diarization as a
-// by-product of ASR and may also fill in `text` per segment; backends with a
-// dedicated diarization pipeline (e.g. sherpa-onnx pyannote) leave `text`
-// empty and emit only the segmentation.
-
-message DiarizeRequest {
-  string dst = 1;                      // path to audio file (HTTP layer materialises uploads to a temp file)
-  uint32 threads = 2;
-  string language = 3;                 // optional; only meaningful for transcription-bundling backends
-  int32  num_speakers = 4;             // exact speaker count if known (>0 forces); 0 = auto
-  int32  min_speakers = 5;             // hint when auto-detecting; 0 = unset
-  int32  max_speakers = 6;             // hint when auto-detecting; 0 = unset
-  float  clustering_threshold = 7;     // distance threshold when num_speakers unknown; 0 = backend default
-  float  min_duration_on = 8;          // discard segments shorter than this (seconds); 0 = backend default
-  float  min_duration_off = 9;         // merge gaps shorter than this (seconds); 0 = backend default
-  bool   include_text = 10;            // when the backend can emit per-segment transcript for free, ask it to populate `text`
-}
-
-message DiarizeSegment {
-  int32  id = 1;
-  float  start = 2;                    // seconds
-  float  end = 3;                      // seconds
-  string speaker = 4;                  // backend-emitted speaker label (e.g. "0", "SPEAKER_00")
-  string text = 5;                     // optional per-segment transcript (empty unless include_text and supported)
-}
-
-message DiarizeResponse {
-  repeated DiarizeSegment segments = 1;
-  int32  num_speakers = 2;             // count of distinct speaker labels in `segments`
-  float  duration = 3;                 // total audio duration in seconds (0 if unknown)
-  string language = 4;                 // optional, when the backend bundles transcription
-}
-
 message SoundGenerationRequest {
  string text = 1;
  string model = 2;
@@ -671,141 +475,6 @@ message DetectResponse {
  repeated Detection Detections = 1;
 }

-// --- Depth estimation messages (Depth Anything 3) ---
-
-message DepthRequest {
-  string src = 1;                  // input image (filesystem path or base64-encoded payload)
-  string dst = 2;                  // optional output directory for exports (glb/colmap)
-  bool include_depth = 3;          // return the per-pixel metric depth map
-  bool include_confidence = 4;     // return the per-pixel confidence map (DualDPT)
-  bool include_pose = 5;           // return camera extrinsics/intrinsics (DualDPT)
-  bool include_sky = 6;            // return the per-pixel sky map (mono models)
-  bool include_points = 7;         // back-project to a 3D point cloud (DualDPT)
-  float points_conf_thresh = 8;    // keep points with confidence >= this threshold
-  repeated string exports = 9;     // requested exports: "glb", "colmap"
-}
-
-message DepthResponse {
-  int32 width = 1;                 // processed depth-map width
-  int32 height = 2;                // processed depth-map height
-  repeated float depth = 3;        // width*height row-major metric depth
-  repeated float confidence = 4;   // width*height row-major confidence (DualDPT)
-  repeated float sky = 5;          // width*height row-major sky map (mono)
-  repeated float extrinsics = 6;   // 12 floats, 3x4 row-major (world-to-camera)
-  repeated float intrinsics = 7;   // 9 floats, 3x3 row-major
-  int32 num_points = 8;            // number of 3D points
-  repeated float points = 9;       // num_points*3 xyz, world space
-  bytes point_colors = 10;         // num_points*3 uint8 rgb
-  repeated string export_paths = 11; // paths written for the requested exports
-  bool is_metric = 12;             // depth is in metric units
-}
-
-// --- Face recognition messages ---
-
-message FacialArea {
-  float x = 1;
-  float y = 2;
-  float w = 3;
-  float h = 4;
-}
-
-message FaceVerifyRequest {
-  string img1 = 1;              // base64-encoded image
-  string img2 = 2;              // base64-encoded image
-  float  threshold = 3;         // cosine-distance threshold; 0 = use backend default
-  bool   anti_spoofing = 4;     // run MiniFASNet liveness on each image; failed liveness forces verified=false
-}
-
-message FaceVerifyResponse {
-  bool       verified = 1;
-  float      distance = 2;      // 1 - cosine_similarity
-  float      threshold = 3;
-  float      confidence = 4;    // 0-100
-  string     model = 5;         // e.g. "buffalo_l"
-  FacialArea img1_area = 6;
-  FacialArea img2_area = 7;
-  float      processing_time_ms = 8;
-  bool       img1_is_real = 9;          // anti-spoofing result when enabled
-  float      img1_antispoof_score = 10;
-  bool       img2_is_real = 11;
-  float      img2_antispoof_score = 12;
-}
-
-message FaceAnalyzeRequest {
-  string          img = 1;          // base64-encoded image
-  repeated string actions = 2;      // subset of ["age","gender","emotion","race"]; empty = all-supported
-  bool            anti_spoofing = 3;
-}
-
-message FaceAnalysis {
-  FacialArea         region = 1;
-  float              face_confidence = 2;
-  float              age = 3;
-  string             dominant_gender = 4;   // "Man" | "Woman"
-  map<string, float> gender = 5;
-  string             dominant_emotion = 6;  // reserved; empty in MVP
-  map<string, float> emotion = 7;
-  string             dominant_race = 8;     // not populated
-  map<string, float> race = 9;
-  bool               is_real = 10;          // anti-spoofing result when enabled
-  float              antispoof_score = 11;
-}
-
-message FaceAnalyzeResponse {
-  repeated FaceAnalysis faces = 1;
-}
-
-// --- Voice (speaker) recognition messages ---
-//
-// Analogous to the Face* messages above, but for speaker biometrics.
-// Audio fields accept a filesystem path (same convention as
-// TranscriptRequest.dst). The HTTP layer materialises base64 / URL /
-// data-URI inputs to a temp file before calling the gRPC backend.
-
-message VoiceVerifyRequest {
-  string audio1 = 1;            // path to first audio clip
-  string audio2 = 2;            // path to second audio clip
-  float  threshold = 3;         // cosine-distance threshold; 0 = use backend default
-  bool   anti_spoofing = 4;     // reserved for future AASIST bolt-on
-}
-
-message VoiceVerifyResponse {
-  bool   verified = 1;
-  float  distance = 2;          // 1 - cosine_similarity
-  float  threshold = 3;
-  float  confidence = 4;        // 0-100
-  string model = 5;             // e.g. "speechbrain/spkrec-ecapa-voxceleb"
-  float  processing_time_ms = 6;
-}
-
-message VoiceAnalyzeRequest {
-  string          audio = 1;        // path to audio clip
-  repeated string actions = 2;      // subset of ["age","gender","emotion"]; empty = all-supported
-}
-
-message VoiceAnalysis {
-  float              start = 1;          // segment start time in seconds (0 if single-utterance)
-  float              end = 2;            // segment end time in seconds
-  float              age = 3;
-  string             dominant_gender = 4;
-  map<string, float> gender = 5;
-  string             dominant_emotion = 6;
-  map<string, float> emotion = 7;
-}
-
-message VoiceAnalyzeResponse {
-  repeated VoiceAnalysis segments = 1;
-}
-
-message VoiceEmbedRequest {
-  string audio = 1;              // path to audio clip
-}
-
-message VoiceEmbedResponse {
-  repeated float embedding = 1;
-  string         model = 2;
-}
-
 message ToolFormatMarkers {
  string format_type = 1;           // "json_native", "tag_with_json", "tag_with_tagged"

@@ -884,143 +553,6 @@ message AudioDecodeResult {
  int32 samples_per_frame = 3;
 }

-// Generic audio transform: an audio-in, audio-out operation, optionally
-// conditioned on a second reference signal. Concrete transforms include
-// AEC + noise suppression + dereverberation (LocalVQE), voice conversion
-// (reference = target speaker), pitch shifting, etc.
-message AudioTransformRequest {
-  string audio_path = 1;             // required, primary input file path
-  string reference_path = 2;         // optional auxiliary; empty => zero-fill
-  string dst = 3;                    // required, output file path
-  map<string, string> params = 4;    // backend-specific tuning
-}
-
-message AudioTransformResult {
-  string dst = 1;
-  int32  sample_rate = 2;
-  int32  samples = 3;
-  bool   reference_provided = 4;
-}
-
-// Bidirectional streaming audio transform. The first message MUST carry a
-// Config; subsequent messages carry Frames. A second Config mid-stream
-// resets streaming state before the next frame.
-message AudioTransformFrameRequest {
-  oneof payload {
-    AudioTransformStreamConfig config = 1;
-    AudioTransformFrame        frame  = 2;
-  }
-}
-
-message AudioTransformStreamConfig {
-  enum SampleFormat {
-    F32_LE = 0;
-    S16_LE = 1;
-  }
-  SampleFormat sample_format = 1;
-  int32 sample_rate = 2;             // 0 => backend default
-  int32 frame_samples = 3;           // 0 => backend default
-  map<string, string> params = 4;
-  bool reset = 5;                    // reset streaming state before next frame
-}
-
-message AudioTransformFrame {
-  bytes audio_pcm = 1;               // frame_samples samples in stream's format
-  bytes reference_pcm = 2;           // empty => zero-fill (silent reference)
-}
-
-message AudioTransformFrameResponse {
-  bytes pcm = 1;
-  int64 frame_index = 2;
-}
-
-// === AudioToAudioStream messages =========================================
-//
-// Bidirectional stream between the LocalAI core and an any-to-any audio
-// model. The client opens the stream with a Config payload, then alternates
-// Frame (input audio) and Control (turn boundaries, function-call results,
-// session updates) payloads. The server streams back typed events: audio
-// frames carry PCM in `pcm`; transcript / tool-call deltas carry JSON in
-// `meta`; the stream ends with a `response.done` (success) or `error` event.
-
-message AudioToAudioRequest {
-  oneof payload {
-    AudioToAudioConfig  config  = 1;
-    AudioToAudioFrame   frame   = 2;
-    AudioToAudioControl control = 3;
-  }
-}
-
-message AudioToAudioConfig {
-  // PCM format for client→server audio. 0 => backend default
-  // (16 kHz for the LFM2-Audio Conformer encoder).
-  int32 input_sample_rate = 1;
-  // Preferred server→client audio rate. 0 => backend default
-  // (24 kHz for the LFM2-Audio vocoder).
-  int32 output_sample_rate = 2;
-  // Optional system prompt override. Empty => backend chooses based on
-  // mode (e.g. "Respond with interleaved text and audio.").
-  string system_prompt = 3;
-  // Optional baked-voice id. Models that only ship a fixed set of
-  // voices (e.g. LFM2-Audio: us_male/us_female/uk_male/uk_female) match
-  // this against their voice table; an empty string keeps the default.
-  string voice = 4;
-  // JSON-encoded array of tool definitions in OpenAI Chat Completions
-  // format. Empty => no tools.
-  string tools = 5;
-  // Free-form sampling / decoding parameters (temperature, top_k,
-  // max_new_tokens, audio_top_k, etc).
-  map<string, string> params = 6;
-  // True => reset any session-scoped state before processing further
-  // frames on this stream. The first Config implicitly resets.
-  bool reset = 7;
-}
-
-message AudioToAudioFrame {
-  // Raw PCM s16le mono at config.input_sample_rate. Empty pcm + end_of_input
-  // is a valid "user finished speaking" marker without trailing audio.
-  bytes pcm = 1;
-  // Marks the last frame of a user turn. The backend may begin emitting
-  // a response immediately after seeing this.
-  bool end_of_input = 2;
-}
-
-message AudioToAudioControl {
-  // Free-form control event names. Initial set:
-  //   "input_audio_buffer.commit"     — user finished speaking
-  //   "response.cancel"               — abort in-flight generation
-  //   "conversation.item.create"      — inject a non-audio item (e.g.
-  //                                     function_call_output as JSON in
-  //                                     `payload`)
-  //   "session.update"                — re-configure mid-stream
-  string event = 1;
-  // Event-specific JSON payload.
-  bytes payload = 2;
-}
-
-message AudioToAudioResponse {
-  // Event identifies what this frame carries. Mirrors the OpenAI Realtime
-  // API server-event names where applicable. Initial set:
-  //   "response.audio.delta"
-  //   "response.audio_transcript.delta"
-  //   "response.function_call_arguments.delta"
-  //   "response.function_call_arguments.done"
-  //   "response.done"
-  //   "error"
-  string event = 1;
-  // Populated when event = response.audio.delta.
-  bytes pcm = 2;
-  // Populated alongside pcm to identify its rate. 0 => same as the
-  // session's negotiated output_sample_rate.
-  int32 sample_rate = 3;
-  // JSON payload for non-PCM events (transcript chunk, tool args, error
-  // body).
-  bytes meta = 4;
-  // Monotonic per-stream counter, useful for client reordering and
-  // debugging.
-  int64 sequence = 5;
-}
-
 message ModelMetadataResponse {
  bool supports_thinking = 1;
  string rendered_template = 2;  // The rendered chat template with enable_thinking=true (empty if not applicable)
@@ -1163,32 +695,3 @@ message QuantizationStopRequest {
  string job_id = 1;
 }

-// ForwardHeader is one HTTP header on the request or response. Headers
-// like Authorization are typically injected by the backend (from the
-// resolved API key) rather than passed through from the client.
-message ForwardHeader {
-  string name = 1;
-  string value = 2;
-}
-
-// ForwardRequest is a streamed HTTP request to the upstream. First
-// message carries path/method/headers; subsequent messages carry
-// body_chunk only. All fields except body_chunk are honoured on the
-// first message and ignored thereafter.
-message ForwardRequest {
-  string path = 1;                          // e.g. "/v1/chat/completions" — appended to the model's upstream_url
-  string method = 2;                        // usually "POST"
-  repeated ForwardHeader headers = 3;
-  bytes body_chunk = 4;
-}
-
-// ForwardReply is a streamed HTTP response from the upstream. First
-// message carries status/headers; subsequent messages carry body_chunk
-// only. SSE responses arrive as a sequence of body_chunk frames; the
-// caller is responsible for any parsing.
-message ForwardReply {
-  int32 status = 1;
-  repeated ForwardHeader headers = 2;
-  bytes body_chunk = 3;
-}
-
--- a/backend/cpp/ds4/.gitignore
+++ b/backend/cpp/ds4/.gitignore
@@ -1,10 +0,0 @@
-ds4/
-build/
-package/
-grpc-server
-ds4-worker
-*.o
-backend.pb.cc
-backend.pb.h
-backend.grpc.pb.cc
-backend.grpc.pb.h
--- a/backend/cpp/ds4/CMakeLists.txt
+++ b/backend/cpp/ds4/CMakeLists.txt
@@ -1,157 +0,0 @@
-cmake_minimum_required(VERSION 3.15)
-project(ds4-grpc-server LANGUAGES CXX C)
-
-set(CMAKE_CXX_STANDARD 17)
-set(CMAKE_CXX_STANDARD_REQUIRED ON)
-set(TARGET grpc-server)
-
-option(DS4_NATIVE "Compile with -march=native / -mcpu=native" ON)
-set(DS4_GPU "cpu" CACHE STRING "GPU backend: cpu, cuda, or metal")
-set(DS4_DIR "${CMAKE_CURRENT_SOURCE_DIR}/ds4" CACHE PATH "Path to cloned ds4 source")
-
-if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
-    # Homebrew installs protobuf/grpc under a non-default prefix. The generated
-    # backend.pb.cc / backend.grpc.pb.cc pull in google/protobuf and grpcpp
-    # headers, but the hw_grpc_proto library links neither target, so on macOS
-    # the headers (e.g. google/protobuf/runtime_version.h) are never on the
-    # compiler's include path. Add the Homebrew prefix globally, matching the
-    # llama-cpp backend which builds on Darwin CI.
-    if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "arm64")
-        set(HOMEBREW_DEFAULT_PREFIX "/opt/homebrew")
-    else()
-        set(HOMEBREW_DEFAULT_PREFIX "/usr/local")
-    endif()
-    link_directories("${HOMEBREW_DEFAULT_PREFIX}/lib")
-    include_directories("${HOMEBREW_DEFAULT_PREFIX}/include")
-endif()
-
-find_package(Threads REQUIRED)
-find_package(Protobuf CONFIG QUIET)
-if(NOT Protobuf_FOUND)
-    find_package(Protobuf REQUIRED)
-endif()
-find_package(gRPC CONFIG QUIET)
-if(NOT gRPC_FOUND)
-    # Ubuntu's apt-installed grpc++ does not ship a CMake config - fall back.
-    find_library(GRPCPP_LIB grpc++ REQUIRED)
-    find_library(GRPCPP_REFLECTION_LIB grpc++_reflection REQUIRED)
-    add_library(gRPC::grpc++ INTERFACE IMPORTED)
-    set_target_properties(gRPC::grpc++ PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_LIB}")
-    add_library(gRPC::grpc++_reflection INTERFACE IMPORTED)
-    set_target_properties(gRPC::grpc++_reflection PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_REFLECTION_LIB}")
-endif()
-
-find_program(_PROTOC NAMES protoc REQUIRED)
-find_program(_GRPC_CPP_PLUGIN NAMES grpc_cpp_plugin REQUIRED)
-
-get_filename_component(HW_PROTO "${CMAKE_CURRENT_SOURCE_DIR}/../../backend.proto" ABSOLUTE)
-get_filename_component(HW_PROTO_PATH "${HW_PROTO}" PATH)
-
-set(HW_PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.cc")
-set(HW_PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.h")
-set(HW_GRPC_SRCS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.cc")
-set(HW_GRPC_HDRS  "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.h")
-
-add_custom_command(
-    OUTPUT "${HW_PROTO_SRCS}" "${HW_PROTO_HDRS}" "${HW_GRPC_SRCS}" "${HW_GRPC_HDRS}"
-    COMMAND ${_PROTOC}
-    ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
-         --cpp_out  "${CMAKE_CURRENT_BINARY_DIR}"
-         -I "${HW_PROTO_PATH}"
-         --plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN}"
-         "${HW_PROTO}"
-    DEPENDS "${HW_PROTO}")
-
-add_library(hw_grpc_proto STATIC
-    ${HW_GRPC_SRCS} ${HW_GRPC_HDRS}
-    ${HW_PROTO_SRCS} ${HW_PROTO_HDRS})
-target_include_directories(hw_grpc_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
-
-set(DS4_OBJS "${DS4_DIR}/ds4.o")
-if(DS4_GPU STREQUAL "cuda")
-    list(APPEND DS4_OBJS "${DS4_DIR}/ds4_cuda.o")
-elseif(DS4_GPU STREQUAL "metal")
-    list(APPEND DS4_OBJS "${DS4_DIR}/ds4_metal.o")
-elseif(DS4_GPU STREQUAL "cpu")
-    set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
-endif()
-
-# ds4.c now references ds4_distributed.c (distributed inference) and ds4_ssd.c
-# (SSD expert-cache), each split into its own translation unit upstream. Both
-# are GPU-agnostic objects shared by every GPU mode, so link them in regardless
-# of DS4_GPU.
-list(APPEND DS4_OBJS "${DS4_DIR}/ds4_distributed.o")
-list(APPEND DS4_OBJS "${DS4_DIR}/ds4_ssd.o")
-
-add_executable(${TARGET}
-    grpc-server.cpp
-    dsml_parser.cpp
-    dsml_renderer.cpp
-    kv_cache.cpp)
-
-target_include_directories(${TARGET} PRIVATE ${DS4_DIR})
-
-foreach(obj ${DS4_OBJS})
-    target_sources(${TARGET} PRIVATE ${obj})
-    set_source_files_properties(${obj} PROPERTIES EXTERNAL_OBJECT TRUE GENERATED TRUE)
-endforeach()
-
-target_link_libraries(${TARGET} PRIVATE
-    hw_grpc_proto
-    gRPC::grpc++
-    gRPC::grpc++_reflection
-    protobuf::libprotobuf
-    Threads::Threads
-    m)
-
-if(DS4_GPU STREQUAL "cuda")
-    find_package(CUDAToolkit REQUIRED)
-    target_link_libraries(${TARGET} PRIVATE CUDA::cudart CUDA::cublas)
-elseif(DS4_GPU STREQUAL "metal")
-    find_library(FOUNDATION_LIB Foundation REQUIRED)
-    find_library(METAL_LIB Metal REQUIRED)
-    target_link_libraries(${TARGET} PRIVATE ${FOUNDATION_LIB} ${METAL_LIB})
-elseif(DS4_GPU STREQUAL "cpu")
-    target_compile_definitions(${TARGET} PRIVATE DS4_NO_GPU)
-endif()
-
-if(DS4_NATIVE)
-    if(APPLE)
-        target_compile_options(${TARGET} PRIVATE -mcpu=native)
-    else()
-        target_compile_options(${TARGET} PRIVATE -march=native)
-    endif()
-endif()
-
-# ds4-worker: standalone distributed worker. Links the same ds4 engine objects
-# (including ds4_distributed.o) but has NO gRPC/protobuf dependency - it speaks
-# ds4's own TCP transport via ds4_dist_run(). Buildable wherever the engine
-# objects build, even on hosts without protobuf/grpc dev headers.
-add_executable(ds4-worker worker_main.c)
-target_include_directories(ds4-worker PRIVATE ${DS4_DIR})
-foreach(obj ${DS4_OBJS})
-    target_sources(ds4-worker PRIVATE ${obj})
-    set_source_files_properties(${obj} PROPERTIES EXTERNAL_OBJECT TRUE GENERATED TRUE)
-endforeach()
-# worker_main.c is C, but the engine objects built by nvcc (ds4_cuda.o) and the
-# Metal path (ds4_metal.o, Obj-C++) reference the C++ runtime (libstdc++). Force
-# the C++ linker driver so those symbols resolve; the C driver would not link
-# libstdc++ and the CUDA/Metal builds fail with undefined std:: references.
-set_target_properties(ds4-worker PROPERTIES LINKER_LANGUAGE CXX)
-target_link_libraries(ds4-worker PRIVATE Threads::Threads m)
-
-if(DS4_GPU STREQUAL "cuda")
-    target_link_libraries(ds4-worker PRIVATE CUDA::cudart CUDA::cublas)
-elseif(DS4_GPU STREQUAL "metal")
-    target_link_libraries(ds4-worker PRIVATE ${FOUNDATION_LIB} ${METAL_LIB})
-elseif(DS4_GPU STREQUAL "cpu")
-    target_compile_definitions(ds4-worker PRIVATE DS4_NO_GPU)
-endif()
-
-if(DS4_NATIVE)
-    if(APPLE)
-        target_compile_options(ds4-worker PRIVATE -mcpu=native)
-    else()
-        target_compile_options(ds4-worker PRIVATE -march=native)
-    endif()
-endif()
--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -1,83 +0,0 @@
-# ds4 backend Makefile.
-#
-# Upstream pin lives below as DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
-# (.github/bump_deps.sh) can find and update it - matches the
-# llama-cpp / ik-llama-cpp / turboquant convention.
-
-DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
-DS4_REPO?=https://github.com/antirez/ds4
-
-CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
-BUILD_DIR := build
-
-BUILD_TYPE ?=
-NATIVE ?= false
-JOBS ?= $(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
-
-UNAME_S := $(shell uname -s)
-
-CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
-
-# ds4_distributed.o and ds4_ssd.o are GPU-agnostic translation units that
-# ds4.c/ds4_cpu.o now reference (upstream split distributed inference and the
-# SSD expert-cache into their own .c files). Both objects are shared by every
-# GPU mode, so they are appended unconditionally below.
-ifeq ($(BUILD_TYPE),cublas)
-    CMAKE_ARGS += -DDS4_GPU=cuda
-    DS4_OBJ_TARGET := ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
-else ifeq ($(UNAME_S),Darwin)
-    CMAKE_ARGS += -DDS4_GPU=metal
-    DS4_OBJ_TARGET := ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
-else
-    # CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
-    CMAKE_ARGS += -DDS4_GPU=cpu
-    DS4_OBJ_TARGET := ds4_cpu.o ds4_distributed.o ds4_ssd.o
-endif
-
-ifneq ($(NATIVE),true)
-    CMAKE_ARGS += -DDS4_NATIVE=OFF
-endif
-
-.PHONY: grpc-server package clean purge test all
-all: grpc-server
-
-# Clone the upstream ds4 source at the pinned commit. Directory acts as the
-# target so make only re-clones when missing. After a DS4_VERSION bump,
-# run 'make purge && make' to refetch (or rely on CI's clean build).
-ds4:
-	mkdir -p ds4
-	cd ds4 && \
-	git init -q && \
-	git remote add origin $(DS4_REPO) && \
-	git fetch --depth 1 origin $(DS4_VERSION) && \
-	git checkout FETCH_HEAD
-
-# Build ds4's engine object files via its own Makefile, which already encodes
-# the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
-ds4/ds4.o: ds4
-ifeq ($(BUILD_TYPE),cublas)
-	+$(MAKE) -C ds4 ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
-else ifeq ($(UNAME_S),Darwin)
-	+$(MAKE) -C ds4 ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
-else
-	+$(MAKE) -C ds4 ds4_cpu.o ds4_distributed.o ds4_ssd.o
-endif
-
-grpc-server: ds4/ds4.o
-	mkdir -p $(BUILD_DIR)
-	cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
-	cp $(BUILD_DIR)/grpc-server grpc-server
-	cp $(BUILD_DIR)/ds4-worker ds4-worker
-
-package: grpc-server
-	bash package.sh
-
-test:
-	@echo "ds4 backend: e2e coverage at tests/e2e-backends/ (BACKEND_BINARY mode)"
-
-clean:
-	rm -rf $(BUILD_DIR) grpc-server ds4-worker package
-	if [ -d ds4 ]; then $(MAKE) -C ds4 clean; fi
-
-purge: clean
-	rm -rf ds4
--- a/backend/cpp/ds4/dsml_parser.cpp
+++ b/backend/cpp/ds4/dsml_parser.cpp
@@ -1,359 +0,0 @@
-#include "dsml_parser.h"
-
-#include <algorithm>
-#include <cstdio>
-#include <cstring>
-#include <chrono>
-#include <random>
-#include <string>
-#include <vector>
-
-namespace ds4cpp {
-
-namespace {
-
-constexpr const char *kThinkOpen      = "<think>";
-constexpr const char *kThinkClose     = "</think>";
-constexpr const char *kToolsOpen      = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>";   // <｜DSML｜tool_calls>
-constexpr const char *kToolsClose     = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>"; // </｜DSML｜tool_calls>
-constexpr const char *kInvokeOpenPfx  = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\""; // <｜DSML｜invoke name="
-constexpr const char *kInvokeClose    = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>";       // </｜DSML｜invoke>
-constexpr const char *kParamOpenPfx   = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\""; // <｜DSML｜parameter name="
-constexpr const char *kParamClose     = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>";       // </｜DSML｜parameter>
-
-// All structural markers the parser might encounter - used to detect "buf
-// might be a partial marker, don't drain yet" conditions.
-const std::vector<std::string> &all_markers() {
-    static const std::vector<std::string> v = {
-        kThinkOpen, kThinkClose,
-        kToolsOpen, kToolsClose,
-        kInvokeOpenPfx, kInvokeClose,
-        kParamOpenPfx, kParamClose,
-    };
-    return v;
-}
-
-// Returns true if `buf` could be a *prefix* of any marker (i.e., we should
-// wait for more text before draining as plain content). The marker-prefix
-// loop handles fixed markers exactly. For markers with variable-length
-// internal data (kInvokeOpenPfx, kParamOpenPfx have an open quote, then the
-// tool/param name, then a closing quote and `>`), we also wait while buf
-// starts with `<` and has not yet seen a `>`: the leading `<` could be the
-// start of one of those open markers, or a literal that we can confirm only
-// once we know what follows. Anything after the first `>` arrives is either
-// consumed by TryConsumeMarker or emitted as a literal `<` by the caller.
-bool looks_like_prefix(const std::string &buf) {
-    for (const auto &m : all_markers()) {
-        if (m.size() > buf.size() && m.compare(0, buf.size(), buf) == 0) return true;
-    }
-    if (!buf.empty() && buf[0] == '<' && buf.find('>') == std::string::npos) {
-        return true;
-    }
-    return false;
-}
-
-bool consume_literal(std::string &buf, const std::string &lit) {
-    if (buf.compare(0, lit.size(), lit) == 0) {
-        buf.erase(0, lit.size());
-        return true;
-    }
-    return false;
-}
-
-// Find the next '<' in buf starting at offset; returns std::string::npos if none.
-size_t next_tag(const std::string &buf, size_t off = 0) {
-    return buf.find('<', off);
-}
-
-std::string json_escape(const std::string &in) {
-    std::string out;
-    out.reserve(in.size() + 2);
-    for (char c : in) {
-        switch (c) {
-            case '"':  out += "\\\""; break;
-            case '\\': out += "\\\\"; break;
-            case '\b': out += "\\b"; break;
-            case '\f': out += "\\f"; break;
-            case '\n': out += "\\n"; break;
-            case '\r': out += "\\r"; break;
-            case '\t': out += "\\t"; break;
-            default:
-                if (static_cast<unsigned char>(c) < 0x20) {
-                    char tmp[8];
-                    std::snprintf(tmp, sizeof(tmp), "\\u%04x", c);
-                    out += tmp;
-                } else {
-                    out += c;
-                }
-        }
-    }
-    return out;
-}
-
-} // namespace
-
-DsmlParser::DsmlParser() = default;
-
-bool DsmlParser::IsInDsmlStructural() const {
-    switch (state_) {
-        case State::TOOL_CALLS:
-        case State::INVOKE:
-            return true;
-        case State::PARAM_VALUE:  // payload bytes; user sampling applies
-        case State::TEXT:
-        case State::THINK:
-            return false;
-    }
-    return false;
-}
-
-void DsmlParser::EmitArgsChunk(const std::string &chunk, std::vector<ParserEvent> &out) {
-    if (chunk.empty()) return;
-    ParserEvent e;
-    e.type = ParserEvent::TOOL_ARGS;
-    e.text = chunk;
-    e.index = tool_index_;
-    out.push_back(std::move(e));
-}
-
-void DsmlParser::FinishCurrentToolCall(std::vector<ParserEvent> &out) {
-    if (tool_index_ < 0) return;
-    // Close the JSON object that was opened on the first parameter.
-    if (args_emitted_open_brace_) {
-        EmitArgsChunk("}", out);
-    } else {
-        EmitArgsChunk("{}", out);
-    }
-    ParserEvent e;
-    e.type = ParserEvent::TOOL_END;
-    e.index = tool_index_;
-    out.push_back(std::move(e));
-    current_tool_name_.clear();
-    args_emitted_open_brace_ = false;
-    args_param_count_ = 0;
-}
-
-bool DsmlParser::TryConsumeMarker(std::vector<ParserEvent> &out) {
-    switch (state_) {
-    case State::TEXT: {
-        if (consume_literal(buf_, kThinkOpen))   { state_ = State::THINK;       return true; }
-        if (consume_literal(buf_, kToolsOpen))   { state_ = State::TOOL_CALLS;  return true; }
-        return false;
-    }
-    case State::THINK: {
-        if (consume_literal(buf_, kThinkClose))  { state_ = State::TEXT;        return true; }
-        return false;
-    }
-    case State::TOOL_CALLS: {
-        if (consume_literal(buf_, kToolsClose))  { state_ = State::TEXT;        return true; }
-        // <｜DSML｜invoke name="X">
-        if (buf_.compare(0, std::strlen(kInvokeOpenPfx), kInvokeOpenPfx) == 0) {
-            size_t close_q = buf_.find('"', std::strlen(kInvokeOpenPfx));
-            if (close_q == std::string::npos) return false; // need more bytes
-            size_t close_gt = buf_.find('>', close_q);
-            if (close_gt == std::string::npos) return false;
-            current_tool_name_ = buf_.substr(std::strlen(kInvokeOpenPfx),
-                                             close_q - std::strlen(kInvokeOpenPfx));
-            tool_index_++;
-            buf_.erase(0, close_gt + 1);
-            ParserEvent e;
-            e.type = ParserEvent::TOOL_START;
-            e.tool_name = current_tool_name_;
-            e.tool_id   = RandomToolId();
-            e.index     = tool_index_;
-            out.push_back(std::move(e));
-            args_emitted_open_brace_ = false;
-            args_param_count_ = 0;
-            state_ = State::INVOKE;
-            return true;
-        }
-        return false;
-    }
-    case State::INVOKE: {
-        if (consume_literal(buf_, kInvokeClose)) {
-            FinishCurrentToolCall(out);
-            state_ = State::TOOL_CALLS;
-            return true;
-        }
-        // <｜DSML｜parameter name="K" string="true|false">
-        if (buf_.compare(0, std::strlen(kParamOpenPfx), kParamOpenPfx) == 0) {
-            size_t close_q = buf_.find('"', std::strlen(kParamOpenPfx));
-            if (close_q == std::string::npos) return false;
-            size_t string_attr = buf_.find("string=\"", close_q);
-            if (string_attr == std::string::npos) return false;
-            size_t string_q = buf_.find('"', string_attr + 8);
-            if (string_q == std::string::npos) return false;
-            size_t close_gt = buf_.find('>', string_q);
-            if (close_gt == std::string::npos) return false;
-            param_name_ = buf_.substr(std::strlen(kParamOpenPfx),
-                                      close_q - std::strlen(kParamOpenPfx));
-            std::string string_val = buf_.substr(string_attr + 8,
-                                                 string_q - (string_attr + 8));
-            param_is_string_ = (string_val == "true");
-            param_value_.clear();
-            buf_.erase(0, close_gt + 1);
-            // Emit args JSON opener / separator.
-            std::string opener;
-            if (!args_emitted_open_brace_) { opener = "{"; args_emitted_open_brace_ = true; }
-            else                            { opener = ","; }
-            opener += "\"" + json_escape(param_name_) + "\":";
-            if (param_is_string_) opener += "\"";
-            EmitArgsChunk(opener, out);
-            args_param_count_++;
-            state_ = State::PARAM_VALUE;
-            return true;
-        }
-        return false;
-    }
-    case State::PARAM_VALUE: {
-        if (consume_literal(buf_, kParamClose)) {
-            if (param_is_string_) EmitArgsChunk("\"", out);
-            state_ = State::INVOKE;
-            return true;
-        }
-        return false;
-    }
-    }
-    return false;
-}
-
-void DsmlParser::DrainPlain(std::vector<ParserEvent> &out) {
-    // Drain everything up to the next '<' that *might* start a marker.
-    // Anything before the next '<' is safe to emit; the '<...' tail stays buffered.
-    while (!buf_.empty()) {
-        size_t lt = next_tag(buf_, 0);
-        if (lt == std::string::npos) {
-            // No tag at all - emit (or accumulate) the whole buffer.
-            ParserEvent e;
-            if (state_ == State::PARAM_VALUE) {
-                std::string esc = param_is_string_ ? json_escape(buf_) : buf_;
-                EmitArgsChunk(esc, out);
-            } else if (state_ == State::THINK) {
-                e.type = ParserEvent::REASONING;
-                e.text = buf_;
-                out.push_back(std::move(e));
-            } else if (state_ == State::TEXT) {
-                e.type = ParserEvent::CONTENT;
-                e.text = buf_;
-                out.push_back(std::move(e));
-            }
-            // Inside INVOKE / TOOL_CALLS with no marker, raw bytes are
-            // structural whitespace - discard.
-            buf_.clear();
-            return;
-        }
-        if (lt > 0) {
-            std::string chunk = buf_.substr(0, lt);
-            buf_.erase(0, lt);
-            ParserEvent e;
-            if (state_ == State::PARAM_VALUE) {
-                std::string esc = param_is_string_ ? json_escape(chunk) : chunk;
-                EmitArgsChunk(esc, out);
-            } else if (state_ == State::THINK) {
-                e.type = ParserEvent::REASONING;
-                e.text = chunk;
-                out.push_back(std::move(e));
-            } else if (state_ == State::TEXT) {
-                e.type = ParserEvent::CONTENT;
-                e.text = chunk;
-                out.push_back(std::move(e));
-            }
-        }
-        // buf_[0] == '<' - try consuming a marker. If we consumed one, loop again.
-        if (!TryConsumeMarker(out)) {
-            // Could be a partial marker - wait for more bytes.
-            if (looks_like_prefix(buf_)) return;
-            // Otherwise this '<' is a literal - emit one char and continue.
-            std::string one(1, buf_[0]);
-            buf_.erase(0, 1);
-            ParserEvent e;
-            if (state_ == State::PARAM_VALUE) {
-                std::string esc = param_is_string_ ? json_escape(one) : one;
-                EmitArgsChunk(esc, out);
-            } else if (state_ == State::THINK) {
-                e.type = ParserEvent::REASONING;
-                e.text = one;
-                out.push_back(std::move(e));
-            } else if (state_ == State::TEXT) {
-                e.type = ParserEvent::CONTENT;
-                e.text = one;
-                out.push_back(std::move(e));
-            }
-        }
-    }
-}
-
-void DsmlParser::Feed(const std::string &chunk, std::vector<ParserEvent> &out) {
-    buf_ += chunk;
-    DrainPlain(out);
-}
-
-void DsmlParser::Flush(std::vector<ParserEvent> &out) {
-    // At flush time we no longer wait for marker completion - drain everything
-    // (the trailing bytes won't grow). Mirror DrainPlain's state-aware
-    // classification: PARAM_VALUE bytes become TOOL_ARGS, THINK bytes become
-    // REASONING, TEXT bytes become CONTENT, and INVOKE/TOOL_CALLS bytes are
-    // structural whitespace (discarded).
-    auto emit_plain = [&](const std::string &chunk) {
-        if (chunk.empty()) return;
-        if (state_ == State::PARAM_VALUE) {
-            std::string esc = param_is_string_ ? json_escape(chunk) : chunk;
-            EmitArgsChunk(esc, out);
-            return;
-        }
-        if (state_ == State::THINK) {
-            ParserEvent e;
-            e.type = ParserEvent::REASONING;
-            e.text = chunk;
-            out.push_back(std::move(e));
-            return;
-        }
-        if (state_ == State::TEXT) {
-            ParserEvent e;
-            e.type = ParserEvent::CONTENT;
-            e.text = chunk;
-            out.push_back(std::move(e));
-            return;
-        }
-        // INVOKE / TOOL_CALLS: structural whitespace, discard.
-    };
-    while (!buf_.empty()) {
-        size_t lt = next_tag(buf_, 0);
-        if (lt == std::string::npos) {
-            emit_plain(buf_);
-            buf_.clear();
-            return;
-        }
-        if (lt > 0) {
-            std::string chunk = buf_.substr(0, lt);
-            buf_.erase(0, lt);
-            emit_plain(chunk);
-        }
-        if (!TryConsumeMarker(out)) {
-            // Definitely a literal '<' now (no chance of more bytes arriving).
-            std::string one(1, buf_[0]);
-            buf_.erase(0, 1);
-            emit_plain(one);
-        }
-    }
-    // If we ended mid-tool-call (model truncated), close it cleanly.
-    if (state_ == State::INVOKE || state_ == State::PARAM_VALUE) {
-        if (state_ == State::PARAM_VALUE && param_is_string_) EmitArgsChunk("\"", out);
-        FinishCurrentToolCall(out);
-        state_ = State::TEXT;
-    }
-}
-
-std::string RandomToolId() {
-    static thread_local std::mt19937_64 rng{
-        static_cast<uint64_t>(std::chrono::system_clock::now().time_since_epoch().count())};
-    const char *alphabet =
-        "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
-    std::string out = "call_";
-    for (int i = 0; i < 16; ++i) {
-        out += alphabet[rng() % 62];
-    }
-    return out;
-}
-
-} // namespace ds4cpp
--- a/backend/cpp/ds4/dsml_parser.h
+++ b/backend/cpp/ds4/dsml_parser.h
@@ -1,77 +0,0 @@
-#pragma once
-#include <functional>
-#include <string>
-#include <vector>
-
-namespace ds4cpp {
-
-struct ParserEvent {
-    enum Type { CONTENT, REASONING, TOOL_START, TOOL_ARGS, TOOL_END };
-    Type type;
-    std::string text;        // CONTENT, REASONING, TOOL_ARGS
-    std::string tool_name;   // TOOL_START
-    std::string tool_id;     // TOOL_START (caller-assigned)
-    int index = 0;           // TOOL_START / TOOL_ARGS / TOOL_END
-};
-
-// Streaming parser. Stateless across instances; one per Predict call.
-class DsmlParser {
-public:
-    DsmlParser();
-
-    // Feed a chunk of raw model-emitted text. Appends classified events to
-    // `out`. May buffer the tail of `chunk` internally if it looks like a
-    // marker prefix.
-    void Feed(const std::string &chunk, std::vector<ParserEvent> &out);
-
-    // Flush any remaining buffered text as CONTENT (called at generation end).
-    void Flush(std::vector<ParserEvent> &out);
-
-    // True when the parser is inside a DSML structural position - that is,
-    // tags/markers between tool-call boundaries where the model is expected
-    // to emit protocol bytes verbatim. Mirrors ds4_server.c's "force
-    // temperature=0 unless dsml_decode_state_uses_payload_sampling" rule:
-    //
-    //   TEXT / THINK                  -> false (user sampling applies)
-    //   PARAM_VALUE                   -> false (payload uses user sampling)
-    //   TOOL_CALLS / INVOKE           -> true  (structural; force greedy)
-    //
-    // Callers should use this BEFORE the next sample() call to pick the
-    // effective temperature; the parser's state reflects what's already
-    // been consumed, so it predicts the next token's classification.
-    bool IsInDsmlStructural() const;
-
-private:
-    enum class State { TEXT, THINK, TOOL_CALLS, INVOKE, PARAM_VALUE };
-    State state_ = State::TEXT;
-    std::string buf_;
-    std::string current_tool_name_;
-    int tool_index_ = -1;
-    // While parsing a parameter value:
-    std::string param_name_;
-    bool param_is_string_ = true;
-    std::string param_value_;
-    // Incrementally-built arguments JSON for the active tool call.
-    std::string args_json_so_far_;
-    bool args_emitted_open_brace_ = false;
-    int args_param_count_ = 0;
-
-    // Try to consume one structural marker starting at buf_[0]. Returns true
-    // and advances state if a complete marker was consumed; false if the
-    // buffer is ambiguous (could be a marker prefix).
-    bool TryConsumeMarker(std::vector<ParserEvent> &out);
-
-    // Drain plain text from buf_ as far as we're sure it's not a marker prefix.
-    // Emits CONTENT or REASONING depending on current state.
-    void DrainPlain(std::vector<ParserEvent> &out);
-
-    // Emit the next chunk of arguments JSON to the consumer.
-    void EmitArgsChunk(const std::string &chunk, std::vector<ParserEvent> &out);
-    void FinishCurrentToolCall(std::vector<ParserEvent> &out);
-};
-
-// Generate a random tool call ID (e.g. "call_AbCdEf"). Used by the gRPC layer
-// when assigning IDs to streamed tool calls.
-std::string RandomToolId();
-
-} // namespace ds4cpp
--- a/backend/cpp/ds4/dsml_renderer.cpp
+++ b/backend/cpp/ds4/dsml_renderer.cpp
@@ -1,140 +0,0 @@
-#include "dsml_renderer.h"
-
-// We accept either nlohmann::json (if available) or fall back to a tiny
-// hand-rolled parser. The LocalAI tree already has nlohmann/json bundled
-// in vendor paths; we use the apt-installed nlohmann-json3-dev (installed
-// in Task 11 step 1) when present, otherwise the bundled copy.
-#if __has_include(<nlohmann/json.hpp>)
-#include <nlohmann/json.hpp>
-using json = nlohmann::json;
-#else
-#error "nlohmann/json.hpp not found; install nlohmann-json3-dev"
-#endif
-
-#include <sstream>
-
-namespace ds4cpp {
-
-namespace {
-
-void render_param(std::ostringstream &os, const std::string &name,
-                  const json &value) {
-    bool is_string = value.is_string();
-    os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\"" << name
-       << "\" string=\"" << (is_string ? "true" : "false") << "\">";
-    if (is_string) {
-        os << value.get<std::string>();
-    } else {
-        os << value.dump();
-    }
-    os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>\n";
-}
-
-} // namespace
-
-std::string RenderAssistantToolCalls(const std::string &tool_calls_json) {
-    if (tool_calls_json.empty()) return "";
-    json arr;
-    try {
-        arr = json::parse(tool_calls_json);
-    } catch (const std::exception &) {
-        return "";
-    }
-    if (!arr.is_array() || arr.empty()) return "";
-
-    std::ostringstream os;
-    os << "\n\n<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n";
-    for (const auto &call : arr) {
-        // OpenAI shape: { id, type, function: { name, arguments (JSON string) } }
-        // Anthropic shape comes through normalized by LocalAI.
-        std::string name;
-        std::string args_str;
-        if (call.contains("function")) {
-            const auto &fn = call["function"];
-            if (fn.contains("name") && fn["name"].is_string())
-                name = fn["name"].get<std::string>();
-            if (fn.contains("arguments") && fn["arguments"].is_string())
-                args_str = fn["arguments"].get<std::string>();
-        }
-        os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"" << name << "\">\n";
-        if (!args_str.empty()) {
-            json args;
-            try {
-                args = json::parse(args_str);
-            } catch (...) {
-                args = json{};
-            }
-            if (args.is_object()) {
-                for (auto it = args.begin(); it != args.end(); ++it) {
-                    render_param(os, it.key(), it.value());
-                }
-            }
-        }
-        os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n";
-    }
-    os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>";
-    return os.str();
-}
-
-std::string RenderToolResult(const std::string &tool_call_id, const std::string &content) {
-    std::ostringstream os;
-    // ds4_server.c wraps tool results in a "tool_result" DSML tag carrying
-    // the tool_call_id. Match that shape.
-    os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_result id=\"" << tool_call_id << "\">"
-       << content
-       << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_result>";
-    return os.str();
-}
-
-std::string RenderToolsManifest(const std::string &tools_json) {
-    if (tools_json.empty()) return "";
-    json arr;
-    try {
-        arr = json::parse(tools_json);
-    } catch (const std::exception &) {
-        return "";
-    }
-    if (!arr.is_array() || arr.empty()) return "";
-
-    // Extract each OpenAI tool's `function` object, dump as compact JSON, one
-    // per line. Mirrors openai_function_schema_from_tool() in ds4_server.c.
-    std::ostringstream schemas;
-    for (const auto &tool : arr) {
-        if (tool.contains("function") && tool["function"].is_object()) {
-            schemas << tool["function"].dump() << "\n";
-        } else if (tool.is_object()) {
-            // Anthropic / direct-schema form: pass through.
-            schemas << tool.dump() << "\n";
-        }
-    }
-    if (schemas.tellp() == std::streampos(0)) return "";
-
-    // Verbatim text from ds4_server.c append_tools_prompt_text. Do NOT
-    // paraphrase - the model was trained on these exact bytes.
-    std::ostringstream os;
-    os << "## Tools\n\n"
-          "You have access to a set of tools to help answer the user question. "
-          "You can invoke tools by writing a \"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\" block like the following:\n\n"
-          "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n"
-          "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"$TOOL_NAME\">\n"
-          "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\"$PARAMETER_NAME\" string=\"true|false\">$PARAMETER_VALUE</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>\n"
-          "...\n"
-          "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n"
-          "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"$TOOL_NAME2\">\n"
-          "...\n"
-          "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n"
-          "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n\n"
-          "String parameters should be specified as raw text and set `string=\"true\"`. "
-          "Preserve characters such as `>`, `&`, and `&&` exactly; never replace normal string characters with XML or HTML entity escapes. "
-          "Only if a string value itself contains the exact closing parameter tag `</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>`, write that tag as `&lt;/\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>` inside the value. "
-          "For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set `string=\"false\"`.\n\n"
-          "If thinking_mode is enabled (triggered by <think>), you MUST output your complete reasoning inside <think>...</think> BEFORE any tool calls or final response.\n\n"
-          "Otherwise, output directly after </think> with tool calls or final response.\n\n"
-          "### Available Tool Schemas\n\n"
-       << schemas.str()
-       << "\nYou MUST strictly follow the above defined tool name and parameter schemas to invoke tool calls. "
-          "Use the exact parameter names from the schemas.";
-    return os.str();
-}
-
-} // namespace ds4cpp
--- a/backend/cpp/ds4/dsml_renderer.h
+++ b/backend/cpp/ds4/dsml_renderer.h
@@ -1,27 +0,0 @@
-#pragma once
-#include <string>
-
-namespace ds4cpp {
-
-// Render an assistant message's tool_calls JSON array into the DSML block
-// that ds4 expects in its prompt. `tool_calls_json` is the value of
-// proto.Message.tool_calls (OpenAI shape: array of {id, type, function:{name, arguments}}).
-// Returns the DSML text to append after the assistant's content.
-std::string RenderAssistantToolCalls(const std::string &tool_calls_json);
-
-// Render a role="tool" message into the DSML "tool result" block. ds4's
-// prompt template expects tool results inside a specific tag; we wrap the
-// `content` with that tag and include the `tool_call_id` so the model can
-// correlate.
-std::string RenderToolResult(const std::string &tool_call_id, const std::string &content);
-
-// Render the "## Tools" manifest that ds4 expects in the SYSTEM prompt when
-// tools are available. Without this preamble the model has no idea tools
-// exist and will not emit DSML tool calls. Mirrors append_tools_prompt_text()
-// in ds4_server.c (~line 1646): a fixed preamble + "### Available Tool
-// Schemas" section + one JSON schema per line (extracted from each OpenAI
-// tool's .function object) + a fixed closing instruction. Returns empty
-// when tools_json is empty / unparseable.
-std::string RenderToolsManifest(const std::string &tools_json);
-
-} // namespace ds4cpp
--- a/backend/cpp/ds4/grpc-server.cpp
+++ b/backend/cpp/ds4/grpc-server.cpp
@@ -1,978 +0,0 @@
-// ds4 LocalAI gRPC backend.
-//
-// Wraps antirez/ds4's `ds4_engine_*` / `ds4_session_*` public API
-// (see ds4/ds4.h) over LocalAI's backend.proto. Tool calls, thinking
-// mode, and disk KV cache are wired in follow-up commits; this commit
-// is just the bind/listen/Health/Free skeleton.
-
-#include "backend.pb.h"
-#include "backend.grpc.pb.h"
-
-#include "dsml_parser.h"   // populated in Task 12
-#include "dsml_renderer.h" // populated in Task 16
-#include "kv_cache.h"      // populated in Task 17
-
-extern "C" {
-#include "ds4.h"
-}
-
-#include <grpcpp/grpcpp.h>
-#include <grpcpp/server.h>
-#include <grpcpp/server_builder.h>
-#include <grpcpp/ext/proto_server_reflection_plugin.h>
-
-#include <atomic>
-#include <chrono>
-#include <climits>
-#include <csignal>
-#include <cstddef>
-#include <cstdint>
-#include <cstdlib>
-#include <cstring>
-#include <ctime>
-#include <iostream>
-#include <memory>
-#include <mutex>
-#include <string>
-#include <thread>
-#include <vector>
-
-using grpc::Server;
-using grpc::ServerBuilder;
-using grpc::ServerContext;
-using grpc::ServerWriter;
-// NOTE: do NOT alias `grpc::Status` as `Status` - the Status RPC method below
-// would shadow the type, breaking the other RPC method declarations that use
-// it as a return type. Use GStatus instead.
-using GStatus = ::grpc::Status;
-using grpc::StatusCode;
-
-namespace {
-
-// Global state - ds4 is single-engine-per-process by design.
-std::mutex g_engine_mu;
-ds4_engine *g_engine = nullptr;
-ds4_session *g_session = nullptr;
-int g_ctx_size = 32768;
-std::string g_kv_cache_dir; // empty disables disk cache
-
-// Distributed coordinator state. g_distributed is set true when LoadModel is
-// given 'ds4_role:coordinator'; generation then waits for the worker route to
-// form before running. Single-node behavior is unchanged when unset.
-bool g_distributed = false;
-int g_route_timeout_sec = 60;
-
-std::atomic<Server *> g_server{nullptr};
-
-// Parse a "key:value" option string. Returns empty when no colon.
-static std::pair<std::string, std::string> split_option(const std::string &opt) {
-    auto colon = opt.find(':');
-    if (colon == std::string::npos) return {opt, ""};
-    return {opt.substr(0, colon), opt.substr(colon + 1)};
-}
-
-// Parse a positive base-10 integer. Returns false (without throwing) on empty,
-// trailing garbage, non-positive, or overflow - unlike std::stoi.
-static bool parse_positive_int(const std::string &s, int *out) {
-    if (s.empty()) return false;
-    char *end = nullptr;
-    long v = std::strtol(s.c_str(), &end, 10);
-    if (!end || *end != '\0' || v <= 0 || v > INT_MAX) return false;
-    *out = static_cast<int>(v);
-    return true;
-}
-
-// Parse a ds4 layer spec "START:END" or "START:output" into the engine's
-// distributed layer fields. Returns false on malformed input.
-static bool parse_layers_spec(const std::string &spec, ds4_distributed_layers *out) {
-    auto colon = spec.find(':');
-    if (colon == std::string::npos) return false;
-    std::string lhs = spec.substr(0, colon);
-    std::string rhs = spec.substr(colon + 1);
-    if (lhs.empty() || rhs.empty()) return false;
-    char *end = nullptr;
-    long start = std::strtol(lhs.c_str(), &end, 10);
-    if (!end || *end != '\0' || start < 0) return false;
-    out->start = static_cast<uint32_t>(start);
-    out->has_output = false;
-    if (rhs == "output") {
-        out->has_output = true;
-        out->end = out->start; // engine treats has_output as "through final layer"
-    } else {
-        long e = std::strtol(rhs.c_str(), &end, 10);
-        if (!end || *end != '\0' || e < start) return false;
-        out->end = static_cast<uint32_t>(e);
-    }
-    out->set = true;
-    return true;
-}
-
-// Parse a boolean LoadModel option. An empty value (a bare flag-style option
-// like "ssd_streaming" with no colon) means true so model YAMLs can write
-// options: ["ssd_streaming"] to enable a switch.
-static bool parse_bool_option(const std::string &s, bool *out) {
-    if (s.empty() || s == "true" || s == "1" || s == "yes" || s == "on") { *out = true; return true; }
-    if (s == "false" || s == "0" || s == "no" || s == "off") { *out = false; return true; }
-    return false;
-}
-
-// Table-driven mapping from LoadModel option keys to ds4_engine_options fields.
-// ds4_engine_options is a fixed C struct with no reflection, so the field set
-// is enumerated once here; adding a future engine knob is a one-line table
-// entry rather than a new branch in LoadModel. Two fields need ds4's own typed
-// parsers (Gib, CacheExperts) so a plain string passthrough can't cover them.
-enum class DsOptType { Bool, Int, Uint, Float, Str, Gib, CacheExperts };
-
-struct DsOptSpec {
-    const char *key;
-    DsOptType   type;
-    size_t      off;      // byte offset into ds4_engine_options
-    size_t      off2;     // second offset (CacheExperts writes experts + bytes)
-    bool        is_path;  // Str values: resolve a relative value against the model dir
-};
-
-static const DsOptSpec kEngineOptSpecs[] = {
-    {"mtp_path",                      DsOptType::Str,          offsetof(ds4_engine_options, mtp_path),                      0, true},
-    {"mtp_draft",                     DsOptType::Int,          offsetof(ds4_engine_options, mtp_draft_tokens),              0},
-    {"mtp_margin",                    DsOptType::Float,        offsetof(ds4_engine_options, mtp_margin),                    0},
-    {"prefill_chunk",                 DsOptType::Uint,         offsetof(ds4_engine_options, prefill_chunk),                 0},
-    {"power_percent",                 DsOptType::Int,          offsetof(ds4_engine_options, power_percent),                 0},
-    {"warm_weights",                  DsOptType::Bool,         offsetof(ds4_engine_options, warm_weights),                  0},
-    {"quality",                       DsOptType::Bool,         offsetof(ds4_engine_options, quality),                       0},
-    {"ssd_streaming",                 DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming),                 0},
-    {"ssd_streaming_cold",            DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming_cold),            0},
-    {"ssd_streaming_preload_experts", DsOptType::Uint,         offsetof(ds4_engine_options, ssd_streaming_preload_experts), 0},
-    {"ssd_streaming_cache_experts",   DsOptType::CacheExperts, offsetof(ds4_engine_options, ssd_streaming_cache_experts),
-                                                               offsetof(ds4_engine_options, ssd_streaming_cache_bytes)},
-    {"simulate_used_memory",          DsOptType::Gib,          offsetof(ds4_engine_options, simulate_used_memory_bytes),    0},
-    {"expert_profile_path",           DsOptType::Str,          offsetof(ds4_engine_options, expert_profile_path),           0, true},
-    {"directional_steering_file",     DsOptType::Str,          offsetof(ds4_engine_options, directional_steering_file),     0, true},
-    {"directional_steering_attn",     DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_attn),     0},
-    {"directional_steering_ffn",      DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_ffn),      0},
-};
-
-// Apply a single key:value LoadModel option to the engine options struct.
-// Unknown keys are ignored (back-compat: callers pass mixed option sets).
-// String values are copied into `storage`, whose elements the engine reads by
-// pointer during ds4_engine_open; `storage` MUST have reserved capacity so
-// push_back never reallocates and dangles an earlier c_str(). Returns false
-// with `err` set when a recognized key has an invalid value.
-static bool apply_engine_option(ds4_engine_options *opt, const std::string &key,
-                                const std::string &val, const std::string &model_dir,
-                                std::vector<std::string> &storage, std::string &err) {
-    const DsOptSpec *spec = nullptr;
-    for (const auto &s : kEngineOptSpecs) {
-        if (key == s.key) { spec = &s; break; }
-    }
-    if (!spec) return true; // unknown key: ignore
-
-    char *base = reinterpret_cast<char *>(opt);
-    switch (spec->type) {
-    case DsOptType::Bool: {
-        bool b = false;
-        if (!parse_bool_option(val, &b)) { err = key + " must be true/false"; return false; }
-        *reinterpret_cast<bool *>(base + spec->off) = b;
-        return true;
-    }
-    case DsOptType::Int: {
-        char *end = nullptr;
-        long v = std::strtol(val.c_str(), &end, 10);
-        if (val.empty() || !end || *end != '\0') { err = key + " must be an integer"; return false; }
-        *reinterpret_cast<int *>(base + spec->off) = static_cast<int>(v);
-        return true;
-    }
-    case DsOptType::Uint: {
-        char *end = nullptr;
-        long v = std::strtol(val.c_str(), &end, 10);
-        if (val.empty() || !end || *end != '\0' || v < 0 || v > static_cast<long>(UINT32_MAX)) {
-            err = key + " must be a non-negative integer"; return false;
-        }
-        *reinterpret_cast<uint32_t *>(base + spec->off) = static_cast<uint32_t>(v);
-        return true;
-    }
-    case DsOptType::Float: {
-        char *end = nullptr;
-        float f = std::strtof(val.c_str(), &end);
-        if (val.empty() || !end || *end != '\0') { err = key + " must be a number"; return false; }
-        *reinterpret_cast<float *>(base + spec->off) = f;
-        return true;
-    }
-    case DsOptType::Str: {
-        // Resolve a relative path option (e.g. mtp_path: a sibling GGUF the
-        // gallery downloaded next to the model) against the model directory, so
-        // YAMLs reference companion files by name. Absolute values pass through.
-        if (spec->is_path && !model_dir.empty() && !val.empty() && val.front() != '/') {
-            storage.push_back(model_dir + "/" + val);
-        } else {
-            storage.push_back(val);
-        }
-        *reinterpret_cast<const char **>(base + spec->off) = storage.back().c_str();
-        return true;
-    }
-    case DsOptType::Gib: {
-        uint64_t bytes = 0;
-        if (!ds4_parse_gib_arg(val.c_str(), &bytes)) {
-            err = key + " must be a GiB value, e.g. 64GB"; return false;
-        }
-        *reinterpret_cast<uint64_t *>(base + spec->off) = bytes;
-        return true;
-    }
-    case DsOptType::CacheExperts: {
-        uint32_t experts = 0;
-        uint64_t bytes = 0;
-        if (!ds4_parse_streaming_cache_experts_arg(val.c_str(), &experts, &bytes)) {
-            err = key + " must be a positive expert count or a <number>GB budget"; return false;
-        }
-        *reinterpret_cast<uint32_t *>(base + spec->off)  = experts;
-        *reinterpret_cast<uint64_t *>(base + spec->off2) = bytes;
-        return true;
-    }
-    }
-    return true;
-}
-
-// When acting as a distributed coordinator, block until the worker route
-// covers all layers (ds4_session_distributed_route_ready == 1) or the timeout
-// elapses. Returns an empty string on success, or an error message to return
-// to the client. No-op when not distributed.
-//
-// Takes the g_engine_mu lock by reference and RELEASES it during each poll
-// sleep. The wait can span up to g_route_timeout_sec seconds while workers
-// connect; holding g_engine_mu the whole time would block the Status/Health
-// readiness probes (they also lock g_engine_mu), making LocalAI's loader treat
-// a still-starting worker as hung.
-static std::string wait_route_ready(std::unique_lock<std::mutex> &lock) {
-    if (!g_distributed) return "";
-    char err[256] = {0};
-    const int deadline_polls = g_route_timeout_sec * 10; // 100ms per poll
-    for (int i = 0; i <= deadline_polls; ++i) {
-        int ready = ds4_session_distributed_route_ready(g_session, err, sizeof(err));
-        if (ready == 1) return "";
-        if (ready < 0) {
-            return std::string("ds4 distributed route error: ") +
-                   (err[0] ? err : "unknown");
-        }
-        // Release the lock while sleeping so Status/Health and other RPCs can
-        // interleave during worker startup.
-        lock.unlock();
-        struct timespec ts = {0, 100L * 1000L * 1000L}; // 100ms
-        nanosleep(&ts, nullptr);
-        lock.lock();
-        // A concurrent Free() may have torn down the engine while we slept.
-        if (!g_engine || !g_session) {
-            return "ds4: model unloaded while waiting for distributed route";
-        }
-    }
-    return "ds4 distributed route incomplete: workers not connected (layers uncovered)";
-}
-
-static void append_token_text(ds4_engine *engine, int token, std::string &out) {
-    size_t len = 0;
-    const char *text = ds4_token_text(engine, token, &len);
-    if (text && len > 0) out.append(text, len);
-}
-
-struct CollectCtx {
-    ds4_engine *engine;
-    std::string raw_buf;  // exact raw bytes for Reply.message
-    ds4cpp::DsmlParser parser;
-    backend::Reply *reply;
-    int tokens;
-
-    // Per-tool aggregation: accumulate ChatDelta tool_calls so we emit one
-    // delta with all calls, mirroring how vllm's non-streaming path returns.
-    struct Pending {
-        std::string id;
-        std::string name;
-        std::string args;
-    };
-    std::vector<Pending> pending;
-
-    std::string content_buf;
-    std::string reasoning_buf;
-};
-
-static void apply_events(CollectCtx *c, const std::vector<ds4cpp::ParserEvent> &events) {
-    for (const auto &e : events) {
-        switch (e.type) {
-        case ds4cpp::ParserEvent::CONTENT:
-            c->content_buf += e.text;
-            break;
-        case ds4cpp::ParserEvent::REASONING:
-            c->reasoning_buf += e.text;
-            break;
-        case ds4cpp::ParserEvent::TOOL_START:
-            if ((int)c->pending.size() <= e.index)
-                c->pending.resize(e.index + 1);
-            c->pending[e.index].id = e.tool_id;
-            c->pending[e.index].name = e.tool_name;
-            break;
-        case ds4cpp::ParserEvent::TOOL_ARGS:
-            if ((int)c->pending.size() > e.index)
-                c->pending[e.index].args += e.text;
-            break;
-        case ds4cpp::ParserEvent::TOOL_END:
-            // No-op for non-streaming: the final delta is emitted at the end.
-            break;
-        }
-    }
-}
-
-static void collect_emit(void *ud, int token) {
-    auto *c = static_cast<CollectCtx *>(ud);
-    if (token == ds4_token_eos(c->engine)) return;
-    size_t len = 0;
-    const char *text = ds4_token_text(c->engine, token, &len);
-    if (!text || len == 0) return;
-    std::string chunk(text, len);
-    c->raw_buf += chunk;
-    std::vector<ds4cpp::ParserEvent> events;
-    c->parser.Feed(chunk, events);
-    apply_events(c, events);
-    c->tokens++;
-}
-static void collect_done(void *) {}
-
-struct StreamCtx {
-    ds4_engine *engine;
-    ServerWriter<backend::Reply> *writer;
-    ds4cpp::DsmlParser parser;
-    int tokens;
-    bool aborted;
-    // Track which tool indices we've seen TOOL_START for, so subsequent
-    // ARGS deltas can elide the redundant id/name fields.
-    std::vector<bool> tool_started;
-};
-
-static void stream_emit(void *ud, int token) {
-    auto *s = static_cast<StreamCtx *>(ud);
-    if (s->aborted) return;
-    if (token == ds4_token_eos(s->engine)) return;
-    size_t len = 0;
-    const char *text = ds4_token_text(s->engine, token, &len);
-    if (!text || len == 0) return;
-    std::string chunk(text, len);
-    std::vector<ds4cpp::ParserEvent> events;
-    s->parser.Feed(chunk, events);
-    if (events.empty()) { s->tokens++; return; }
-
-    backend::Reply reply;
-    auto *delta = reply.add_chat_deltas();
-    bool any_field = false;
-    for (const auto &e : events) {
-        switch (e.type) {
-        case ds4cpp::ParserEvent::CONTENT:
-            delta->set_content(delta->content() + e.text);
-            any_field = true;
-            break;
-        case ds4cpp::ParserEvent::REASONING:
-            delta->set_reasoning_content(delta->reasoning_content() + e.text);
-            any_field = true;
-            break;
-        case ds4cpp::ParserEvent::TOOL_START: {
-            if ((int)s->tool_started.size() <= e.index)
-                s->tool_started.resize(e.index + 1, false);
-            s->tool_started[e.index] = true;
-            auto *tc = delta->add_tool_calls();
-            tc->set_index(e.index);
-            tc->set_id(e.tool_id);
-            tc->set_name(e.tool_name);
-            any_field = true;
-            break;
-        }
-        case ds4cpp::ParserEvent::TOOL_ARGS: {
-            auto *tc = delta->add_tool_calls();
-            tc->set_index(e.index);
-            tc->set_arguments(e.text);
-            any_field = true;
-            break;
-        }
-        case ds4cpp::ParserEvent::TOOL_END:
-            // No marker delta needed - the Go side closes the tool call on
-            // the final aggregator pass.
-            break;
-        }
-    }
-    reply.set_message(chunk);
-    reply.set_tokens(1);
-    if (any_field) {
-        if (!s->writer->Write(reply)) s->aborted = true;
-    }
-    s->tokens++;
-}
-static void stream_done(void *) {}
-
-// Per-thread RNG seed for ds4_session_sample. Initialized lazily from
-// system_clock; ds4 owns the random walk after that.
-static uint64_t *get_rng() {
-    static thread_local uint64_t seed = 0;
-    if (seed == 0) {
-        seed = static_cast<uint64_t>(
-            std::chrono::system_clock::now().time_since_epoch().count());
-        if (seed == 0) seed = 1;
-    }
-    return &seed;
-}
-
-struct SampleParams {
-    float temperature;
-    int top_k;
-    float top_p;
-    float min_p;
-};
-
-// Compute the effective sampling parameters for the next token, mirroring
-// ds4_server.c:7102-7115:
-//   - thinking mode enabled -> override (T=1, top_k=0, top_p=1, min_p=0)
-//   - inside DSML structural position (tool-call markers) -> force T=0
-//   - otherwise -> the request's user-supplied sampling settings
-// The parser argument carries state from tokens emitted so far; its
-// IsInDsmlStructural() predicts the next token's classification.
-static SampleParams compute_sample_params(const backend::PredictOptions *request,
-                                          const ds4cpp::DsmlParser &parser,
-                                          bool think_enabled);
-
-static ds4_think_mode parse_think_mode(const backend::PredictOptions *request) {
-    // Per the vllm backend convention, "enable_thinking" gates thinking on/off,
-    // and "reasoning_effort" picks the strength when on.
-    const auto &md = request->metadata();
-    auto et = md.find("enable_thinking");
-    bool enabled = true; // default ON per ds4-server
-    if (et != md.end()) enabled = (et->second == "true" || et->second == "1");
-    if (!enabled) return DS4_THINK_NONE;
-    auto re = md.find("reasoning_effort");
-    if (re != md.end() && (re->second == "max" || re->second == "xhigh"))
-        return DS4_THINK_MAX;
-    return DS4_THINK_HIGH;
-}
-
-static SampleParams compute_sample_params(const backend::PredictOptions *request,
-                                          const ds4cpp::DsmlParser &parser,
-                                          bool think_enabled) {
-    SampleParams p = {
-        request->temperature(),
-        request->topk(),
-        request->topp(),
-        request->minp(),
-    };
-    if (think_enabled) {
-        // Match ds4-server: thinking mode wants creativity in the reasoning
-        // pass and the trailing content, so the entire generation overrides
-        // sampling unless DSML structural bytes take over below.
-        p.temperature = 1.0f;
-        p.top_k = 0;
-        p.top_p = 1.0f;
-        p.min_p = 0.0f;
-    }
-    if (parser.IsInDsmlStructural()) {
-        // Tool-call structural bytes (tags, markers, headers) must parse
-        // cleanly. Force greedy regardless of user/thinking settings.
-        p.temperature = 0.0f;
-    }
-    return p;
-}
-
-// Build the rendered text for cache keying. We feed the same text the model
-// will see; that lets the cache survive small client-side reformatting of
-// chat history (the cache is keyed on bytes, not tokens).
-static std::string render_prompt_text(const backend::PredictOptions *request) {
-    // Two-mode: either the raw prompt or the chat-template path. We mirror
-    // build_prompt's branching but accumulate text (not tokens) so we can
-    // SHA1 it for the cache key. ds4_session caches a tokens-indexed
-    // checkpoint, but the disk format keys on bytes per ds4-server's design.
-    if (!request->usetokenizertemplate() || request->messages_size() == 0) {
-        return request->prompt();
-    }
-    std::string out;
-    const std::string sys_role = "system";
-    for (const auto &m : request->messages()) {
-        if (m.role() == sys_role) { out += "[sys] " + m.content() + "\n"; break; }
-    }
-    for (const auto &m : request->messages()) {
-        if (m.role() == sys_role) continue;
-        out += "[" + m.role() + "] " + m.content() + "\n";
-    }
-    return out;
-}
-
-ds4cpp::KvCache g_kv_cache;
-
-// Try to recover prefill state for `rendered`. Returns the matched prefix length.
-static size_t maybe_load_cache(const std::string &rendered) {
-    if (!g_kv_cache.enabled() || !g_session) return 0;
-    return g_kv_cache.LoadLongestPrefix(g_session, rendered, g_ctx_size);
-}
-
-static void maybe_save_cache(const std::string &rendered) {
-    if (g_kv_cache.enabled() && g_session) {
-        g_kv_cache.Save(g_session, rendered, g_ctx_size);
-    }
-}
-
-static void build_prompt(ds4_engine *engine, const backend::PredictOptions *request,
-                         ds4_tokens *out) {
-    if (!request->usetokenizertemplate() || request->messages_size() == 0) {
-        ds4_tokenize_text(engine, request->prompt().c_str(), out);
-        return;
-    }
-    // Chat-template path: render via ds4's helpers.
-    ds4_chat_begin(engine, out);
-
-    ds4_think_mode think = parse_think_mode(request);
-
-    // ds4_encode_chat_prompt is convenient when there is exactly one
-    // system+user pair, but for arbitrary turn lists we use the granular
-    // append helpers. Pull the first system message (if any), then append
-    // every other message in order.
-    const std::string sys_role = "system";
-    std::string system_text;
-    for (const auto &m : request->messages()) {
-        if (m.role() == sys_role) { system_text = m.content(); break; }
-    }
-    // Inject the tools manifest into the system prompt when tools are present.
-    // ds4 was trained to emit DSML tool calls ONLY when this preamble is in
-    // the system message - without it, the model has no idea tools exist and
-    // the e2e tool-call test will fail. The renderer lives in dsml_renderer
-    // and is a verbatim port of ds4_server.c's append_tools_prompt_text.
-    std::string tools_manifest;
-    if (!request->tools().empty()) {
-        tools_manifest = ds4cpp::RenderToolsManifest(request->tools());
-    }
-    if (!system_text.empty() || !tools_manifest.empty()) {
-        std::string combined = system_text;
-        if (!tools_manifest.empty()) {
-            if (!combined.empty()) combined += "\n\n";
-            combined += tools_manifest;
-        }
-        ds4_chat_append_message(engine, out, "system", combined.c_str());
-    }
-    for (const auto &m : request->messages()) {
-        if (m.role() == sys_role) continue;
-        if (m.role() == "assistant" && !m.tool_calls().empty()) {
-            std::string combined = m.content();
-            combined += ds4cpp::RenderAssistantToolCalls(m.tool_calls());
-            ds4_chat_append_message(engine, out, "assistant", combined.c_str());
-        } else if (m.role() == "tool") {
-            std::string body = ds4cpp::RenderToolResult(m.tool_call_id(), m.content());
-            ds4_chat_append_message(engine, out, "user", body.c_str());
-        } else {
-            ds4_chat_append_message(engine, out, m.role().c_str(), m.content().c_str());
-        }
-    }
-    ds4_chat_append_assistant_prefix(engine, out, think);
-}
-
-class DS4Backend final : public backend::Backend::Service {
-public:
-    GStatus Health(ServerContext *, const backend::HealthMessage *,
-                  backend::Reply *reply) override {
-        reply->set_message(std::string("OK"));
-        return GStatus::OK;
-    }
-
-    GStatus Free(ServerContext *, const backend::HealthMessage *,
-                backend::Result *result) override {
-        std::lock_guard<std::mutex> lock(g_engine_mu);
-        if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
-        if (g_engine)  { ds4_engine_close(g_engine);  g_engine  = nullptr; }
-        result->set_success(true);
-        return GStatus::OK;
-    }
-
-    GStatus LoadModel(ServerContext *, const backend::ModelOptions *request,
-                     backend::Result *result) override {
-        std::lock_guard<std::mutex> lock(g_engine_mu);
-
-        // Reset distributed state so a model swap (a second LoadModel without
-        // ds4_role) doesn't inherit a stale coordinator configuration.
-        g_distributed = false;
-        g_route_timeout_sec = 60;
-
-        if (g_engine) {
-            if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
-            ds4_engine_close(g_engine);
-            g_engine = nullptr;
-        }
-
-        std::string model_path = request->modelfile();
-        if (model_path.empty()) model_path = request->model();
-        if (model_path.empty()) {
-            result->set_success(false);
-            result->set_message("ds4: ModelOptions.Model or .ModelFile must be set");
-            return GStatus::OK;
-        }
-
-        ds4_engine_options opt = {};
-        opt.model_path = model_path.c_str();
-        opt.n_threads = request->threads() > 0 ? request->threads() : 0;
-        opt.mtp_margin = 3.0f; // ds4 default; overridable via the mtp_margin option
-
-#if defined(DS4_NO_GPU)
-        opt.backend = DS4_BACKEND_CPU;
-#elif defined(__APPLE__)
-        opt.backend = DS4_BACKEND_METAL;
-#else
-        opt.backend = DS4_BACKEND_CUDA;
-#endif
-
-        // Stable storage for string-valued engine options. The engine reads
-        // these by pointer during ds4_engine_open, so the std::string backing
-        // store must outlive the call and not reallocate; reserve up front so
-        // push_back keeps every prior c_str() valid. Static + clear() reuses
-        // the buffer across LoadModel calls (the old engine is closed above).
-        static std::vector<std::string> s_opt_strings;
-        s_opt_strings.clear();
-        s_opt_strings.reserve(sizeof(kEngineOptSpecs) / sizeof(kEngineOptSpecs[0]));
-
-        // Directory of the main model, used to resolve relative path options.
-        std::string model_dir;
-        if (auto slash = model_path.find_last_of('/'); slash != std::string::npos) {
-            model_dir = model_path.substr(0, slash);
-        }
-
-        std::string ds4_role, ds4_layers, ds4_listen;
-        for (const auto &o : request->options()) {
-            auto [k, v] = split_option(o);
-            if (k == "kv_cache_dir") { g_kv_cache_dir = v; continue; }
-            else if (k == "ds4_role") { ds4_role = v; continue; }
-            else if (k == "ds4_layers") { ds4_layers = v; continue; }
-            else if (k == "ds4_listen") { ds4_listen = v; continue; }
-            else if (k == "ds4_route_timeout") {
-                if (!parse_positive_int(v, &g_route_timeout_sec)) {
-                    result->set_success(false);
-                    result->set_message("ds4: ds4_route_timeout must be a positive integer");
-                    return GStatus::OK;
-                }
-                continue;
-            }
-            std::string err;
-            if (!apply_engine_option(&opt, k, v, model_dir, s_opt_strings, err)) {
-                result->set_success(false);
-                result->set_message("ds4: " + err);
-                return GStatus::OK;
-            }
-        }
-
-        g_kv_cache.SetDir(g_kv_cache_dir);
-
-        // Coordinator wiring. 'ds4_role:coordinator' enables layer-split
-        // distributed inference: this process listens on ds4_listen and owns
-        // the ds4_layers slice; workers dial in (see `local-ai worker
-        // ds4-distributed`). Absent ds4_role => unchanged single-node path.
-        // Must be static: opt.distributed.listen_host is a const char* the
-        // engine retains past this call, so it cannot point at a local that
-        // goes out of scope (otherwise a future "simplify to local" refactor
-        // reintroduces a dangling pointer).
-        static std::string s_listen_host;
-        if (ds4_role == "coordinator") {
-            if (ds4_layers.empty() || ds4_listen.empty()) {
-                result->set_success(false);
-                result->set_message("ds4: ds4_role:coordinator requires ds4_layers and ds4_listen");
-                return GStatus::OK;
-            }
-            // host:port for IPv4/hostname; IPv6 literals are unsupported (the
-            // first colon would split inside the address).
-            auto host_port = split_option(ds4_listen); // "host:port" -> {host, port}
-            if (host_port.second.empty()) {
-                result->set_success(false);
-                result->set_message("ds4: ds4_listen must be host:port");
-                return GStatus::OK;
-            }
-            int listen_port = 0;
-            if (!parse_positive_int(host_port.second, &listen_port)) {
-                result->set_success(false);
-                result->set_message("ds4: ds4_listen port must be a positive integer");
-                return GStatus::OK;
-            }
-            ds4_distributed_layers layers = {};
-            if (!parse_layers_spec(ds4_layers, &layers)) {
-                result->set_success(false);
-                result->set_message("ds4: invalid ds4_layers (want START:END or START:output)");
-                return GStatus::OK;
-            }
-            s_listen_host = host_port.first;
-            opt.distributed.role = DS4_DISTRIBUTED_COORDINATOR;
-            opt.distributed.layers = layers;
-            opt.distributed.listen_host = s_listen_host.c_str();
-            opt.distributed.listen_port = listen_port;
-            g_distributed = true;
-        }
-
-        int rc = ds4_engine_open(&g_engine, &opt);
-        if (rc != 0 || !g_engine) {
-            result->set_success(false);
-            result->set_message("ds4_engine_open failed (rc=" + std::to_string(rc) + ")");
-            return GStatus::OK;
-        }
-
-        g_ctx_size = request->contextsize() > 0 ? request->contextsize() : 32768;
-        rc = ds4_session_create(&g_session, g_engine, g_ctx_size);
-        if (rc != 0 || !g_session) {
-            ds4_engine_close(g_engine);
-            g_engine = nullptr;
-            result->set_success(false);
-            result->set_message("ds4_session_create failed (rc=" + std::to_string(rc) + ")");
-            return GStatus::OK;
-        }
-
-        result->set_success(true);
-        result->set_message("loaded " + model_path);
-        return GStatus::OK;
-    }
-
-    GStatus TokenizeString(ServerContext *, const backend::PredictOptions *request,
-                          backend::TokenizationResponse *response) override {
-        std::lock_guard<std::mutex> lock(g_engine_mu);
-        if (!g_engine) return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
-        ds4_tokens out = {};
-        ds4_tokenize_text(g_engine, request->prompt().c_str(), &out);
-        for (int i = 0; i < out.len; ++i) response->add_tokens(out.v[i]);
-        response->set_length(out.len);
-        ds4_tokens_free(&out);
-        return GStatus::OK;
-    }
-
-    GStatus Predict(ServerContext *, const backend::PredictOptions *request,
-                   backend::Reply *reply) override {
-        std::unique_lock<std::mutex> lock(g_engine_mu);
-        if (!g_engine || !g_session) {
-            return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
-        }
-        if (std::string route_err = wait_route_ready(lock); !route_err.empty()) {
-            return GStatus(StatusCode::UNAVAILABLE, route_err);
-        }
-        ds4_tokens prompt = {};
-        build_prompt(g_engine, request, &prompt);
-        int n_predict = request->tokens() > 0 ? request->tokens() : 256;
-
-        CollectCtx collect = {g_engine, "", {}, reply, 0, {}, "", ""};
-        std::string cache_key = render_prompt_text(request);
-        size_t cache_hit = maybe_load_cache(cache_key);
-        (void)cache_hit; // future: skip prompt prefix if hit covers full prompt
-
-        // Manual generation loop on g_session. When MTP speculative weights
-        // were loaded (LoadModel option 'mtp_path:'), we use the
-        // ds4_session_eval_speculative_argmax path which may accept N>1
-        // tokens per outer iteration. Otherwise per-token argmax + eval.
-        // Either way g_session advances so the disk KV cache picks up a
-        // real checkpoint after the call (see maybe_save_cache below).
-        char err[256] = {0};
-        int rc = ds4_session_sync(g_session, &prompt, err, sizeof(err));
-        int prompt_len = prompt.len;
-        ds4_tokens_free(&prompt);
-        if (rc == 0) {
-            const int eos = ds4_token_eos(g_engine);
-            const int draft_max = ds4_engine_mtp_draft_tokens(g_engine);
-            const bool think_enabled = ds4_think_mode_enabled(parse_think_mode(request));
-            int produced = 0;
-            while (produced < n_predict) {
-                SampleParams sp = compute_sample_params(request, collect.parser, think_enabled);
-                int first;
-                if (sp.temperature <= 0.0f) {
-                    first = ds4_session_argmax(g_session);
-                } else {
-                    first = ds4_session_sample(g_session,
-                                               sp.temperature, sp.top_k,
-                                               sp.top_p, sp.min_p, get_rng());
-                }
-                if (first == eos) break;
-                // MTP only when sampling is greedy (ds4-server gate).
-                if (draft_max > 0 && sp.temperature <= 0.0f) {
-                    constexpr int kAcceptedMax = 8;
-                    int accepted[kAcceptedMax];
-                    int cap = std::min(kAcceptedMax, draft_max + 1);
-                    int n = ds4_session_eval_speculative_argmax(
-                        g_session, first, draft_max, eos,
-                        accepted, cap, err, sizeof(err));
-                    if (n < 0) { rc = -1; break; }
-                    bool stop = false;
-                    for (int j = 0; j < n; ++j) {
-                        if (accepted[j] == eos) { stop = true; break; }
-                        collect_emit(&collect, accepted[j]);
-                        if (++produced >= n_predict) { stop = true; break; }
-                    }
-                    if (stop) break;
-                } else {
-                    collect_emit(&collect, first);
-                    if (++produced >= n_predict) break;
-                    rc = ds4_session_eval(g_session, first, err, sizeof(err));
-                    if (rc != 0) break;
-                }
-            }
-            collect_done(&collect);
-        }
-        maybe_save_cache(cache_key);
-
-        // Flush any buffered parser state.
-        std::vector<ds4cpp::ParserEvent> events;
-        collect.parser.Flush(events);
-        apply_events(&collect, events);
-
-        if (rc != 0) {
-            return GStatus(StatusCode::INTERNAL,
-                          std::string("ds4 generation failed: ") + err);
-        }
-
-        // Emit one ChatDelta with content/reasoning/tool_calls.
-        auto *delta = reply->add_chat_deltas();
-        delta->set_content(collect.content_buf);
-        delta->set_reasoning_content(collect.reasoning_buf);
-        for (size_t i = 0; i < collect.pending.size(); ++i) {
-            auto *tc = delta->add_tool_calls();
-            tc->set_index(static_cast<int32_t>(i));
-            tc->set_id(collect.pending[i].id);
-            tc->set_name(collect.pending[i].name);
-            tc->set_arguments(collect.pending[i].args);
-        }
-
-        reply->set_message(collect.raw_buf);
-        reply->set_tokens(collect.tokens);
-        reply->set_prompt_tokens(prompt_len);
-        return GStatus::OK;
-    }
-
-    GStatus PredictStream(ServerContext *, const backend::PredictOptions *request,
-                         ServerWriter<backend::Reply> *writer) override {
-        std::unique_lock<std::mutex> lock(g_engine_mu);
-        if (!g_engine || !g_session) {
-            return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
-        }
-        if (std::string route_err = wait_route_ready(lock); !route_err.empty()) {
-            return GStatus(StatusCode::UNAVAILABLE, route_err);
-        }
-        ds4_tokens prompt = {};
-        build_prompt(g_engine, request, &prompt);
-        int n_predict = request->tokens() > 0 ? request->tokens() : 256;
-
-        StreamCtx s = {g_engine, writer, {}, 0, false, {}};
-        std::string cache_key = render_prompt_text(request);
-        size_t cache_hit = maybe_load_cache(cache_key);
-        (void)cache_hit;
-
-        // Manual loop on g_session - see Predict() above for the rationale.
-        // MTP speculative path used when ds4_engine_mtp_draft_tokens > 0.
-        char err[256] = {0};
-        int rc = ds4_session_sync(g_session, &prompt, err, sizeof(err));
-        ds4_tokens_free(&prompt);
-        if (rc == 0) {
-            const int eos = ds4_token_eos(g_engine);
-            const int draft_max = ds4_engine_mtp_draft_tokens(g_engine);
-            const bool think_enabled = ds4_think_mode_enabled(parse_think_mode(request));
-            int produced = 0;
-            while (produced < n_predict && !s.aborted) {
-                SampleParams sp = compute_sample_params(request, s.parser, think_enabled);
-                int first;
-                if (sp.temperature <= 0.0f) {
-                    first = ds4_session_argmax(g_session);
-                } else {
-                    first = ds4_session_sample(g_session,
-                                               sp.temperature, sp.top_k,
-                                               sp.top_p, sp.min_p, get_rng());
-                }
-                if (first == eos) break;
-                if (draft_max > 0 && sp.temperature <= 0.0f) {
-                    constexpr int kAcceptedMax = 8;
-                    int accepted[kAcceptedMax];
-                    int cap = std::min(kAcceptedMax, draft_max + 1);
-                    int n = ds4_session_eval_speculative_argmax(
-                        g_session, first, draft_max, eos,
-                        accepted, cap, err, sizeof(err));
-                    if (n < 0) { rc = -1; break; }
-                    bool stop = false;
-                    for (int j = 0; j < n; ++j) {
-                        if (accepted[j] == eos) { stop = true; break; }
-                        stream_emit(&s, accepted[j]);
-                        if (s.aborted) { stop = true; break; }
-                        if (++produced >= n_predict) { stop = true; break; }
-                    }
-                    if (stop) break;
-                } else {
-                    stream_emit(&s, first);
-                    if (s.aborted || ++produced >= n_predict) break;
-                    rc = ds4_session_eval(g_session, first, err, sizeof(err));
-                    if (rc != 0) break;
-                }
-            }
-            stream_done(&s);
-        }
-        maybe_save_cache(cache_key);
-
-        // Flush parser state.
-        std::vector<ds4cpp::ParserEvent> events;
-        s.parser.Flush(events);
-        if (!events.empty() && !s.aborted) {
-            backend::Reply reply;
-            auto *delta = reply.add_chat_deltas();
-            for (const auto &e : events) {
-                if (e.type == ds4cpp::ParserEvent::CONTENT) {
-                    delta->set_content(delta->content() + e.text);
-                } else if (e.type == ds4cpp::ParserEvent::REASONING) {
-                    delta->set_reasoning_content(delta->reasoning_content() + e.text);
-                }
-            }
-            s.writer->Write(reply);
-        }
-
-        if (rc != 0 && !s.aborted) {
-            return GStatus(StatusCode::INTERNAL,
-                          std::string("ds4 generation failed: ") + err);
-        }
-        return GStatus::OK;
-    }
-
-    GStatus Status(ServerContext *, const backend::HealthMessage *,
-                  backend::StatusResponse *response) override {
-        std::lock_guard<std::mutex> lock(g_engine_mu);
-        response->set_state(g_engine ? backend::StatusResponse::READY
-                                     : backend::StatusResponse::UNINITIALIZED);
-        return GStatus::OK;
-    }
-};
-
-void RunServer(const std::string &addr) {
-    DS4Backend service;
-    grpc::EnableDefaultHealthCheckService(true);
-    grpc::reflection::InitProtoReflectionServerBuilderPlugin();
-
-    ServerBuilder builder;
-    builder.AddListeningPort(addr, grpc::InsecureServerCredentials());
-    builder.RegisterService(&service);
-    builder.SetMaxReceiveMessageSize(64 * 1024 * 1024);
-    builder.SetMaxSendMessageSize(64 * 1024 * 1024);
-
-    std::unique_ptr<Server> server(builder.BuildAndStart());
-    if (!server) {
-        std::cerr << "ds4 grpc-server: failed to bind " << addr << "\n";
-        std::exit(1);
-    }
-    g_server = server.get();
-    std::cerr << "ds4 grpc-server listening on " << addr << "\n";
-    server->Wait();
-}
-
-void signal_handler(int) {
-    if (auto *srv = g_server.load()) {
-        srv->Shutdown(std::chrono::system_clock::now() +
-                      std::chrono::seconds(3));
-    }
-}
-
-} // namespace
-
-int main(int argc, char *argv[]) {
-    std::string addr = "127.0.0.1:50051";
-    for (int i = 1; i < argc; ++i) {
-        std::string a = argv[i];
-        const std::string addr_flag = "--addr=";
-        if (a.rfind(addr_flag, 0) == 0) addr = a.substr(addr_flag.size());
-        else if (a == "--addr" && i + 1 < argc) addr = argv[++i];
-        else if (a == "--help" || a == "-h") {
-            std::cout << "Usage: grpc-server --addr=HOST:PORT\n";
-            return 0;
-        }
-    }
-    std::signal(SIGINT, signal_handler);
-    std::signal(SIGTERM, signal_handler);
-    RunServer(addr);
-    return 0;
-}
--- a/backend/cpp/ds4/kv_cache.cpp
+++ b/backend/cpp/ds4/kv_cache.cpp
@@ -1,205 +0,0 @@
-#include "kv_cache.h"
-
-#include <cerrno>
-#include <cstdio>
-#include <cstring>
-#include <dirent.h>
-#include <fstream>
-#include <sys/stat.h>
-#include <vector>
-
-namespace ds4cpp {
-
-namespace {
-
-// Minimal SHA1 (public domain reference). 30 lines; used only here.
-struct Sha1 {
-    uint32_t h[5];
-    uint64_t bits;
-    uint8_t block[64];
-    size_t used;
-    Sha1() { h[0]=0x67452301; h[1]=0xEFCDAB89; h[2]=0x98BADCFE; h[3]=0x10325476; h[4]=0xC3D2E1F0; bits=0; used=0; }
-    static uint32_t rol(uint32_t x, int n){ return (x<<n)|(x>>(32-n)); }
-    void transform(const uint8_t *b) {
-        uint32_t w[80];
-        for (int i=0;i<16;i++) w[i] = (uint32_t)b[i*4]<<24 | (uint32_t)b[i*4+1]<<16 | (uint32_t)b[i*4+2]<<8 | b[i*4+3];
-        for (int i=16;i<80;i++) w[i] = rol(w[i-3]^w[i-8]^w[i-14]^w[i-16], 1);
-        uint32_t a=h[0],bb=h[1],c=h[2],d=h[3],e=h[4];
-        for (int i=0;i<80;i++) {
-            uint32_t f,k;
-            if (i<20)      { f=(bb&c)|((~bb)&d); k=0x5A827999; }
-            else if (i<40) { f=bb^c^d;            k=0x6ED9EBA1; }
-            else if (i<60) { f=(bb&c)|(bb&d)|(c&d); k=0x8F1BBCDC; }
-            else           { f=bb^c^d;            k=0xCA62C1D6; }
-            uint32_t t = rol(a,5)+f+e+k+w[i];
-            e=d; d=c; c=rol(bb,30); bb=a; a=t;
-        }
-        h[0]+=a; h[1]+=bb; h[2]+=c; h[3]+=d; h[4]+=e;
-    }
-    void update(const void *p, size_t n) {
-        const uint8_t *bp = (const uint8_t*)p;
-        bits += (uint64_t)n*8;
-        while (n) {
-            size_t take = 64-used;
-            if (take>n) take=n;
-            std::memcpy(block+used, bp, take);
-            used += take; bp += take; n -= take;
-            if (used == 64) { transform(block); used = 0; }
-        }
-    }
-    void final(uint8_t out[20]) {
-        uint8_t pad[64] = {0x80};
-        size_t padlen = (used < 56) ? (56-used) : (120-used);
-        uint64_t lb = bits;
-        uint8_t len[8];
-        for (int i=0;i<8;i++) len[7-i] = (uint8_t)(lb >> (i*8));
-        update(pad, padlen);
-        update(len, 8);
-        for (int i=0;i<5;i++) {
-            out[i*4]   = h[i]>>24;
-            out[i*4+1] = h[i]>>16;
-            out[i*4+2] = h[i]>>8;
-            out[i*4+3] = h[i];
-        }
-    }
-};
-
-std::string mkdir_p(const std::string &d) {
-    if (d.empty()) return d;
-    struct stat st{};
-    if (stat(d.c_str(), &st) == 0) return d;
-    mkdir(d.c_str(), 0755);
-    return d;
-}
-
-bool file_exists(const std::string &p) {
-    struct stat st{};
-    return stat(p.c_str(), &st) == 0;
-}
-
-} // namespace
-
-std::string Sha1Hex(const void *data, size_t len) {
-    Sha1 s;
-    s.update(data, len);
-    uint8_t out[20];
-    s.final(out);
-    char hex[41];
-    for (int i = 0; i < 20; ++i) std::snprintf(hex + i*2, 3, "%02x", out[i]);
-    hex[40] = 0;
-    return std::string(hex);
-}
-
-KvCache::KvCache() = default;
-
-void KvCache::SetDir(const std::string &dir) {
-    dir_ = dir;
-    if (!dir_.empty()) {
-        mkdir_p(dir_);
-        std::fprintf(stderr, "ds4 KvCache: enabled at %s\n", dir_.c_str());
-    } else {
-        std::fprintf(stderr, "ds4 KvCache: disabled (no dir set)\n");
-    }
-}
-
-std::string KvCache::Path(const std::string &rendered_text) const {
-    if (dir_.empty()) return "";
-    return dir_ + "/" + Sha1Hex(rendered_text.data(), rendered_text.size()) + ".kv";
-}
-
-size_t KvCache::LoadLongestPrefix(ds4_session *session,
-                                  const std::string &rendered_text,
-                                  int ctx_size) {
-    if (dir_.empty() || !session) return 0;
-    // Strategy: enumerate all .kv files in dir, read their stored prefix
-    // header, pick the longest one that is also a prefix of rendered_text.
-    DIR *d = opendir(dir_.c_str());
-    if (!d) return 0;
-    struct dirent *de;
-    size_t best_len = 0;
-    std::string best_path;
-    while ((de = readdir(d)) != nullptr) {
-        std::string name = de->d_name;
-        if (name.size() < 4 || name.substr(name.size()-3) != ".kv") continue;
-        std::string path = dir_ + "/" + name;
-        std::ifstream f(path, std::ios::binary);
-        if (!f) continue;
-        char magic[4]; f.read(magic, 4);
-        if (f.gcount() != 4 || std::memcmp(magic, "DS4G", 4) != 0) continue;
-        uint32_t version=0, file_ctx=0, prefix_len=0;
-        f.read((char*)&version, 4); f.read((char*)&file_ctx, 4); f.read((char*)&prefix_len, 4);
-        if (version != 1) continue;
-        if ((int)file_ctx != ctx_size) continue;
-        if (prefix_len > rendered_text.size()) continue;
-        std::vector<char> prefix(prefix_len);
-        f.read(prefix.data(), prefix_len);
-        if (std::memcmp(prefix.data(), rendered_text.data(), prefix_len) != 0) continue;
-        if (prefix_len > best_len) {
-            best_len = prefix_len;
-            best_path = path;
-        }
-    }
-    closedir(d);
-    if (best_len == 0) return 0;
-
-    // Load best_path's payload into session.
-    std::ifstream f(best_path, std::ios::binary);
-    char magic[4]; f.read(magic, 4);
-    uint32_t version, file_ctx, prefix_len;
-    f.read((char*)&version, 4); f.read((char*)&file_ctx, 4); f.read((char*)&prefix_len, 4);
-    f.seekg(prefix_len, std::ios::cur);
-    uint64_t payload_bytes = 0;
-    f.read((char*)&payload_bytes, 8);
-    // ds4_session_load_payload reads from a FILE*; reopen via fopen.
-    FILE *fp = std::fopen(best_path.c_str(), "rb");
-    if (!fp) return 0;
-    // Seek past header + prefix + payload_bytes field.
-    std::fseek(fp, 4 + 4 + 4 + 4 + prefix_len + 8, SEEK_SET);
-    char errbuf[256] = {0};
-    int rc = ds4_session_load_payload(session, fp, payload_bytes, errbuf, sizeof(errbuf));
-    std::fclose(fp);
-    if (rc != 0) return 0;
-    return best_len;
-}
-
-void KvCache::Save(ds4_session *session, const std::string &rendered_text, int ctx_size) {
-    if (dir_.empty()) {
-        std::fprintf(stderr, "ds4 KvCache::Save: skipped (dir empty)\n");
-        return;
-    }
-    if (!session) {
-        std::fprintf(stderr, "ds4 KvCache::Save: skipped (session null)\n");
-        return;
-    }
-    std::string path = Path(rendered_text);
-    uint64_t payload_bytes = ds4_session_payload_bytes(session);
-    std::fprintf(stderr, "ds4 KvCache::Save: path=%s payload_bytes=%llu prefix_len=%zu\n",
-                 path.c_str(), (unsigned long long)payload_bytes, rendered_text.size());
-    FILE *fp = std::fopen(path.c_str(), "wb");
-    if (!fp) {
-        std::fprintf(stderr, "ds4 KvCache::Save: fopen failed: %s\n", std::strerror(errno));
-        return;
-    }
-    char magic[4] = {'D','S','4','G'};
-    uint32_t version = 1;
-    uint32_t ctx = static_cast<uint32_t>(ctx_size);
-    uint32_t prefix_len = static_cast<uint32_t>(rendered_text.size());
-    std::fwrite(magic, 4, 1, fp);
-    std::fwrite(&version, 4, 1, fp);
-    std::fwrite(&ctx, 4, 1, fp);
-    std::fwrite(&prefix_len, 4, 1, fp);
-    std::fwrite(rendered_text.data(), prefix_len, 1, fp);
-    std::fwrite(&payload_bytes, 8, 1, fp);
-    char errbuf[256] = {0};
-    int rc = ds4_session_save_payload(session, fp, errbuf, sizeof(errbuf));
-    std::fclose(fp);
-    if (rc != 0) {
-        std::fprintf(stderr, "ds4 KvCache::Save: ds4_session_save_payload rc=%d err=%s; removing %s\n",
-                     rc, errbuf, path.c_str());
-        std::remove(path.c_str());
-    } else {
-        std::fprintf(stderr, "ds4 KvCache::Save: wrote %s ok\n", path.c_str());
-    }
-}
-
-} // namespace ds4cpp
--- a/backend/cpp/ds4/kv_cache.h
+++ b/backend/cpp/ds4/kv_cache.h
@@ -1,44 +0,0 @@
-#pragma once
-#include <string>
-extern "C" {
-#include "ds4.h"
-}
-
-namespace ds4cpp {
-
-// Disk-backed KV cache for ds4 sessions. Keyed by SHA1(rendered prompt prefix).
-// Format (our own, NOT bit-compatible with ds4-server's KVC files - interop
-// is a follow-up plan):
-//
-//   "DS4G" (4 bytes magic) + u32 version=1 + u32 ctx_size +
-//   u32 prefix_text_len + prefix_text + u64 payload_bytes + payload
-class KvCache {
-public:
-    KvCache(); // disabled (dir empty)
-
-    // Set the cache directory. Empty disables.
-    void SetDir(const std::string &dir);
-
-    // Returns the cache file path for a given rendered text prefix.
-    std::string Path(const std::string &rendered_text) const;
-
-    // Look up the longest cached prefix that is also a prefix of
-    // `rendered_text`. Loads it into `session` if found. Returns the
-    // matched prefix length in bytes (0 if no hit).
-    size_t LoadLongestPrefix(ds4_session *session,
-                             const std::string &rendered_text,
-                             int ctx_size);
-
-    // Save the current session, associated with this rendered text prefix.
-    void Save(ds4_session *session, const std::string &rendered_text, int ctx_size);
-
-    bool enabled() const { return !dir_.empty(); }
-
-private:
-    std::string dir_;
-};
-
-// Compute SHA1 of arbitrary bytes; returns 40-char hex.
-std::string Sha1Hex(const void *data, size_t len);
-
-} // namespace ds4cpp
--- a/backend/cpp/ds4/package.sh
+++ b/backend/cpp/ds4/package.sh
@@ -1,40 +0,0 @@
-#!/bin/bash
-set -e
-CURDIR=$(dirname "$(realpath "$0")")
-REPO_ROOT="${CURDIR}/../../.."
-
-mkdir -p "$CURDIR/package/lib"
-cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
-cp -avf "$CURDIR/ds4-worker"  "$CURDIR/package/"
-cp -rfv "$CURDIR/run.sh"      "$CURDIR/package/"
-
-UNAME_S=$(uname -s)
-if [ "$UNAME_S" = "Darwin" ]; then
-    # Darwin: bundle dylibs via otool -L (handled by scripts/build/ds4-darwin.sh).
-    echo "package.sh: Darwin handled by ds4-darwin.sh"
-    exit 0
-fi
-
-if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
-    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
-    LIBDIR=/lib/x86_64-linux-gnu
-elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
-    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
-    LIBDIR=/lib/aarch64-linux-gnu
-else
-    echo "package.sh: unknown architecture" >&2; exit 1
-fi
-
-for lib in libc.so.6 libgcc_s.so.1 libstdc++.so.6 libm.so.6 libgomp.so.1 \
-           libdl.so.2 librt.so.1 libpthread.so.0; do
-    cp -arfLv "$LIBDIR/$lib" "$CURDIR/package/lib/$lib"
-done
-
-GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
-if [ -f "$GPU_LIB_SCRIPT" ]; then
-    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
-    package_gpu_libs
-fi
-
-echo "ds4 package contents:"
-ls -lah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/cpp/ds4/run.sh
+++ b/backend/cpp/ds4/run.sh
@@ -1,9 +0,0 @@
-#!/bin/bash
-# Entry point for the ds4 backend image / BACKEND_BINARY mode.
-set -e
-CURDIR=$(dirname "$(realpath "$0")")
-export LD_LIBRARY_PATH="$CURDIR/lib:$LD_LIBRARY_PATH"
-if [ -f "$CURDIR/lib/ld.so" ]; then
-    exec "$CURDIR/lib/ld.so" "$CURDIR/grpc-server" "$@"
-fi
-exec "$CURDIR/grpc-server" "$@"
--- a/backend/cpp/ds4/worker_main.c
+++ b/backend/cpp/ds4/worker_main.c
@@ -1,126 +0,0 @@
-// ds4-worker: standalone distributed worker for the LocalAI ds4 backend.
-//
-// A ds4 distributed worker owns a slice of the model's transformer layers,
-// dials the coordinator, and serves activations for its slice. It does NOT
-// speak backend.proto - it speaks ds4's own TCP transport via ds4_dist_run().
-// This binary is intentionally minimal (no HTTP/web/kvstore/linenoise): it
-// only needs the engine objects + ds4_distributed.o, which the backend already
-// builds. It is launched by `local-ai worker ds4-distributed`.
-//
-// Usage:
-//   ds4-worker --role worker --model <gguf> --layers 20:output \
-//              --coordinator <host> <port> [--cpu|--cuda|--metal] [-c CTX] [-t N]
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <string.h>
-#include <signal.h>
-#include <limits.h>
-
-#include "ds4.h"
-#include "ds4_distributed.h"
-
-static const char *need_arg(int *i, int argc, char **argv, const char *flag) {
-    if (*i + 1 >= argc) {
-        fprintf(stderr, "ds4-worker: missing value for %s\n", flag);
-        exit(2);
-    }
-    return argv[++(*i)];
-}
-
-static int parse_int_arg(const char *s, const char *flag) {
-    char *end = NULL;
-    long v = strtol(s, &end, 10);
-    if (!s[0] || *end || v <= 0 || v > INT_MAX) {
-        fprintf(stderr, "ds4-worker: invalid value for %s: %s\n", flag, s);
-        exit(2);
-    }
-    return (int)v;
-}
-
-static ds4_backend default_backend(void) {
-#if defined(DS4_NO_GPU)
-    return DS4_BACKEND_CPU;
-#elif defined(__APPLE__)
-    return DS4_BACKEND_METAL;
-#else
-    return DS4_BACKEND_CUDA;
-#endif
-}
-
-int main(int argc, char **argv) {
-    signal(SIGPIPE, SIG_IGN);
-
-    ds4_engine_options opt = {0};
-    opt.backend = default_backend();
-    int ctx_size = 32768;
-
-    for (int i = 1; i < argc; i++) {
-        const char *arg = argv[i];
-        if (!strcmp(arg, "-h") || !strcmp(arg, "--help")) {
-            fprintf(stdout, "ds4-worker: standalone ds4 distributed worker\n");
-            ds4_dist_usage(stdout);
-            fprintf(stdout, "  -m, --model PATH   model GGUF (the worker loads only its --layers slice)\n");
-            fprintf(stdout, "  -c, --ctx N        context size (default 32768)\n");
-            fprintf(stdout, "  -t, --threads N    CPU threads\n");
-            fprintf(stdout, "  --cpu|--cuda|--metal  backend override\n");
-            return 0;
-        }
-
-        char dist_err[256] = {0};
-        ds4_dist_cli_parse_result dist_parse =
-            ds4_dist_parse_cli_arg(arg, &i, argc, argv, &opt.distributed,
-                                   dist_err, sizeof(dist_err));
-        if (dist_parse == DS4_DIST_CLI_ERROR) {
-            fprintf(stderr, "ds4-worker: %s\n",
-                    dist_err[0] ? dist_err : "invalid distributed option");
-            return 2;
-        }
-        if (dist_parse == DS4_DIST_CLI_MATCHED) continue;
-
-        if (!strcmp(arg, "-m") || !strcmp(arg, "--model")) {
-            opt.model_path = need_arg(&i, argc, argv, arg);
-        } else if (!strcmp(arg, "-c") || !strcmp(arg, "--ctx")) {
-            ctx_size = parse_int_arg(need_arg(&i, argc, argv, arg), arg);
-        } else if (!strcmp(arg, "-t") || !strcmp(arg, "--threads")) {
-            opt.n_threads = parse_int_arg(need_arg(&i, argc, argv, arg), arg);
-        } else if (!strcmp(arg, "--cpu")) {
-            opt.backend = DS4_BACKEND_CPU;
-        } else if (!strcmp(arg, "--cuda")) {
-            opt.backend = DS4_BACKEND_CUDA;
-        } else if (!strcmp(arg, "--metal")) {
-            opt.backend = DS4_BACKEND_METAL;
-        } else {
-            fprintf(stderr, "ds4-worker: unknown option: %s\n", arg);
-            return 2;
-        }
-    }
-
-    if (opt.distributed.role != DS4_DISTRIBUTED_WORKER) {
-        fprintf(stderr, "ds4-worker: --role worker is required\n");
-        return 2;
-    }
-    if (!opt.model_path) {
-        fprintf(stderr, "ds4-worker: --model is required\n");
-        return 2;
-    }
-
-    char prep_err[256] = {0};
-    if (ds4_dist_prepare_engine_options(&opt.distributed, &opt,
-                                        prep_err, sizeof(prep_err)) != 0) {
-        fprintf(stderr, "ds4-worker: %s\n", prep_err);
-        return 2;
-    }
-
-    ds4_engine *engine = NULL;
-    if (ds4_engine_open(&engine, &opt) != 0 || !engine) {
-        fprintf(stderr, "ds4-worker: failed to open engine\n");
-        return 1;
-    }
-
-    ds4_dist_generation_options gen = {0};
-    gen.ctx_size = ctx_size;
-    int rc = ds4_dist_run(engine, &opt.distributed, &gen);
-    ds4_engine_close(engine);
-    return rc;
-}
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=b3dfb7858cfcb9166e92f366e5af87f19ebc94be
+IK_LLAMA_VERSION?=8befd92ea5f702494ea9813fe42a52fb015db5fe
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/ik-llama-cpp/grpc-server.cpp
+++ b/backend/cpp/ik-llama-cpp/grpc-server.cpp
@@ -326,7 +326,7 @@ struct llama_client_slot
       char buffer[512];
        double t_token = t_prompt_processing / num_prompt_tokens_processed;
        double n_tokens_second = 1e3 / t_prompt_processing * num_prompt_tokens_processed;
-        snprintf(buffer, sizeof(buffer), "prompt eval time     = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)",
+        sprintf(buffer, "prompt eval time     = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)",
                t_prompt_processing, num_prompt_tokens_processed,
                t_token, n_tokens_second);
        LOG_INFO(buffer, {
@@ -340,7 +340,7 @@ struct llama_client_slot

        t_token = t_token_generation / n_decoded;
        n_tokens_second = 1e3 / t_token_generation * n_decoded;
-        snprintf(buffer, sizeof(buffer), "generation eval time = %10.2f ms / %5d runs   (%8.2f ms per token, %8.2f tokens per second)",
+        sprintf(buffer, "generation eval time = %10.2f ms / %5d runs   (%8.2f ms per token, %8.2f tokens per second)",
                t_token_generation, n_decoded,
                t_token, n_tokens_second);
        LOG_INFO(buffer, {
@@ -352,7 +352,7 @@ struct llama_client_slot
            {"n_tokens_second",    n_tokens_second},
        });

-        snprintf(buffer, sizeof(buffer), "          total time = %10.2f ms", t_prompt_processing + t_token_generation);
+        sprintf(buffer, "          total time = %10.2f ms", t_prompt_processing + t_token_generation);
        LOG_INFO(buffer, {
            {"slot_id",             id},
            {"task_id",             task_id},
@@ -686,16 +686,7 @@ struct llama_server_context
        slot->sparams.mirostat_eta      = json_value(data, "mirostat_eta",      default_sparams.mirostat_eta);
        slot->params.n_keep             = json_value(data, "n_keep",            slot->params.n_keep);
        slot->sparams.seed               = json_value(data, "seed",              default_sparams.seed);
-        {
-            // upstream changed common_params_sampling::grammar from std::string to
-            // the common_grammar struct (type + grammar). The incoming JSON still
-            // carries a plain string, so build the user-provided grammar here and
-            // fall back to the server default when the request omits it.
-            std::string grammar_str = json_value(data, "grammar", std::string());
-            slot->sparams.grammar = grammar_str.empty()
-                ? default_sparams.grammar
-                : common_grammar{COMMON_GRAMMAR_TYPE_USER, std::move(grammar_str)};
-        }
+        slot->sparams.grammar           = json_value(data, "grammar",           default_sparams.grammar);
        slot->sparams.n_probs           = json_value(data, "n_probs",           default_sparams.n_probs);
        slot->sparams.min_keep          = json_value(data, "min_keep",          default_sparams.min_keep);
        slot->sparams.grammar_triggers = grammar_triggers;
@@ -1241,7 +1232,7 @@ struct llama_server_context
             //      {"logit_bias",        slot.sparams.logit_bias},
            {"n_probs",           slot.sparams.n_probs},
            {"min_keep",          slot.sparams.min_keep},
-            {"grammar",           slot.sparams.grammar.grammar},
+            {"grammar",           slot.sparams.grammar},
            {"samplers",          samplers}
        };
    }
--- a/backend/cpp/ik-llama-cpp/patches/0002-clip-ggml-quantize-chunk-user-data.patch
+++ b/backend/cpp/ik-llama-cpp/patches/0002-clip-ggml-quantize-chunk-user-data.patch
@@ -1,11 +0,0 @@
--- a/examples/llava/clip.cpp
-+++ b/examples/llava/clip.cpp
-@@ -2494,7 +2494,7 @@
-             }
-             new_data = work.data();
-
-            new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr);
-+            new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr, nullptr);
-         } else {
-             new_type = cur->type;
-             new_data = cur->data;
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,14 +1,6 @@

-LLAMA_VERSION?=f3e182816421c648188b5eab269853bf1531d950
+LLAMA_VERSION?=4f02d4733934179386cbc15b3454be26237940bb
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
-# LLAMA_PAGED controls whether the vendored paged-attention patch series
-# (patches/paged/) is applied on top of the pinned llama.cpp. Default on; set
-# LLAMA_PAGED=off to build a clean-against-upstream backend (e.g. to unblock a
-# dep-bump if an upstream change breaks a paged hook - the paged carry is then
-# fixed independently). Runtime behaviour stays gated by the LLAMA_KV_PAGED env
-# regardless, so an LLAMA_PAGED=on build is byte-identical to stock until that
-# env is set.
-LLAMA_PAGED?=on

 CMAKE_ARGS?=
 BUILD_TYPE?=
@@ -42,9 +34,6 @@ else ifeq ($(BUILD_TYPE),hipblas)
 	export CXX=$(ROCM_HOME)/llvm/bin/clang++
 	export CC=$(ROCM_HOME)/llvm/bin/clang
 	AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201
-ifeq ($(strip $(AMDGPU_TARGETS)),)
-$(error AMDGPU_TARGETS is empty — set it to a comma-separated list of gfx targets e.g. gfx1100,gfx1101)
-endif
 	CMAKE_ARGS+=-DGGML_HIP=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
 else ifeq ($(BUILD_TYPE),vulkan)
 	CMAKE_ARGS+=-DGGML_VULKAN=1
@@ -145,28 +134,14 @@ llama.cpp:
 	git remote add origin $(LLAMA_REPO)  && \
 	git fetch --all --tags && \
 	git checkout -b build $(LLAMA_VERSION) && \
-	git submodule update --init --recursive --depth 1 --single-branch && \
-	for p in $(CURRENT_MAKEFILE_DIR)patches/0*.patch; do \
-		[ -e "$$p" ] || continue; \
-		echo "applying llama.cpp patch: $$p"; \
-		git apply --verbose "$$p" || { echo "patch failed: $$p"; exit 1; }; \
-	done && \
-	if [ "$(LLAMA_PAGED)" = "off" ]; then \
-		echo "LLAMA_PAGED=off: skipping paged-attention patch series"; \
-	else \
-		for p in $(CURRENT_MAKEFILE_DIR)patches/paged/0*.patch; do \
-			[ -e "$$p" ] || continue; \
-			echo "applying llama.cpp PAGED patch: $$p"; \
-			git apply --verbose "$$p" || { echo "paged patch failed: $$p"; exit 1; }; \
-		done; \
-	fi
+	git submodule update --init --recursive --depth 1 --single-branch

 llama.cpp/tools/grpc-server: llama.cpp
 	mkdir -p llama.cpp/tools/grpc-server
-	LLAMA_PAGED=$(LLAMA_PAGED) bash prepare.sh
+	bash prepare.sh

 rebuild:
-	LLAMA_PAGED=$(LLAMA_PAGED) bash prepare.sh
+	bash prepare.sh
 	rm -rf grpc-server
 	$(MAKE) grpc-server

--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
--- a/backend/cpp/llama-cpp/paged/.gitignore
+++ b/backend/cpp/llama-cpp/paged/.gitignore
@@ -1,7 +0,0 @@
-tests/test_free_block_queue
-tests/test_block_pool
-tests/test_paged_kv_manager
-tests/test_prefix_cache
-tests/test_ggml_paged_rw
-tests/test_ggml_paged_attn
-paged-bench
--- a/backend/cpp/llama-cpp/paged/BLACKWELL_KERNEL_GAPS.md
+++ b/backend/cpp/llama-cpp/paged/BLACKWELL_KERNEL_GAPS.md
@@ -1,105 +0,0 @@
-# Blackwell (GB10 / sm_121) kernel gaps — measured + the corrected strategy
-
-Supersedes the "greenfield tcgen05 FP4 grouped GEMM" framing in `FP4_GROUPED_MOE_KERNEL.md`. Research +
-profiling reframed the problem: the kernels we need **already exist in ggml**; they're just **untuned for
-Blackwell**. And the parity target is far lower than the headline vLLM number implied.
-
-## 1. The parity target was wrong — it's ~3,300 t/s single-stream, not 24,444
-
-vLLM's dense "24,444 t/s" is **aggregate concurrent-batch** throughput, not single-sequence. The GB10
-compute roofline caps **single-stream** Qwen3-32B prefill at **~3,300 t/s (BF16/INT8 ceiling)** / **~6,600
-(FP4 ceiling)**. So: don't chase 24,444 with one kernel. Aggregate parity = (a kernel at the ceiling) +
-(batched-prefill scheduling). The *kernel* job is to reach ~3,300 (matches vLLM, which on GB10 also runs at
-the BF16 ceiling) or ~6,600 (beats it, via FP4).
-
-## 2. GB10 per-precision DENSE peaks (measured, not spec)
-
-| precision | dense peak | vs BF16 |
-|---|---|---|
-| BF16 / FP16 | ~213 TFLOP/s | 1.0× |
-| INT8 | ~215 TOPS | **1.0×** |
-| FP4 (MXFP4/NVFP4) | ~427–500 TFLOP/s | **2.0×** |
-
-Memory: ~273 GB/s LPDDR5X (the bottleneck for *decode*; prefill is compute-bound). **Critical:** GB10 is
-**1:1:2** (BF16:INT8:FP4), NOT datacenter Blackwell's 1:2:4 — **INT8 gives ZERO speedup over BF16 here.** So
-int8-MMQ has no precision advantage; only FP4 does. (NVIDIA spec sheets still claim 1:2:4 — contradicted by
-direct GB10 measurement; on-the-record discrepancy.)
-
-## 3. Measured gaps (nsys, GB10)
-
-| path | kernel | % of prefill | achieved | % of ceiling |
-|---|---|---|---|---|
-| **Dense** Q4_K_M | `mul_mat_q<Q4_K/Q6_K>` (int8 MMQ) | 80% | ~46 TFLOP/s | **~21% of 215** |
-| **MoE** MXFP4 | `mul_mat_q<MXFP4>` (FP4 MMA) | 37% | ~22 TFLOP/s | **~4–5% of 500** (or ~10% of BF16) |
-
-Both kernels are **engaged correctly but untuned for Blackwell** — llama.cpp's MMQ was "tuned primarily for
-RTX 3000/4000" (Ampere/Ada). The headroom (4–5×) is recoverable; it's not an architectural ceiling.
-
-## 4. ggml's current quantized-matmul paths (what exists)
-
- **MMQ** (int8): quantizes activations to Q8_1, int8 `mma.sync`/`dp4a`. Prefill path. **Untuned for sm_12x.**
- **FP4 MMA** (#17906, merged): native MXFP4/NVFP4 `m16n8k64` block-scaled FP4 mma for cc≥12.0. Works on GB10
-  for MoE (we measured 3441 t/s MXFP4 prefill) — but underutilized (~5% of FP4 peak). On **sm_121** it's hit
-  by build-flag (`120f`) + nvcc `-O3` miscompile (#18331) + capability-gating issues.
- **dequant→cuBLAS-FP16**: unfused fallback (materializes FP16 weights, round-trips memory). Not a fused
-  Marlin. (Our `GGML_CUDA_FORCE_CUBLAS` no-op = this didn't even engage for Q4_K.)
- **NO fused Marlin-style W4A16 kernel** (dequant 4-bit→BF16 in-shared-mem → BF16 tensor cores). Real gap.
-
-## 5. Strategy — match vs beat (this replaces the tcgen05-greenfield plan)
-
-**To MATCH vLLM (~3,300 single-stream): FP4 is NOT required.** Because INT8 == BF16 on GB10, a tuned MMQ and
-a BF16 Marlin kernel share the *same* ceiling — and vLLM hits parity via W4A16 Marlin (BF16), since its FP4
-is also broken on sm_121.
-
-Ranked, by effort:
-1. **Probe: tune the existing int8 MMQ for Blackwell** (dense). Cheapest. We're at 21% of the ceiling —
-   recover via tile sizes, async copy (`cp.async`), double-buffered shared-mem pipeline, occupancy. Caveat:
-   the `nwarps*tile_C::I==mmq_y` static_assert (found earlier) couples the constants; and the Q8_1
-   activation-quant overhead caps pure-MMQ tuning. Bounded upside, but a fast experiment.
-2. **Build a Marlin-style W4A16 BF16 GEMM** (dense) — the robust path to ~3,300 (4.3× over today's 765).
-   Dequant 4-bit→BF16 in shared memory, MMA on BF16 tensor cores, `cp.async` multi-buffer, offline weight
-   reshuffle. Mirrors vLLM's actual GB10 path; keeps activations BF16 (better quality than int8 MMQ); fills a
-   genuine ggml gap. **This is the recommended kernel to MATCH.**
-
-**To BEAT vLLM (~6,600, 2×): fix — don't rewrite — the FP4 path on sm_121.**
-3. **Get the existing FP4 MMA (#17906/#20644) fully working + tuned on sm_121.** It already works on sm_120
-   (RTX 5090: +43–68% prefill) and on GB10 for MoE. The blockers are the `120f` arch flag, the `-O3`
-   miscompile (#18331), capability gating — **build/compiler fixes, not a new kernel.** Then tune the FP4 MMQ
-   (it's at ~5% of FP4 peak). This is where upstream momentum already is, and the only route past vLLM.
-
-**Dropped:** the from-scratch tcgen05/CUTLASS grouped GEMM (the old scaffold). It aimed past the matchable
-ceiling, duplicates work the FP4-MMA path already does, and FP4 on sm_121 is a *fix* problem not a *write*
-problem. The `fp4-grouped-moe.cu` scaffold/hook stays as a useful dispatch seam, but the kernel behind it
-should be one of (1)/(2)/(3), not a greenfield CUTLASS collective.
-
-## 6. Cheap experiment — RESULT: MXFP4 dense = free 1.44×, but not parity (kernel still untuned)
-
-Requantized Qwen3-32B dense → MXFP4 (forced attn+ffn to mxfp4 via `--tensor-type`, `--allow-requantize`,
-speed-only test) and benched prefill:
-
-| quant | kernel | pp512 | pp2048 | vs Q4_K |
-|---|---|---|---|---|
-| Q4_K_M | int8-MMQ | 765 | 763 | 1.0× |
-| **MXFP4** | **FP4-MMA** | **1099** | **1153** | **1.44×** |
-
-**Findings:**
- **MXFP4 dense is a real, free 1.44× over Q4_K** — just a requantize, the existing FP4-MMA path engages for
-  dense weights on GB10. Worth shipping as a **Blackwell dense-quant recommendation** in the gallery (no kernel).
- **But it is NOT parity.** 1153 t/s = **~17% of the FP4 ceiling (~6,600)** / ~35% of the BF16 ceiling. So the
-  **FP4-MMA kernel is itself untuned** (consistent with the MoE measurement, ~5% of FP4 peak). MXFP4 moves dense
-  from the int8 path (765) onto the FP4 path (1153), but the FP4 kernel leaves ~4–6× on the table.
- **So the kernel work is confirmed and now precise: tune the FP4-MMA kernel** (it's the highest-value, since it
-  serves both dense-MXFP4 and MoE, and FP4 is the only path that can *beat* vLLM). Strategy item (3) — fix +
-  tune the existing FP4-MMA on sm_121 — is the priority; a Marlin-style W4A16 BF16 kernel (2) is the alternative
-  to *match* on the BF16 ceiling if FP4 tuning stalls.
-
-Conclusion: the cheap test did NOT collapse the kernel problem (the kernels are untuned, not just the quant), but
-it (a) gives a free 1.44× to ship now, and (b) sharpens the target to **tuning the FP4-MMA kernel**.
-
-## Sources
-GB10 peaks (measured): forums.developer.nvidia.com/t/351993, /360142, /373618. Marlin: github.com/IST-DASLab/marlin,
-arxiv 2408.11743, developers.redhat.com Marlin/Machete. MMQ untuned: llama.cpp docs/build.md, discussions/16578,
-DandinPower/llama.cpp_bench. FP4 landing/sm121: llama.cpp PR #17906/#20644, issues #19662/#18331. Roofline:
-vllm.ai/blog/2026-06-01-vllm-dgx-spark, lmsys.org DGX Spark.
-
-> **Correction (measured):** the earlier `GGML_CUDA_FORCE_CUBLAS` env test was a no-op because it's a *compile-time* `#ifdef`, not a runtime flag — cuBLAS never engaged. A real rebuild with `-DGGML_CUDA_FORCE_CUBLAS=ON` shows cuBLAS is **slower** than MMQ for dense Q4 (pp2048 690 vs 750) and runs an **Ampere `cutlass_80_tensorop` FP16 kernel** — cuBLAS-13.0 has no sm_121-tuned GEMM and falls back to sm_80. So *both* MMQ and cuBLAS sit at ~46 TFLOP/s (~21% of the 213 BF16 peak); there is **no library shortcut** to the ceiling on GB10 — a hand-tuned sm_120a kernel (Marlin-style) is required.
--- a/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md
@@ -1,334 +0,0 @@
-# Chunked prefill + n_batch/n_ubatch decouple — implementation plan
-
-Scope: LocalAI's llama.cpp backend (`backend/cpp/llama-cpp/`). Companion to
-`PHASED_VLLM_PARITY_PLAN.md` Phase 3. This document is the concrete, file-cited
-plan for what the brief called "chunked prefill".
-
-Line numbers below are from two trees:
- LocalAI: `backend/cpp/llama-cpp/grpc-server.cpp`, `core/backend/options.go`,
-  `backend/backend.proto`, `core/backend/hardware_defaults.go` — exact.
- Vendored upstream scheduler: `llama.cpp/tools/server/server-context.cpp`. The
-  build copies `llama.cpp/tools/server/*` into `tools/grpc-server/` (`prepare.sh`
-  lines 15-17) and only overrides `grpc-server.cpp` + `CMakeLists.txt`. So
-  `update_slots()` is **inherited upstream code, not LocalAI code**. Line numbers
-  cited for it are from a same-era checkout (`d12cc3d`, 2026-04-09); the pin is
-  `f3e1828` (Makefile line 2). The structure is identical; exact lines may drift
-  a few rows at the pin — match on the quoted comment strings, not the integers.
-
---
-
-## TL;DR — the headline finding
-
-**Chunked prefill with prefill/decode interleaving is ALREADY implemented** in the
-llama.cpp server scheduler that LocalAI vendors. It is not a missing feature on
-this version. `update_slots()` in `server-context.cpp`:
-
-1. **Adds ongoing decode tokens first** — "first, add sampled tokens from any
-   ongoing sequences" (≈ line 2088). Every `SLOT_STATE_GENERATING` slot gets its
-   one sampled token into the shared `llama_batch` before any prefill is added.
-2. **Then fills the remaining `n_batch` budget with prompt (prefill) tokens** —
-   "next, batch any pending prompts without exceeding n_batch" (≈ line 2166),
-   gated by `params_base.cont_batching` (LocalAI sets `cont_batching = true` by
-   default, `grpc-server.cpp:547`). The per-slot prefill fill loop
-   (≈ line 2552) is `while (slot.prompt.n_tokens() < slot.task->n_tokens() &&
-   batch.n_tokens < n_batch)` — i.e. it caps each slot's prefill contribution to
-   the **remaining** budget and defers the rest to the next iteration.
-3. **Decodes the combined batch in one pass** (≈ line 2728-2741): decode tokens
-   and prefill-chunk tokens go through the **same `llama_decode`**, which then
-   splits internally into `n_ubatch` physical sub-batches.
-
-This is exactly the behavior the abandoned-looking draft **upstream PR #10718**
-("server : chunked prefill support") asked for — "the first task is no longer
-blocked by the second long prompt processing task." That PR is still marked OPEN
-but its goal was absorbed into the natural evolution of `update_slots()`; we do
-**not** need to port it. A long prefill no longer stalls the decode batch: decode
-slots are serviced first every iteration, prefill consumes only the leftover
-budget.
-
-**Therefore: do not re-implement chunked prefill.** The real LocalAI gap is
-narrow and is the rest of this plan:
-
- **Phase A (the actual gap): the `n_batch`/`n_ubatch` decouple.** LocalAI ties
-  the scheduler token budget (`n_batch`) to the physical forward width
-  (`n_ubatch`) at `grpc-server.cpp:515` + `:519`. This forces
-  `n_batch == n_ubatch`, so the logical scheduling window can never be wider than
-  one physical ubatch. You cannot keep `n_ubatch` at the Blackwell GEMM sweet
-  spot (2048) while widening `n_batch` so concurrent prefills + decodes co-batch
-  into a larger logical window. There is no first-class `batch:`/`ubatch:` split
-  on the Go side, and there is only a one-directional `ubatch` override on the C++
-  side (you can shrink ubatch below the coupled value, never grow n_batch above
-  it).
- **Phase B (optional policy lever): a decode-headroom prefill cap.** Upstream
-  caps prefill at the full `n_batch` shared with decode. Under heavy mixed load
-  one fat prefill chunk per iteration still adds inter-token latency (ITL) jitter
-  to the decoders sharing that forward. vLLM exposes
-  `long_prefill_token_threshold` / `max_num_partial_prefills` for this. A
-  LocalAI-specific per-iteration prefill cap (a patch to vendored `update_slots`)
-  bounds that jitter. This is genuinely not in upstream and is the only place a
-  scheduler-policy change is warranted.
-
---
-
-## 1. Current behavior — precise citations
-
-### 1.1 The scheduler is upstream, inherited verbatim
- `prepare.sh:15-17` copies all of `llama.cpp/tools/server/*` into the
-  `grpc-server` build dir; `grpc-server.cpp` (LocalAI) replaces only the HTTP/gRPC
-  service + `params_parse` + `parse_options`. `update_slots()`, the slot state
-  machine, and the batch builder are **upstream `server-context.cpp`**, untouched
-  by LocalAI today.
- Slot states: `server-context.cpp:36-42` —
-  `SLOT_STATE_IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT /
-  GENERATING`.
-
-### 1.2 Decode-first, then prefill-fill, one shared batch
- `common_batch_clear(batch)` (≈ 2078) — one batch per `update_slots` iteration.
- Decode phase (≈ 2088-2156): for each `SLOT_STATE_GENERATING` slot,
-  `common_batch_add(batch, slot.sampled, …, /*logits=*/true)` adds exactly one
-  token. Decode is guaranteed a seat before prefill runs.
- Budget fetch (≈ 2158-2160): `n_batch = llama_n_batch(ctx)`,
-  `n_ubatch = llama_n_ubatch(ctx)`.
- Prefill phase (≈ 2166): `if (params_base.cont_batching || batch.n_tokens == 0)`
-  → with cont_batching ON, prefill is added to the **same** batch as decode.
- Per-slot prefill fill (≈ 2552-2597):
-  `while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch)`
-  — adds prompt tokens until the slot is done **or** the shared budget is hit.
-  Whatever does not fit stays for the next iteration (the slot remains
-  `SLOT_STATE_PROCESSING_PROMPT`).
- Whole-prompt completion (≈ 2603-2615): when the slot's prompt is fully consumed
-  it flips to `SLOT_STATE_DONE_PROMPT`, sets `batch.logits[last] = true`, inits
-  the sampler. Next iteration it becomes `GENERATING`.
- Budget break (≈ 2693-2695): `if (batch.n_tokens >= n_batch) break;`.
- Decode (≈ 2728-2741): loops `batch_view` slices of `min(n_batch, remaining)` and
-  calls `llama_decode`; the physical `n_ubatch` split happens inside
-  `llama_decode`.
-
-### 1.3 The chunking is gated by `can_split()`
- `server-context.cpp:225-231`: `can_split()` returns true unless the task needs
-  embeddings with non-LAST pooling. So **completion/generation tasks always
-  chunk-and-interleave**; only embeddings/rerank force the whole prompt into one
-  ubatch (≈ 2234-2244 raises "input is too large… increase the physical batch
-  size" — this is exactly why LocalAI bumped `n_ubatch` for rerank, see below).
-
-### 1.4 LocalAI ties n_batch to n_ubatch (the gap)
- `grpc-server.cpp:515` — `params.n_batch  = request->nbatch();`
- `grpc-server.cpp:519` — `params.n_ubatch = request->nbatch();` with the comment
-  that this fixes reranking being capped at the 512 default `n_ubatch`.
- `grpc-server.cpp:781-784` — the **only** decouple knob today: an `n_ubatch` /
-  `ubatch` option that overrides `n_ubatch` alone (added for embeddings/rerank).
-  There is **no** `batch` / `n_batch` option parse, so `n_batch` cannot be raised
-  above the coupled value from a model config. Confirmed: `grep '"n_batch"|"batch"'`
-  in `grpc-server.cpp` returns nothing.
- Options arrive via `request->options(i)` parsed as `optname:optval`
-  (`grpc-server.cpp:584-585`); these come from `ModelOptions.Options` ⟵
-  `c.Options` (`core/backend/options.go:221`).
-
-### 1.5 Go side sends a single batch number
- `backend/backend.proto:341` — `int32 NBatch = 4;` is the only batch field; there
-  is **no** `NUBatch`.
- `core/backend/options.go:108-129` `EffectiveBatchSize`: returns `c.Batch` if set,
-  else context size for single-pass (score/embed/rerank), else
-  `hardwareDefaultBatchSize(512)`.
- `core/backend/options.go:228` — `NBatch: int32(b)` (single value to the
-  backend; becomes both `n_batch` and `n_ubatch` via 1.4).
- `core/backend/hardware_defaults.go:28,37-40` — `BlackwellBatchSize = 2048`;
-  on Blackwell an unset batch defaults to 2048, so today
-  `n_batch == n_ubatch == 2048` there.
-
---
-
-## 2. Why the decouple matters for serving (not just rerank)
-
-Invariant: `n_ubatch <= n_batch`. `n_ubatch` is the physical forward-pass GEMM
-width (compute efficiency; GB10 sweet spot ≈ 2048). `n_batch` is the per-iteration
-**scheduler token budget** — the logical window shared by decode + prefill chunks,
-analogous to vLLM's `max_num_batched_tokens`.
-
-With `n_batch == n_ubatch` (today), the scheduling window cannot exceed one
-physical ubatch. Consequences:
- Under concurrency, the combined (decode + multiple prefill chunks) logical batch
-  is capped at the physical ubatch, so aggregate prefill cannot grow past one
-  ubatch worth of tokens per iteration even when more slots have prompts queued.
- A user who shrinks `batch:` for memory also shrinks the physical ubatch,
-  degrading prefill GEMM efficiency — and vice versa.
-
-Decoupling lets us hold `n_ubatch = 2048` (efficient GEMM) while setting a larger
-`n_batch` (e.g. 4096) so more concurrent prefill+decode tokens co-schedule into one
-logical window, lifting aggregate prefill under mixed load — `llama_decode` still
-tiles the physical work at 2048.
-
---
-
-## 3. Phased implementation
-
-### Phase 0 — Verification harness (do first; TDD red)
-Bite-sized, no code change to the scheduler.
- **0.1 Token-identical greedy under mixed load.** Script: start the backend with
-  `n_parallel >= 4`, greedy sampling (temp 0, fixed seed). Fire (a) several short
-  decode streams and (b) one ~8k-token prompt concurrently (the exact repro from
-  PR #10718's body works). Capture each stream's full token id sequence. Re-run
-  with the prefill request absent. **Assert the short streams' token ids are
-  byte-identical** in both runs — proves interleaving does not perturb decode
-  numerics (KV/position correctness across chunk boundaries). Wire as a Ginkgo
-  spec under the backend e2e suite.
- **0.2 Mixed-workload throughput baseline.** Use `llama-batched-bench` (built from
-  the same tree) or a small driver hitting `/v1/chat/completions`: measure
-  aggregate prefill tok/s and decode tok/s, and p50/p99 ITL of the decode streams,
-  under the mixed workload. Record numbers for the current `n_batch==n_ubatch`
-  config. This is the before of Phase A/B.
-
-Expected result of Phase 0: 0.1 already passes (interleave is correct today);
-0.2 gives the baseline the decouple must beat.
-
-### Phase A — Decouple n_batch from n_ubatch
-Goal: let model config set the physical ubatch independently of the logical batch,
-defaulting to today's behavior (no regression).
-
- **A.1 C++: accept a `batch`/`n_batch` option (and keep `ubatch`).**
-  In `grpc-server.cpp`, after the existing `ubatch` branch (`:781-784`), add a
-  sibling branch:
-  ```cpp
-  } else if (!strcmp(optname, "n_batch") || !strcmp(optname, "batch")) {
-      if (optval != NULL) {
-          try { params.n_batch = std::stoi(optval_str); } catch (...) {}
-      }
-  ```
-  This is the missing direction (raise `n_batch` above the coupled value). Order
-  matters: both `:515/:519` run first (coupling as default), then option parsing
-  overrides either independently. Add a clamp note: if a user sets
-  `n_ubatch > n_batch`, llama.cpp will clamp/upbatch; log a warning. Keep the
-  `:519` aliasing for backward compat (rerank still works with no options).
-
- **A.2 Proto: add an explicit physical ubatch field.**
-  `backend/backend.proto:341` add `int32 NUBatch = <next free tag>;` (do not reuse
-  4). Regenerate with `make protogen-go` + the C++ proto build.
-
- **A.3 C++: honor `NUBatch` when present.**
-  In `grpc-server.cpp` `params_parse`, after `:519`, add:
-  ```cpp
-  if (request->nubatch() > 0) {
-      params.n_ubatch = request->nubatch();
-  }
-  ```
-  so an explicit physical ubatch wins over the `n_batch` alias, with the `ubatch`
-  string-option as a third path for users who only edit `options:`.
-
- **A.4 Go: config surface + plumbing.**
-  - Add `UBatch *int` (yaml `ubatch`) to the llama config struct alongside `Batch`
-    (search `core/config` for the `Batch` field; mirror it).
-  - In `core/backend/options.go`: add `EffectiveUBatchSize(c)` mirroring
-    `EffectiveBatchSize` (return `c.UBatch` if set, else
-    `min(EffectiveBatchSize(c), BlackwellBatchSize-or-512)` so the physical ubatch
-    stays at the hardware sweet spot while `n_batch` may be larger). Set
-    `NUBatch: int32(EffectiveUBatchSize(c))` next to `NBatch:` (`:228`).
-  - Keep the default such that when neither is set, `NUBatch == NBatch` ⇒
-    byte-identical to today.
-
- **A.5 Serving default (the lever).**
-  In `hardware_defaults.go`, introduce `BlackwellLogicalBatch = 4096` (or a
-  measured value) and let `EffectiveBatchSize` return it for **multi-slot serving**
-  configs (when `n_parallel > 1` and the model is a completion model), while
-  `EffectiveUBatchSize` stays at `BlackwellBatchSize = 2048`. Gate behind the same
-  Blackwell detection already used at `:37-40`. Single-stream/embedding/rerank
-  paths keep `n_batch == n_ubatch`. This is the only behavioral change shipped by
-  Phase A; Phase 0.2 must show it is net-positive before defaulting it on.
-
- **A.6 Tests.** Extend `hardware_defaults_internal_test.go` with
-  `EffectiveUBatchSize` cases; add a `grpcModelOpts` test asserting
-  `NUBatch <= NBatch` and that unset config yields `NUBatch == NBatch`. Re-run
-  0.1 (must still be token-identical) and 0.2 (must show aggregate-prefill gain or
-  neutral ITL) at `n_batch=4096, n_ubatch=2048`.
-
-### Phase B — Decode-headroom prefill cap (optional policy, vendored patch)
-Only if Phase 0.2 / A shows decode ITL jitter from fat prefill chunks. This is the
-one change that touches the inherited scheduler, so it lives as a patch in
-`backend/cpp/llama-cpp/patches/` (applied by `prepare.sh:6-11` / Makefile
-`:141-145`), never as an edit to a checked-in upstream file.
-
-Policy (pseudocode; insert into `update_slots()` prefill fill loop, the
-`while (… && batch.n_tokens < n_batch)` at ≈ `server-context.cpp:2552`):
-
-```
-# token budget for THIS iteration, decode already seated:
-n_decode_in_batch = batch.n_tokens            # set after the decode phase
-prefill_budget    = n_batch                    # default == today
-
-if serving_mode and n_decode_in_batch > 0:
-    # leave room so decoders are not starved/jittered by one giant prefill chunk
-    # max_prefill_per_iter defaults to n_ubatch (one physical tile) when decode active
-    prefill_budget = min(n_batch, n_decode_in_batch + max_prefill_per_iter)
-
-# fill loop guard becomes:
-while slot.prompt.n_tokens() < slot.task->n_tokens()
-      and batch.n_tokens < prefill_budget:
-      ...
-```
-
- `max_prefill_per_iter` is a new `common_params` field surfaced as an
-  `options:` knob (`max_prefill_tokens` / `mpt`) parsed in `grpc-server.cpp`
-  exactly like A.1, default `0` = disabled = today's behavior.
- Semantics mirror vLLM `long_prefill_token_threshold`: cap the prefill share so
-  ongoing decodes keep a steady cadence; the remaining prompt rides the next
-  iteration (already supported by the state machine — slot stays
-  `PROCESSING_PROMPT`).
- **Correctness:** unchanged KV/position path — chunk boundaries already advance
-  `slot.prompt.tokens.pos_next()` per added token (≈ 2570) and the slot resumes
-  from `slot.prompt.n_tokens()` next iteration. Capping the budget only changes
-  *how many* tokens are added this iteration, not *which* positions, so 0.1 must
-  remain token-identical.
-
-### Phase C — Docs + defaults rollout
- Document `batch` / `ubatch` (and `max_prefill_tokens` if B ships) in
-  `docs/content/` model-config reference, with the serving recipe
-  (`n_parallel>1`, `n_batch=4096`, `ubatch=2048`).
- Note the orthogonality to paged KV (below) in
-  `PHASED_VLLM_PARITY_PLAN.md` Phase 3.
-
---
-
-## 4. Risk / correctness
-
- **KV-cache & positions across chunks:** already handled upstream. Each prefill
-  token added advances `pos_next()` (≈ 2570) and is pushed to `slot.prompt.tokens`
-  (≈ 2573); the next iteration resumes from `slot.prompt.n_tokens()`. Chunk
-  boundaries are transparent to the KV cache because positions are absolute, not
-  per-chunk. Phase A changes only budgets, not positions; Phase B changes only the
-  per-iteration count. The 0.1 token-identical test is the guardrail.
- **Unified KV cache (LocalAI default, `n_parallel` slots share one cache):**
-  unaffected — co-batching prefill+decode across slots is what the unified cache is
-  for; positions are per-`seq_id` (`{ slot.id }` in `common_batch_add`).
- **`n_ubatch > n_batch`:** invalid; A.4 clamps `EffectiveUBatchSize <=
-  EffectiveBatchSize` and A.1 logs a warning if options violate it.
- **Embeddings / rerank:** must keep `n_ubatch >= prompt length` (single pass,
-  `can_split()==false`). The existing `:519` alias + `EffectiveBatchSize`
-  context-sizing for single-pass usecases (`options.go:119-124`) must be preserved
-  — do not let the serving `BlackwellLogicalBatch` default leak into single-pass
-  configs (A.5 gates on completion + `n_parallel>1`).
- **Turboquant fork:** the fork lacks some `common_params` fields (see
-  `LOCALAI_LEGACY_LLAMA_CPP_SPEC` precedent at `grpc-server.cpp:755`). `n_batch` /
-  `n_ubatch` are ancient fields and safe; if Phase B adds `max_prefill_per_iter`,
-  guard the new field behind a `#ifndef` like the checkpoint block does.
-
-## 5. Orthogonality to paged KV (Phase 2)
-
-Keep them independent. Paged KV (the `-kvp` / block-manager effort, draft #22569,
-and `paged/`) changes **where** KV blocks live (allocation/utilization). Chunked
-prefill / this decouple changes **how many tokens per iteration** the scheduler
-batches (the `n_batch` budget and decode/prefill interleave). They compose: paged
-KV raises the concurrency ceiling (more slots), the decouple widens the per-iter
-scheduling window to feed those slots; neither touches the other's data structures.
-The only contact point is `update_slots()` — if both ship a vendored patch to it,
-land them as separate, ordered patches in `patches/` and keep the hunks disjoint
-(paged touches allocation/seq_rm; chunked-prefill Phase B touches the prefill fill
-budget).
-
---
-
-## 6. Bottom line
-
- Chunked prefill + decode interleave: **already present and correct** on the
-  pinned llama.cpp — verify (Phase 0.1), do not rebuild.
- Real work: the **n_batch/n_ubatch decouple** (Phase A) — small, additive,
-  default-preserving — plus an **optional decode-headroom prefill cap** (Phase B)
-  if measurements show ITL jitter. Both are LocalAI-side: A in `grpc-server.cpp`
-  + proto + `options.go`; B as a vendored `patches/` hunk.
--- a/backend/cpp/llama-cpp/paged/DECODE_OVERHEAD.md
+++ b/backend/cpp/llama-cpp/paged/DECODE_OVERHEAD.md
@@ -1,215 +0,0 @@
-# llama.cpp multi-user decode overhead on DGX Spark (GB10, sm_121)
-
-Investigation of the Qwen3-32B concurrent-decode throughput gap (llama.cpp ~547 t/s
-vs vLLM ~667 t/s) on the GB10 box, build `~/llama.cpp-pr24423/build` (Release,
-sm_121, `LLAMA_MAX_SEQ=256`, flash-attn on), model
-`~/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf`.
-
-## TL;DR (the result overturns the brief's premise)
-
-On **this** build the prime suspect is wrong and the host-overhead premise does not
-hold:
-
-1. **CUDA graphs are NOT disabled at high concurrency.** At npl=128, 94 of 98
-   decode `graph_compute` calls **replay a captured CUDA graph** (0 resets, stable
-   key, no property churn post-warmup). The keyed-warmup gate works.
-2. **There is no ~170ms/step host hotspot here.** The GPU is **~96% active during
-   decode with graphs ON and ~96% active with graphs OFF**. Decode at npl=128 is
-   **GPU-compute-bound**, not host-bound.
-3. The brief's "20% GPU util / 66ms GPU / 170ms host per step" was measured on a
-   different/earlier build (mainline without these graph fixes). It is not
-   reproducible on `llama.cpp-pr24423`.
-4. Because the GPU is the bottleneck, re-enabling graphs cannot lift the number:
-   the clean A/B shows graphs ON vs OFF = **+1.5% at npl=128** (and +2.9% at
-   npl=32 - the benefit shrinks as concurrency rises and the GPU saturates).
-5. The real gap to vLLM is the **quantized decode GEMM kernel**: `mul_mat_q`
-   (Q4_K + Q6_K) is ~68% of decode GPU time and runs ~2.1x above the GB10
-   memory-bandwidth floor. Closing the gap requires Marlin/Machete-style int4
-   GEMM kernels, not host-side work. This is a kernel project (the direction the
-   prior session's uncommitted `marlin-w4a16.cu` / `fp4-grouped-moe.cu` already
-   started, though those target w4a16/GPTQ-int4, not the K-quants this GGUF uses).
-
-## 1. Why CUDA graphs are (not) disabled - exact code + measurement
-
-### The gate (code)
-
-PR24423 refactored the CUDA-graph path into a keyed, warmup-based scheme in
-`~/llama.cpp-pr24423/ggml/src/ggml-cuda/ggml-cuda.cu`:
-
- `ggml_cuda_graph_get_key(cgraph)` (~L3343) keys the cached CUDA graph by
-  `cgraph->nodes[0]` (first-node pointer).
- `ggml_cuda_graph_check_compability(cgraph)` (~L3301) disables graphs only for:
-  - **split buffers** (`ggml_backend_buft_is_cuda_split`), and
-  - **`GGML_OP_MUL_MAT_ID`** when `src0` is non-quantized **or**
-    `ne[2] > get_mmvq_mmid_max(...)` (MoE expert routing needs a stream sync).
-  Qwen3-32B is **dense** -> no `MUL_MAT_ID` -> this condition never fires.
- `ggml_backend_cuda_graph_compute` (~L4514) warmup gate: a graph is used only
-  after **2 consecutive calls with no property change** (`warmup_complete`); any
-  property change resets warmup. `ggml_cuda_graph_update_required` (~L3347)
-  detects change by `memcmp` of the full `ggml_tensor` struct + per-src
-  data-ptr/ne/nb, with a fast path when `cgraph->uid` is unchanged.
-
-### Why it stays enabled across decode steps
-
-The graph stays stable because llama.cpp's host-side graph reuse holds during
-decode, so node pointers/props (and `cgraph->uid`) do not churn:
-
- `llama_kv_cache::get_n_kv` (`src/llama-kv-cache.cpp` L1223-1233) **pads n_kv to
-  a multiple of 256** ("so that the graph remains constant across batches and can
-  be reused"). For ntg<=256 within the first KV block, n_kv is constant.
- `can_reuse_kq_mask` (`src/llama-graph.cpp` L43) keeps the KQ-mask dims stable:
-  `ne=[n_kv, n_tokens/n_stream, 1, n_stream]` = `[256,1,1,128]` every decode step
-  at npl=128.
- `can_reuse` (`src/llama-context.cpp` L1283) therefore returns true, so the
-  scheduler is **not** reset/re-split. `graph->uid` is only reassigned inside
-  `ggml_backend_sched_split_graph` (`ggml/src/ggml-backend.cpp` L1033, L1485),
-  which is skipped on the reuse path -> stable uid -> CUDA graph replays.
-
-### Measurement (instrumented build, npl=128, ntg=96)
-
-Env-gated counters added to `ggml_backend_cuda_graph_compute` /
-`ggml_cuda_graph_update_required` (since `GGML_LOG_DEBUG` is compiled out in
-Release / NDEBUG). End-of-run summary:
-
-```
-[GTRACE-SUMMARY] calls=98 notenab=0 warming=3 warmdone=1 RESET=0 USED=94 incompat=0 distinct_keys=1
-```
-
-94/98 decode `graph_compute` calls **replayed** a captured CUDA graph; **0**
-warmup resets; a **single** distinct graph key for the whole decode; no node
-property churn after warmup. Graphs are fully engaged at npl=128.
-
-(The instrumentation was reverted afterwards; the checkout is back to its
-pre-task state and the `.so` rebuilt clean.)
-
-## 2. The per-step CPU "hotspot" - there isn't one on this build
-
-GPU utilization during npl=128 decode (ntg=256):
-
- **Graphs ON** - `nvidia-smi` sampled every 0.7s through the decode phase:
-  steady **96% GPU util**, SM clock **2184 MHz** (not throttled), 45-47 W.
- **Graphs OFF** (`GGML_CUDA_DISABLE_GRAPHS=1`) - nsys CUDA trace, 8s window:
-  total GPU kernel time = `3,983,292,128 ns / 0.516` = **~7.72s of the 8s
-  window = ~96% GPU-active**. Even with every kernel launched individually from
-  the host, the GPU is still ~96% busy. There are essentially **no host gaps**.
-
-Per-step wall = 60.6s / 256 steps = **~237 ms/step**, and the sum of one decode
-graph's kernel times (nsys, graphs-on capture) is ~244 ms -> GPU kernel time per
-step ~= wall time per step. The host work between steps is in the low single-digit
-ms (the ~4% idle), consistent with graphs ON giving only +1.5% at npl=128.
-
-This directly contradicts the brief's 66ms-GPU / 170ms-host split, which must have
-come from a pre-graphs build.
-
-### Per-step GPU breakdown (nsys, npl=128 decode, graphs off, 8s window)
-
-| Kernel | % GPU time | ~ms/step |
-|--------|-----------:|---------:|
-| `mul_mat_q` Q4_K (type 12) | 51.6 | ~118 |
-| `flash_attn_ext_f16` | 19.3 | ~44 |
-| `mul_mat_q` Q6_K (type 14) | 16.2 | ~37 |
-| `unary_gated` silu | 4.1 | ~9 |
-| mmq stream-k fixup + quantize_q8_1 | ~5 | ~12 |
-| rms_norm / rope / set_rows / add | ~4 | ~10 |
-
-Quantized matmul = **~68%** of decode GPU time (~155 ms/step). Attention ~19%.
-
-`perf` could not profile the host (kernel `perf_event_paranoid=4`), but it is moot:
-the host is ~4% of the wall, so there is no ~170ms host hotspot to chase.
-
-## 3. Fix attempt + measured result
-
-### The requested fix (re-enable graphs / pad the decode batch) is a no-op here
-
-Graphs are already enabled and the batch is already stable (n_kv padded to 256,
-kq_mask dims constant). The clean cold A/B (cooldowns between every run):
-
-| npl | graphs ON (t/s) | graphs OFF (t/s) | delta |
-|----:|----------------:|-----------------:|------:|
-| 32  | 242.60 | 235.75 | +2.9% |
-| 64  | 398.59 | 389.06 | +2.5% |
-| 128 | 543.95 | 535.71 | +1.5% |
-
-Baseline (separate cold runs, original non-instrumented build):
-npl=32 243.9, npl=64 397.1, **npl=128 544.95** (matches the ~546 baseline).
-
-Graphs help, but the benefit **monotonically shrinks** as concurrency rises and
-the GPU saturates. At npl=128 there is only ~1.5% of host launch overhead left to
-remove, and GPU util is ~96% in both columns. **You cannot lift npl=128 decode
-toward 667 by working on graphs/host overhead - the GPU is the bottleneck.**
-
-### Where the number actually is, and the real lever
-
- vLLM 667 t/s at this concurrency = **192 ms/step**; llama.cpp 547 = **237
-  ms/step**. The ~45 ms/step gap maps almost entirely onto the quantized matmul.
- GB10 memory-bandwidth floor for a 32B Q4_K_M (~19.8 GB of weights, read once
-  per step and shared across the 128 sequences) at ~273 GB/s is **~72 ms/step**.
-  llama.cpp's `mul_mat_q` spends ~155 ms/step on matmul = **~2.1x the bandwidth
-  floor**. vLLM's Marlin/Machete int4 GEMMs run much closer to the floor; that
-  efficiency difference is the ~547 -> 667 gap.
- The Q6_K matmul (`mul_mat_q` type 14) also shows pathological tail latency
-  (median 0.89 ms, max 5.5 ms) - the MMQ kernel is not well-tuned for the skinny
-  n=128 decode shape.
-
-**The lever to beat 547 is a faster quantized decode GEMM**, i.e. a Marlin-style
-int4 kernel for the decode shapes. This is exactly the direction of the prior
-session's uncommitted `ggml/src/ggml-cuda/marlin-w4a16.cu` and
-`fp4-grouped-moe.cu` (already wired via
-`if (!split && ggml_cuda_w4a16_mul_mat(...)) return;` in `ggml_cuda_mul_mat`).
-Note those target **w4a16 / GPTQ-int4**, while this GGUF is **K-quant (Q4_K/Q6_K)**,
-so they are inert for this model - a Marlin path for K-quants (or shipping the
-model in a Marlin-friendly int4 format) would be required. That is a multi-day
-kernel effort, out of scope for this session, but it is the only lever that can
-move the number.
-
-### Why the "bump LLAMA_MAX_SEQ to 1024 -> 377" data point is consistent
-
-`llama_batch_allocr` keeps `seq_cpl` as an `LLAMA_MAX_SEQ x LLAMA_MAX_SEQ` table
-(`src/llama-batch.cpp`), so per-batch seq bookkeeping scales ~O(MAX_SEQ^2). At
-MAX_SEQ=1024 that host cost becomes large enough (~70 ms/step) to dominate and
-drop decode to 377. At MAX_SEQ=256 the same term is ~4.4 ms/step (the ~1.5% that
-graphs reclaim); lowering to 128 would save ~3 ms/step (~1%). So MAX_SEQ tuning
-confirms the host term is real but tiny at 256 - not a path to 667.
-
-## How this would land in LocalAI
-
- **No host/graph patch is warranted** for this build: graphs already engage and
-  the decode is GPU-bound. A "pad the decode batch / force graph capture" patch
-  would change nothing measurable at high concurrency.
- The actionable upstream/vendored work is a **Marlin-style int4 decode GEMM**
-  (extend the prior `marlin-w4a16.cu` to cover K-quants, or quantize the served
-  model into a Marlin-friendly int4 layout). That is where the ~547 -> 667+ lives.
- If a small host win is still wanted, keep `LLAMA_MAX_SEQ` no larger than the max
-  concurrency actually used (the per-batch `seq_cpl` table is O(MAX_SEQ^2)).
-
-## Reproduction
-
-```
-# baseline / A/B (cold, 30s cooldowns)
-llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -npp 16 -ntg 128 -npl 32,64,128 \
-  -ngl 99 -b 2048 -ub 2048 -fa on            # graphs on
-GGML_CUDA_DISABLE_GRAPHS=1 ...same...        # graphs off
-
-# GPU util (graphs on): sample nvidia-smi during decode -> ~96%, 2184 MHz
-# GPU active (graphs off): nsys profile -t cuda --delay=6 --duration=8 ...
-#   nsys stats --report cuda_gpu_kern_sum  -> sum/0.516 ~= 7.72s of 8s = ~96%
-```
-
-## UPDATE: NVFP4 closes most of the decode gap (no Marlin-for-K-quants needed)
-
-The diagnosis above said the lever is "a more bandwidth-efficient int4 decode GEMM"
-and feared a multi-day Marlin-for-K-quants kernel. But the FP4-MMA path is already
-that kernel. Measured (npl=128, cold A/B, npp=16 ntg=128):
-
-| quant | decode S_TG (t/s) | vs Q4_K | vs vLLM 667 |
-|---|---|---|---|
-| Q4_K_M | 547 (548/546) | - | 82% |
-| **NVFP4** | **619 (617/622)** | **+13%** | **93%** |
-
-NVFP4's `mul_mat_q<NVFP4>` runs closer to the GB10 bandwidth floor at the thin n=128
-decode shape than Q4_K's int8-MMQ (which ran ~2.1x above it). So shipping the model
-as NVFP4 closes the decode gap from ~22% to ~7% AND wins prefill (1209 vs Q4 767 /
-vLLM 800). Net on GB10: llama.cpp+NVFP4 is ahead on prefill (1.5x) and within ~7% on
-decode. The remaining ~7% would be incremental FP4-MMA decode-kernel tuning, NOT a
-from-scratch Marlin kernel - a much smaller, optional effort. NVFP4 is the answer to
-both the prefill and the decode gap.
--- a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
@@ -1,253 +0,0 @@
-# Closing the vLLM Gap on Blackwell (GB10 / DGX Spark) — Living Plan & Results
-
-Target hardware: NVIDIA **GB10** (Grace-Blackwell, `sm_121a`, 119 GiB unified LPDDR5X), `dgx.casa`.
-Model under test: **Qwen3-Coder-30B-A3B-Instruct** (MoE, 128 experts, top-8, ~3B active).
-Engines: llama.cpp (CUDA, `~/llama.cpp-pr24423`, build `7a6ddc5`, `CMAKE_CUDA_ARCHITECTURES=121`) vs vLLM 0.23.0 (`~/vllm-bench`, torch 2.11.0+cu130).
-
-> This is a working document. Each phase appends measured numbers, what was learned, and what's next.
-> Methodology: `llama-bench` (single-stream pp/tg, built-in reps) and `llama-batched-bench` (`-npl` sweep,
-> decode-phase aggregate `S_TG`, prefill aggregate `S_PP`); vLLM via `~/bench/vllm_conc.py` (decode-phase
-> aggregate matched to `S_TG`). Same model/prompt/seed. Precision matched where possible.
-
---
-
-## Baseline results (established)
-
-### Single-stream (B=1), matched ~8-bit
-| Engine / precision | prefill pp512 (t/s) | decode tg128 (t/s) |
-|---|---|---|
-| llama.cpp **Q8_0** | 2215 ± 15 | **54.8 / 62.2** * |
-| llama.cpp **F16** | 700 ± 24 | 32.9 ± 0.05 |
-| vLLM **FP8** | 9155 ± 308 | 52.45 ± 0.05 |
-
-\* two sessions; ~55 right after worker-stop (clocks settling), ~62 steady state. Both ≥ vLLM → **single-stream parity holds**.
-
-### Concurrency sweep (decode-phase aggregate `S_TG`, prefill aggregate)
-| B | llama Q8 prefill | vLLM FP8 prefill | llama Q8 decode | vLLM FP8 decode |
-|---|---|---|---|---|
-| 1 | 1080 | 9644 | 60.1 | 48.0 |
-| 8 | 2189 | 33373 | 160.8 | 312.4 |
-| 32 | 2198 | 99398 | 357.1 | 1171 |
-| 64 | 2194 | 151990 | 519.2 | 2064 |
-
-llama F16 prefill also flat: B=1 452 → B=8 723 → B=32 778. **Prefill flat at both precisions = kernel-throughput ceiling.**
-
-### Our paged patch (LLAMA_KV_PAGED) — concurrency effect: NONE
-Same Q8 binary, paged branch confirmed firing (137 placements at B=8), throughput identical within noise:
-| | B=1 | B=8 | B=32 |
-|---|---|---|---|
-| stock decode | 61.2 | 171.7 | 377.0 |
-| paged decode | 62.7 | 170.8 | 376.8 |
-
-Patch is placement-only correctness prototype; doesn't implement concurrency mechanics. Single-stream-neutral, concurrency-neutral.
-
---
-
-## Root-cause diagnosis (nsys + code audit)
-
- **74.5% of GPU compute = `mul_mat_q`** (Q8_0 int8 MMQ GEMM, the MoE experts). Only cutlass kernel seen is `cutlass_80_tensorop` = **Ampere (sm_80)**, not Blackwell.
- ggml-cuda has **NO FP8 path** (no e4m3/e5m2 GEMM, no cuBLASLt FP8). Q8_0 runs the **Ampere-class int8 `mma.sync s8.s8.s32`** even on GB10 (`mma.cuh:924`, dispatched unconditionally `mmq.cu:307`).
- ggml-cuda **DOES** have a **native Blackwell FP4 path** (MXFP4 + NVFP4, `mma...kind::mxf4...e2m1`, `mma.cuh:1126`, gated `BLACKWELL_MMA_AVAILABLE`). Merged via #17906/#20644/#21074.
- **No fused MoE grouped GEMM**, no tcgen05/wgmma (warp-level `mma.sync` only).
- **Small per-expert GEMMs**: 512-tok ubatch → ~32 tok/expert (128 exp, top-8) → thin GEMMs, memory-bound, can't fill tensor-core tiles. vLLM processes 8192 tok/step → ~512 tok/expert → compute-bound + FP8.
- **The 45–69× gap is partly apples-to-oranges**: we compared llama Q8 (Ampere int8) vs vLLM FP8 (Blackwell). Upstream/NVIDIA benches put the *real* FP4-vs-FP8 prefill gap at **~25–50% long-context**, not 45–69×.
-
-Key upstream refs: discussion #22042 (FP8 design: `ggml_mul_mat_ext` + scale tensors), #17906 (native MXFP4), #18250 (NVFP4-MoE closed not-planned).
-
---
-
-## The levers (cheap → expensive) — execution log
-
-### Lever 1 — NVFP4/MXFP4 model (use existing Blackwell FP4 path) + ubatch bump
-Status: **IN PROGRESS** — single-stream done, concurrency next.
-Quant: `llama-quantize F16 -> MXFP4_MOE` (type 38), 15.9 GiB / 4.47 BPW. (No NVFP4 in llama-quantize; MXFP4_MOE puts experts in MXFP4 = Blackwell FP4 MMA.)
-
-Single-stream (llama-bench), MXFP4 vs Q8 vs vLLM-FP8:
-| metric | llama Q8 | **llama MXFP4** | vLLM FP8 |
-|---|---|---|---|
-| prefill pp512 (ub512) | 2215 | **3061 ± 22** | 9155 |
-| prefill pp2048 (ub512) | ~2200 | 3137 ± 7 | — |
-| prefill pp2048 (**ub2048**) | — | **3441 ± 14** | — |
-| decode tg128 | 62.2 | **86.4 ± 0.3** | 52.45 |
-
-Findings:
- **MXFP4 decode 86.4 beats vLLM FP8 52.45 by 1.65×** (4-bit = less memory traffic; decode is memory-bound). llama wins decode outright.
- MXFP4 prefill +38% over Q8; **ub2048 lifts prefill +10%** (3137→3441). Single-stream prefill gap to vLLM: 4.1× (Q8) → **2.7× (MXFP4)**.
- Caveat: MXFP4 is 4-bit vs vLLM FP8 8-bit — not precision-matched. Fair match = vLLM NVFP4 (4-bit); pending.
-Concurrency (decode-phase aggregate `S_TG`, ub2048), MXFP4 vs Q8 vs vLLM-FP8:
-| B | Q8 dec | **MXFP4 dec** | vLLM dec | Q8 pp | **MXFP4 pp** | vLLM pp |
-|---|---|---|---|---|---|---|
-| 1 | 60.1 | **83.4** | 48.0 | 1080 | 1625 | 9644 |
-| 8 | 160.8 | **267.4** | 312.4 | 2189 | 3634 | 33373 |
-| 32 | 357.1 | **551.2** | 1171 | 2198 | 3651 | 99398 |
-| 64 | 519.2 | **770.2** | 2064 | 2194 | 3648 | 151990 |
-
-**Lever-1 verdict:** MXFP4 is a large, free win — decode +50–66% over Q8, prefill plateau +66% (2200→3650). MXFP4 decode **wins at B=1, near-parity at B=8** vs vLLM; only falls behind at high concurrency. **Prefill still plateaus (~3650)** — the MoE prefill GEMM doesn't scale with batch (no fused grouped GEMM; ubatch-limited). That plateau is the real remaining structural gap → Levers 2–3. Quality caveat unchanged (MXFP4 4-bit vs vLLM FP8 8-bit; quality not yet evaluated).
-
-### Lever 2 — `n_ubatch` / `n_batch` tuning (standalone)
-Status: **DONE + SHIPPED (auto-default implemented)**
-MXFP4 pp4096 vs ubatch: ub512=2994, **ub2048=3316**, ub4096=2820(noisy), ub8192=3180.
-**Verdict:** prefill saturates at ub=2048; larger ubatch gives nothing. The ~3300–3650 ceiling is the **MoE GEMM kernel**, not batch size. → No more free config wins; the rest is kernel work (Levers 3–5).
-**Implemented:** `core/backend/hardware_defaults.go` — `EffectiveBatchSize` now defaults the physical batch
-(n_batch→n_ubatch alias) to **2048 on Blackwell** (`xsysinfo.IsNVIDIABlackwell`, cc≥12 / sm_120/121) when the
-config leaves `batch:` unset; explicit `batch:` always wins. Detection is a shared Go helper; placed at the
-common ModelOptions builder so it covers the C++ llama.cpp backend too. Tests: `hardware_defaults_internal_test.go`.
-
-### Lever 1b — Standard Q4 vs MXFP4 (what's actually MXFP4-specific)
-**Q4_K_M** (17.3 GiB) vs **MXFP4** (15.9 GiB), ub2048:
-| metric | Q4_K_M | MXFP4 | Q8 |
-|---|---|---|---|
-| decode tg128 | **93.5** | 86.4 | 62.2 |
-| prefill pp512 | 2164 | **3061** | 2215 |
-| prefill pp2048 | 2953 | **3441** | ~2200 |
-**Verdict:** the **decode win is just "4-bit"** — plain Q4_K_M matches/beats MXFP4 on decode (both memory-bound).
-MXFP4's *only* real edge is **prefill (+41% over Q4_K_M)** via Blackwell FP4 tensor cores. So for shipping,
-**"4-bit quant + ubatch=2048" captures most of the win portably**; MXFP4 is a Blackwell-only prefill extra.
-
-### Lever 3 — Fused FP4/FP8 MoE grouped GEMM (+ activation-quant fusion)
-Status: **DESIGNED + PROFILED, not built** (multi-week kernel R&D). The single biggest remaining prefill win.
-
-**Decisive measurements:**
- Prefill does NOT scale with bigger single prompts (attention O(N²) confounds): MXFP4 pp2048=3295, pp8192=1524,
-  pp16384=2051. So the plateau is not a batch-size fix.
- Real gap is batched many-sequence prefill: B=32 llama 3651 vs vLLM 99398 = **27×**. llama.cpp MoE prefill runs
-  at only **~22 effective TFLOP/s** on the GB10 — far below the GPU. Large headroom.
- **nsys (MXFP4 pp2048):** `mul_mat_q<type39>` (MoE FP4 GEMM) = **37.2%**, `quantize_mmq_mxfp4` (act-quant) = 8.0%,
-  `mul_mat_q<type8>` (dense/attn, still Q8) = 10.1%, flash_attn = 8.8%. The native FP4 MMA *is* engaged — the
-  inefficiency is the **per-expert thin-tile MMQ scheduler** + **un-fused activation quant**.
-
-**Target (precise):** the ~45% in `mmq.cu`'s grouped MoE path (`ggml_cuda_mul_mat_q` + `ids`, `mmid.cu`). Replace
-the per-expert thin-tile scheduler with a CUTLASS-style grouped GEMM (full tiles regardless of tokens/expert) and
-fuse `quantize_mmq_mxfp4` into the permute/gather. Dense Q8 matmuls (10%) are the separate Lever-4 (FP8) target.
-Problem (measured): the prefill ceiling is the MoE expert GEMM. Today `ggml_cuda_mul_mat_q` with `ids`
-(`mmq.cu:127`) launches one grouped MMQ over a 3D grid (z = expert), but each expert's tile is thin
-(~tokens/expert columns) so int8/FP4 tensor cores run underfilled; throughput is memory-bound on weight
-streaming and flat vs batch.
-Approach:
- Replace the per-expert thin-tile scheduler with a **CUTLASS-style grouped GEMM** that concatenates all
-  experts' token-blocks into one problem with per-group offsets, so tiles are always full (m16n8k64 FP4 /
-  m16n8k32 FP8) regardless of per-expert token count. Mirrors vLLM's `fused_moe` + cutlass grouped GEMM.
- **Fuse activation quantization into the permute/gather** (the `quantize_mmq_q8_1`/FP4 quantize currently a
-  separate 3.3% kernel) so the routed activations are quantized as they're scattered into expert order.
- Files: new kernel under `ggml/src/ggml-cuda/` (e.g. `moe-grouped-gemm.cu`) + dispatch hook in
-  `ggml_cuda_mul_mat_id` (`ggml-cuda.cu:2622`); reuse `mmid.cu` routing/`expert_bounds`.
- Effort: high (2–4 wks expert CUDA). Risk: numerics + sm_121 tile tuning. Expected payoff: the bulk of the
-  prefill gap (vLLM's MoE prefill advantage is mostly this). Upstream: #18250 (NVFP4-MoE) was closed
-  not-planned, so this would be a LocalAI patch or a fresh upstream proposal.
-
-### Lever 4 — FP8 (e4m3) GEMM for dense layers
-Status: **DESIGNED, not built** (blocked on a core ggml API change).
-Problem: ggml-cuda has no FP8 matmul (only int8/FP4). vLLM runs qkv/o_proj/lm_head in FP8 on Blackwell
-tensor cores. Our dense layers run int8-MMQ or f16-cuBLAS.
-Approach (two options):
- (a) **cuBLASLt FP8**: route dense `mul_mat` through `cublasLtMatmul` with `CUDA_R_8F_E4M3` A/B and FP32
-  compute + scale pointers. Lowest kernel effort; gets library-tuned Blackwell FP8 immediately. Needs the
-  scale-tensor plumbing below.
- (b) **Hand-written sm_121 `mma.sync ...e4m3.e4m3.f32`** kernels in `mma.cuh`/`mmf.cu`. More control, more work.
- Prerequisite (both): the **`ggml_mul_mat_ext` / scale-tensor API** from upstream discussion #22042 —
-  per-tensor FP8 scales don't fit the block-scaled quant struct; `MUL_MAT`/`MUL_MAT_ID` must accept optional
-  scale tensors. This is a cross-cutting ggml change (graph + ops + all backends' fallbacks).
- Effort: high (API change is the hard part; cuBLASLt path is then moderate). Payoff: closes dense-layer
-  prefill/compute gap; complements Lever 3. Note: for *this* MoE model the experts dominate, so Lever 3 > 4.
-
-### Lever 5 — tcgen05 / wgmma-class kernels for large-prefill tiles
-Status: **DESIGNED, not built** (very high effort; last increment).
-Problem: ggml's tensor-core path is warp-level `mma.sync` only (no `wgmma`/`tcgen05`). Blackwell's
-tensor-memory `tcgen05` MMA (what CUTLASS uses) extracts substantially more throughput at large prefill tiles.
-Approach: introduce warpgroup/tcgen05 GEMM main-loops for the FP4/FP8 paths (effectively adopting CUTLASS
-3.x collective mainloops for sm_120/121), used when tile size is large enough (prefill). Decode (thin) keeps
-`mma.sync`.
- Effort: very high (CUTLASS-class engineering). Payoff: the final slice of large-prefill throughput; only
-  worth it after Levers 3–4 land. Realistically: depend on/upstream CUTLASS kernels rather than hand-roll.
-
---
-
-## Paged attention — complete implementation (after kernels are fair)
-The placement prototype is insufficient (measured: zero concurrency benefit). A real implementation needs all
-four gaps. CPU foundation already built & verified (`PagedKVManager` P0–P3, `README.md`); the in-model parts
-are unbuilt. **Build order and concrete design:**
-
-1. **On-demand block allocation from a shared pool** (capacity win — more concurrent seqs before OOM).
-   - Replace `find_slot`'s ring-buffer (`llama-kv-cache.cpp:818`) with `PagedKVManager` block allocation; the
-     KV tensor becomes a shared block pool `[n_embd, block_size*num_blocks]`, sequences draw blocks on demand
-     (already prototyped on CPU: `paged_kv_manager.{h,cpp}`, `test_ggml_paged_rw.cpp`).
-   - Win measured where it counts: max concurrent sequences before OOM (not yet benchmarked — needs this).
-2. **Gather-read** so each seq attends only its own blocks (`get_k`/`get_v` `:1145/1165` → `ggml_get_rows`
-   gather into scratch, then existing attention). Numerically proven on CPU (`test_ggml_paged_attn.cpp`,
-   7.5e-08 vs reference). Needs `build_attn_paged` branch in `llama-graph.cpp` + Gate 0 in a real model.
-3. **Continuous batching / scheduler** (no head-of-line blocking on mixed-length traffic). New scheduler in
-   the server slot path; admit/evict at block granularity; the dimension where paging beats llama.cpp's
-   current static batching. This is where the *real* concurrency win lives (vs our synthetic uniform test).
-4. **Automatic prefix sharing** (block-hash dedup; `PagedKVManager::{compute_block_hashes,get_computed_blocks}`
-   already implemented & tested). Cross-tenant shared system prompts reuse physical blocks.
-
-Status: design in `2026-06-19-paged-attention-llamacpp-design.md`; CPU P0–P3 done; in-model #1–#4 unbuilt.
-**Then** measure concurrency in paging's real scenarios — **memory-pressured (max seqs before OOM)** and
-**mixed-length continuous batching** — on the MXFP4 (fair-quant) footing, not the uniform/over-provisioned
-test that (correctly) showed no benefit.
-
-> Reality check from this session's data: paged attention is a **capacity + scheduling** win, not a per-token
-> speed win. On GB10 with 119 GB unified memory and uniform requests we are not memory-bound at B≤64, so the
-> placement prototype showed nothing. Paging's value appears under memory pressure (many/long sequences) and
-> bursty mixed-length traffic. The per-token throughput gap is a **kernel** problem (Levers 1–3), separate
-> from paging.
-
---
-
-## Implementation plan A — Lever 3: FP4 MoE GEMM to vLLM parity
-
-Goal: lift batched MoE prefill from ~3.65k t/s (B=32) toward vLLM's ~99k. Root cause (profiled):
-`mul_mat_q<MXFP4>` runs at ~22 effective TFLOP/s — warp-level `mma.sync`, not Blackwell tcgen05.
-Cheap knobs are exhausted (ubatch saturates at 2048; `GGML_CUDA_FORCE_CUBLAS` is a no-op 3419↔3423;
-tile width already full at mmq_x=128). So parity needs kernel work, done iteratively on the DGX
-(`~/llama.cpp-pr24423`, editable + rebuildable; diffs captured as `patches/`).
-
-Phases (each: hypothesis → edit `ggml/src/ggml-cuda/` → `cmake --build build --target llama-bench` →
-`llama-bench` MXFP4 pp/concurrency → record):
-1. **Cheap kernel tweaks (low confidence, fast).** nwarps (occupancy), `mmq_y` tile, stream-k on/off,
-   FP4 load-tile path. Measure each. Likely small (<1.3x) — these don't change the warp-MMA ceiling.
-   - **Result (nwarps):** DEAD END. `nwarps` is locked by `static_assert(nwarps*tile_C::I == mmq_y)`
-     (mmq.cuh:3234) → nwarps=8 for mmq_y=128. Can't raise occupancy without co-scaling mmq_y to 256
-     (nwarps=16), which blows Blackwell shared-memory limits. The MMQ constants are tightly coupled;
-     it is not freely tunable. Confirms parity needs the kernel rewrite (phase 3), not knobs.
-2. **Fuse activation quant** (`quantize_mmq_mxfp4`, 8%) into the permute/gather. Removes a kernel +
-   a global round-trip. Tractable, ~1.1x.
-   - **Result:** NOT AVAILABLE as a cheap patch. `quantize_mmq_fp4_cuda` (mmq.cu:200) *already* takes
-     `ids_src1` — the gather is already fused into the quant. The only remaining fusion is quantize-on-load
-     *inside* the GEMM hot loop (intricate, ~8% ceiling, risky). ORippler's #24481 fuses the decode (MMVQ)
-     post-scale and intends a "BS>1" (prefill) follow-up — unwritten. Marginal; skip.
-
-**Upstream survey (2026-06):** there is NO tcgen05/CUTLASS grouped-GEMM MoE kernel in ggml — not merged,
-not in-flight, not a draft (Discussion #18369 is talk, no PR; #18250 closed not-planned). CUTLASS is not a
-dependency (the profile's `cutlass_80_tensorop` is cuBLAS-internal). No fork has a portable MoE kernel
-(croll83/llama.cpp-dgx is GatedDeltaNet-focused). Maintainer signal (woachk on #17906): "the path forward
-is to wait for cuTile C++." So **nothing to cherry-pick; phase 3 is genuinely from-scratch.**
-3. **The real lever — tcgen05 / CUTLASS FP4 grouped GEMM.** Replace the per-expert MMQ scheduler with a
-   CUTLASS 3.x collective-mainloop grouped GEMM (sm_120a, `e2m1` block-scaled, tcgen05 tensor-memory MMA),
-   one problem over all experts with per-group offsets, fused act-quant. This is what vLLM/FlashInfer use.
-   Multi-week; the honest path to parity. Prefer **upstream ggml** (issue drafted) over a private patch.
-4. **Full-model low precision.** Quantize dense layers (qkv/o_proj/lm_head, the 10% Q8) to FP4/FP8 too so
-   the whole prefill runs on FP4 tensor cores, not int8-MMQ.
-Exit per phase: measured t/s recorded here; stop a phase when it's a dead end (recorded as such).
-Matching vLLM realistically requires phase 3; phases 1–2 are the warm-up + de-risking.
-
-## Implementation plan B — Complete paged attention (the pivot)
-
-CPU foundation done (P0–P3, `README.md`): vLLM-parity block manager + ggml write/gather + attention
-numerics + placement Gate 0 (token-identical in-model). Remaining = make it deliver the multi-tenant wins.
-Phases:
-1. **On-demand shared-block pool** — replace `find_slot` ring buffer (`llama-kv-cache.cpp:818`) with
-   `PagedKVManager` block allocation; KV tensor = `[n_embd, block_size*num_blocks]` shared pool. Win:
-   fit more concurrent seqs before OOM. Test: max concurrent seqs at fixed budget vs contiguous.
-2. **Gather-read** (`get_k/get_v` `:1145/1165` → `ggml_get_rows` into scratch) + `build_attn_paged` branch
-   in `llama-graph.cpp`. Numerically proven on CPU (7.5e-08). Gate 0: token-identical multi-seq.
-3. **Continuous batching / scheduler** — admit/evict at block granularity in the server slot path. The
-   real concurrency win on mixed-length traffic (where the placement prototype showed nothing).
-4. **Automatic prefix sharing** — block-hash dedup (`PagedKVManager::{compute_block_hashes,get_computed_blocks}`
-   already implemented + tested). Cross-tenant shared system prompts reuse physical blocks.
-Then benchmark in paging's real regimes — **memory-pressured** + **mixed-length continuous batching** — on
-the MXFP4 (fair-quant) footing. Note: GB10's 119 GB unified memory means win-1 needs genuine pressure
-(long/many seqs) to show; the win is capacity + scheduling, not per-token speed.
-
-## Honest scope note
-Levers 3–5 and the complete paged implementation are each substantial (weeks of expert CUDA/systems work). This doc tracks what is **measured** vs **designed** vs **not-yet-built**, and never claims a number that wasn't run on the box.
--- a/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md
+++ b/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md
@@ -1,59 +0,0 @@
-# FP4 grouped-GEMM MoE kernel (Lever 3) — scaffold + implementation plan
-
-The one piece of work that actually closes the vLLM gap on Blackwell (GB10/sm_121). Both phases are
-bottlenecked by the same kernel: `mul_mat_q<MXFP4>` (warp-level `mma.sync` grouped MMQ, ~22 TFLOP/s) is
-**37%** of prefill and **54.6%** of decode-at-B=64 GPU time (`BENCHMARKS.md`). Paged attention can't touch
-it (proven). The fix is a CUTLASS-3.x collective-mainloop grouped GEMM with block-scaled `e2m1` operands via
-tcgen05 tensor-memory MMA — what vLLM/FlashInfer/TRT-LLM use.
-
-## Scaffold (DONE — builds clean, default byte-identical)
-
-Lives in the DGX checkout `~/llama.cpp-pr24423/ggml/src/ggml-cuda/` (to be rebased onto the pin as a patch /
-upstreamed). Captured diff: `patches/kernel/0001-fp4-grouped-moe-scaffold.patch`.
-
- `fp4-grouped-moe.{cuh,cu}` — entry `ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst) -> bool`
-  (true = handled, false = fall back to MMQ). Gated behind env `GGML_CUDA_FP4_GROUPED`. Currently always
-  returns false → **default build unchanged**.
- Hook in `ggml_cuda_mul_mat_id` (the MoE dispatch), before the `ggml_cuda_mul_mat_q(...ids...)` call:
-  `if (ggml_cuda_fp4_grouped_moe(...)) return;`. Builds via the `file(GLOB "*.cu")` (re-run cmake configure
-  after adding the file — GLOB is configure-time).
-
-This is the integration seam. The kernel fills the stub.
-
-## Implementation phases (each: build on GB10 → numerical parity vs `mul_mat_q<MXFP4>` → bench)
-
-1. **Reference grouped GEMM (correctness first, slow OK).** Per-expert problem sizes + offsets from `ids`;
-   dequant `e2m1`+scales → BF16; loop CUTLASS (or cuBLAS) per group. Gate: output matches MMQ within fp tol
-   on a 2-expert toy + the real model (token-identical greedy). Establishes the harness + the data plumbing.
-2. **CUTLASS GemmGrouped, sm_120a, BF16 operands.** Replace the loop with one `cutlass::gemm::device::
-   GemmGrouped` launch over all experts (per-group offsets). Measures the grouping win alone.
-3. **Block-scaled FP4 operands (the real lever).** `e2m1` A/B with `e8m0`(MX)/`e4m3`(NV) block scales via the
-   Blackwell scaled-MMA collective (tcgen05 tensor-memory). This is where the TFLOP/s jumps. Needs CUTLASS
-   3.x + sm_120a; verify the block-scale layout matches ggml's MXFP4/NVFP4 packing.
-4. **Fuse activation quant** (the F32→FP4 of src1) into the gather/permute prologue.
-5. **Enable by default** on sm_120/121 when parity holds + faster; keep the env as an escape hatch.
-
-## Dependencies / decisions
-
- **CUTLASS is not currently a ggml dependency** (the profile's `cutlass_80_tensorop` is cuBLAS-internal).
-  Adding it = submodule/fetch + include dir, gated to CUDA sm_120+. Float the approach with ggml maintainers
-  early (Discussion #18369 is the home; JohannesGaessler asked to discuss arch before big kernel work).
- Target sm_120a/121a (consumer Blackwell). Datacenter Blackwell (sm_100) is a separate tile config.
- Risk: needs ncu-driven iteration on the GB10; this is multi-week, expert-CUDA. No upstream base to fork
-  (exhaustive search confirmed). Net-new value upstream.
-
-## DENSE scope — RESOLVED (TODO #28, benchmarked): dense needs an FP4 GEMM too
-
-Benchmarked Qwen3-32B dense, vLLM W4A16 vs llama.cpp Q4_K_M (`BENCHMARKS.md`). **Dense prefill is 7.6–32×
-behind** (llama int8-MMQ plateaus ~765 t/s; vLLM FP4 scales to 24.4k); decode ~parity at B=1, 2.2× at B=64.
-So the kernel track is **two kernels, not one**:
-
- **(a) Dense FP4 GEMM** — a plain non-grouped CUTLASS/tcgen05 block-scaled FP4 GEMM. **Simpler than grouped;
-  land this FIRST** — it's the easier first kernel, benefits every dense model, and de-risks the FP4 collective
-  before the grouped variant. Hook: the non-MoE `ggml_cuda_mul_mat_q` (no `ids`) path.
- **(b) MoE grouped FP4 GEMM** — the scaffold above (`ggml_cuda_fp4_grouped_moe`), per-expert offsets.
-
-Both share the same block-scaled `e2m1` collective; (a) is (b) with one group. Suggested order: build (a),
-prove the FP4 collective + parity harness, then generalize to (b). (Aside: full W4A4 NVFP4 doesn't run on
-GB10 today — FlashInfer ships no FP4 cubins for sm_121, so the dense `mm_fp4` kernel hangs/returns zeros; the
-W4A16 Marlin path is the fast, correct one and is the fair comparison. See `BENCHMARKS.md` for the root cause.)
--- a/backend/cpp/llama-cpp/paged/MXFP4_QUALITY.md
+++ b/backend/cpp/llama-cpp/paged/MXFP4_QUALITY.md
@@ -1,140 +0,0 @@
-# MXFP4-dense vs Q4_K_M quality check (Qwen3, GB10 / DGX Spark)
-
-## Question
-
-MXFP4-quantized **dense** Qwen3-32B is measurably faster on GB10 (Blackwell) than
-Q4_K_M: ~1.58x concurrent prefill, ~1.2x decode, for free (just a requantize that
-routes onto the FP4-MMA kernel). Before LocalAI recommends MXFP4-dense as a Blackwell
-default, we must confirm its **quality is acceptable versus Q4_K** (Q4_K is normally the
-stronger 4-bit format).
-
-Critical caveat going in: the pre-existing `~/bench/q3-32b-mxfp4-dense.gguf` was built
-with `--allow-requantize`, so it was suspected to be **double-quantized** (Q4_K_M ->
-MXFP4), which would unfairly penalize MXFP4. The goal here was a *fair* answer.
-
-## Verdict
-
-**Do NOT recommend MXFP4-dense as a quality-equivalent replacement for Q4_K on
-Blackwell.** A clean apples-to-apples test (same BF16 source, both 4-bit, no imatrix)
-shows MXFP4-dense carries a **large** quality penalty that Q4_K does not:
-
- Q4_K_M costs **+2.6%** perplexity vs the BF16 baseline.
- MXFP4-dense costs **+30.8%** perplexity vs the BF16 baseline (i.e. **+27.5% worse
-  than Q4_K**).
-
-The double-quant suspicion was correct but it was **not** the main culprit: even a clean
-MXFP4-from-BF16 is dramatically worse than Q4_K. The ~1.58x prefill / ~1.2x decode
-speedup is real, but it is not free on quality. MXFP4-dense output is still coherent (not
-gibberish), so it is usable where raw throughput dominates and a quality hit is
-acceptable, but it must not be presented as a drop-in, quality-neutral Q4_K replacement.
-
-## Evidence
-
-### 1. Provenance of the existing 32B MXFP4 (it is double-quant)
-
-`~/dense_mxfp4.sh` (mtime matches the `q3-32b-mxfp4-dense.gguf` mtime, Jun 20 09:47)
-created it:
-
-```
-SRC=$HOME/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf      # <-- source is Q4_K_M, not F16/BF16
-OUT=$HOME/bench/q3-32b-mxfp4-dense.gguf
-$QB --allow-requantize --tensor-type "attn=mxfp4" --tensor-type "ffn=mxfp4" \
-    "$SRC" "$OUT" MXFP4_MOE
-```
-
-Confirmed **double-quantized** (Q4_K_M -> MXFP4). Any PPL measured on this file
-overstates MXFP4's true penalty, so the 32B number below is a loose upper bound, not the
-fair answer.
-
-### 2. 32B quick read (wikitext-2-raw test, 50 chunks, ctx 512, ngl 99)
-
-`llama-perplexity`, PR build `~/llama.cpp-pr24423/build` (sm_121):
-
-| 32B model | PPL | vs Q4_K |
-|---|---|---|
-| Qwen3-32B-Q4_K_M | **7.3865** +/- 0.177 | - |
-| q3-32b-mxfp4-dense (double-quant) | **8.4638** +/- 0.206 | +14.6% |
-
-MXFP4 is much worse than Q4_K here, **and** it is double-quant, so the quick read is
-unfair -> escalated to a clean small-model comparison.
-
-### 3. Fair comparison: clean small dense model (Qwen3-4B BF16)
-
-The MXFP4-vs-Q4_K delta is a *format* property and roughly model-size-independent, so a
-small model gives a fast, clean answer. Downloaded `Qwen3-4B-BF16.gguf` (unsloth, ~7.7
-GiB) and quantized it **from that same BF16 source** to both formats with the identical
-recipe used for the 32B (no `--allow-requantize` needed, no imatrix on either side):
-
-```
-llama-quantize  q3-4b-bf16.gguf  q3-4b-q4km.gguf   Q4_K_M
-llama-quantize --tensor-type attn=mxfp4 --tensor-type ffn=mxfp4 \
-               q3-4b-bf16.gguf  q3-4b-mxfp4.gguf  MXFP4_MOE
-```
-
-Perplexity (wikitext-2-raw test, 50 chunks, ctx 512, ngl 99):
-
-| Qwen3-4B | size | PPL | vs BF16 | vs Q4_K |
-|---|---|---|---|---|
-| BF16 (baseline) | 7672 MiB | **13.3188** +/- 0.416 | - | - |
-| Q4_K_M | 2497 MiB | **13.6605** +/- 0.426 | **+2.57%** | - |
-| MXFP4 (clean) | 2236 MiB (4.66 BPW) | **17.4183** +/- 0.561 | **+30.78%** | **+27.5%** |
-
-This is the apples-to-apples quality answer: **clean MXFP4-from-BF16 is ~12x more lossy
-than Q4_K relative to the BF16 baseline** (30.8% vs 2.6%). Notably the clean-4B MXFP4-vs-
-Q4_K gap (+27.5%) is *wider* than the 32B double-quant gap (+14.6%), consistent with
-smaller models being more quantization-sensitive - the double-quant did not invent the
-problem, it is intrinsic to the format as quantized by `llama-quantize`.
-
-### 4. Coherence spot-check (32B, llama-simple, n=60)
-
-MXFP4-dense 32B is fully coherent, not degraded gibberish:
-
- "The capital of France is" -> MXFP4: "...Paris, is located near the Seine River..."
-  (correct); Q4_K similar.
- "Q: What is 17 multiplied by 23? A:" -> MXFP4 reasons via the distributive property
-  (sound); Q4_K answers 391 directly (correct).
- "def fibonacci(n):" -> both emit valid Python.
-
-So the quality cost shows up as measurably higher perplexity (and would surface on harder
-/ longer tasks), not as obviously broken text at short generation lengths.
-
-## Why
-
-`MXFP4_MOE` is a 4-bit float format (E2M1 values, shared E8M0 scale per block of 32,
-round-to-nearest) designed for MoE expert tensors (gpt-oss et al.) with a coarse
-per-block scale. Q4_K uses 6-bit superblock scales plus per-sub-block mins - materially
-better for dense attention/FFN weights. Forcing MXFP4 onto dense layers to reach the FP4
-kernel trades ~1.58x prefill for a large accuracy loss. The FP4-MMA speed path is real,
-but the weights it accepts (MXFP4 here) are lossy for dense.
-
-## Caveat, stated precisely
-
-This measures **llama.cpp's `llama-quantize` MXFP4** (OCP MX FP4, RTN, **no imatrix**)
-against **llama.cpp's Q4_K_M** (k-quant superblocks, also no imatrix here). It is a fair
-format-vs-format comparison of exactly what LocalAI would ship if it routed a requantize
-through this path. It does **not** claim FP4 is fundamentally unviable on Blackwell:
-
- An imatrix-aware MXFP4, or a better FP4 format with two-level scaling
-  (**NVFP4** - there are already `q3-32b-nvfp4` / `q3-32b-nvfp4a16` dirs on the box),
-  may close much of this gap and is the more promising Blackwell FP4 path to evaluate.
- The result is for Qwen3 dense; other families may differ in magnitude but the
-  format-level disadvantage of plain MXFP4 RTN vs Q4_K is expected to hold.
-
-## Recommendation
-
- **Do not** ship a blanket "use MXFP4-dense on Blackwell" recommendation as a Q4_K
-  quality equivalent. The ~1.58x prefill / ~1.2x decode win comes with a real ~30% PPL
-  inflation (vs ~2.6% for Q4_K). Q4_K_M stays the right dense default on Blackwell.
- If exposing MXFP4-dense at all, gate it as an explicit **throughput-over-quality**
-  option with the perplexity caveat surfaced, not a default.
- MXFP4/FP4 remains correct where the model is trained for it (MoE / gpt-oss-style).
-  Pursue **NVFP4** (and/or imatrix-aware FP4) as the quality-competitive Blackwell FP4
-  format before making any FP4-dense recommendation.
-
-## Reproduction (DGX Spark, GB10, build `~/llama.cpp-pr24423/build`, sm_121)
-
- Dataset: `~/wikitext-2-raw/wiki.test.raw` (wikitext-2-raw-v1 test).
- 32B: `~/ppl32b.sh` -> `~/ppl32b.out`; coherence `~/coh32b.sh` -> `~/coh32b.out`.
- Clean 4B: `~/fair4b.sh` -> `~/fair4b.out` (quantize + 3x perplexity).
- All runs `-ngl 99`, `--chunks 50`, `-c 512`. GB10 thermal-throttles but PPL is a
-  correctness metric, so thermal state does not affect these numbers.
--- a/backend/cpp/llama-cpp/paged/Makefile
+++ b/backend/cpp/llama-cpp/paged/Makefile
@@ -1,41 +0,0 @@
-CXX ?= g++
-CXXFLAGS ?= -std=c++17 -O2 -Wall -Wextra -I.
-
-TESTS = test_free_block_queue test_block_pool test_paged_kv_manager test_prefix_cache
-BINS  = $(addprefix tests/,$(TESTS))
-
-all: $(BINS)
-
-tests/%: tests/%.cpp paged_kv_manager.cpp paged_kv_manager.h
-	$(CXX) $(CXXFLAGS) -o $@ $< paged_kv_manager.cpp
-
-check: all
-	@for t in $(BINS); do echo "== $$t =="; ./$$t || exit 1; done
-
-paged-bench: paged-bench.cpp paged_kv_manager.cpp paged_kv_manager.h
-	$(CXX) $(CXXFLAGS) -o $@ paged-bench.cpp paged_kv_manager.cpp
-
-bench: paged-bench
-	./paged-bench
-
-# --- Optional ggml integration test (Phase 1: paged write/gather mechanism) ---
-# Requires a built ggml. Override these to point at your checkout / build:
-#   make ggml-check GGML_SRC=<llama.cpp>/ggml GGML_BUILD=<ggml-build>
-GGML_SRC   ?= ../../llama-cpp-fallback-build/llama.cpp/ggml
-GGML_BUILD ?= /tmp/ggml-build
-GGML_LIBDIR = $(GGML_BUILD)/src
-
-GGML_TESTS = test_ggml_paged_rw test_ggml_paged_attn
-GGML_BINS  = $(addprefix tests/,$(GGML_TESTS))
-
-tests/test_ggml_%: tests/test_ggml_%.cpp paged_kv_manager.cpp paged_kv_manager.h
-	$(CXX) $(CXXFLAGS) -I$(GGML_SRC)/include -o $@ $< paged_kv_manager.cpp \
-		-L$(GGML_LIBDIR) -lggml -lggml-base -lggml-cpu -Wl,-rpath,$(GGML_LIBDIR)
-
-ggml-check: $(GGML_BINS)
-	@for t in $(GGML_BINS); do echo "== $$t =="; ./$$t || exit 1; done
-
-clean:
-	rm -f $(BINS) $(GGML_BINS) paged-bench
-
-.PHONY: all check ggml-check clean
--- a/backend/cpp/llama-cpp/paged/NVFP4_TEST.md
+++ b/backend/cpp/llama-cpp/paged/NVFP4_TEST.md
@@ -1,114 +0,0 @@
-# NVFP4-dense on DGX Spark (GB10, sm_121): is it the quality-preserving FP4 win MXFP4 wasn't?
-
-Test rig: DGX Spark GB10 (sm_121), `~/llama.cpp-pr24423/build` (PR #24423, FP4 MMA + NVFP4
-kernel), wikitext-2-raw, clean BF16 source `q3-4b-bf16.gguf` (the same source used for the
-established MXFP4 / Q4_K fair test). NVFP4 and all comparison quants were produced clean from
-BF16, no imatrix.
-
-## Verdict (short)
-
-YES on all the load-bearing questions, with one honest caveat:
-
-1. llama.cpp CAN produce an NVFP4 GGUF.
-2. NVFP4 quality is Q4_K-class, NOT MXFP4-class: +7.4% PPL vs BF16 (MXFP4 was +30.8%). It is
-   slightly behind Q4_K (+4.8% relative) but in the same ballpark, not on the quality cliff.
-3. NVFP4 routes onto the FP4 MMA kernel and gets the FP4 prefill speedup: ~1.29x Q4_K on the
-   4B, tracking MXFP4 to within 5% (MXFP4 hit 1.58x on the 32B; NVFP4 should track it there too).
-4. Output is coherent.
-
-Bottom line: NVFP4-dense IS the quality-preserving FP4 win MXFP4 wasn't. It delivers
-essentially the full FP4 prefill speedup at roughly Q4_K quality, where MXFP4 paid a 27% quality
-tax for the same speed. LocalAI can support/recommend NVFP4-dense on Blackwell for prefill-bound
-workloads, with the caveat that it is marginally (~5%) behind Q4_K on perplexity; an imatrix-guided
-NVFP4 quant would likely close most of that remaining gap.
-
-## 1. Feasibility: can llama-quantize produce an NVFP4 GGUF? YES
-
- The type exists with a full quantize path, not just a kernel:
-  - `GGML_TYPE_NVFP4 = 40` (`ggml.h`), `GGML_FTYPE_MOSTLY_NVFP4 = 26`
-  - `quantize_nvfp4` / `quantize_row_nvfp4_ref` / `dequantize_row_nvfp4` registered in `ggml.c`
-  - type_name is `"nvfp4"`, block `QK_NVFP4` (per-16 FP8/E4M3 block scale + global scale)
- NVFP4 is NOT a top-level `llama-quantize` ftype (no `NVFP4` entry in the allowed-types list,
-  no reference in `tools/quantize/quantize.cpp` or `src/llama-quant.cpp`), BUT
-  `--tensor-type name=nvfp4` resolves it: `parse_ggml_type` matches the arg against
-  `ggml_type_name(...)`, which returns `"nvfp4"`. This is the exact same mechanism that produced
-  MXFP4-dense.
- Recipe used (mirrors the MXFP4-dense GGUF byte-for-byte in structure: token_embd Q8_0, all
-  norms F32, all 2D attn+ffn weights to FP4):
-
-  ```
-  llama-quantize --tensor-type "attn=nvfp4" --tensor-type "ffn=nvfp4" \
-                 q3-4b-bf16.gguf q3-4b-nvfp4.gguf Q8_0
-  ```
-
-  Result: `q3-4b-nvfp4.gguf`, 2343.93 MiB, 4.89 BPW, ~5 s. (MXFP4-dense was 2350 MiB; same shape.)
-  Every `blk.N.attn_*` and `blk.N.ffn_*` reported `converting to nvfp4`; token_embd Q8_0; norms F32.
-
-The on-box `~/bench/q3-32b-nvfp4*` dirs are vLLM HF safetensors (already 4-bit), not GGUF, and
-do not feed llama.cpp - confirmed and irrelevant.
-
-## 2. Quality (decisive): NVFP4 is Q4_K-class, not MXFP4-class
-
-`llama-perplexity -f wiki.test.raw --chunks 50 -c 512 -ngl 99`, all clean from the same BF16 4B:
-
-| Quant   | PPL    | vs BF16  | vs Q4_K  |
-|---------|--------|----------|----------|
-| BF16    | 13.32  | -        | -        |
-| Q4_K_M  | 13.66  | +2.6%    | -        |
-| NVFP4   | 14.31  | +7.4%    | +4.8%    |
-| MXFP4   | 17.42  | +30.8%   | +27.6%   |
-
-(NVFP4 measured this run: Final PPL = 14.3097 +/- 0.4457.)
-
-NVFP4 lands much closer to Q4_K (gap 0.65 PPL) than to MXFP4 (gap 3.11 PPL). MXFP4's finer
-sibling delivers: the two-level scaling (per-16 FP8 block scale + global scale) recovers almost
-all of the quality MXFP4's coarse per-32 E8M0 scale threw away. It is not quite Q4_K, but it is
-firmly in the "acceptable 4-bit" regime, not the lossy one.
-
-## 3. Speed: NVFP4 routes onto the FP4 MMA kernel
-
-No clean BF16 32B was on the box (only the vLLM NVFP4 safetensors and the Q4_K/MXFP4 32B GGUFs),
-so per the brief this is the 4B speed signal - a 3-way cold A/B on the SAME 4B model, 45 s
-cooldowns between runs (`-npp 512 -ntg 128 -npl 8,32,64 -b 2048 -ub 2048 -ngl 99`):
-
-Prefill S_PP (t/s):
-
-| B   | Q4_K   | NVFP4  | MXFP4  | NVFP4 / Q4_K | NVFP4 / MXFP4 |
-|-----|--------|--------|--------|--------------|---------------|
-| 8   | 4862   | 6313   | 6602   | 1.30x        | 0.96x         |
-| 32  | 5020   | 6497   | 6836   | 1.29x        | 0.95x         |
-| 64  | 5031   | 6490   | 6831   | 1.29x        | 0.95x         |
-
- NVFP4 prefill is within ~5% of MXFP4 at every batch size -> both land on the same FP4 MMA
-  kernel. NVFP4 does NOT fall back to a slow path.
- NVFP4 beats Q4_K's int8-MMQ prefill by ~1.29x on the 4B. The established 32B figures were
-  Q4_K S_PP ~767 and MXFP4 ~1209 (1.58x); since NVFP4 tracks MXFP4 to within 5%, NVFP4 on the
-  32B should likewise approach ~1.5x. (The 4B shows a smaller multiplier than the 32B because a
-  smaller model spends proportionally less time in the matmul the FP4 kernel accelerates.)
- Token-gen (S_TG) is comparable across all three (memory-bound), as expected.
-
-## 4. Coherence
-
-`llama-simple` (llama-cli hangs - avoided), NVFP4 4B:
- "The capital of France is" -> "...Paris. ...Germany is in Berlin. ...Italy is in Rome.
-  ...Spain is in Madrid. ...Netherlands is in Amsterdam." (all correct)
- "Q: What is 17 plus 25? A:" -> "42." (correct)
-
-Coherent and factually accurate.
-
-## Recommendation for LocalAI on Blackwell
-
-Support and recommend NVFP4-dense as the FP4 prefill option on Blackwell (sm_120/121), produced
-via `--tensor-type attn=nvfp4 --tensor-type ffn=nvfp4` over a BF16 source (token_embd Q8_0,
-norms F32). It gives ~the full FP4 prefill speedup (FP4 MMA kernel, ~1.3x Q4_K on 4B and
-expected ~1.5x on larger models) at roughly Q4_K quality (+7.4% PPL vs BF16). This is the win
-MXFP4 failed to deliver: MXFP4 paid a +30.8% quality tax for the same speed and was rejected.
-
-Caveats / follow-ups:
- NVFP4 is still ~4.8% behind Q4_K on PPL. For quality-first deployments where the prefill win
-  does not matter, Q4_K_M remains the better pick.
- These NVFP4/Q4_K numbers are clean (no imatrix). An imatrix-guided NVFP4 quant is the obvious
-  next step and would likely close most of the remaining gap to Q4_K - worth measuring before a
-  blanket recommendation.
- A direct 32B NVFP4-vs-Q4_K speed run (needs a clean BF16 32B GGUF, not on the box) would
-  confirm the projected ~1.5x; the 4B signal plus the MXFP4-tracking already make this very likely.
--- a/backend/cpp/llama-cpp/paged/PAGED_KV_HIGH_CONCURRENCY.md
+++ b/backend/cpp/llama-cpp/paged/PAGED_KV_HIGH_CONCURRENCY.md
@@ -1,115 +0,0 @@
-# Paged KV at high concurrency on a single GB10 - the datacenter-scale test
-
-Closes the open question left by `PR22569_EVAL.md`: that eval could not test the
-"paged KV unlocks thousands of sequences" thesis because **both** KV paths hit the
-`LLAMA_MAX_SEQ=256` compile cap, and the 32B-dense model it used is compute-bound
-(plateaus by npl=128 for an unrelated reason). This run removes both confounders:
-**recompiled `LLAMA_MAX_SEQ=2048`** and used a **bandwidth-bound model (Qwen3-1.7B-Q8_0)**
-where decode aggregate is free to keep climbing with concurrency.
-
-Hardware: NVIDIA GB10 (sm_121, 119 GiB unified LPDDR5X, ~273 GB/s). Build:
-`~/llama.cpp-pr22569` (PR #22569 paged path + the reshape fix), `LLAMA_MAX_SEQ=2048`,
-sm_121 Release. Contiguous = `llama-batched-bench` (unified KV) `S_TG`. Paged =
-`llama-paged -kvp --fit off` `aggregate tps`. `npp=16, ntg/n_predict=128, b=ub=2048,
-ngl 99`. Cold runs, 12 s cooldowns.
-
-## TL;DR for the decision
-
-**On a single GB10, paged KV does NOT deliver a throughput or concurrency win - the
-aggregate-decode ceiling is set by the hardware, not the KV layout, and contiguous KV
-already reaches it.** Measured across two model regimes and concurrency up to 2048
-sequences:
-
- Aggregate decode **plateaus** once the GPU saturates - for both KV layouts:
-  - 32B-dense (compute-bound): ~540 t/s, flat from npl=128 (prior eval).
-  - 1.7B (bandwidth-bound): ~3,200-3,700 t/s, flat from npl=512 (this run).
- Paged and contiguous land at the **same ceiling**; PR #22569's paged op was 12-13%
-  *slower* than the mature contiguous flash-attention path at equal concurrency on 32B.
- Pushing concurrency past the plateau is **actively harmful to UX**: per-sequence
-  throughput collapses (23 -> 1.9 tok/s) and TTFT explodes (0.6 s -> 4.3 s avg, **64 s
-  max**) while aggregate stays flat.
-
-**vLLM's ~24k aggregate headline is unreachable on a single GB10 with these models
-regardless of KV layout** - it needs aggregate memory bandwidth / compute that one GB10
-does not have (i.e. many GPUs). Paged KV is a **memory-capacity / anti-fragmentation /
-prefix-sharing** feature, not a single-node throughput-ceiling feature. The static
-single-model benchmark deliberately does not create the memory-pressure regime where
-paging pays off, which is exactly why no win appears.
-
-## The numbers
-
-### Aggregate decode vs concurrency, Qwen3-1.7B-Q8_0 (bandwidth-bound), `LLAMA_MAX_SEQ=2048`
-
-| npl | contiguous `S_TG` (t/s) | paged `aggregate tps` (t/s) | paged per-seq tps | paged TTFT avg / max |
-|----:|------------------------:|----------------------------:|------------------:|---------------------:|
-| 128 | 2,643 | 2,887 | 23-25 | - |
-| 256 | 2,925 | - | - | - |
-| 512 | 3,215 | 3,637 | 7.2-7.8 | 0.57 s / 0.90 s |
-| 1024 | 3,118 | 3,695 | 3.7-4.2 | 1.17 s / 2.37 s |
-| 2048 | (not run) | 3,608 | 1.9-14.6 | 4.28 s / **63.8 s** |
-
-Both paths flatten by npl~512. 8x more concurrency (128->1024) buys contiguous only
-**+18%** and paged **+28%**, then both stop. (The two tools meter slightly differently -
-`llama-paged` aggregate vs `batched-bench` decode-only `S_TG` - so the small paged-vs-
-contiguous offset is not a real paged advantage; the prior apples-to-apples 32B eval had
-paged 12-13% *behind*.)
-
-### Why it plateaus (the hardware ceiling, not the KV layout)
-
-Decode is memory-bandwidth-bound: each step reads the model weights once and shares that
-read across the whole batch. Once concurrency is high enough that the shared weight-read
-is amortized, the per-step cost is dominated by KV reads + attention + host work, none of
-which paging makes cheaper. The GB10's ~273 GB/s sets the floor; at the plateau the GPU
-is ~saturated. Adding sequences past that point cannot raise aggregate - it only divides
-the same throughput across more users (per-seq tps falls, TTFT rises). The 32B-dense case
-plateaus even earlier (npl=128) because it saturates on **compute** (weight matmuls), not
-bandwidth - the kernel decomposition is in `VLLM_DECOMPOSITION.md`.
-
-## What paged KV is actually for (the honest, deliverable value)
-
-Paging never helps a static, uniform-length, single-model benchmark on a GPU with memory
-to spare - there is no fragmentation and no over-reservation to reclaim. Its real wins,
-which require the regime this hardware+benchmark does not exercise, are:
-
-1. **Concurrent-tenant capacity under memory pressure.** Block KV fits more *diverse*
-   in-flight sequences (variable, dynamically arriving/leaving contexts) without the
-   contiguous path's per-slot reservation/fragmentation. Pays off when KV memory, not
-   compute/bandwidth, is the binding constraint - i.e. at multi-GPU datacenter scale or
-   with very long/variable contexts.
-2. **Cross-request prefix sharing.** A chained-hash block cache shares identical system
-   prompts / RAG preambles across requests (vLLM's `block_pool.py` + block-hash map). A
-   real token-budget win for shared-prefix workloads; PR #22569 defers this to a
-   non-existent Phase 2 (our from-scratch P0 has the machinery).
-
-These are measured as **max concurrent distinct tenants** and **KV memory saved**, not as
-aggregate tok/s on one model. They do not move the single-GB10 throughput ceiling.
-
-## Recommendation
-
- **Do not pitch paged KV as a single-GB10 throughput lever** - it is measured flat to
-  the contiguous ceiling (and PR #22569 is slower). Doing so would not survive a
-  benchmark.
- **The single-GB10 throughput story is already strong without paging:** llama.cpp is
-  ahead of vLLM single-stream (MXFP4 1153 > 800) and at ~70-81% of vLLM aggregate at
-  npl<=128 with a near-identical batching multiplier (`VLLM_DECOMPOSITION.md`). Ship the
-  MXFP4/NVFP4-dense prefill win (`NVFP4_TEST.md`) - that is the cheap, real, defensible
-  Blackwell number.
- **If datacenter-scale (thousands of concurrent tenants) is the genuine target,** the
-  lever is **multiple GPUs** plus paged KV's **capacity + prefix-sharing** features -
-  framed and measured as concurrent-tenant capacity and KV memory saved, on a
-  variable-context / shared-prefix workload. A single GB10 cannot produce the ~24k
-  aggregate regardless of KV layout; that is a fleet-level result.
-
-## Reproduction (DGX, `~/llama.cpp-pr22569`, `LLAMA_MAX_SEQ=2048`)
-
-```sh
-M=~/bench/draft17/Qwen3-1.7B-Q8_0.gguf
-# contiguous
-for NPL in 128 256 512 1024; do
-  ./build/bin/llama-batched-bench -m $M -npp 16 -ntg 128 -npl $NPL -ngl 99 \
-    -b 2048 -ub 2048 -fa on -c $((NPL*160)); done
-# paged
-for NPL in 512 1024 2048; do
-  ./build/bin/llama-paged -m $M -kvp --fit off -ngpub 32768 -ncpub 128 \
-    -np $NPL -ns $NPL -n 128 -b 2048 -ub 2048 -ngl 99; done
-```
--- a/backend/cpp/llama-cpp/paged/PAGED_KV_TARGET_READINESS.md
+++ b/backend/cpp/llama-cpp/paged/PAGED_KV_TARGET_READINESS.md
@@ -1,170 +0,0 @@
-# Paged KV: target-readiness (correctness, dynamic benchmark, 2xH200 projection)
-
-Target hardware: **~2x H200** (281 GB HBM3e total, ~4.8 TB/s per GPU). The GB10 box is
-the *test* rig, not the target - and several earlier "no win" findings are GB10-specific
-artifacts (low bandwidth caps throughput before KV memory ever binds). This document
-delivers the three things needed to push paged KV toward the real target:
-
-1. **Correctness** of the paged path - verified (and a blocking bug found + fixed).
-2. **A dynamic-load benchmark** that actually exercises where paging wins (`paged-loadgen.cpp`).
-3. **A projection** of the paged-KV payoff on 2x H200, grounded in measured GB10 numbers.
-
---
-
-## 1. Correctness: PASS (after fixing the auto-fit OOM)
-
-`test-paged-kv-e2e` checks the paged decode path against the contiguous reference
-(greedy argmax + top-5 set overlap >= 4). On the box it was previously **unverified** -
-it aborted at context creation. Root cause found:
-
- `common_fit_paged_kv_blocks` (`common/common.cpp:1144`) **unconditionally overrides**
-  `n_gpu_blocks` from `ggml_backend_dev_memory`, which **over-reports free VRAM on the
-  GB10 integrated/unified device** (it sized **~245 GB of KV on a 119 GB box** ->
-  `cudaMalloc` OOM -> `GGML_ASSERT` abort in `llama-kv-cache-paged.cpp:74`). The test's
-  explicit `n_gpu_blocks=64` was being clobbered because `params.fit_params` defaults on.
-
-**Fix (item-1 patch, applied on the box):**
-
-```diff
--- a/tests/test-paged-kv-e2e.cpp
-+++ b/tests/test-paged-kv-e2e.cpp
-@@ run_paged()
-     params.kv_paged      = true;
-+    params.fit_params    = false;  // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
-     params.n_gpu_blocks  = 64;
-```
-
-**Result (Qwen3-0.6B-Q8_0, GB10):**
-
-```
-test-paged-kv-e2e: top-5 argmax match: ref=3743 paged=3743
-test-paged-kv-e2e: top-5 set overlap: 5/5 (require >= 4)
-test-paged-kv-e2e: PASSED
-```
-
-The paged op is **numerically greedy-equivalent to the contiguous path**. The reshape
-bug from `PR22569_EVAL.md` (decoupled head_dim) is already applied in the checkout.
-
-**Target-readiness caveat (the durable fix, not just the test):** the auto-fit itself is
-brittle and must be hardened before it runs on a real serving box - even though
-`ggml_backend_dev_memory` reports correctly on a discrete H200, the function should still
-(a) early-return when `!params.fit_params`, (b) **clamp** the computed `n_gpu_blocks` so
-`n_gpu_blocks * block_bytes <= free_vram - margin` using the *actual* KV element size, and
-(c) not override an explicitly-set value. One-screen change in `common_fit_paged_kv_blocks`.
-
---
-
-## 2. Dynamic-load benchmark - `paged-loadgen.cpp`
-
-**Why the existing tools show no paged win:** `llama-batched-bench` and the stock
-`examples/paged/paged.cpp` both run **fixed-length, all-arrive-at-once, single-prompt**
-load. That has no over-reservation and no fragmentation, so contiguous KV is already
-memory-optimal and paging has nothing to reclaim (`PAGED_KV_HIGH_CONCURRENCY.md`). The
-paged win only exists under **variable lengths + continuous arrival + shared prefixes** -
-the real serving regime. No tool in the tree creates it.
-
-`paged-loadgen.cpp` (committed here) does, via the confirmed `llama_paged_scheduler_*`
-API:
-
- **shared system prefix** (`LG_PREFIX` tokens) prepended to every request -> exercises
-  cross-request prefix sharing,
- **variable prompt length** (`LG_SUFMIN..LG_SUFMAX` unique suffix),
- **bimodal generation length** (`LG_GENLONG` for `LG_LONGPCT`% of requests, else
-  `LG_GENSHORT`) - the over-reservation driver,
- **continuous arrival**: keeps `LG_INFLIGHT` requests live, admitting a new one each time
-  one finishes.
-
-It reports the load-bearing number for the buy decision - the **capacity ratio**:
-
-```
-paged peak KV      = sum over live seqs of ceil(used/block)*block * kv_bytes_per_token
-contiguous reserve = peak_inflight * max_ctx * kv_bytes_per_token   (worst-case per slot)
-CAPACITY RATIO     = contiguous_reserve / paged_peak   (+ prefix sharing on top)
-```
-
-`kv_bytes_per_token = 2 * n_layer * n_head_kv * head_dim * sizeof(f16)` - confirmed against
-`llama-kv-cache-paged.cpp` (e.g. Qwen3-32B: 2*64*8*128*2 = **256 KiB/token**).
-
-**How to run (on the target):** drop into PR #22569's `examples/paged/`, add to its
-CMakeLists next to `llama-paged`, build, then e.g.
-`LG_INFLIGHT=2048 LG_LONGPCT=15 paged-loadgen -m <model> -kvp --fit off -ngpub <N> -ncpub <M> -ngl 99`.
-Sweep `LG_INFLIGHT` to the throughput plateau and read the capacity ratio at that point.
-It is written to run on the target (2x H200) where the regime exists; on GB10 it runs but
-the ratio is uninteresting because throughput plateaus before memory binds (see below).
-
---
-
-## 3. Projection to 2x H200 (grounded in measured GB10 numbers)
-
-### Measured on GB10 (this work)
-
-| model | decode plateau (aggregate) | plateau concurrency | bound by |
-|---|---|---|---|
-| Qwen3-32B-Q4_K_M (dense) | ~540 t/s | npl ~128 | compute |
-| Qwen3-1.7B-Q8_0 | ~3,200 t/s | npl ~512 | bandwidth |
-
-### Hardware ratios (per GPU, then 2x TP at ~85% scaling)
-
-| | GB10 | H200 | per-GPU x | 2x H200 (TP) x |
-|---|---|---|---|---|
-| mem bandwidth | 273 GB/s | ~4.8 TB/s | 17.6 | ~30 |
-| BF16 compute | ~213 TFLOP | ~989 TFLOP | 4.6 | ~8 |
-| HBM | 119 GB | 141 GB | 1.18 | 2.4 (281 GB) |
-
-Decode is bandwidth-bound, so **both the aggregate ceiling and the concurrency at which it
-is reached scale with bandwidth (~30x on 2x H200)**:
-
- **32B-dense aggregate decode ceiling:** 540 x 30 ~= **16,000 t/s**, reached at
-  ~128 x 30 ~= **3,800 concurrent sequences**.
-
-### Why paged KV becomes the binding lever on 2x H200 (and didn't on GB10)
-
-To reach that ~16k t/s ceiling you must hold **~3,800 sequences** of KV. The memory math:
-
- 32B weights (FP8) ~= 32 GB, sharded over 2 GPUs -> ~250 GB HBM free for KV.
- 32B KV = 256 KiB/token. At an avg held context of 2,000 tokens, **per seq = 512 MiB**.
- Contiguous unified KV (reserve for the live set) fits ~250 GB / 512 MiB ~= **~490
-  sequences** - **8x short of the 3,800 needed to reach the throughput ceiling.**
-
-So on 2x H200 **KV memory is the binding constraint at the throughput-optimal concurrency**,
-and contiguous KV strands most of the bandwidth (you'd run at a fraction of 16k t/s). This
-is the gap paged KV closes. On GB10 it never appeared because GB10's 30x-lower bandwidth
-caps decode at npl ~128, whose KV fits in memory trivially - the constraint order is
-inverted on the real target.
-
-### Magnitude of the paged win
-
-Paging recovers concurrency two ways, both multiplicative on achievable throughput:
-
-1. **No over-reservation.** Contiguous must back `max_ctx` per slot; paging uses
-   `ceil(actual/block)`. For a realistic bimodal workload (most generations short, ~15%
-   long, prompts ~512) the average held context is several-fold below `max_ctx` ->
-   `paged-loadgen` capacity ratio typically **~4-10x** (it measures the exact number for
-   your workload's length distribution).
-2. **Cross-request prefix sharing** of shared system prompts / RAG preambles - additional,
-   workload-dependent (chained-hash block cache; vLLM's `block_pool.py`).
-
-Net: on 2x H200, paged KV is plausibly the difference between serving **~500 and ~3,800**
-concurrent 32B sequences in HBM, i.e. between a fraction of and ~all of the **~16k t/s**
-decode ceiling. **That is the datacenter payoff, and it is real on the target even though
-GB10 cannot exhibit it.**
-
-### Honest caveats for the buy case
-
- These are **projections** from GB10 + spec ratios; the capacity multiplier depends on the
-  workload's context-length distribution (more variable -> bigger paged win) and TP
-  efficiency. `paged-loadgen` measures it directly once you have target-GPU time.
- The **paged op itself still needs work**: PR #22569's `ggml_paged_attn` was 12-13%
-  *slower* than the mature contiguous flash-attention path at equal concurrency
-  (`PR22569_EVAL.md`), lacks prefix sharing (deferred to a non-existent Phase 2), and has
-  the fit-robustness bug above. Adopting paged KV for the target means either hardening
-  #22569 or finishing the from-scratch P4 - the capacity win above assumes a *correct,
-  competitive* op, which is the remaining engineering.
- Prefill on either KV layout is compute-capped, not a paged concern.
-
-**Bottom line for the decision:** paged KV **is** the right lever for the 2x H200 target -
-the GB10 "no win" result is a bandwidth artifact, not a verdict. The paged path is now
-**correctness-verified**, the **benchmark to size the win exists**, and the projection
-says the payoff is **~5-10x concurrent-tenant capacity -> several-fold higher aggregate
-decode** on the target. The remaining work is hardening/finishing the paged op, not
-proving the thesis.
--- a/backend/cpp/llama-cpp/paged/PHASED_VLLM_PARITY_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/PHASED_VLLM_PARITY_PLAN.md
@@ -1,55 +0,0 @@
-# Making llama.cpp/LocalAI a viable vLLM alternative — phased plan
-
-Goal: close the practical gap to vLLM for both single-user *speed* and multi-user *throughput*, while keeping
-quality (no lossy quant). Grounded in measured benchmarks + research (`BENCHMARKS.md`, `BLACKWELL_KERNEL_GAPS.md`,
-`VLLM_THROUGHPUT_GAP.md`). The gap is NOT one thing — each phase targets a distinct, independent lever.
-
-## Where vLLM actually leads (measured, GB10 / Qwen3-32B)
-
- **Single-user decode:** ~parity (10.2 vs 11.7) — bandwidth-bound. vLLM's edge is **spec-dec** (lossless).
- **Multi-user decode:** gap grows to ~2.2× at B=64 (kernel + scheduler).
- **Prefill aggregate:** llama plateaus ~765, vLLM scales to 24k — **paged KV + chunked prefill + kernel**.
- Note: on GB10 vLLM's FP4 trump card is *broken* (falls back to Marlin); llama.cpp runs reliably — a real
-  viability point. vLLM is structurally ahead mainly via **paged KV, chunked prefill, cross-request prefix cache**.
-
-## Phases
-
-### Phase 1 — Hardware-tuned config (PR #10411) — DONE
-Folded into the hardware-defaults path (`core/config/hardware_defaults.go`):
- Blackwell physical batch (n_ubatch) = 2048.
- **VRAM-scaled `n_parallel` default** (>=32GiB→8, >=8→4, >=4→2): turns on concurrency + continuous batching,
-  which the backend leaves OFF at its `n_parallel=1` default. Unified KV → slots share the budget (no extra
-  KV memory). Single-host (local GPU) + distributed router (per node). Already-good defaults confirmed:
-  flash-attn=auto, context=4096.
-
-### Phase 2 — Paged / block KV cache  ← biggest structural multi-user lever
-vLLM's PagedAttention lifts KV utilization ~20-38% → ~96%. llama.cpp's own A10G data (draft PR #22569):
-contiguous OOMs at 26 seqs / 496 t/s → paged 247 seqs / 1256 t/s (**~9.5× concurrency, 2.5× aggregate**).
- Build on / complete **upstream draft PR #22569** (`-kvp`, block manager + paged-attn ggml op, FCFS scheduler)
-  rather than the from-scratch series we prototyped (`paged/`). Our CPU-verified block manager + gather-read
-  design informs the review/port; the upstream momentum is the place to land it.
- Phase 2b: cross-request prefix sharing (block-hash dedup) — our `PagedKVManager` already implements it.
-
-### Phase 3 — Prefill amortization (chunked prefill + n_batch/n_ubatch split)
-llama aggregate prefill plateaus because (a) one prompt saturates compute, (b) the per-forward GEMM M-dim is
-capped at `n_ubatch`=512, (c) no scheduler chunked prefill (draft #10718 abandoned).
- Split logical `n_batch` from physical `n_ubatch` (LocalAI ties them today) so concurrent prefills batch into
-  a larger logical batch while keeping ubatch at the Blackwell sweet spot (2048).
- Chunked prefill + prefill/decode co-batching in the server slot scheduler.
-
-### Phase 4 — Batched-GEMM kernel tuning (the decode 2.2× + prefill height)
-Per `BLACKWELL_KERNEL_GAPS.md`: dense int8-MMQ at ~21% of ceiling, MoE FP4-MMA at ~5%. Both untuned for
-Blackwell. To MATCH: tune MMQ or a Marlin-style W4A16 BF16 GEMM (FP4 not required — GB10 is INT8==BF16). To
-BEAT (2×): fix+tune the existing FP4-MMA on sm_121 (build-flag/`-O3`-miscompile, not greenfield).
-
-### Phase 5 — Backend GPU sampling
-CPU per-sequence sampling caps GPU util ~60% beyond n_parallel ~8-16 (upstream PR #17004). Track/adopt.
-
-### Cross-cutting — Speculative decoding (single-user speed, quality-preserving)
-Dense ≥14B: lossless ~1.8-3×. llama.cpp has `-md`/`--spec-draft-*`. Wire a draft-model field in the model
-config + ship Qwen3 target+draft (1.7B) pairs in the gallery. NOT for MoE-A3B (nothing to amortize).
-
-## Sequencing rationale
-Phase 1 (config) ships now — biggest immediate multi-user win for zero kernel work (concurrency was OFF).
-Phase 2 (paged KV) is the highest-leverage structural build and has upstream momentum. Phases 3-4 are deeper
-(scheduler + kernel). Spec-dec is independent and can land any time for single-user speed.
--- a/backend/cpp/llama-cpp/paged/PR17004_EVAL.md
+++ b/backend/cpp/llama-cpp/paged/PR17004_EVAL.md
@@ -1,90 +0,0 @@
-# PR #17004 (backend / GPU sampling) evaluation on DGX Spark (GB10, sm_121)
-
-Date: 2026-06-21. Hardware: NVIDIA GB10 (GB10, sm_121), CUDA 13.0, cmake 3.28.
-Model: `Qwen3-32B-Q4_K_M.gguf`. LocalAI pin: `LLAMA_VERSION=f3e182816421c648188b5eab269853bf1531d950` (2026-06-17).
-
-## TL;DR (clean negative)
-
-1. **PR #17004 is MERGED and is ALREADY present in our pinned llama.cpp `f3e1828`.** There is nothing to apply / cherry-pick / patch. The `-bs/--backend-sampling` CLI arg, the `llama_set_sampler` / `llama_get_sampled_*` API, and the GPU argsort/top-k/cumsum/softmax kernels are all in the pin.
-2. **The prescribed benchmark cannot test the fix.** `llama-batched-bench` does ZERO sampling - it feeds random tokens (`std::rand() % n_vocab`). Its ~540 t/s plateau is therefore **not** sampling-bound, and enabling backend sampling does nothing to it. The valid tool is `llama-batched` (examples/batched), which the PR updated to drive per-sequence sampler chains and which actually exercises `-bs`.
-3. **In a controlled real-sampling A/B (same `llama-batched` harness, CPU vs GPU sampler), GPU sampling gave only +25% at np=32, +3% at np=64, and CRASHED (`GGML_ASSERT(obj_new)`, graph-context alloc) at np=128 and np=256** - exactly the multi-user regime the investigation cares about.
-4. **nsys at np=64: GPU kernel profile and GPU-busy time are essentially identical with and without the fix** (CPU 392.5 t/s / GPU 404.2 t/s; total GPU kernel+memop time ~4.05 s in both). Sampling kernels do not even appear among the top GPU contributors. GPU utilization did **not** rise.
-5. **Conclusion: PR #17004, in the state shipped by our pin, does NOT break the ~540 plateau and does not move decode aggregate toward the ~2700 GPU-bound ceiling or past vLLM's 667.** It is modest at low parallelism and unusable (crash) at the high parallelism in question. The PR's own guidance ("recommended `--parallel 1`", "will take time to mature") matches what we measured.
-
-## 1. What PR #17004 does + state
-
- Title: "sampling : add support for backend sampling". **State: MERGED** into `master` (PR head branch `gpu-sampling`). 44 files, +4133/-296.
- `libllama`: new `llama_context_params.samplers` / `n_samplers`, `llama_set_sampler`, `llama_get_sampled_*`, `llama_sampler_seq_config`, updated `llama_sampler_i`. Sampler chain can now run inside the compute graph on the backend (GPU) instead of on the CPU after `llama_decode`.
- CUDA: optimized/new `argsort`, `top-k`, `cumsum`, `softmax` kernels; CMake option `-DGGML_CUDA_CUB_3DOT2=ON` (builds a CCCL v3.2 prerelease for faster top-k).
- Tools: new `-bs, --backend-sampling` arg in `common/arg.cpp` (line 1921); server (`server-context.cpp`) per-slot wiring; `examples/batched/batched.cpp` updated.
- Supported backend samplers: `top-k`, `top-p`, `min-p`, `temp` (+ dist). **Limitations (from the PR): not compatible with grammar sampling; single output per sequence per batch; no save/load of sampling state; recommended only with `--parallel 1` and CUB_3DOT2.** Open follow-ups: #18547 (avoid graph reallocations), #18550 (skip inactive samplers in parallel decode).
- It DOES target the CPU-side per-sequence sampling stall we hypothesised - the mechanism is correct. Maturity is the problem.
-
-Note: the GitHub API reports `mergedAt: 2026-01-04`, but the PR contains June 2026 upstream-merge commits and the feature is verified present in our 2026-06-17 pin, so treat the date field as a metadata quirk. What matters: the code is in `f3e1828`.
-
-## 2/3. Apply + build
-
-No apply needed (already in pin). Built from a clean `git worktree` at `f3e1828` (`~/llama-pr17004`), to avoid disturbing the existing diffusion build:
-
-```
-cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON \
-  -DCMAKE_CUDA_ARCHITECTURES=121 -DLLAMA_MAX_SEQ=256 \
-  -DGGML_CUDA_CUB_3DOT2=ON -DLLAMA_CURL=OFF
-cmake --build build --target llama-batched llama-batched-bench -j20
-```
-
-**Build: SUCCESS** (CUB_3DOT2=ON FetchContent fetched and compiled despite flaky net; sm_121; LLAMA_MAX_SEQ=256). `-bs/--backend-sampling` confirmed present in `llama-batched --help`.
-
-## 4. Decode aggregate: fix vs baseline vs vLLM
-
-### 4a. `llama-batched-bench` (NO sampling - reconfirms the plateau, unaffected by the fix)
-`-npp 16 -ntg 128 -npl 32,64,128,256 -c 40960 -b 2048 -ub 2048`
-
-| npl | S_TG t/s |
-|-----|----------|
-| 32  | 241.8 |
-| 64  | 395.1 |
-| 128 | 542.6 |
-| 256 | 567.2 |
-
-Reproduces the ~540 plateau. Because this tool never samples, `-bs` is irrelevant here - the plateau is decode/host-overhead-bound, not sampling-bound.
-
-### 4b. `llama-batched` real-sampling A/B (CPU sampler vs `-bs` GPU sampler, identical harness)
-`-kvu -n 128 -np {32,64,128,256} -c 40960 --seed 1` (samplers: top-k 40 / top-p 0.95 / temp 0.8)
-
-| np  | CPU sampling t/s | GPU `-bs` sampling t/s | delta |
-|-----|------------------|------------------------|-------|
-| 32  | 174.1 | 217.5 | +25% |
-| 64  | 390.5 | 403.4 | +3.3% |
-| 128 | 497.9 | **CRASH** `GGML_ASSERT(obj_new) ggml.c:1768` | - |
-| 256 | 396.7 | **CRASH** `GGML_ASSERT(obj_new) ggml.c:1768` | - |
-
-(`llama-batched` absolute t/s is lower than `batched-bench` because it does real sampling plus per-token detokenize/string/stream work; the A/B *within* this harness isolates the sampler cost.)
-
-**Does the fix break the plateau? No.** GPU sampling helps only at low parallelism and the gain shrinks as np rises (+25% -> +3%), then the path crashes at np>=128 - i.e. it fails in exactly the multi-user regime where the plateau matters. It does not approach the ~2700 ceiling and does not pass vLLM's 667. The CPU-sampling curve itself peaks at np=128 (498) and *drops* at np=256 (397), confirming CPU sampling is a scaling wall - but PR #17004 as shipped does not lift it because the GPU path is unstable there.
-
-## 5. GPU-utilization mechanism (nsys, np=64, the highest np where `-bs` survives)
-
-`nsys profile -t cuda ... -n 96 -np 64`
-
-| mode | decode t/s | total GPU kernel+memop time | top GPU contributors |
-|------|-----------|------------------------------|----------------------|
-| CPU sampling | 392.5 | ~4.07 s | mul_mat_q (55%+17%), flash_attn (5.7%), mul_mat_vec (2%) |
-| GPU `-bs`    | 404.2 | ~4.04 s | identical set; sampling kernels not in top contributors |
-
-GPU-busy time and the kernel mix are **essentially unchanged** between modes. The argsort/top-k/cumsum/softmax sampling kernels are negligible in the timeline; the only visible difference is H2D memcpy *instances* rising 1,495 -> 7,076 (pinned-memory sampler transfers) at ~unchanged total memcpy time. **GPU utilization did not rise.** This directly refutes the idea that, at this workload, the GPU idle is dominated by CPU sampler arithmetic - moving the sampler onto the GPU barely changed throughput (+3%) and did not raise GPU occupancy. The ~80% idle measured elsewhere is dominated by something other than the sampler math (host-side batch construction / synchronization / detokenize), which PR #17004 does not address.
-
-(np=256 nsys "with fix" could not be captured: `-bs` aborts there. Fixing the crash needs the unmerged follow-ups #18547/#18550, not in our pin.)
-
-## LocalAI adoption path
-
-**The code arrives transparently with a version bump; enabling it is not transparent.**
-
- `backend/cpp/llama-cpp/prepare.sh` copies all of upstream `llama.cpp/tools/server/*` (including the #17004-modified `server-context.cpp` / `server-task.cpp` / `server-common.cpp`) into `tools/grpc-server/`, and `grpc-server.cpp` `#include`s them. So once `LLAMA_VERSION` points at a commit containing #17004 (our pin `f3e1828` already does), the backend-sampling machinery compiles into `grpc-server` automatically. **No vendored patch in `patches/` is required for the code.**
- The vendored `server-context.cpp` already does the per-slot wiring (around line 1615): `backend_sampling &= task.params.sampling.backend_sampling`, also disabled for speculative decode and for pre-sampling logits (`n_probs>0`), then `llama_set_sampler(ctx_tgt, slot.id, common_sampler_get(slot.smpl))`.
- **But it is OFF unless `task.params.sampling.backend_sampling == true`.** LocalAI's `grpc-server` builds `params` itself from the gRPC request and never sets this flag (and does not pass the upstream `--backend-sampling` CLI arg). So as-is, LocalAI compiles the feature but never uses it. **A small grpc-server change is needed**: read a LocalAI model option / env and set `params.sampling.backend_sampling = true` (global or per-request).
- For performant CUDA top-k, add `-DGGML_CUDA_CUB_3DOT2=ON` to the llama-cpp CUDA `CMAKE_ARGS` in the Makefile (optional; a non-CUB fallback exists).
- **Caveats that blunt the benefit for LocalAI specifically:** grammar-constrained requests (JSON-schema / tool calls - a large share of LocalAI traffic), `logprobs`/`n_probs>0`, and speculative decoding all fall back to CPU sampling by the gating above; and the GPU path crashes at np>=128 in this pin. So even after wiring the flag, the multi-user throughput case would not benefit (and would crash) until the follow-up PRs (#18547/#18550) land and stabilise high-parallelism backend sampling.
-
-### Recommendation
-Do **not** adopt PR #17004 as the multi-user throughput fix yet. It is already in the tree but is immature at the parallelism that matters (crashes at np>=128, modest gains below). The measured bottleneck at this workload is not the sampler arithmetic (nsys shows GPU-busy unchanged when sampling moves to GPU). Re-evaluate after #18547/#18550 merge into a future pin; revisit the host-side decode/batch-construction overhead as the more likely real lever.
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Ettore Di Giacinto	44e7d9806b	fix(distributed): stop queue loops on agent nodes + dead-letter cap pending_backend_ops rows targeting agent-type workers looped forever: the reconciler fan-out hit a NATS subject the worker doesn't subscribe to, returned ErrNoResponders, we marked the node unhealthy, and the health monitor flipped it back to healthy on the next heartbeat. Next tick, same row, same failure. Three related fixes: 1. enqueueAndDrainBackendOp skips nodes whose NodeType != backend. Agent workers handle agent NATS subjects, not backend.install / delete / list, so enqueueing for them guarantees an infinite retry loop. Silent skip is correct — they aren't consumers of these ops. 2. Reconciler drain mirrors enqueueAndDrainBackendOp's behavior on nats.ErrNoResponders: mark the node unhealthy before recording the failure, so subsequent ListDuePendingBackendOps (filters by status=healthy) stops picking the row until the node actually recovers. Matches the synchronous fan-out path. 3. Dead-letter cap at maxPendingBackendOpAttempts (10). After ~1h of exponential backoff the row is a poison message; further retries just thrash NATS. Row is deleted and logged at ERROR so it stays visible without staying infinite. Plus a one-shot startup cleanup in NewNodeRegistry: drop queue rows that target agent-type nodes, non-existent nodes, or carry an empty backend name. Guarded by the same schema-migration advisory lock so only one instance performs it. The guards above prevent new rows of this shape; this closes the migration gap for existing ones. Tests: the prune migration (valid row stays, agent + empty-name rows drop) on top of existing upsert / backoff coverage.	2026-04-19 21:27:05 +00:00
Ettore Di Giacinto	7a9d89fa54	feat(ui): shared FilterBar across the System page tabs The Backends gallery had a nice search + chip + toggle strip; the System page had nothing, so the two surfaces felt like different apps. Lift the pattern into a reusable FilterBar and wire both System tabs through it. New component core/http/react-ui/src/components/FilterBar.jsx renders a search input, a role="tablist" chip row (aria-selected for a11y), and optional toggles / right slot. Chips support an optional `count` which the System page uses to show "User 3", "Updates 1" etc. System Models tab: search by id or backend; chips for All/Running/Idle/Disabled/Pinned plus a conditional Distributed chip in distributed mode. "Last synced" + Update button live in the right slot. System Backends tab: search by name/alias/meta-backend-for; chips for All/User/System/Meta plus conditional Updates / Offline-nodes chips when relevant. The old ad-hoc "Updates only" toggle from the upgrade banner folded into the Updates chip — one source of truth for that filter. Offline chip only appears in distributed mode when at least one backend has an unhealthy node, so the chip row stays quiet on healthy clusters. Filter state persists in URL query params (mq/mf/bq/bf) so deep links and tab switches keep the operator's filter context instead of resetting every time. Also adds an "Adopted" distribution path: when a model in /api/models/capabilities carries source="registry-only" (discovered on a worker but not configured locally), the Models tab shows a ghost chip labelled "Adopted" with hover copy explaining how to persist it — this is what closes the loop on the ghost-model story end-to-end.	2026-04-19 08:46:22 +00:00
Ettore Di Giacinto	ee34a52c5d	feat(ui): NodeDistributionChip — shared per-node attribution component Large clusters were going to break the Manage → Backends Nodes column: the old inline logic rendered every node as a badge and would shred the layout at >10 workers, plus the Manage → Models distribution cell had copy-pasted its own slightly-different version. NodeDistributionChip handles any cluster size with two render modes: - small (≤3 nodes): inline chips of node names, colored by health. - large: a single "on N nodes · M offline · K drift" summary chip; clicking opens a Popover with a per-node table (name, status, version, digest for backends; name, status, state for models). Drift counting mirrors the backend's summarizeNodeDrift so the UI number matches UpgradeInfo.NodeDrift. Digests are truncated to the docker-style 12-char form with the full value preserved in the title. Popover is a new general-purpose primitive: fixed positioning anchored to the trigger, flips above when there's no room below, closes on outside-click or Escape, returns focus to the trigger. Uses .card as its surface so theming is inherited. Also useful for a future labels-editor popup and the user menu. Manage.jsx drops its duplicated inline Nodes-column + loaded_on cell and uses the shared chip with context="backends" / "models" respectively. Delete code removes ~40 lines of ad-hoc logic.	2026-04-19 08:39:59 +00:00
Ettore Di Giacinto	92b9e22dc9	feat(ui): show cluster distribution of models in the System page When a frontend restarted in distributed mode, models that workers had already loaded weren't visible until the operator clicked into each node manually — the /api/models/capabilities endpoint only knew about configs on the frontend's filesystem, not the registry-backed truth. /api/models/capabilities now joins in ListAllLoadedModels() when the registry is active, returning loaded_on[] with node id/name/state/status for each model. Models that live in the registry but lack a local config (the actual ghosts, not recovered from the frontend's file cache) still surface with source="registry-only" so operators can see and persist them; without that emission they'd be invisible to this frontend. Manage → Models replaces the old Running/Idle pill with a distribution cell that lists the first three nodes the model is loaded on as chips colored by state (green loaded, blue loading, amber anything else). On wider clusters the remaining count collapses into a +N chip with a title-attribute breakdown. Disabled / single-node behavior unchanged. Adopted models get an extra "Adopted" ghost-icon chip with hover copy explaining what it means and how to make it permanent. Distributed mode also enables a 10s auto-refresh and a "Last synced Xs ago" indicator next to the Update button so ghost rows drop off within one reconcile tick after their owning process dies. Non-distributed mode is untouched — no polling, no cell-stack, same old Running/Idle.	2026-04-19 08:37:45 +00:00
Ettore Di Giacinto	f0ab68e352	feat(distributed): durable backend fan-out + state reconciliation Two connected problems handled together: 1) Backend delete/install/upgrade used to silently skip non-healthy nodes, so a delete during an outage left a zombie on the offline node once it returned. The fan-out now records intent in a new pending_backend_ops table before attempting the NATS round-trip. Currently-healthy nodes get an immediate attempt; everyone else is queued. Unique index on (node_id, backend, op) means reissuing the same operation refreshes next_retry_at instead of stacking duplicates. 2) Loaded-model state could drift from reality: a worker OOM'd, got killed, or restarted a backend process would leave a node_models row claiming the model was still loaded, feeding ghost entries into the /api/nodes/models listing and the router's scheduling decisions. The existing ReplicaReconciler gains two new passes that run under a fresh KeyStateReconciler advisory lock (non-blocking, so one wedged frontend doesn't freeze the cluster): - drainPendingBackendOps: retries queued ops whose next_retry_at has passed on currently-healthy nodes. Success deletes the row; failure bumps attempts and pushes next_retry_at out with exponential backoff (30s → 15m cap). ErrNoResponders also marks the node unhealthy. - probeLoadedModels: gRPC-HealthChecks addresses the DB thinks are loaded but hasn't seen touched in the last probeStaleAfter (2m). Unreachable addresses are removed from the registry. A pluggable ModelProber lets tests substitute a fake without standing up gRPC. DistributedBackendManager exposes DeleteBackendDetailed so the HTTP handler can surface per-node outcomes ("2 succeeded, 1 queued") to the UI in a follow-up commit; the existing DeleteBackend still returns error-only for callers that don't care about node breakdown. Multi-frontend safety: the state pass uses advisorylock.TryWithLockCtx on a new key so N frontends coordinate — the same pattern the health monitor and replica reconciler already rely on. Single-node mode runs both passes inline (adapter is nil, state drain is a no-op). Tests cover the upsert semantics, backoff math, the probe removing an unreachable model but keeping a reachable one, and filtering by probeStaleAfter.	2026-04-19 08:34:57 +00:00
Ettore Di Giacinto	9373de9f9b	feat(ui): polish the Nodes page so it reads like a product The Nodes page was the biggest visual liability in distributed mode. Rework the main dashboard surfaces in place without changing behavior: StatCards: uniform height (96px min), left accent bar colored by the metric's semantic (success/warning/error/primary), icon lives in a 36x36 soft-tinted chip top-right, value is left-aligned and large. Grid auto-fills so the row doesn't collapse on narrow viewports. This replaces the previous thin-bordered boxes with inconsistent heights. Table rows: expandable rows now show a chevron cue on the left (rotates on expand) so users know rows open. Status cell became a dedicated chip with an LED-style halo dot instead of a bare bullet. Action buttons gained labels — "Approve", "Resume", "Drain" — so the icons aren't doing all the semantic work; the destructive remove action uses the softer btn-danger-ghost variant so rows don't scream red, with the ConfirmDialog still owning the real "are you sure". Applied cell-mono/cell-muted utility classes so label chips and addresses share one spacing/font grammar instead of re-declaring inline styles everywhere. Expanded drawer: empty states for Loaded Models and Installed Backends now render as a proper drawer-empty card (dashed border, icon, one-line hint) instead of a plain muted string that read like broken formatting. Tabs: three inline-styled buttons became the shared .tab class so they inherit focus ring, hover state, and the rest of the design system — matches the System page. "Add more workers" toggle turned into a .nodes-add-worker dashed-border button labelled "Register a new worker" (action voice) instead of a chevron + muted link that operators kept mistaking for broken text. New shared CSS primitives carry over to other pages: .stat-grid + .stat-card, .row-chevron, .node-status, .drawer-empty, .nodes-add-worker.	2026-04-19 08:20:52 +00:00
Ettore Di Giacinto	1b3c951c85	feat(ui): surface backend upgrades in the System page The System page (Manage.jsx) only showed updates as a tiny inline arrow, so operators routinely missed them. Port the Backend Gallery's upgrade UX so System speaks the same visual language: - Yellow banner at the top of the Backends tab when upgrades are pending, with an "Upgrade all" button (serial fan-out, matches the gallery) and a "Updates only" filter toggle. - Warning pill (↑ N) next to the tab label so the count is glanceable even when the banner is scrolled out of view. - Per-row labeled "Upgrade to vX.Y" button (replaces the icon-only button that silently flipped semantics between Reinstall and Upgrade), plus an "Update available" badge in the new Version column. - New columns: Version (with upgrade + drift chips), Nodes (per-node attribution badges for distributed mode, degrading to a compact "on N nodes · M offline" chip above three nodes), Installed (relative time). - System backends render a "Protected" chip instead of a bare "—" so rows still align and the reason is obvious. - Delete uses the softer btn-danger-ghost so rows don't scream red; the ConfirmDialog still owns the "are you sure". The upgrade checker also needed the same per-worker fix as the previous commit: NewUpgradeChecker now takes a BackendManager getter so its periodic runs call the distributed CheckUpgrades (which asks workers) instead of the empty frontend filesystem. Without this the /api/backends/ upgrades endpoint stayed empty in distributed mode even with the protocol change in place. New CSS primitives — .upgrade-banner, .tab-pill, .badge-row, .cell-stack, .cell-mono, .cell-muted, .row-actions, .btn-danger-ghost — all live in App.css so other pages can adopt them without duplicating styles.	2026-04-19 08:14:49 +00:00
Ettore Di Giacinto	1f43762655	fix(distributed): detect backend upgrades across worker nodes Before this change `DistributedBackendManager.CheckUpgrades` delegated to the local manager, which read backends from the frontend filesystem. In distributed deployments the frontend has no backends installed locally — they live on workers — so the upgrade-detection loop never ran and the UI silently never surfaced upgrades even when the gallery advertised newer versions or digests. Worker-side: NATS backend.list reply now carries Version, URI and Digest for each installed backend (read from metadata.json). Frontend-side: DistributedBackendManager.ListBackends aggregates per-node refs (name, status, version, digest) instead of deduping, and CheckUpgrades feeds that aggregation into gallery.CheckUpgradesAgainst — a new entrypoint factored out of CheckBackendUpgrades so both paths share the same core logic. Cluster drift policy: when per-node version/digest tuples disagree, the backend is flagged upgradeable regardless of whether any single node matches the gallery, and UpgradeInfo.NodeDrift enumerates the outliers so operators can see why it is out of sync. The next upgrade-all realigns the cluster. Tests cover: drift detection, unanimous-match (no upgrade), and the empty-installed-version path that the old distributed code silently missed.	2026-04-19 08:03:20 +00:00