fix: newline in virtual.yaml

Stupid one line fix, but it will fix CI Signed-off-by: Dave <dave@gray101.com>
2026-07-07 06:49:49 -04:00 · 2024-04-25 10:39:07 -04:00
3290 changed files with 62265 additions and 779911 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -1,315 +0,0 @@
-# Adding a New Backend
-
-When adding a new backend to LocalAI, you need to update several files to ensure the backend is properly built, tested, and registered. Here's a step-by-step guide based on the pattern used for adding backends like `moonshine`:
-
-## 1. Create Backend Directory Structure
-
-Create the backend directory under the appropriate location:
- **Python backends**: `backend/python/<backend-name>/`
- **Go backends**: `backend/go/<backend-name>/`
- **C++ backends**: `backend/cpp/<backend-name>/`
- **Rust backends**: `backend/rust/<backend-name>/`
-
-For Python backends, you'll typically need:
- `backend.py` - Main gRPC server implementation
- `Makefile` - Build configuration
- `install.sh` - Installation script for dependencies
- `protogen.sh` - Protocol buffer generation script
- `requirements.txt` - Python dependencies
- `run.sh` - Runtime script
- `test.py` / `test.sh` - Test files
-
-For Rust backends, you'll typically need (see `backend/rust/kokoros/` as a reference):
- `Cargo.toml` - Crate manifest; depend on the upstream project as a submodule under `sources/`
- `build.rs` - Invokes `tonic_build` to generate gRPC stubs from `backend/backend.proto` (use the `BACKEND_PROTO_PATH` env var so the Makefile can inject the canonical copy)
- `src/` - The gRPC server implementation (implement `Backend` via `tonic`)
- `Makefile` - Copies `backend.proto` into the crate, runs `cargo build --release`, then `package.sh`
- `package.sh` - Uses `ldd` to bundle the binary's dynamic deps and `ld.so` into `package/lib/`
- `run.sh` - Sets `LD_LIBRARY_PATH`/`SSL_CERT_DIR` and execs the binary via the bundled `lib/ld.so`
- `sources/<UpstreamProject>/` - Git submodule with the upstream Rust crate
-
-## 2. Add Build Configurations to `.github/backend-matrix.yml`
-
-The build matrix is data-only YAML at `.github/backend-matrix.yml` (not inside `backend.yml` itself). `backend.yml` (master push) and `backend_pr.yml` (PR) load it via `scripts/changed-backends.js`, which also handles per-file path filtering so only touched backends rebuild on PRs and master pushes alike. Add build matrix entries to `.github/backend-matrix.yml` for each platform/GPU type you want to support. Look at similar backends for reference — `chatterbox`/`faster-whisper` for Python, `piper`/`silero-vad` for Go, `kokoros` for Rust.
-
-**Without an entry here no image is ever built or pushed, and the gallery entry in `backend/index.yaml` will point at a tag that does not exist.** The `dockerfile:` field must point at `./backend/Dockerfile.<lang>` matching the language bucket from step 1 (e.g. `Dockerfile.python`, `Dockerfile.golang`, `Dockerfile.rust`). The `tag-suffix` must match the `uri:` in the corresponding `backend/index.yaml` image entry exactly.
-
-**`scripts/changed-backends.js` registration — REQUIRED for any new dockerfile suffix.** This is the single most common omission, because it has no effect on the PR that adds the backend (when no prior path filter could catch it anyway) — it only breaks the *next* PR that touches your backend's directory, which then gets zero CI jobs and looks broken for unrelated reasons. Edit `scripts/changed-backends.js:inferBackendPath` and add a branch BEFORE the more-generic suffixes:
-
-```js
-if (item.dockerfile.endsWith("<your-dockerfile-suffix>")) {
-    return `backend/cpp/<your-backend>/`;   // or backend/python|go|rust/...
-}
-```
-
-The `endsWith()` test is against the matrix entry's `dockerfile:` value (e.g. `./backend/Dockerfile.ds4` → `endsWith("ds4")`). Specificity order matters here just like it does for importers: more-specific suffixes go BEFORE more-generic ones (e.g. `ds4` before `llama-cpp` even though both end with letters, because some upstream might one day call itself `super-ds4-llama-cpp`). Verify locally before pushing:
-
-```bash
-# Confirm your dockerfile suffix is unique enough
-node -e "
-const yaml = require('js-yaml'); const fs = require('fs');
-const m = yaml.load(fs.readFileSync('.github/backend-matrix.yml','utf8'));
-for (const e of m.include.filter(e => e.backend === '<your-backend>')) {
-  console.log(e.dockerfile, '->', e.dockerfile.endsWith('<suffix>'));
-}"
-```
-
-A quick way to find the right insertion point: `grep -n 'item.dockerfile.endsWith' scripts/changed-backends.js`.
-
-**`bump_deps.yaml` registration — REQUIRED for any backend pinning an upstream commit.** If your backend's Makefile has a `*_VERSION?=<sha>` pin to a third-party repo, the daily auto-bump bot at `.github/workflows/bump_deps.yaml` won't notice it unless you register the backend in its matrix. The bot runs `.github/bump_deps.sh` which `grep`s for `^$VAR?=` in the Makefile you list — so the pin MUST live in the Makefile (not in a separate shell script). The bump for ds4 (#9761) had to walk this back because the original landed the pin in `prepare.sh`, which the bot can't see. Pattern (for `antirez/ds4`):
-
-```yaml
-# .github/workflows/bump_deps.yaml
-matrix:
-  include:
-    - repository: "antirez/ds4"
-      variable: "DS4_VERSION"
-      branch: "main"
-      file: "backend/cpp/ds4/Makefile"
-```
-
-And the corresponding Makefile shape (mirror `backend/cpp/llama-cpp/Makefile`):
-
-```makefile
-DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
-DS4_REPO?=https://github.com/antirez/ds4
-...
-ds4:
-	mkdir -p ds4
-	cd ds4 && git init -q && \
-	git remote add origin $(DS4_REPO) && \
-	git fetch --depth 1 origin $(DS4_VERSION) && \
-	git checkout FETCH_HEAD
-```
-
-If you have a `prepare.sh` doing the clone, delete it — the recipe belongs in the Makefile target so `make purge && make` works as a clean-and-rebuild and so the bump bot finds the pin.
-
-**Placement in file:**
- CPU builds: Add after other CPU builds (e.g., after `cpu-chatterbox`)
- CUDA 12 builds: Add after other CUDA 12 builds (e.g., after `gpu-nvidia-cuda-12-chatterbox`)
- CUDA 13 builds: Add after other CUDA 13 builds (e.g., after `gpu-nvidia-cuda-13-chatterbox`)
-
-**Additional build types you may need:**
- ROCm/HIP: Use `build-type: 'hipblas'` with `base-image: "rocm/dev-ubuntu-24.04:7.2.1"`
- Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"`
- L4T (ARM): Use `build-type: 'l4t'` with `platforms: 'linux/arm64'` and `runs-on: 'ubuntu-24.04-arm'`
-
-**Per-arch native builds (`linux/amd64` + `linux/arm64`):**
-
-Multi-arch backends are NOT a single matrix entry with `platforms: 'linux/amd64,linux/arm64'`. Instead, add **two** entries — one with `platforms: 'linux/amd64'` + `platform-tag: 'amd64'` + `runs-on: 'ubuntu-latest'`, one with `platforms: 'linux/arm64'` + `platform-tag: 'arm64'` + `runs-on: 'ubuntu-24.04-arm'` — both sharing the same `tag-suffix`. The script detects the shared `tag-suffix` and emits a `merge-matrix` entry, so `backend-merge-jobs` (in `backend.yml`/`backend_pr.yml`) automatically assembles the manifest list from per-arch digest artifacts. See `-cpu-faster-whisper` in `.github/backend-matrix.yml` for a reference shape.
-
-**llama-cpp / ik-llama-cpp / turboquant variants only — `builder-base-image`:**
-
-Entries whose `dockerfile` is `./backend/Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` must also set a `builder-base-image` field pointing at a prebuilt base from `quay.io/go-skynet/ci-cache:base-grpc-*` (CI builds these via `.github/workflows/base-images.yml`). The mapping is by `(build-type, platforms)` — see existing entries for the pattern. CI uses these prebuilt bases to skip the gRPC compile (~25–35 min cold). Local `make backends/<name>` ignores `builder-base-image` and uses the from-source path inside the Dockerfile, so you don't need quay access for local builds.
-
-### Cover every OS the project supports (Linux **and** Darwin)
-
-`.github/backend-matrix.yml` has two matrices, and they are the source of truth for which OS a backend ships on:
-
- `include:` — the **Linux** matrix (x86_64 + arm64; CPU and CUDA / ROCm / SYCL / Vulkan).
- `includeDarwin:` — the **macOS / Apple Silicon** matrix (arm64; Metal where the engine supports it, otherwise a native arm64 CPU build).
-
-**A new backend must target every OS it can build for — do not ship Linux-only by default.** A backend that appears only under `include:` is silently unavailable on macOS even when its code would run there. Most C/C++/GGML engines build on Darwin out of the box (ggml defaults `GGML_METAL=ON` on Apple, so a plain build is Metal-enabled), and many Python backends do too (CPU / MPS wheels). If a backend genuinely cannot support an OS (e.g. CUDA-only, no CPU variant), state that in the PR description instead of omitting it silently.
-
-Wiring a backend into `includeDarwin:` is more than the matrix entry:
-
-1. **`includeDarwin:` entry** — `tag-suffix: "-metal-darwin-arm64-<backend>"`, `build-type: "metal"`, `lang: "go"` for go+ggml backends; omit `build-type` for the bespoke C++ ones (llama-cpp / ds4 / privacy-filter). Match an existing entry of the same shape.
-2. **`backend/index.yaml`** — add `metal:` to the backend's `capabilities` map (main and `-development`) and concrete `metal-<backend>` / `metal-<backend>-development` image entries pointing at the `-metal-darwin-arm64-<backend>` images.
-3. **C/C++ backends only** — add an `inferBackendPathDarwin` case in `scripts/changed-backends.js` returning `backend/cpp/<backend>/` (the generic fallthrough assumes `backend/<lang>/`, which is wrong for a C++ source tree driven with `lang: go`), and give `run.sh` a Darwin branch that exports `DYLD_LIBRARY_PATH` instead of `LD_LIBRARY_PATH`. If the build is bespoke (single `grpc-server` + dylib bundling), model it on `scripts/build/ds4-darwin.sh` and add a `backends/<backend>-darwin` make target plus a gated step in `.github/workflows/backend_build_darwin.yml`.
-4. **C++ proto gotcha** — if the backend compiles the generated gRPC/protobuf in a separate CMake target (e.g. `hw_grpc_proto`), that target must link `protobuf::libprotobuf` + `gRPC::grpc++` so the Homebrew include dirs propagate; otherwise macOS fails with `google/protobuf/runtime_version.h not found` (Linux hides this because apt headers sit in `/usr/include`).
-
-The CI path filter only builds a backend on a PR when a file under its directory changes, so a darwin-only YAML edit builds nothing — touch a file under `backend/<lang>/<backend>/` (a one-line comment is enough) in the same PR.
-
-## 3. Add Backend Metadata to `backend/index.yaml`
-
-**Step 3a: Add Meta Definition**
-
-Add a YAML anchor definition in the `## metas` section (around line 2-300). Look for similar backends to use as a template such as `diffusers` or `chatterbox`
-
-**Step 3b: Add Image Entries**
-
-Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags.
-
-**Note on integrity:** OCI backends installed from a gallery whose `verification:` block is set are verified against a keyless-cosign policy before extraction; tarball/HTTP backends use the optional `sha256:` field. New backends do not need any extra YAML — the gallery-level `verification:` block covers every entry. See [.agents/backend-signing.md](backend-signing.md) for the producer-side CI step.
-
-## 4. Update the Makefile
-
-The Makefile needs to be updated in several places to support building and testing the new backend:
-
-**Step 4a: Add to `.NOTPARALLEL`**
-
-Add `backends/<backend-name>` to the `.NOTPARALLEL` line (around line 2) to prevent parallel execution conflicts:
-
-```makefile
-.NOTPARALLEL: ... backends/<backend-name>
-```
-
-**Step 4b: Add to `prepare-test-extra`**
-
-Add the backend to the `prepare-test-extra` target to prepare it for testing. Use the path matching your language bucket (`backend/python/`, `backend/go/`, `backend/rust/`, …):
-
-```makefile
-prepare-test-extra: protogen-python
-	...
-	$(MAKE) -C backend/<lang>/<backend-name>
-```
-
-For Rust backends the target is usually the crate build target itself (e.g. `$(MAKE) -C backend/rust/<backend-name> <backend-name>-grpc`) so the binary is in place before `test` runs.
-
-**Step 4c: Add to `test-extra`**
-
-Add the backend to the `test-extra` target to run its tests — applies to Go and Rust backends too, not only Python:
-
-```makefile
-test-extra: prepare-test-extra
-	...
-	$(MAKE) -C backend/<lang>/<backend-name> test
-```
-
-Each backend's own `Makefile` should define a `test` target so this line works regardless of language. Integration tests that need large model downloads should be gated behind an env var (see `backend/rust/kokoros/`'s `KOKOROS_MODEL_PATH` pattern) so CI only runs unit tests.
-
-**Step 4d: Add Backend Definition**
-
-Add a backend definition variable in the backend definitions section (around line 428-457). The format depends on the backend type:
-
-**For Python backends with root context** (like `faster-whisper`, `coqui`):
-```makefile
-BACKEND_<BACKEND_NAME> = <backend-name>|python|.|false|true
-```
-
-**For Python backends with `./backend` context** (like `chatterbox`, `moonshine`):
-```makefile
-BACKEND_<BACKEND_NAME> = <backend-name>|python|./backend|false|true
-```
-
-**For Go backends**:
-```makefile
-BACKEND_<BACKEND_NAME> = <backend-name>|golang|.|false|true
-```
-
-**For Rust backends**:
-```makefile
-BACKEND_<BACKEND_NAME> = <backend-name>|rust|.|false|true
-```
-
-The language field (`python`/`golang`/`rust`/…) must match a `backend/Dockerfile.<lang>` file.
-
-**Step 4e: Generate Docker Build Target**
-
-Add an eval call to generate the docker-build target (around line 480-501):
-
-```makefile
-$(eval $(call generate-docker-build-target,$(BACKEND_<BACKEND_NAME>)))
-```
-
-**Step 4f: Add to `docker-build-backends`**
-
-Add `docker-build-<backend-name>` to the `docker-build-backends` target (around line 507):
-
-```makefile
-docker-build-backends: ... docker-build-<backend-name>
-```
-
-**Determining the Context:**
-
- If the backend is in `backend/python/<backend-name>/` and uses `./backend` as context in the workflow file, use `./backend` context
- If the backend is in `backend/python/<backend-name>/` but uses `.` as context in the workflow file, use `.` context
- Check similar backends to determine the correct context
-
-## Documenting the backend (README + docs)
-
-A backend is not "added" until it is discoverable. Update the user-facing docs:
-
- **`docs/content/features/backends.md`** - add the backend to the right
-  category in the "LocalAI supports various types of backends" list (and add a
-  new category if it introduces a new modality, e.g. sound classification).
- If the backend introduces a **new API surface** (a new endpoint or a realtime
-  capability), document it under `docs/content/` where its area lives (audio,
-  vision, etc.) and follow the api-endpoints checklist in
-  [api-endpoints-and-auth.md](api-endpoints-and-auth.md).
-
-**If the backend is a native C/C++/GGML engine created and maintained by the
-LocalAI team** (a from-scratch port like `parakeet.cpp`, `ced.cpp`,
-`vibevoice.cpp`, `rf-detr.cpp`, not a wrapper around a third-party runtime), it
-ALSO belongs in the top-level **`README.md`** table under "native C/C++/GGML
-engines ... developed and maintained by the LocalAI project itself". Add a row
-linking the upstream engine repo with a one-line description. This is the
-project's showcase of its own engines; a new in-house backend that is missing
-from it is a documentation bug.
-
-## 5. Verification Checklist
-
-After adding a new backend, verify:
-
- [ ] Backend directory structure is complete with all necessary files
- [ ] Build configurations added to `.github/backend-matrix.yml` for all desired platforms (per-arch entries with `platform-tag` for multi-arch; `builder-base-image` for llama-cpp / ik-llama-cpp / turboquant)
- [ ] **OS coverage considered**: added to `includeDarwin:` (macOS/Apple Silicon) if the backend can build there — with the `backend/index.yaml` `metal:` capability + `metal-<backend>` image entries, a `run.sh` Darwin/DYLD branch and `inferBackendPathDarwin` case for C++ backends — or the PR explains why an OS is unsupported. Do not ship Linux-only by default.
- [ ] Meta definition added to `backend/index.yaml` in the `## metas` section
- [ ] Image entries added to `backend/index.yaml` for all build variants (latest + development)
- [ ] Tag suffixes match between workflow file and index.yaml
- [ ] Makefile updated with all 6 required changes (`.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, backend definition, docker-build target eval, `docker-build-backends`)
- [ ] No YAML syntax errors (check with linter)
- [ ] No Makefile syntax errors (check with linter)
- [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
- [ ] Documented: added to the category list in `docs/content/features/backends.md` (and any new endpoint/realtime capability documented under `docs/content/`)
- [ ] If it is an in-house native C/C++/GGML engine, added to the maintained-engines table in the top-level `README.md`
-
-## Bundling runtime shared libraries (`package.sh`)
-
-The final `Dockerfile.python` stage is `FROM scratch` — there is no system `libc`, no `apt`, no fallback library path. Only files explicitly copied from the builder stage end up in the backend image. That means any runtime `dlopen` your backend (or its Python deps) needs **must** be packaged into `${BACKEND}/lib/`.
-
-Pattern:
-
-1. Make sure the library is installed in the builder stage of `backend/Dockerfile.python` (add it to the top-level `apt-get install`).
-2. Drop a `package.sh` in your backend directory that copies the library — and its soname symlinks — into `$(dirname $0)/lib`. See `backend/python/vllm/package.sh` for a reference implementation that walks `/usr/lib/x86_64-linux-gnu`, `/usr/lib/aarch64-linux-gnu`, etc.
-3. `Dockerfile.python` already runs `package.sh` automatically if it exists, after `package-gpu-libs.sh`.
-4. `libbackend.sh` automatically prepends `${EDIR}/lib` to `LD_LIBRARY_PATH` at run time, so anything packaged this way is found by `dlopen`.
-
-How to find missing libs: when a Python module silently fails to register torch ops or you see `AttributeError: '_OpNamespace' '...' object has no attribute '...'`, run the backend image's Python with `LD_DEBUG=libs` to see which `dlopen` failed. The filename in the error message (e.g. `libnuma.so.1`) is what you need to package.
-
-To verify packaging works without trusting the host:
-
-```bash
-make docker-build-<backend>
-CID=$(docker create --entrypoint=/run.sh local-ai-backend:<backend>)
-docker cp $CID:/lib /tmp/check && docker rm $CID
-ls /tmp/check    # expect the bundled .so files + symlinks
-```
-
-Then boot it inside a fresh `ubuntu:24.04` (which intentionally does *not* have the lib installed) to confirm it actually loads from the backend dir.
-
-## Importer integration
-
-When you add a new backend, you MUST also make it importable via the model import form (`/import-model`). The import form dropdown is sourced dynamically from `GET /backends/known` — it reads the importer registry at `core/gallery/importers/importers.go`, so the steps below are the ONLY way to make your backend show up.
-
-Required steps:
-
-1. **If your backend has unambiguous detection signals** (unique file extension, HF `pipeline_tag`, unique repo name pattern, unique artefact like `modules.json`):
-   - Create an importer file at `core/gallery/importers/<backend>.go` following the Match/Import pattern in `llama-cpp.go`.
-   - Register it in `importers.go:defaultImporters` in **specificity order** — more specific detectors must appear BEFORE more generic ones (e.g. `sentencetransformers` before `transformers`, `stablediffusion-ggml` before `llama-cpp`, `vllm-omni` before `vllm`). First match wins.
-2. **If your backend is a drop-in replacement** (same artefacts as another backend, e.g. `ik-llama-cpp` and `turboquant` both consume GGUF the same way `llama-cpp` does):
-   - Do NOT create a new importer. Extend the existing importer's `Import()` to swap the emitted `backend:` field when `preferences.backend` matches. See `llama-cpp.go` for the pattern.
-3. **If your backend has no reliable auto-detect signal** (preference-only — e.g. `sglang`, `tinygrad`, `whisperx`):
-   - Do NOT create an importer. Instead add the backend name to the curated pref-only slice in `core/http/endpoints/localai/backend.go` that feeds `/backends/known`. A single line addition.
-4. **Always** add a table-driven test in `core/gallery/importers/importers_test.go` (Ginkgo/Gomega):
-   - Use a real public HuggingFace repo URI as the test fixture (existing tests already hit the live HF API — follow that pattern).
-   - Cover detection (auto-match without preferences), preference-override (explicit `backend:` in preferences wins), and — if the backend's modality has a common `pipeline_tag` but ambiguous artefacts — an ambiguity test asserting `errors.Is(err, importers.ErrAmbiguousImport)`.
-
-Rules of thumb:
-
- When in doubt, lean pref-only. A wrong auto-detect is worse than a forced preference.
- Never silently emit a modality mismatch (e.g. emit `llama-cpp` for a TTS repo because `.gguf` is present). Return `ErrAmbiguousImport` instead.
- Registration order is the single most common source of bugs. Check by running `go test ./core/gallery/importers/...` — the existing suite will fail if you've shadowed a pre-existing detector.
-
-## 6. Example: Adding a Python Backend
-
-For reference, when `moonshine` was added:
- **Files created**: `backend/python/moonshine/{backend.py, Makefile, install.sh, protogen.sh, requirements.txt, run.sh, test.py, test.sh}`
- **Workflow entries**: 3 build configurations (CPU, CUDA 12, CUDA 13)
- **Index entries**: 1 meta definition + 6 image entries (cpu, cuda12, cuda13 x latest/development)
- **Makefile updates**:
-  - Added to `.NOTPARALLEL` line
-  - Added to `prepare-test-extra` and `test-extra` targets
-  - Added `BACKEND_MOONSHINE = moonshine|python|./backend|false|true`
-  - Added eval for docker-build target generation
-  - Added `docker-build-moonshine` to `docker-build-backends`
--- a/.agents/adding-gallery-models.md
+++ b/.agents/adding-gallery-models.md
@@ -1,111 +0,0 @@
-# Adding GGUF Models from HuggingFace to the Gallery
-
-When adding a GGUF model from HuggingFace to the LocalAI model gallery, follow this guide.
-
-## Gallery file
-
-All models are defined in `gallery/index.yaml`. Find the appropriate section (embedding models near other embeddings, chat models near similar chat models) and add a new entry.
-
-## Getting the SHA256
-
-GGUF files on HuggingFace expose their SHA256 via the `x-linked-etag` HTTP header. Fetch it with:
-
-```bash
-curl -sI "https://huggingface.co/<org>/<repo>/resolve/main/<filename>.gguf" | grep -i x-linked-etag
-```
-
-The value (without quotes) is the SHA256 hash. Example:
-
-```bash
-curl -sI "https://huggingface.co/ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/resolve/main/embeddinggemma-300m-qat-Q8_0.gguf" | grep -i x-linked-etag
-# x-linked-etag: "6fa0c02a9c302be6f977521d399b4de3a46310a4f2621ee0063747881b673f67"
-```
-
-**Important**: Pay attention to exact filename casing — HuggingFace filenames are case-sensitive (e.g., `Q8_0` vs `q8_0`). Check the repo's file listing to get the exact name.
-
-## Entry format — Embedding models
-
-Embedding models use `gallery/virtual.yaml` as the base config and set `embeddings: true`:
-
-```yaml
- name: "model-name"
-  url: github:mudler/LocalAI/gallery/virtual.yaml@master
-  urls:
-    - https://huggingface.co/<original-model-org>/<original-model-name>
-    - https://huggingface.co/<gguf-org>/<gguf-repo-name>
-  description: |
-    Short description of the model, its size, and capabilities.
-  tags:
-    - embeddings
-  overrides:
-    backend: llama-cpp
-    embeddings: true
-    parameters:
-      model: <filename>.gguf
-  files:
-    - filename: <filename>.gguf
-      uri: huggingface://<gguf-org>/<gguf-repo-name>/<filename>.gguf
-      sha256: <sha256-hash>
-```
-
-## Entry format — Chat/LLM models
-
-Chat models typically reference a template config (e.g., `gallery/gemma.yaml`, `gallery/chatml.yaml`) that defines the prompt format. Use YAML anchors (`&name` / `*name`) if adding multiple quantization variants of the same model:
-
-```yaml
- &model-anchor
-  url: "github:mudler/LocalAI/gallery/<template>.yaml@master"
-  name: "model-name"
-  icon: https://example.com/icon.png
-  license: <license>
-  urls:
-    - https://huggingface.co/<org>/<model>
-    - https://huggingface.co/<gguf-org>/<gguf-repo>
-  description: |
-    Model description.
-  tags:
-    - llm
-    - gguf
-    - gpu
-    - cpu
-  overrides:
-    parameters:
-      model: <filename>-Q4_K_M.gguf
-  files:
-    - filename: <filename>-Q4_K_M.gguf
-      sha256: <sha256>
-      uri: huggingface://<gguf-org>/<gguf-repo>/<filename>-Q4_K_M.gguf
-```
-
-To add a variant (e.g., different quantization), use YAML merge:
-
-```yaml
- !!merge <<: *model-anchor
-  name: "model-name-q8"
-  overrides:
-    parameters:
-      model: <filename>-Q8_0.gguf
-  files:
-    - filename: <filename>-Q8_0.gguf
-      sha256: <sha256>
-      uri: huggingface://<gguf-org>/<gguf-repo>/<filename>-Q8_0.gguf
-```
-
-## Available template configs
-
-Look at existing `.yaml` files in `gallery/` to find the right prompt template for your model architecture:
-
- `gemma.yaml` — Gemma-family models (gemma, embeddinggemma, etc.)
- `chatml.yaml` — ChatML format (many Mistral/OpenHermes models)
- `deepseek.yaml` — DeepSeek models
- `virtual.yaml` — Minimal base (good for embedding models that don't need chat templates)
-
-## Checklist
-
-1. **Find the GGUF file** on HuggingFace — note exact filename (case-sensitive)
-2. **Get the SHA256** using the `curl -sI` + `x-linked-etag` method above
-3. **Choose the right template** config from `gallery/` based on model architecture
-4. **Add the entry** to `gallery/index.yaml` near similar models
-5. **Set `embeddings: true`** if it's an embedding model
-6. **Include both URLs** — the original model page and the GGUF repo
-7. **Write a description** — mention model size, capabilities, and quantization type
--- a/.agents/ai-coding-assistants.md
+++ b/.agents/ai-coding-assistants.md
@@ -1,101 +0,0 @@
-# AI Coding Assistants
-
-This document provides guidance for AI tools and developers using AI
-assistance when contributing to LocalAI.
-
-**LocalAI follows the same guidelines as the Linux kernel project for
-AI-assisted contributions.** See the upstream policy here:
-<https://docs.kernel.org/process/coding-assistants.html>
-
-The rules below mirror that policy, adapted to LocalAI's license and
-project layout. If anything is unclear, the kernel document is the
-authoritative reference for intent.
-
-AI tools helping with LocalAI development should follow the standard
-project development process:
-
- [CONTRIBUTING.md](../CONTRIBUTING.md) — development workflow, commit
-  conventions, and PR guidelines
- [.agents/coding-style.md](coding-style.md) — code style, editorconfig,
-  logging, and documentation conventions
- [.agents/building-and-testing.md](building-and-testing.md) — build and
-  test procedures
-
-## Licensing and Legal Requirements
-
-All contributions must comply with LocalAI's licensing requirements:
-
- LocalAI is licensed under the **MIT License** — see the [LICENSE](../LICENSE)
-  file
- New source files should use the SPDX license identifier `MIT` where
-  applicable to the file type
- Contributions must be compatible with the MIT License and must not
-  introduce code under incompatible licenses (e.g., GPL) without an
-  explicit discussion with maintainers
-
-## Signed-off-by and Developer Certificate of Origin
-
-**AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally
-certify the Developer Certificate of Origin (DCO). The human submitter
-is responsible for:
-
- Reviewing all AI-generated code
- Ensuring compliance with licensing requirements
- Adding their own `Signed-off-by` tag (when the project requires DCO)
-  to certify the contribution
- Taking full responsibility for the contribution
-
-AI agents MUST NOT add `Co-Authored-By` trailers for themselves either.
-A human reviewer owns the contribution; the AI's involvement is recorded
-via `Assisted-by` (see below).
-
-## Attribution
-
-When AI tools contribute to LocalAI development, proper attribution helps
-track the evolving role of AI in the development process. Contributions
-should include an `Assisted-by` tag in the commit message trailer in the
-following format:
-
-```
-Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
-```
-
-Where:
-
- `AGENT_NAME` — name of the AI tool or framework (e.g., `Claude`,
-  `Copilot`, `Cursor`)
- `MODEL_VERSION` — specific model version used (e.g.,
-  `claude-opus-4-7`, `gpt-5`)
- `[TOOL1] [TOOL2]` — optional specialized analysis tools invoked by the
-  agent (e.g., `golangci-lint`, `staticcheck`, `go vet`)
-
-Basic development tools (git, go, make, editors) should **not** be listed.
-
-### Example
-
-```
-fix(llama-cpp): handle empty tool call arguments
-
-Previously the parser panicked when the model returned a tool call with
-an empty arguments object. Fall back to an empty JSON object in that
-case so downstream consumers receive a valid payload.
-
-Assisted-by: Claude:claude-opus-4-7 golangci-lint
-Signed-off-by: Jane Developer <jane@example.com>
-```
-
-## Scope and Responsibility
-
-Using an AI assistant does not reduce the contributor's responsibility.
-The human submitter must:
-
- Understand every line that lands in the PR
- Verify that generated code compiles, passes tests, and follows the
-  project style
- Confirm that any referenced APIs, flags, or file paths actually exist
-  in the current tree (AI models may hallucinate identifiers)
- Not submit AI output verbatim without review
-
-Reviewers may ask for clarification on any change regardless of how it
-was produced. "An AI wrote it" is not an acceptable answer to a design
-question.
--- a/.agents/api-endpoints-and-auth.md
+++ b/.agents/api-endpoints-and-auth.md
@@ -1,355 +0,0 @@
-# API Endpoints and Authentication
-
-This guide covers how to add new API endpoints and properly integrate them with the auth/permissions system.
-
-> **Before you ship a new endpoint or capability surface**, re-read the [checklist at the bottom of this file](#checklist). LocalAI advertises its feature surface in several independent places — miss any one of them and clients/admins/UI won't know the endpoint exists.
-
-## Architecture overview
-
-Authentication and authorization flow through three layers:
-
-1. **Global auth middleware** (`core/http/auth/middleware.go` → `auth.Middleware`) — applied to every request in `core/http/app.go`. Handles session cookies, Bearer tokens, API keys, and legacy API keys. Populates `auth_user` and `auth_role` in the Echo context.
-2. **Feature middleware** (`auth.RequireFeature`) — per-feature access control applied to route groups or individual routes. Checks if the authenticated user has the specific feature enabled.
-3. **Admin middleware** (`auth.RequireAdmin`) — restricts endpoints to admin users only.
-
-When auth is disabled (no auth DB, no legacy API keys), all middleware becomes pass-through (`auth.NoopMiddleware`).
-
-## Adding a new API endpoint
-
-### Step 1: Create the handler
-
-Write the endpoint handler in the appropriate package under `core/http/endpoints/`. Follow existing patterns:
-
-```go
-// core/http/endpoints/localai/my_feature.go
-func MyFeatureEndpoint(app *application.Application) echo.HandlerFunc {
-    return func(c echo.Context) error {
-        // Use auth.GetUser(c) to get the authenticated user (may be nil if auth is disabled)
-        user := auth.GetUser(c)
-
-        // Your logic here
-        return c.JSON(http.StatusOK, result)
-    }
-}
-```
-
-### Step 2: Register routes
-
-Add routes in the appropriate file under `core/http/routes/`. The file you use depends on the endpoint category:
-
-| File | Category |
-|------|----------|
-| `routes/openai.go` | OpenAI-compatible API endpoints (`/v1/...`) |
-| `routes/localai.go` | LocalAI-specific endpoints (`/api/...`, `/models/...`, `/backends/...`) |
-| `routes/agents.go` | Agent pool endpoints (`/api/agents/...`) |
-| `routes/auth.go` | Auth endpoints (`/api/auth/...`) |
-| `routes/ui_api.go` | UI backend API endpoints |
-
-### Step 3: Apply the right middleware
-
-Choose the appropriate protection level:
-
-#### No auth required (public)
-Exempt paths bypass auth entirely. Add to `isExemptPath()` in `middleware.go` or use the `/api/auth/` prefix (always exempt). Use sparingly — most endpoints should require auth.
-
-#### Standard auth (any authenticated user)
-The global middleware already handles this. API paths (`/api/`, `/v1/`, etc.) automatically require authentication when auth is enabled. You don't need to add any extra middleware.
-
-```go
-router.GET("/v1/my-endpoint", myHandler)  // auth enforced by global middleware
-```
-
-#### Admin only
-Pass `adminMiddleware` to the route. This is set up in `app.go` and passed to `Register*Routes` functions:
-
-```go
-// In the Register function signature, accept the middleware:
-func RegisterMyRoutes(router *echo.Echo, app *application.Application, adminMiddleware echo.MiddlewareFunc) {
-    router.POST("/models/apply", myHandler, adminMiddleware)
-}
-```
-
-#### Feature-gated
-For endpoints that should be toggleable per-user, use feature middleware. There are two approaches:
-
-**Approach A: Route-level middleware** (preferred for groups of related endpoints)
-
-```go
-// In app.go, create the feature middleware:
-myFeatureMw := auth.RequireFeature(application.AuthDB(), auth.FeatureMyFeature)
-
-// Pass it to the route registration function:
-routes.RegisterMyRoutes(e, app, myFeatureMw)
-
-// In the routes file, apply to a group:
-g := e.Group("/api/my-feature", myFeatureMw)
-g.GET("", listHandler)
-g.POST("", createHandler)
-```
-
-**Approach B: RouteFeatureRegistry** (preferred for individual OpenAI-compatible endpoints)
-
-Add an entry to `RouteFeatureRegistry` in `core/http/auth/features.go`. The `RequireRouteFeature` global middleware will automatically enforce it:
-
-```go
-var RouteFeatureRegistry = []RouteFeature{
-    // ... existing entries ...
-    {"POST", "/v1/my-endpoint", FeatureMyFeature},
-}
-```
-
-## Adding a new feature
-
-When you need a new toggleable feature (not just a new endpoint under an existing feature):
-
-### 1. Define the feature constant
-
-Add to `core/http/auth/permissions.go`:
-
-```go
-const (
-    // Add to the appropriate group:
-    // Agent features (default OFF for new users)
-    FeatureMyFeature = "my_feature"
-
-    // OR API features (default ON for new users)
-    FeatureMyFeature = "my_feature"
-)
-```
-
-Then add it to the appropriate slice:
-
-```go
-// Default OFF — user must be explicitly granted access:
-var AgentFeatures = []string{..., FeatureMyFeature}
-
-// Default ON — user has access unless explicitly revoked:
-var APIFeatures = []string{..., FeatureMyFeature}
-```
-
-### 2. Add feature metadata
-
-In `core/http/auth/features.go`, add to the appropriate `FeatureMetas` function so the admin UI can display it:
-
-```go
-func AgentFeatureMetas() []FeatureMeta {
-    return []FeatureMeta{
-        // ... existing ...
-        {FeatureMyFeature, "My Feature", false},  // false = default OFF
-    }
-}
-```
-
-### 3. Wire up the middleware
-
-In `core/http/app.go`:
-
-```go
-myFeatureMw := auth.RequireFeature(application.AuthDB(), auth.FeatureMyFeature)
-```
-
-Then pass it to the route registration function.
-
-### 4. Register route-feature mappings (if applicable)
-
-If your feature gates standard API endpoints (like `/v1/...`), add entries to `RouteFeatureRegistry` in `features.go` instead of using per-route middleware.
-
-## Accessing the authenticated user in handlers
-
-```go
-import "github.com/mudler/LocalAI/core/http/auth"
-
-func MyHandler(c echo.Context) error {
-    // Get the user (nil when auth is disabled or unauthenticated)
-    user := auth.GetUser(c)
-    if user == nil {
-        // Handle unauthenticated — or let middleware handle it
-    }
-
-    // Check role
-    if user.Role == auth.RoleAdmin {
-        // admin-specific logic
-    }
-
-    // Check feature access programmatically (when you need conditional behavior, not full blocking)
-    if auth.HasFeatureAccess(db, user, auth.FeatureMyFeature) {
-        // feature-specific logic
-    }
-
-    // Check model access
-    if !auth.IsModelAllowed(db, user, modelName) {
-        return c.JSON(http.StatusForbidden, ...)
-    }
-}
-```
-
-## Middleware composition patterns
-
-Middleware can be composed at different levels. Here are the patterns used in the codebase:
-
-### Group-level middleware (agents pattern)
-```go
-// All routes in the group share the middleware
-g := e.Group("/api/agents", poolReadyMw, agentsMw)
-g.GET("", listHandler)
-g.POST("", createHandler)
-```
-
-### Per-route middleware (localai pattern)
-```go
-// Individual routes get middleware as extra arguments
-router.POST("/models/apply", applyHandler, adminMiddleware)
-router.GET("/metrics", metricsHandler, adminMiddleware)
-```
-
-### Middleware slice (openai pattern)
-```go
-// Build a middleware chain for a handler
-chatMiddleware := []echo.MiddlewareFunc{
-    usageMiddleware,
-    traceMiddleware,
-    modelFilterMiddleware,
-}
-app.POST("/v1/chat/completions", chatHandler, chatMiddleware...)
-```
-
-## Error response format
-
-Always use `schema.ErrorResponse` for auth/permission errors to stay consistent with the OpenAI-compatible API:
-
-```go
-return c.JSON(http.StatusForbidden, schema.ErrorResponse{
-    Error: &schema.APIError{
-        Message: "feature not enabled for your account",
-        Code:    http.StatusForbidden,
-        Type:    "authorization_error",
-    },
-})
-```
-
-Use these HTTP status codes:
- `401 Unauthorized` — no valid credentials provided
- `403 Forbidden` — authenticated but lacking permission
- `429 Too Many Requests` — rate limited (auth endpoints)
-
-## Usage tracking
-
-If your endpoint should be tracked for usage (token counts, request counts), add the `usageMiddleware` to its middleware chain. See `core/http/middleware/usage.go` and how it's applied in `routes/openai.go`.
-
-## Advertising surfaces — where to register a new capability
-
-Beyond routing and auth, LocalAI publishes its capability surface in **four independent places**. When you add an endpoint — especially one introducing a net-new capability like a new media type or a new auth-gated feature — you must update every relevant surface. These aren't optional: missing them means the endpoint works but is invisible to clients, admins, and the UI.
-
-### 1. Swagger `@Tags` annotation (mandatory)
-
-Every handler needs a swagger block so the endpoint appears in `/swagger/index.html` and in the `/api/instructions` output. The `@Tags` value is what groups the endpoint into a capability area:
-
-```go
-// MyEndpoint does X.
-// @Summary Do X.
-// @Tags my-capability
-// @Param request body schema.MyRequest true "payload"
-// @Success 200 {object} schema.MyResponse "Response"
-// @Router /v1/my-endpoint [post]
-func MyEndpoint(...) echo.HandlerFunc { ... }
-```
-
-Use an existing tag when the endpoint extends an existing area (e.g. `audio`, `images`, `face-recognition`). Create a new tag only when the endpoint introduces a genuinely new capability surface — and in that case, also register it in step 2.
-
-After adding endpoints, regenerate the embedded spec so the runtime serves it:
-
-```bash
-make protogen-go         # ensures gRPC codegen is fresh first
-make swagger             # regenerates swagger/swagger.json
-```
-
-### 2. `/api/instructions` registry (for new capability areas)
-
-`core/http/endpoints/localai/api_instructions.go` defines `instructionDefs` — a lightweight, machine-readable index of capability areas that groups swagger endpoints by tag. It's the primary discovery surface for agents and SDKs ("what can this server do?").
-
-**When to update:** only when adding a new capability area (a new swagger tag). Existing-tag additions automatically surface without any change here.
-
-Add an entry to `instructionDefs`:
-
-```go
-{
-    Name:        "my-capability",             // URL segment at /api/instructions/my-capability
-    Description: "Short sentence describing the capability",
-    Tags:        []string{"my-capability"},   // must match swagger @Tags
-    Intro:       "Optional gotcha/context that isn't in the swagger descriptions (caveats, defaults, cross-references to other endpoints).",
-},
-```
-
-Also bump the expected-length count in `api_instructions_test.go` and add the name to the `ContainElements` assertion.
-
-### 3. `capabilities.js` symbol (for new model-config FLAG_* flags)
-
-If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), you need to update **all** of:
-
- `Usecase<Name>` string constant in `core/config/backend_capabilities.go`
- `UsecaseInfoMap` entry mapping the string to its flag + gRPC method
- `FLAG_<NAME>` bitmask in `core/config/model_config.go`
- `GetAllModelConfigUsecases()` map entry (otherwise the YAML loader silently ignores the string)
- `ModalityGroups` membership if the flag should affect `IsMultimodal()` (e.g. realtime_audio is in both speech-input and audio-output groups so a lone flag still reads as multimodal)
- `GuessUsecases()` branch listing the backends that own this capability
- `usecaseFilters` in `core/http/routes/ui_api.go` (drives the gallery filter dropdown)
- `Models.jsx` `FILTERS` array + matching `filters.<camelCase>` i18n key in `core/http/react-ui/public/locales/en/models.json`
- `core/http/react-ui/src/utils/capabilities.js`:
-
-```js
-export const CAP_MY_CAPABILITY = 'FLAG_MY_CAPABILITY'
-```
-
-React pages that want to filter the ModelSelector by capability import this symbol. Declare it even if you're not building the UI page yet — the declaration keeps the Go/JS vocabularies in sync.
-
-### 4. `docs/content/` (user-facing documentation)
-
-A new capability deserves its own page under `docs/content/features/`, plus cross-links from related features and an entry in `docs/content/whats-new.md`. See the pattern used by `face-recognition.md` / `object-detection.md`.
-
-## Path protection rules
-
-The global auth middleware classifies paths as API paths or non-API paths:
-
- **API paths** (always require auth when auth is enabled): `/api/`, `/v1/`, `/models/`, `/backends/`, `/backend/`, `/tts`, `/vad`, `/video`, `/stores/`, `/system`, `/ws/`, `/metrics`
- **Exempt paths** (never require auth): `/api/auth/` prefix, anything in `appConfig.PathWithoutAuth`
- **Non-API paths** (UI, static assets): pass through without auth — the React UI handles login redirects client-side
-
-If you add endpoints under a new top-level path prefix, add it to `isAPIPath()` in `middleware.go` to ensure it requires authentication.
-
-## Checklist
-
-When adding a new endpoint:
-
-**Routing & auth**
- [ ] Handler in `core/http/endpoints/`
- [ ] Route registered in appropriate `core/http/routes/` file
- [ ] Auth level chosen: public / standard / admin / feature-gated
- [ ] Entry added to `RouteFeatureRegistry` in `core/http/auth/features.go` (one row per route/method — all /v1/* routes gate through this, not per-route middleware)
- [ ] If new feature: constant in `permissions.go`, added to the right slice (`APIFeatures` default-ON / `AgentFeatures` default-OFF), metadata in `features.go` `*FeatureMetas()`
- [ ] If feature uses group middleware: wired in `core/http/app.go` and passed to the route registration function
- [ ] If new path prefix: added to `isAPIPath()` in `middleware.go`
- [ ] If token-counting: `usageMiddleware` added to middleware chain
-
-**Advertising surfaces (easy to miss — see the [Advertising surfaces](#advertising-surfaces--where-to-register-a-new-capability) section)**
- [ ] Swagger block on the handler: `@Summary`, `@Tags`, `@Param`, `@Success`, `@Router`
- [ ] If new capability area (new swagger tag): entry in `instructionDefs` in `core/http/endpoints/localai/api_instructions.go` + test count bumped in `api_instructions_test.go`
- [ ] If new `FLAG_*` usecase flag: matching `CAP_*` symbol exported from `core/http/react-ui/src/utils/capabilities.js`
- [ ] `docs/content/features/<feature>.md` created; cross-links from related feature pages; entry in `docs/content/whats-new.md`
-
-**Quality**
- [ ] Error responses use `schema.ErrorResponse` format (or `echo.NewHTTPError` with a mapped gRPC status — see the `mapBackendError` helper in `core/http/endpoints/localai/images.go`)
- [ ] Tests cover both authenticated and unauthenticated access
- [ ] Swagger regenerated (`make swagger`) if you changed any `@Router`/`@Tags`/`@Param` annotation
-
-## Companion: MCP admin tool surface
-
-**Required for admin endpoints.** Every new admin endpoint MUST be considered for the MCP admin tool surface — the REST API and the MCP tool catalog can drift silently otherwise, and both the LocalAI Assistant chat modality and the standalone `local-ai mcp-server` rely on `pkg/mcp/localaitools/` to mirror REST.
-
-Two outcomes are acceptable; one is not:
-
- **Tool added.** The new endpoint is something an admin would manage conversationally (install, list, edit, toggle, upgrade). Follow the full checklist in [.agents/localai-assistant-mcp.md](localai-assistant-mcp.md): add a `LocalAIClient` interface method, implement it in both `inproc` and `httpapi`, register the tool with a `Tool*` constant, update the skill prompts, **and add the route to `toolToHTTPRoute` in `pkg/mcp/localaitools/coverage_test.go`**.
- **Tool deliberately skipped.** The endpoint is internal/diagnostic and adding a chat path would be misleading. Document the decision in the PR description; no code action.
- **Forgot.** This breaks the contract. The `TestToolHTTPRouteMappingComplete` test in `pkg/mcp/localaitools` is a partial guard (it checks every `Tool*` has a route mapping), but it does NOT detect new REST endpoints without a tool — that's still a process check on the PR author.
-
-**Add to the bottom of the checklist below**:
- [ ] If admin: decided whether MCP coverage is needed; if yes, tool registered + map updated; if no, skip-reason in PR description.
--- a/.agents/backend-signing.md
+++ b/.agents/backend-signing.md
@@ -1,126 +0,0 @@
-# Backend image signing & verification
-
-LocalAI verifies backend OCI images against a per-gallery keyless-cosign
-policy. This page documents the trust model, the producer side
-(`.github/workflows/backend_merge.yml` in this repo), and the consumer
-side (`pkg/oci/cosignverify` plus the gallery YAML).
-
-## Trust model
-
- **Producer:** `.github/workflows/backend_merge.yml` signs each pushed
-  manifest list with `cosign sign --recursive` in keyless mode after
-  `docker buildx imagetools create`. The signing cert is issued by
-  Fulcio bound to the workflow's OIDC identity. There is no long-lived
-  signing key. `--recursive` signs both the manifest list and every
-  per-arch entry — needed because our consumer resolves a tag to a
-  per-arch manifest before checking signatures.
- **Storage:** Signatures are written as OCI 1.1 referrers
-  (`--registry-referrers-mode=oci-1-1`) in the new Sigstore bundle format
-  (current cosign releases do this by default; no `--new-bundle-format`
-  flag). No `:sha256-<hex>.sig` tag clutter.
- **Consumer:** `pkg/oci/cosignverify` discovers the bundle via the
-  referrers API, hands it to `sigstore-go`, and verifies it against the
-  policy declared in the gallery YAML (`Gallery.Verification`).
- **Revocation:** Keyless cosign certs are ephemeral (10-minute Fulcio
-  validity), so revocation is policy-side, not CA-side. The gallery's
-  `verification.not_before` (RFC3339) is the kill-switch — advance it to
-  invalidate every signature produced before a known compromise window.
-
-## Producer setup
-
-`backend_merge.yml` is the workflow that joins per-arch digests into the
-multi-arch manifest list users actually pull, so it's also the right place
-to sign. The job needs:
-
- `permissions: { id-token: write, contents: read }` at the job level so
-  the runner can exchange its GitHub OIDC token for a Fulcio cert.
- `sigstore/cosign-installer@v3` step (current cosign releases already
-  default to the new bundle format).
- After each `docker buildx imagetools create`, resolve the resulting
-  list digest with `docker buildx imagetools inspect <tag> --format
-  '{{.Manifest.Digest}}'` and sign:
-
-```sh
-cosign sign --yes --recursive \
-  --registry-referrers-mode=oci-1-1 \
-  "${REGISTRY_REPO}@${DIGEST}"
-```
-
-Sign by digest, never by tag — signing by tag binds the signature to
-whatever the tag points at *now*, and a subsequent tag push orphans it.
-
-`--registry-referrers-mode=oci-1-1` is still gated behind
-`COSIGN_EXPERIMENTAL=1` in cosign v2.4.x (set at the job env level in
-`backend_merge.yml`). Re-evaluate when bumping the pinned cosign release
-— newer versions are expected to graduate this flag and the env var can
-then be dropped.
-
-`backend_build_darwin.yml` builds and pushes single-arch darwin images
-that bypass the manifest-list merge. If/when those entries get a gallery
-`verification:` policy, the equivalent cosign step has to land there
-too.
-
-## Consumer setup (in `mudler/LocalAI` gallery YAML)
-
-Once CI is signing, add a `verification:` block to the backend gallery
-entry (`backend/index.yaml`):
-
-```yaml
- name: localai
-  url: github:mudler/LocalAI/backend/index.yaml@master
-  verification:
-    issuer: "https://token.actions.githubusercontent.com"
-    identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@refs/heads/master$"
-    # Optional revocation cutoff; advance during incident response.
-    # not_before: "2026-06-01T00:00:00Z"
-```
-
-Identity matching pins the OIDC subject Fulcio issued the signing cert
-to. Without this, any image signed by *anyone* with a Fulcio cert would
-pass — the regex is what makes a signature mean "produced by our CI".
-
-## Strict mode
-
-Default behaviour: OCI backends without a `verification:` block install
-with a warning (logs include `installing OCI backend without signature
-verification`). Tarball/HTTP backends without a `sha256` field log a
-similar warning.
-
-For production, set `LOCALAI_REQUIRE_BACKEND_INTEGRITY=1` (or pass
-`--require-backend-integrity` to `local-ai run` / `local-ai backends
-install` / `local-ai models install`). The warning becomes a hard error
-and unverifiable backends refuse to install.
-
-## Revocation playbook
-
-If `backend_merge.yml` (or any workflow with `id-token: write`) is
-compromised and we've shipped malicious signed images:
-
-1. **Identify the compromise window.** Find the earliest IntegratedTime
-   from the bad signatures (Rekor search by `subject` filter).
-2. **Set `verification.not_before`** in `backend/index.yaml` to a
-   timestamp just *after* that window's start.
-3. **Push the YAML.** Deployed LocalAI instances pick it up on next
-   gallery refresh (1-hour cache in `core/gallery/gallery.go`).
-4. **Fix the underlying compromise** in the workflow and re-sign images
-   with the new build, which will have IntegratedTime > `not_before`.
-5. **Optional:** for absolute decisiveness, also rotate to a new
-   workflow path (`backend_merge_v2.yml`) and update `identity_regex`.
-
-## Where the code lives
-
- `pkg/oci/cosignverify/` — verifier, policy, OCI referrer fetch, NotBefore enforcement.
- `pkg/downloader/uri.go` — `WithImageVerifier` option threaded through `DownloadFileWithContext`.
- `core/gallery/backends.go` — `backendDownloadOptions` builds the verifier from the gallery's policy.
- `core/config/gallery.go` — `Gallery.Verification` YAML schema.
- `core/cli/run.go`, `core/cli/backends.go`, `core/cli/models.go` — `--require-backend-integrity` flag propagation.
- `.github/workflows/backend_merge.yml` — producer-side `cosign sign --recursive` after each multi-arch manifest list push.
-
-## Out of scope (follow-ups)
-
- **Signing the gallery YAML itself.** The index is fetched over HTTPS
-  from GitHub; we trust the host. A cosign blob signature on the YAML
-  would close that gap but adds key-management overhead. Revisit this
-  page if/when added.
- **Tarball/HTTP backend signing.** Cosign can sign arbitrary blobs, but
-  for now non-OCI backends keep using the `sha256:` field in YAML.
--- a/.agents/building-and-testing.md
+++ b/.agents/building-and-testing.md
@@ -1,49 +0,0 @@
-# Build and Testing
-
-Building and testing the project depends on the components involved and the platform where development is taking place. Due to the amount of context required it's usually best not to try building or testing the project unless the user requests it. If you must build the project then inspect the Makefile in the project root and the Makefiles of any backends that are effected by changes you are making. In addition the workflows in .github/workflows can be used as a reference when it is unclear how to build or test a component. The primary Makefile contains targets for building inside or outside Docker, if the user has not previously specified a preference then ask which they would like to use.
-
-## Building a specified backend
-
-Let's say the user wants to build a particular backend for a given platform. For example let's say they want to build coqui for ROCM/hipblas
-
- The Makefile has targets like `docker-build-coqui` created with `generate-docker-build-target` at the time of writing. Recently added backends may require a new target.
- At a minimum we need to set the BUILD_TYPE, BASE_IMAGE build-args
-  - Use `.github/backend-matrix.yml` as a reference — it's the data-only YAML that lists every backend variant's `build-type`, `base-image`, `platforms`, etc. (`backend.yml` and `backend_pr.yml` consume it via `scripts/changed-backends.js`).
-  - l4t and cublas also require the CUDA major and minor version.
-  - For llama-cpp / ik-llama-cpp / turboquant the matrix also sets `builder-base-image` pointing at a prebuilt `quay.io/go-skynet/ci-cache:base-grpc-*` tag. Local `make backends/<name>` defaults to `BUILDER_TARGET=builder-fromsource` and doesn't need it — the Dockerfile's from-source stage installs everything itself.
- You can pretty print a command like `DOCKER_MAKEFLAGS=-j$(nproc --ignore=1) BUILD_TYPE=hipblas BASE_IMAGE=rocm/dev-ubuntu-24.04:7.2.1 make docker-build-coqui`
- Unless the user specifies that they want you to run the command, then just print it because not all agent frontends handle long running jobs well and the output may overflow your context
- The user may say they want to build AMD or ROCM instead of hipblas, or Intel instead of SYCL or NVIDIA insted of l4t or cublas. Ask for confirmation if there is ambiguity.
- Sometimes the user may need extra parameters to be added to `docker build` (e.g. `--platform` for cross-platform builds or `--progress` to view the full logs), in which case you can generate the `docker build` command directly.
-
-## Test coverage gate
-
-The core Go suites (`./pkg`, `./core`, plus the in-process integration suite `./tests/e2e`) are covered by a **strict, monotonic coverage ratchet**:
-
- `make test-coverage` — runs the suites with `covermode=atomic` instrumentation and writes a merged profile to `coverage/coverage.out`. Uses the same prerequisites as `make test`.
-  - **`--coverpkg` (`COVERAGE_COVERPKG = core/...,pkg/...`):** coverage is attributed to the core+pkg packages, not just the package under test. This is what lets the in-process `tests/e2e` suite (which drives the real HTTP server over loopback via `application.New`) credit the `core/http/endpoints/...` handlers it exercises — folding it in roughly doubled endpoint coverage (e.g. `endpoints/openai` 13.6% → 52%). The denominator is therefore *all* of `core`+`pkg` (minus generated proto, dropped via `COVERAGE_EXCLUDE_RE`), so the number isn't comparable to a plain per-package figure.
-  - **Integration suites (`COVERAGE_E2E_ROOTS = ./tests/e2e`)** run non-recursively (excludes `tests/e2e/distributed`, which needs containers) with `--label-filter=!real-models` (those need a downloaded model) against the mock backend built by `prepare-test`. `tests/integration` is deliberately excluded — it needs `make backends/local-store`, which the coverage CI job doesn't build.
-  - **Flake note:** folding integration tests into a *strict* gate means a hard e2e failure (or a spec that silently stops running) can fail the coverage gate, not just the test. `--flake-attempts` absorbs transient retryable failures; covermode=atomic keeps line coverage deterministic otherwise.
-  - **Why one ginkgo run per root (`scripts/run-coverage.sh`):** passing several recursive roots to a *single* ginkgo invocation (e.g. `ginkgo -r ./pkg ./core`) only merges **one** root's coverprofile into `--output-dir`/`--coverprofile` — the others are silently dropped. Verified with ginkgo 2.29.0: `-r ./pkg ./core` yields only `./pkg` coverage, while `-r ./core` alone yields all 34 core packages. So the script runs each root separately and concatenates the (disjoint) profiles. Don't "simplify" it back to a single multi-root invocation — that's how `core/` (including all of `core/http`, ~7.4k statements) silently vanished from the number before.
-  - **Build tags (`COVERAGE_TAGS`, passed via `GINKGO_TAGS`):** defaults to `debug auth`. The `auth` tag is required to compile the real (sqlite-backed) auth implementation and its ~150 `//go:build auth` tests — without it those files aren't built, the tests don't run, and the gate scores auth against a stub (~3.7% instead of ~38%). If you add new tag-gated tests, extend `COVERAGE_TAGS` or they won't count (and likely won't run in CI at all).
- `make test-coverage-check` — runs `test-coverage`, then `scripts/coverage-check.sh` fails the build if total coverage is **below** the committed baseline in `coverage-baseline.txt`. The Linux job in `.github/workflows/test.yml` runs this instead of `make test`.
- `make test-coverage-baseline` — regenerates and overwrites `coverage-baseline.txt` from the current run.
- `make install-hooks` — sets `core.hooksPath` to the versioned `.githooks/`, whose `pre-commit` runs checks scoped to what's staged: Go changes → `make lint` + `make test-coverage-check`; `core/http/react-ui/` changes → `make test-ui-coverage-check` (Playwright e2e + UI coverage gate). A commit touching neither is skipped; bypass with `git commit --no-verify`. The hook resolves golangci-lint's new-from base to `upstream/master` → `origin/master` → `master`, so it works from a fork clone where `origin/master` is stale (passed to `make lint` via `LINT_NEW_FROM`).
-
-### React UI coverage
-
-The React UI (`core/http/react-ui/`) has **no component/unit tests** — its only tests are the Playwright e2e specs in `e2e/`, which run against the real app served by `tests/e2e-ui/ui-test-server` (the dist is `//go:embed`ed, so the server is rebuilt per coverage run). Those specs do genuinely exercise the UI (clicks, `fill`, `setInputFiles`, `getByRole`/`getByText`, visibility/value assertions).
-
- `make test-ui-coverage` — builds an istanbul-instrumented bundle (`COVERAGE=true`, via `vite-plugin-istanbul` with `forceBuildInstrument: true` — the plugin skips production builds otherwise), re-embeds it into `ui-test-server` (the dist is `//go:embed`ed), runs the Playwright specs, and writes an `nyc` report to `core/http/react-ui/coverage/`. The specs import `{ test, expect }` from `e2e/coverage-fixtures.js` (re-exports Playwright's, plus harvests `window.__coverage__` into `.nyc_output/` after each test). Instrumentation is off unless `COVERAGE=true`, so dev/prod builds and plain `make test-ui-e2e` are unaffected (the fixture no-ops when `window.__coverage__` is absent).
- **Browser:** the flake dev shell ships `chromium` and exports `PLAYWRIGHT_CHROMIUM_PATH`; `playwright.config.js` uses it via `launchOptions.executablePath`, and the Makefile skips `playwright install` when it's set. This avoids Playwright's downloaded browser, which can't resolve system libs (`libglib-2.0`, …) on NixOS. In CI (no `PLAYWRIGHT_CHROMIUM_PATH`) the Makefile falls back to `playwright install --with-deps chromium`.
- The app is a React SPA, so coverage accumulates across in-app navigation within a test; a full `page.goto`/reload resets it.
- `.nycrc.json` uses `all: true`, so **every `src/**` file is in the report**, including 0%-coverage ones — that's how you spot features with no test at all (sort the HTML report or `coverage-summary.json` by line% ascending). 
- **UI coverage gate:** `make test-ui-coverage-check` runs the suite then `scripts/ui-coverage-check.sh`, failing if total line coverage drops more than `UI_COVERAGE_TOLERANCE` below `core/http/react-ui/coverage-baseline.txt`. `make test-ui-coverage-baseline` regenerates the baseline. Runs in CI (`tests-ui-e2e.yml`) and pre-commit on `core/http/react-ui/` changes.
- **Why it has a tolerance (unlike the strict Go gate):** UI e2e coverage is *non-deterministic*. Specs that assert on state and end while async/lazy render work is still in flight collect those lines only when the render beats the coverage teardown — so the total drifts with machine speed/load (a fast local box reads higher than a slow CI runner), diffusely across many specs. The tolerance absorbs that drift, so set the baseline *below* the slow-CI floor, never to a fast-local `make test-ui-coverage-baseline` number, or CI flaps.
- **Raising coverage is cheap:** a *render-smoke* spec (navigate to a route, assert its header renders) mounts a lazy page and runs its full render + initial effects, capturing most of its lines in a few lines of test — see `e2e/page-render-smoke.spec.js`. Auth is disabled in the test server (`isAdmin=true`), so `RequireAdmin`/`RequireFeature` routes render without a mock. The most *deterministic* win is removing a race: make a spec `await` a rendered element before ending (see `e2e/agents.spec.js` → AgentCreate) so its lines count every run.
-
-Rules (both gates):
- **Install the hooks:** `make install-hooks` once per clone so lint + coverage run pre-commit. Don't lean on CI for what the hook catches.
- **Don't work around the gate:** never `git commit --no-verify`, and never hand-lower a baseline or widen a tolerance to turn a red gate green. The ratchet only moves up.
- If a change drops coverage, **add tests** (sort `coverage-summary.json` by line% ascending to find untested code) rather than editing the baseline. When coverage legitimately rises, commit the regenerated baseline (`make test-coverage-baseline` / `test-ui-coverage-baseline`).
- The Go gate is **strict — no tolerance**; `covermode=atomic` keeps it deterministic. The UI gate keeps a small tolerance only because its e2e coverage isn't.
--- a/.agents/ci-caching.md
+++ b/.agents/ci-caching.md
@@ -1,250 +0,0 @@
-# CI Build Caching
-
-Container builds — both the root LocalAI image (`Dockerfile`) and the per-backend images (`backend/Dockerfile.*`) — share a registry-backed BuildKit cache plus a layered set of prebuilt base images. This file explains how the cache is laid out, what invalidates it, and how to bypass it.
-
-## Workflow surfaces
-
-| Workflow | Purpose | Triggers |
-|---|---|---|
-| `.github/workflows/backend.yml` | Backend container images on master | `push` to master + tags, weekly Sunday cron, `workflow_dispatch` |
-| `.github/workflows/backend_pr.yml` | Backend container images on PRs | `pull_request` |
-| `.github/workflows/backend_build.yml` | Reusable: builds one backend (one arch) by digest | `workflow_call` from above |
-| `.github/workflows/backend_merge.yml` | Reusable: assembles per-arch digests into a multi-arch manifest list | `workflow_call` |
-| `.github/workflows/backend_build_darwin.yml` | Reusable: macOS-native backend builds | `workflow_call` |
-| `.github/workflows/image.yml` / `image-pr.yml` | Root LocalAI image (push / PR) | push / PR |
-| `.github/workflows/image_build.yml` / `image_merge.yml` | Reusable: per-arch root-image build + merge | `workflow_call` |
-| `.github/workflows/base-images.yml` | Builds the prebuilt `base-grpc-*` builder bases | Saturdays 05:00 UTC cron, `workflow_dispatch`, master push touching `Dockerfile.base-grpc-builder`, `.docker/install-base-deps.sh`, `.docker/apt-mirror.sh`, or this workflow |
-
-The matrix that drives `backend.yml` / `backend_pr.yml` lives in **`.github/backend-matrix.yml`** (data-only YAML, not embedded in the workflow). `scripts/changed-backends.js` parses it, applies path-filter logic against the PR diff (PR events) or the GitHub Compare API (push events), and emits the filtered matrix plus a `merge-matrix` for backends with multiple per-arch entries.
-
-## Cache layout
-
- **Cache registry**: `quay.io/go-skynet/ci-cache`
- **One tag per matrix entry per arch**, derived from `tag-suffix` and `platform-tag`:
-  - Backend builds (`backend_build.yml`): `cache<tag-suffix>-<platform-tag>`
-    - e.g. `cache-cpu-faster-whisper-amd64`, `cache-cpu-faster-whisper-arm64`, `cache-gpu-nvidia-cuda-13-llama-cpp-amd64`
-  - Root image builds (`image_build.yml`): `cache-localai<tag-suffix>-<platform-tag>` (with a `-core` placeholder when `tag-suffix` is empty, so `cache-localai-core-amd64` for the core image)
-  - Pre-built base images (`base-images.yml`): `cache-base-grpc-<variant>` (one per `(BUILD_TYPE, arch)` permutation)
- Each tag stores a multi-arch BuildKit cache manifest (`mode=max`), so every intermediate stage is re-usable, not just the final image.
-
-The per-arch suffix exists because amd64 and arm64 builds produce different intermediate content; sharing one cache key would thrash on every cross-arch rebuild.
-
-## Read/write semantics
-
-| Trigger | `cache-from` | `cache-to` |
-|---|---|---|
-| `push` to `master` / tag / cron / dispatch | yes | yes (`mode=max,ignore-error=true`) |
-| `pull_request` | yes | **no** |
-
-PR builds read master's warm cache but never write — this prevents PRs from polluting the shared cache with their experimental state. After merge, the master build for that matrix entry refreshes the cache.
-
-`ignore-error=true` on the write side means a transient quay push failure does not fail the build; the next master push retries.
-
-## Pre-built base images (`base-grpc-*`)
-
-The C++ backend Dockerfiles (`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}`) compile gRPC from source. On a cold build that's ~25–35 min before any LocalAI source compiles. To skip that on CI, `.github/workflows/base-images.yml` builds and pushes a set of pre-prepped builder bases:
-
-| Tag | Contents |
-|---|---|
-| `base-grpc-amd64` / `base-grpc-arm64` | Ubuntu 24.04 + apt build deps + protoc + cmake + gRPC at `/opt/grpc` |
-| `base-grpc-cuda-12-amd64` | the above + CUDA 12.8 toolkit |
-| `base-grpc-cuda-13-amd64` | the above + CUDA 13.0 toolkit (Ubuntu 22.04 base) |
-| `base-grpc-cuda-13-arm64` | the above + CUDA 13.0 sbsa toolkit (Ubuntu 24.04 base) |
-| `base-grpc-l4t-cuda-12-arm64` | JetPack r36.4.0 base (CUDA preinstalled, `SKIP_DRIVERS=true`) + gRPC |
-| `base-grpc-rocm-amd64` | rocm/dev-ubuntu-24.04:7.2.1 base + hipblas/hipblaslt/rocblas + gRPC |
-| `base-grpc-vulkan-amd64` / `base-grpc-vulkan-arm64` | Ubuntu 24.04 + Vulkan SDK 1.4.335 + gRPC |
-| `base-grpc-intel-amd64` | intel/oneapi-basekit:2025.3.2 base + gRPC |
-
-**Single source of truth**: the install logic for all 10 variants lives in `.docker/install-base-deps.sh`. Both `Dockerfile.base-grpc-builder` AND each variant Dockerfile's `builder-fromsource` stage bind-mount and execute the same script — so the prebuilt CI base and the local from-source path are bit-equivalent by construction.
-
-### How variant Dockerfiles consume the base
-
-`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` are multi-target. Three stages plus a final aliasing stage:
-
- `builder-fromsource` — `FROM ${BASE_IMAGE}` then runs `install-base-deps.sh` and the per-backend compile script. Used when `BUILDER_TARGET=builder-fromsource` (the default; local `make backends/<name>`).
- `builder-prebuilt` — `FROM ${BUILDER_BASE_IMAGE}` (one of the prebuilt `base-grpc-*` tags) and runs only the per-backend compile script. Used when `BUILDER_TARGET=builder-prebuilt` (CI when the matrix entry sets `builder-base-image`).
- `FROM ${BUILDER_TARGET} AS builder` — alias resolves the ARG-selected stage to a fixed name (BuildKit doesn't allow ARG expansion in `COPY --from=`).
- `FROM scratch` + `COPY --from=builder ...package/. ./` — emits the final scratch image with just the package contents.
-
-BuildKit prunes the unreferenced builder stage, so each build only runs the path it needs. `backend_build.yml` derives `BUILDER_TARGET=builder-prebuilt` automatically when the matrix entry has a non-empty `builder-base-image`; otherwise it defaults to `builder-fromsource`.
-
-The matrix `(build-type, platforms)` → `builder-base-image` mapping for llama-cpp / ik-llama-cpp / turboquant entries:
-
-| `build-type` | `platforms` | tag |
-|---|---|---|
-| `''` | `linux/amd64` | `base-grpc-amd64` |
-| `''` | `linux/arm64` | `base-grpc-arm64` |
-| `cublas` cuda 12 | `linux/amd64` | `base-grpc-cuda-12-amd64` |
-| `cublas` cuda 13 | `linux/amd64` | `base-grpc-cuda-13-amd64` |
-| `cublas` cuda 13 | `linux/arm64` | `base-grpc-cuda-13-arm64` |
-| `cublas` cuda 12 + JetPack base | `linux/arm64` | `base-grpc-l4t-cuda-12-arm64` |
-| `hipblas` | `linux/amd64` | `base-grpc-rocm-amd64` |
-| `vulkan` | `linux/amd64` | `base-grpc-vulkan-amd64` |
-| `vulkan` | `linux/arm64` | `base-grpc-vulkan-arm64` |
-| `sycl_*` | `linux/amd64` | `base-grpc-intel-amd64` |
-
-### Bootstrap order when adding a new variant
-
-If you add a new entry to `base-images.yml`'s matrix, the new tag does not exist on quay until the workflow runs. To consume it from a variant entry safely, dispatch the base-images workflow on the branch first:
-
-```bash
-gh workflow run base-images.yml --ref <feature-branch>
-```
-
-Wait for the new variant to push, then merge the consumer change. Otherwise the consumer's CI fails with "image not found."
-
-## Per-arch native builds + manifest merge
-
-Multi-arch backends (and the core LocalAI image) build natively per arch instead of running both arches under QEMU emulation on a single x86 runner. The pattern:
-
- The matrix has TWO entries per multi-arch backend, sharing the same `tag-suffix` but distinct `platforms` + `platform-tag` + `runs-on`. Example: `-cpu-faster-whisper` has one amd64 entry on `ubuntu-latest` and one arm64 entry on `ubuntu-24.04-arm`.
- Each per-arch build pushes by **canonical digest only** (no tags) via `outputs: type=image,push-by-digest=true,name-canonical=true,push=true`. The digest is uploaded as an artifact named `digests<tag-suffix>-<platform-tag>` (or `digests-localai<...>` for root-image builds).
- `scripts/changed-backends.js` detects shared `tag-suffix` and emits a `merge-matrix` output. `backend.yml` / `backend_pr.yml` have a `backend-merge-jobs` job that consumes it and calls `backend_merge.yml`.
- `backend_merge.yml` downloads all matching digest artifacts and runs `docker buildx imagetools create` to publish the final tagged manifest list pointing at both per-arch digests. Same `docker/metadata-action` config as the original monolithic build, so consumers see no tag-shape change.
- `image_merge.yml` is the equivalent for the root LocalAI image (`-core` placeholder when `tag-suffix` is empty so the artifact-name glob doesn't over-match across `core` and `gpu-vulkan`).
-
-**`provenance: false` is required on multi-registry digest pushes**: with the default `mode=max` provenance attestation, BuildKit bundles a per-registry attestation manifest into each registry's manifest list, making the resulting list digest diverge across registries. `steps.build.outputs.digest` only matches one of them and the merge step's `imagetools create <reg>@sha256:<digest>` lookup fails on the other. Setting `provenance: false` keeps the digest content-only and identical across registries.
-
-## Path filter on master push
-
-Both `backend.yml` (push) and `backend_pr.yml` (PR) generate their matrix dynamically through `scripts/changed-backends.js`:
-
- **PR events**: paginated `pulls/{n}/files` API → filter the matrix to entries whose `dockerfile` path prefix matches the PR diff.
- **Push events**: GitHub Compare API (`/repos/{owner}/{repo}/compare/{before}...{after}`) → same path-filter logic. Falls back to "run everything" on first-branch push (`event.before` zero), API truncation (≥300 changed files), missing API token, or any thrown error.
- **Tag pushes**: `FORCE_ALL=true` is set from the workflow side (`startsWith(github.ref, 'refs/tags/')`) — releases rebuild every backend regardless of diff.
- **Schedule / `workflow_dispatch`**: no `event.before`, falls through to "run everything" automatically.
-
-The Sunday 06:00 UTC cron on `backend.yml` exists specifically because path filtering can leave Python backends frozen on stale wheels. `DEPS_REFRESH` (below) only fires when the build actually runs, so an untouched Python backend would never re-resolve its unpinned deps. The weekly cron is the safety net.
-
-## The `DEPS_REFRESH` cache-buster (Python backends)
-
-Every Python backend goes through the shared `backend/Dockerfile.python`, which ends with:
-
-```dockerfile
-ARG DEPS_REFRESH=initial
-RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
-```
-
-Most Python backends ship `requirements*.txt` files that **do not pin every transitive dep** (`torch`, `transformers`, `vllm`, `diffusers`, etc. are listed without a `==` pin, or with `>=` lower bounds only). With a warm BuildKit cache, the `make` layer hashes only on Dockerfile instructions + COPYed source — not on what `pip install` resolves at runtime. So a warm cache would ship the *first* version of `vllm` ever cached and never pick up upstream releases.
-
-`DEPS_REFRESH` defends against that:
-
- `backend_build.yml` computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W19`) before each build and passes it as a build-arg.
- The `RUN ... make` layer's BuildKit hash now includes that string, so the layer invalidates **at most once per week**, automatically picking up newer wheels.
- Within a week, builds stay warm.
-
-This applies only to `Dockerfile.python` because:
- Go (`Dockerfile.golang`) pins versions in `go.mod` / `go.sum`.
- Rust (`Dockerfile.rust`) pins via `Cargo.lock`.
- C++ backends pin gRPC (`v1.65.0`) and llama.cpp at a specific commit; their inputs don't drift between rebuilds.
-
-### Adjusting the cadence
-
-Bump the format to daily (`+%Y-%m-%d`) or hourly (`+%Y-%m-%d-%H`) for faster refreshes. For one-shot rebuilds without changing the schedule, append a marker to the tag-suffix in the matrix or temporarily delete that backend's cache tag in quay.
-
-## ccache for C++ backend builds
-
-`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` declare a BuildKit cache mount on `/root/.ccache`:
-
-```dockerfile
-RUN --mount=type=cache,target=/root/.ccache,id=<backend>-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
-    bash /usr/local/sbin/compile.sh
-```
-
-The compile script exports `CMAKE_C/CXX/CUDA_COMPILER_LAUNCHER=ccache` so CMake threads ccache through gcc/g++/nvcc. `cache-to: type=registry,mode=max` exports the cache mount data into the registry cache, so subsequent builds restore it.
-
-On a `LLAMA_VERSION` bump, most translation units are byte-identical to the previous version's preprocessed source — ccache returns the previous `.o` and skips the real compile. Same for LocalAI source changes that don't actually touch llama.cpp's CMake inputs. Cache scope is per `(TARGETARCH, BUILD_TYPE)` so e.g. cublas-12 doesn't share with cublas-13 (their CUDA headers differ; cross-pollination would just be cache misses anyway).
-
-## Composite actions
-
-Two composite actions handle runner-side prep:
-
- **`.github/actions/free-disk-space/action.yml`** — wraps `jlumbroso/free-disk-space@main` plus an explicit apt purge of dotnet/android/ghc/mono/etc. Reclaims ~6–10 GB on `ubuntu-latest`. No-op on self-hosted runners. Used by `backend_build.yml`, `image_build.yml`, `test.yml`, `tests-aio.yml`, etc.
- **`.github/actions/setup-build-disk/action.yml`** — relocates Docker's data-root to `/mnt` on hosted X64 runners. GHA hosted `ubuntu-latest` ships ~75 GB of unused space at `/mnt`; combined with the free-disk-space cleanup this gives ~100 GB working space — enough for ROCm dev image + vLLM torch install + flash-attn intermediate layers. No-op on self-hosted and on non-X64 hosted runners. Used by `backend_build.yml`, `image_build.yml`, `base-images.yml`.
-
-Both actions run before any docker buildx step.
-
-## Concurrency
-
-All `backend.yml` / `image.yml` / `test.yml` / etc. workflows use:
-
-```yaml
-concurrency:
-  group: ci-<workflow>-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-```
-
- **PR events** group by PR number → newer pushes to the same PR cancel old runs (intended).
- **Push events** group by `github.sha` → each master commit gets its own run; rapid-fire merges don't cancel each other (this was a real issue prior — two master pushes 11 seconds apart would cancel the first's CI).
-
-## Self-warming, no separate populator
-
-There is no cron job that pre-warms the BuildKit cache for individual backends. The production builds *are* the populators. The first master build of a given matrix entry pays the cold cost; subsequent same-entry master builds reuse everything that hasn't changed (apt installs, gRPC compile in the variant `builder-fromsource` stage or skipped entirely when consuming `base-grpc-*`, Python wheel installs, etc.). The base-images workflow's weekly cron is the closest thing to a populator and only refreshes the prebuilt builder bases.
-
-## Manually evicting cache
-
-To force a fully cold build for one backend or the whole image:
-
-```bash
-# Delete a single tag (requires quay credentials with admin on the repo)
-curl -X DELETE \
-  -H "Authorization: Bearer ${QUAY_TOKEN}" \
-  https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/cache-gpu-nvidia-cuda-12-vllm-amd64
-
-# List all tags
-curl -s -H "Authorization: Bearer ${QUAY_TOKEN}" \
-  "https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/?limit=100" | jq '.tags[].name'
-```
-
-Eviction is rarely needed in normal operation — `DEPS_REFRESH` handles weekly drift, source changes invalidate naturally, and `mode=max` keeps the cache scoped per matrix entry per arch so a stale tag never bleeds into a different build.
-
-## What the cache does **not** cover
-
- The `free-disk-space` and `setup-build-disk` composite actions run on every job — these reclaim runner-state, not Docker layers, so BuildKit caches don't apply.
- Intermediate artifacts of `Build (PR)` are not pushed anywhere — PRs only build for verification.
- Darwin builds (see below) — macOS runners have no Docker daemon, so the registry-backed BuildKit cache cannot apply.
-
-## Darwin native caches
-
-`backend_build_darwin.yml` runs natively on `macOS-14` GitHub-hosted runners — there is no Docker, no BuildKit, no cross-job registry cache. Instead, the reusable workflow uses `actions/cache@v4` for four native caches that mirror the spirit of the Linux cache (warm by default, weekly refresh for unpinned Python deps, PRs read-only).
-
-| Cache | Path(s) | Key | Scope |
-|---|---|---|---|
-| Go modules + build | `~/go/pkg/mod`, `~/Library/Caches/go-build` | `go.sum` (managed by `actions/setup-go@v5` `cache: true`) | All darwin jobs |
-| Homebrew | `~/Library/Caches/Homebrew/downloads`, selected `/opt/homebrew/Cellar/*` | hash of `backend_build_darwin.yml` | All darwin jobs |
-| ccache (llama.cpp CMake) | `~/Library/Caches/ccache` | pinned `LLAMA_VERSION` from `backend/cpp/llama-cpp/Makefile` | `inputs.backend == 'llama-cpp'` only |
-| Python wheels (uv + pip) | `~/Library/Caches/pip`, `~/Library/Caches/uv` | `inputs.backend` + ISO week (`+%Y-W%V`) + hash of that backend's `requirements*.txt` | `inputs.lang == 'python'` only |
-
-Read/write semantics match the BuildKit cache: `actions/cache/restore` runs every time, `actions/cache/save` is gated on `github.event_name != 'pull_request'`. PRs read master's warm cache but never write back.
-
-The Python wheel cache uses the same ISO-week cache-buster as the Linux `DEPS_REFRESH` build-arg — same problem (unpinned `torch`/`mlx`/`diffusers`/`transformers` resolve to fresh wheels weekly), same ~one-cold-rebuild-per-week solution.
-
-The brew Cellar cache requires `HOMEBREW_NO_AUTO_UPDATE=1` and `HOMEBREW_NO_INSTALL_CLEANUP=1` (set as job-level env). Without those, `brew install` would mutate the very directories that were just restored, defeating the cache.
-
-**Force-link after cache restore**: `actions/cache` restores `/opt/homebrew/Cellar/*` but NOT the `/opt/homebrew/bin/*` symlinks. After a cache hit, `brew install` sees the Cellar entries and decides "already installed" without re-running its link step, leaving the formulas off PATH. The Dependencies step explicitly runs `brew link --overwrite` for every cached formula afterwards to ensure the symlinks exist.
-
-For ccache, the workflow exports `CMAKE_ARGS=… -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache` via `$GITHUB_ENV` before running `make build-darwin-go-backend`. The Makefile in `backend/cpp/llama-cpp/` already forwards `CMAKE_ARGS` through to each variant build (`fallback`, `grpc`, `rpc-server`), so no script changes are needed. The three variants share most TUs, so ccache dedupes object files across them.
-
-`backend_build_darwin.yml` also has a llama-cpp-specific build-step branch that runs `make backends/llama-cpp-darwin` (the bespoke script that compiles three CMake variants and bundles dylibs via `otool`), distinct from the generic `make build-darwin-${lang}-backend` path. This was consolidated from a previously-bespoke top-level `llama-cpp-darwin` job in `backend.yml` so llama-cpp on Darwin honors the same path filter as the other 34 Darwin backends.
-
-### Cache budget on Darwin
-
-GitHub Actions caches are limited to 10 GB per repo. Steady-state worst case: ~800 MB Go cache + ~2 GB brew Cellar + up to 2 GB ccache + ~1.5 GB × 5 python backends. If the cap is hit, prefer collapsing the per-backend Python keys into a shared `pyenv-darwin-shared-<week>` key (accepts more cross-backend churn for a smaller footprint) before reducing other caches.
-
-## Self-hosted runners
-
-`.github/backend-matrix.yml` has zero references to `arc-runner-set` or `bigger-runner` — all backends run on GHA free-tier hosted runners (`ubuntu-latest` for amd64, `ubuntu-24.04-arm` for arm64 native, `macos-14` for Darwin). The migration off self-hosted relied on the per-arch native split (no QEMU emulation) plus `setup-build-disk`'s `/mnt` relocation (~100 GB working space, enough for ROCm dev image + vLLM/torch installs).
-
-One residual self-hosted reference remains in `test-extra.yml` (`tests-vibevoice-cpp-grpc-transcription` uses `bigger-runner` for the 30s JFK-decode timeout headroom). That's a separate concern.
-
-## Touching the cache pipeline
-
-When changing `image_build.yml`, `backend_build.yml`, any of the `backend/Dockerfile.*` files, `Dockerfile.base-grpc-builder`, `.docker/install-base-deps.sh`, `.docker/<backend>-compile.sh`, or `scripts/changed-backends.js`:
-
-1. **Don't drop `DEPS_REFRESH=...` from the build-args** without a replacement strategy (lockfiles, pinned requirements). Otherwise master will silently freeze on whichever versions were cached at the time.
-2. **Keep `(tag-suffix, platform-tag)` unique per matrix entry** — together they're the cache namespace. Two matrix entries sharing a key would clobber each other's cache.
-3. **Keep `cache-to` gated on `github.event_name != 'pull_request'`** — PRs must not write.
-4. **Keep `ignore-error=true` on `cache-to`** — quay registry hiccups must not fail builds.
-5. **Keep `provenance: false` on push-by-digest steps** — multi-registry digest divergence is the Bug We Already Fixed; reintroducing provenance attestation re-breaks the merge.
-6. **`install-base-deps.sh` is the single source of truth for base contents.** Both `Dockerfile.base-grpc-builder` (CI) and the variant Dockerfiles' `builder-fromsource` (local) bind-mount and execute it. If you add a package to one path, add it to the script — don't fork the logic into a Dockerfile RUN.
-7. **After adding a `base-images.yml` matrix variant, run the workflow on your branch before merging consumer changes** that depend on the new tag — otherwise the consumer's CI fails "image not found."
--- a/.agents/coding-style.md
+++ b/.agents/coding-style.md
@@ -1,71 +0,0 @@
-# Coding Style
-
-The project has the following .editorconfig:
-
-```
-root = true
-
-[*]
-indent_style = space
-indent_size = 2
-end_of_line = lf
-charset = utf-8
-trim_trailing_whitespace = true
-insert_final_newline = true
-
-[*.go]
-indent_style = tab
-
-[Makefile]
-indent_style = tab
-
-[*.proto]
-indent_size = 2
-
-[*.py]
-indent_size = 4
-
-[*.js]
-indent_size = 2
-
-[*.yaml]
-indent_size = 2
-
-[*.md]
-trim_trailing_whitespace = false
-```
-
- Use comments sparingly to explain why code does something, not what it does. Comments are there to add context that would be difficult to deduce from reading the code.
- Prefer modern Go e.g. use `any` not `interface{}`
-
-## Logging
-
-Use `github.com/mudler/xlog` for logging which has the same API as slog.
-
-## Go tests
-
-All Go tests — including backend tests — must use [Ginkgo](https://onsi.github.io/ginkgo/) (v2) with Gomega matchers, not the stdlib `testing` package with `t.Run` / `t.Errorf`. A test file should register a suite with `RegisterFailHandler(Fail)` in a `TestXxx(t *testing.T)` bootstrap and use `Describe`/`Context`/`It` blocks for the actual cases. Look at any existing `*_test.go` under `core/` or `pkg/` for a template.
-
-Do not mix styles within a package. If you are extending tests in a package that already uses Ginkgo, keep using Ginkgo. If you find stdlib-style Go tests in the tree, treat them as tech debt to be migrated rather than as a pattern to follow.
-
-This is enforced by `golangci-lint` via the `forbidigo` linter (see `.golangci.yml`); calls like `t.Errorf` / `t.Fatalf` / `t.Run` / `t.Skip` / `t.Logf` are flagged. Run `make lint` locally before submitting; the same check runs in CI (`.github/workflows/lint.yml`).
-
-## Outbound HTTP
-
-All outbound HTTP must go through `github.com/mudler/LocalAI/pkg/httpclient` rather than the standard library's default client. Use `httpclient.New(...)` (no body deadline — safe for streaming/SSE) or `httpclient.NewWithTimeout(d, ...)` (simple request/response). Both **refuse redirects by default** and set a TLS 1.2 floor.
-
-The reason is GHSA-3mj3-57v2-4636: the std default client follows redirects, and on a *cross-host* redirect Go forwards custom credential headers (e.g. Anthropic's `x-api-key`) to the redirect target, leaking the secret. `httpclient` fails closed instead.
-
- Need to follow redirects (download CDNs, registry blobs, GitHub asset URLs)? Pass `httpclient.WithFollowRedirects()` — it still strips credential headers on any cross-host hop.
- Have a custom transport (IP-pinned dialer, HTTP/2 tuning, a credential-injecting `RoundTripper`)? Pass `httpclient.WithTransport(rt)`, basing the transport on `httpclient.HardenedTransport()` to keep the TLS floor. Handed a `*http.Client` by a library? `httpclient.Harden(c)` applies the policy in place.
-
-This is enforced by `forbidigo` (see `.golangci.yml`): `http.DefaultClient` and `http.Get`/`Post`/`PostForm`/`Head` are flagged. The `&http.Client{}` composite literal can't be matched precisely by forbidigo without also flagging legitimate `*http.Client` type references, so that form is caught by review — don't construct raw clients.
-
-## Documentation
-
-The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.
-
- **Feature Documentation**: If you add a new feature (like a new backend or API endpoint), create a new markdown file in `docs/content/features/` explaining what it is, how to configure it, and how to use it.
- **Configuration**: If you modify configuration options, update the relevant sections in `docs/content/`.
- **Examples**: providing concrete examples (like YAML configuration blocks) is highly encouraged to help users get started quickly.
- **Shortcodes**: Use `{{% notice note %}}`, `{{% notice tip %}}`, or `{{% notice warning %}}` for callout boxes. Do **not** use `{{% alert %}}` — that shortcode does not exist in this project's Hugo theme and will break the docs build.
--- a/.agents/debugging-backends.md
+++ b/.agents/debugging-backends.md
@@ -1,141 +0,0 @@
-# Debugging and Rebuilding Backends
-
-When a backend fails at runtime (e.g. a gRPC method error, a Python import error, or a dependency conflict), use this guide to diagnose, fix, and rebuild.
-
-## Architecture Overview
-
- **Source directory**: `backend/python/<name>/` (or `backend/go/<name>/`, `backend/cpp/<name>/`)
- **Installed directory**: `backends/<name>/` — this is what LocalAI actually runs. It is populated by `make backends/<name>` which builds a Docker image, exports it, and installs it via `local-ai backends install`.
- **Virtual environment**: `backends/<name>/venv/` — the installed Python venv (for Python backends). The Python binary is at `backends/<name>/venv/bin/python`.
-
-Editing files in `backend/python/<name>/` does **not** affect the running backend until you rebuild with `make backends/<name>`.
-
-## Diagnosing Failures
-
-### 1. Check the logs
-
-Backend gRPC processes log to LocalAI's stdout/stderr. Look for lines tagged with the backend's model ID:
-
-```
-GRPC stderr id="trl-finetune-127.0.0.1:37335" line="..."
-```
-
-Common error patterns:
- **"Method not implemented"** — the backend is missing a gRPC method that the Go side calls. The model loader (`pkg/model/initializers.go`) always calls `LoadModel` after `Health`; fine-tuning backends must implement it even as a no-op stub.
- **Python import errors / `AttributeError`** — usually a dependency version mismatch (e.g. `pyarrow` removing `PyExtensionType`).
- **"failed to load backend"** — the gRPC process crashed or never started. Check stderr lines for the traceback.
-
-### 2. Test the Python environment directly
-
-You can run the installed venv's Python to check imports without starting the full server:
-
-```bash
-backends/<name>/venv/bin/python -c "import datasets; print(datasets.__version__)"
-```
-
-If `pip` is missing from the venv, bootstrap it:
-
-```bash
-backends/<name>/venv/bin/python -m ensurepip
-```
-
-Then use `backends/<name>/venv/bin/python -m pip install ...` to test fixes in the installed venv before committing them to the source requirements.
-
-### 3. Check upstream dependency constraints
-
-When you hit a dependency conflict, check what the main library expects. For example, TRL's upstream `requirements.txt`:
-
-```
-https://github.com/huggingface/trl/blob/main/requirements.txt
-```
-
-Pin minimum versions in the backend's requirements files to match upstream.
-
-## Common Fixes
-
-### Missing gRPC methods
-
-If the Go side calls a method the backend doesn't implement (e.g. `LoadModel`), add a no-op stub in `backend.py`:
-
-```python
-def LoadModel(self, request, context):
-    """No-op — actual loading happens elsewhere."""
-    return backend_pb2.Result(success=True, message="OK")
-```
-
-The gRPC contract requires `LoadModel` to succeed for the model loader to return a usable client, even if the backend doesn't need upfront model loading.
-
-### Dependency version conflicts
-
-Python backends often break when a transitive dependency releases a breaking change (e.g. `pyarrow` removing `PyExtensionType`). Steps:
-
-1. Identify the broken import in the logs
-2. Test in the installed venv: `backends/<name>/venv/bin/python -c "import <module>"`
-3. Check upstream requirements for version constraints
-4. Update **all** requirements files in `backend/python/<name>/`:
-   - `requirements.txt` — base deps (grpcio, protobuf)
-   - `requirements-cpu.txt` — CPU-specific (includes PyTorch CPU index)
-   - `requirements-cublas12.txt` — CUDA 12
-   - `requirements-cublas13.txt` — CUDA 13
-5. Rebuild: `make backends/<name>`
-
-### PyTorch index conflicts (uv resolver)
-
-The Docker build uses `uv` for pip installs. When `--extra-index-url` points to the PyTorch wheel index, `uv` may refuse to fetch packages like `requests` from PyPI if it finds a different version on the PyTorch index first. Fix this by adding `--index-strategy=unsafe-first-match` to `install.sh`:
-
-```bash
-EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
-installRequirements
-```
-
-Most Python backends already do this — check `backend/python/transformers/install.sh` or similar for reference.
-
-## Rebuilding
-
-### Rebuild a single backend
-
-```bash
-make backends/<name>
-```
-
-This runs the Docker build (`Dockerfile.python`), exports the image to `backend-images/<name>.tar`, and installs it into `backends/<name>/`. It also rebuilds the `local-ai` Go binary (without extra tags).
-
-**Important**: If you were previously running with `GO_TAGS=auth`, the `make backends/<name>` step will overwrite your binary without that tag. Rebuild the Go binary afterward:
-
-```bash
-GO_TAGS=auth make build
-```
-
-### Rebuild and restart
-
-After rebuilding a backend, you must restart LocalAI for it to pick up the new backend files. The backend gRPC process is spawned on demand when the model is first loaded.
-
-```bash
-# Kill existing process
-kill <pid>
-
-# Restart
-./local-ai run --debug [your flags]
-```
-
-### Quick iteration (skip Docker rebuild)
-
-For fast iteration on a Python backend's `backend.py` without a full Docker rebuild, you can edit the installed copy directly:
-
-```bash
-# Edit the installed copy
-vim backends/<name>/backend.py
-
-# Restart LocalAI to respawn the gRPC process
-```
-
-This is useful for testing but **does not persist** — the next `make backends/<name>` will overwrite it. Always commit fixes to the source in `backend/python/<name>/`.
-
-## Verification
-
-After fixing and rebuilding:
-
-1. Start LocalAI and confirm the backend registers: look for `Registering backend name="<name>"` in the logs
-2. Trigger the operation that failed (e.g. start a fine-tuning job)
-3. Watch the GRPC stderr/stdout lines for the backend's model ID
-4. Confirm no errors in the traceback
--- a/.agents/ds4-backend.md
+++ b/.agents/ds4-backend.md
@@ -1,145 +0,0 @@
-# Working on the ds4 Backend
-
-`antirez/ds4` is a single-model inference engine for DeepSeek V4 Flash.
-LocalAI wraps the engine's C API (`ds4/ds4.h`) with a fresh C++ gRPC server at
-`backend/cpp/ds4/` - NOT a fork of llama-cpp's grpc-server.cpp.
-
-## Pin
-
-`backend/cpp/ds4/Makefile` pins `DS4_VERSION?=<sha>` at the top. The `ds4`
-target in the Makefile clones `antirez/ds4` at that commit (mirroring the
-llama-cpp / ik-llama-cpp / turboquant pattern). The bump-deps bot
-(`.github/workflows/bump_deps.yaml`) finds this pin via grep and opens a
-daily PR to update it. To bump manually: edit the `DS4_VERSION?=` line,
-then `make purge && make` (or rely on CI's clean build).
-
-## Wire shape
-
-| RPC | Implementation |
-|---|---|
-| Health, Free, Status | Trivial; no engine dependency for Health |
-| LoadModel | `ds4_engine_open` + `ds4_session_create`; backend is compile-time (DS4_NO_GPU → CPU, __APPLE__ → Metal, otherwise CUDA) |
-| TokenizeString | `ds4_tokenize_text` |
-| Predict | `ds4_engine_generate_argmax` + `DsmlParser` → one ChatDelta with content / reasoning_content / tool_calls[] |
-| PredictStream | Same, per-token ChatDelta writes |
-
-## DSML
-
-ds4 emits tool calls as literal text markers (`<｜DSML｜tool_calls>` etc.) -
-NOT special tokens. `dsml_parser.{h,cpp}` is our streaming state machine that
-classifies token bytes into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END
-events. `dsml_renderer.{h,cpp}` does the prompt direction: turns
-OpenAI tool_calls + role=tool messages back into DSML for the next turn.
-
-## Thinking modes
-
-`PredictOptions.Metadata["enable_thinking"]` gates thinking on/off (default ON).
-`["reasoning_effort"] == "max" | "xhigh"` selects `DS4_THINK_MAX`; anything else
-maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_prefix`.
-
-## Disk KV cache
-
-`kv_cache.{h,cpp}` implements an SHA1-keyed file cache using ds4's public
-`ds4_session_save_payload` / `ds4_session_load_payload` API. Enable per request
-via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
-NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).
-
-## Engine options (LoadModel)
-
-`LoadModel` maps `ModelOptions.Options[]` (`"key:value"`, from model-YAML
-`options:`) onto `ds4_engine_options` through a **declarative table**
-(`kEngineOptSpecs` + `apply_engine_option` in `grpc-server.cpp`). The struct is
-plain C with no reflection, so the field set is enumerated once in the table;
-adding a future engine knob is a one-line table row, not a new branch. Unknown
-keys are ignored (back-compat). A bare flag (`ssd_streaming` with no value)
-means `true`. Path-type values (`mtp_path`, `expert_profile_path`,
-`directional_steering_file`) resolve **relative to the model directory**, so a
-gallery entry can reference a companion file it downloaded by bare filename;
-absolute values pass through. `ds4_role` / `ds4_layers` / `ds4_listen` /
-`ds4_route_timeout` / `kv_cache_dir` keep their dedicated handling (validation
-+ coordinator wiring) and are not in the table.
-
-Wired keys: `mtp_path`, `mtp_draft`, `mtp_margin`, `prefill_chunk`,
-`power_percent`, `warm_weights`, `quality`, `ssd_streaming`,
-`ssd_streaming_cold`, `ssd_streaming_preload_experts`,
-`ssd_streaming_cache_experts` (count or `NGB`, sets both experts+bytes via
-`ds4_parse_streaming_cache_experts_arg`), `simulate_used_memory` (`NGB` via
-`ds4_parse_gib_arg`), `expert_profile_path`, `directional_steering_file`,
-`directional_steering_attn`, `directional_steering_ffn`.
-
-## SSD streaming (running models larger than RAM)
-
-ds4's **SSD streaming** keeps non-routed weights resident and streams routed MoE
-experts from the GGUF on cache misses, turning "does it fit in RAM" into a speed
-spectrum. **Metal (Darwin) only** - it is a no-op on CUDA/CPU. Enable with
-`options: ["ssd_streaming"]`; size the routed-expert cache with
-`ssd_streaming_cache_experts:NGB` (omit for ds4's automatic 80%-of-working-set
-budget). Gallery entries built on this: `deepseek-v4-flash-q4-ssd` (153 GB Flash
-on a 128 GB Mac) and `deepseek-v4-pro-q2-ssd` (433 GB Pro, experimental).
-
-## Build matrix
-
-| Build | Where | Notes |
-|---|---|---|
-| `cpu-ds4` (amd64 + arm64) | Linux GHA | ds4 considers CPU debug-only; useful only for wiring tests |
-| `cuda13-ds4` (amd64 + arm64) | Linux GHA + DGX Spark validation | Primary production path on Linux |
-| `ds4-darwin` (arm64) | macOS GHA runners | Metal; uses `scripts/build/ds4-darwin.sh` like llama-cpp-darwin |
-
-cuda12 is intentionally omitted. ROCm / Vulkan / SYCL are not applicable.
-
-## Hardware-gated validation
-
-`tests/e2e-backends/backend_test.go` in `BACKEND_BINARY` mode:
-
-```
-BACKEND_BINARY=$(pwd)/backend/cpp/ds4/package/run.sh \
-BACKEND_TEST_MODEL_FILE=/path/to/ds4flash.gguf \
-BACKEND_TEST_CAPS=health,load,predict,stream,tools \
-BACKEND_TEST_TOOL_PROMPT="What's the weather in Paris?" \
-go test -count=1 -timeout=30m -v ./tests/e2e-backends/...
-```
-
-CI does not load the model; the suite is opt-in via env vars.
-
-## Distributed mode
-
-ds4 supports **layer-split** distributed inference (a model too big for one host,
-split by transformer layer; the GGUF must be present on every machine, each loads
-only its slice). Topology is **inverted** vs llama.cpp: the coordinator listens,
-workers dial in.
-
- **`ds4-worker` binary**: built and packaged next to `grpc-server` (`package.sh`
-  copies it into `package/`). Links the same engine objects plus `ds4_distributed.o`;
-  **no gRPC/protobuf dependency** (speaks ds4's own TCP transport), so it builds
-  even where `grpc-server` can't. Runs the worker serving loop (`ds4_dist_run`).
- **Coordinator wiring**: the ds4 `grpc-server` acts as coordinator when `LoadModel`
-  `ModelOptions.Options` (from model-YAML `options:`) carry:
-  - `ds4_role:coordinator` (enables distributed mode; absent → single-node, back-compat)
-  - `ds4_layers:0:19` (coordinator's own slice, inclusive; `N:output` includes the head)
-  - `ds4_listen:0.0.0.0:1234` (address workers dial into)
-  - `ds4_route_timeout:60` (optional; seconds Predict/PredictStream wait for the route
-    to form before returning gRPC `UNAVAILABLE`; default 60)
- **Worker CLI**: `local-ai worker ds4-distributed -- <ds4-worker args>` resolves the
-  ds4 backend and execs the packaged `ds4-worker` (raw passthrough), e.g.
-  `--role worker --model /models/ds4flash.gguf --layers 20:output --coordinator <host> 1234`.
-
-Opt-in e2e in `tests/e2e-backends/backend_test.go`, gated by
-`BACKEND_TEST_DS4_DISTRIBUTED=1` (plus `BACKEND_TEST_DS4_WORKER_BINARY`,
-`BACKEND_TEST_DS4_WORKER_LAYERS`, `BACKEND_TEST_DS4_COORDINATOR_LAYERS`,
-`BACKEND_TEST_DS4_LISTEN`). Design spec:
-`docs/superpowers/specs/2026-05-30-ds4-distributed-inference-design.md`.
-
-## Importer
-
-`core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by
-matching the `antirez/deepseek-v4-gguf` repo URI or the
-`DeepSeek-V4-Flash-*.gguf` filename pattern. **Registered BEFORE
-`LlamaCPPImporter`** in `defaultImporters` - both match `.gguf` but ds4 is more
-specific, and first-match-wins. The importer emits `backend: ds4`, uses
-`ds4flash.gguf` as the local filename (matches ds4's own CLI default), and
-disables the Go-side automatic tool-parsing fallback (the C++ backend emits
-ChatDelta.tool_calls natively via `DsmlParser`).
-
-ds4 is also listed in `core/http/endpoints/localai/backend.go`'s pref-only
-slice so the `/import-model` UI surfaces it as a manual choice for users who
-want to force the backend on a non-canonical URI.
--- a/.agents/llama-cpp-backend.md
+++ b/.agents/llama-cpp-backend.md
@@ -1,83 +0,0 @@
-# llama.cpp Backend
-
-The llama.cpp backend (`backend/cpp/llama-cpp/grpc-server.cpp`) is a gRPC adaptation of the upstream HTTP server (`llama.cpp/tools/server/server.cpp`). It uses the same underlying server infrastructure from `llama.cpp/tools/server/server-context.cpp`.
-
-## Building and Testing
-
- Test llama.cpp backend compilation: `make backends/llama-cpp`
- The backend is built as part of the main build process
- Check `backend/cpp/llama-cpp/Makefile` for build configuration
-
-## Architecture
-
- **grpc-server.cpp**: gRPC server implementation, adapts HTTP server patterns to gRPC
- Uses shared server infrastructure: `server-context.cpp`, `server-task.cpp`, `server-queue.cpp`, `server-common.cpp`
- The gRPC server mirrors the HTTP server's functionality but uses gRPC instead of HTTP
-
-## Common Issues When Updating llama.cpp
-
-When fixing compilation errors after upstream changes:
-1. Check how `server.cpp` (HTTP server) handles the same change
-2. Look for new public APIs or getter methods
-3. Store copies of needed data instead of accessing private members
-4. Update function calls to match new signatures
-5. Test with `make backends/llama-cpp`
-
-## Key Differences from HTTP Server
-
- gRPC uses `BackendServiceImpl` class with gRPC service methods
- HTTP server uses `server_routes` with HTTP handlers
- Both use the same `server_context` and task queue infrastructure
- gRPC methods: `LoadModel`, `Predict`, `PredictStream`, `Embedding`, `Rerank`, `TokenizeString`, `GetMetrics`, `Health`
-
-## Tool Call Parsing Maintenance
-
-When working on JSON/XML tool call parsing functionality, always check llama.cpp for reference implementation and updates:
-
-### Checking for XML Parsing Changes
-
-1. **Review XML Format Definitions**: Check `llama.cpp/common/chat-parser-xml-toolcall.h` for `xml_tool_call_format` struct changes
-2. **Review Parsing Logic**: Check `llama.cpp/common/chat-parser-xml-toolcall.cpp` for parsing algorithm updates
-3. **Review Format Presets**: Check `llama.cpp/common/chat-parser.cpp` for new XML format presets (search for `xml_tool_call_format form`)
-4. **Review Model Lists**: Check `llama.cpp/common/chat.h` for `COMMON_CHAT_FORMAT_*` enum values that use XML parsing:
-   - `COMMON_CHAT_FORMAT_GLM_4_5`
-   - `COMMON_CHAT_FORMAT_MINIMAX_M2`
-   - `COMMON_CHAT_FORMAT_KIMI_K2`
-   - `COMMON_CHAT_FORMAT_QWEN3_CODER_XML`
-   - `COMMON_CHAT_FORMAT_APRIEL_1_5`
-   - `COMMON_CHAT_FORMAT_XIAOMI_MIMO`
-   - Any new formats added
-
-### Model Configuration Options
-
-Always check `llama.cpp` for new model configuration options that should be supported in LocalAI:
-
-1. **Check Server Context**: Review `llama.cpp/tools/server/server-context.cpp` for new parameters
-2. **Check Chat Params**: Review `llama.cpp/common/chat.h` for `common_chat_params` struct changes
-3. **Check Server Options**: Review `llama.cpp/tools/server/server.cpp` for command-line argument changes
-4. **Examples of options to check**:
-   - `ctx_shift` - Context shifting support
-   - `parallel_tool_calls` - Parallel tool calling
-   - `reasoning_format` - Reasoning format options
-   - Any new flags or parameters
-
-### Speculative Decoding Types
-
-The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
-
-`draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
-
-### Implementation Guidelines
-
-1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation
-2. **Test Coverage**: Add tests for new features matching llama.cpp's behavior
-3. **Documentation**: Update relevant documentation when adding new formats or options
-4. **Backward Compatibility**: Ensure changes don't break existing functionality
-
-### Files to Monitor
-
- `llama.cpp/common/chat-parser-xml-toolcall.h` - Format definitions
- `llama.cpp/common/chat-parser-xml-toolcall.cpp` - Parsing logic
- `llama.cpp/common/chat-parser.cpp` - Format presets and model-specific handlers
- `llama.cpp/common/chat.h` - Format enums and parameter structures
- `llama.cpp/tools/server/server-context.cpp` - Server configuration options
--- a/.agents/localai-assistant-mcp.md
+++ b/.agents/localai-assistant-mcp.md
@@ -1,97 +0,0 @@
-# LocalAI Assistant — admin MCP server
-
-This document is the contract for **anyone** (human or AI agent) touching LocalAI's admin REST surface, the in-process MCP server that wraps it, or the embedded skill prompts that teach the assistant how to use it. Read this before adding/removing/renaming admin endpoints, MCP tools, or skill recipes.
-
-## What this feature is
-
-`pkg/mcp/localaitools/` is a public Go package that exposes LocalAI's admin/management surface as an MCP server. It is used in two ways:
-
-1. **In-process**: when an admin opens a chat with `metadata.localai_assistant=true`, the chat handler injects the in-memory MCP server (paired `net.Pipe()` transport, no HTTP loopback) so the LLM can install models, manage backends and edit configs by chatting.
-2. **Standalone**: the `local-ai mcp-server --target=…` subcommand serves the same MCP server over stdio, talking HTTP to a remote LocalAI instance.
-
-The two modes share **all** tool definitions and skill prompts. They differ only in their `LocalAIClient` implementation (`inproc/` calls services directly; `httpapi/` calls REST).
-
-## The three things you must keep in sync
-
-When you change LocalAI's admin surface, three layers must stay aligned:
-
-1. **REST endpoint** in `core/http/endpoints/localai/*.go`.
-2. **MCP tool registration** in `pkg/mcp/localaitools/tools_*.go`, plus a method on `LocalAIClient` (in `client.go`) and implementations in both `inproc/client.go` **and** `httpapi/client.go`.
-3. **Skill prompt** under `pkg/mcp/localaitools/prompts/skills/*.md` — the markdown that teaches the LLM how to use the new tool. If the new tool fits an existing recipe, update that recipe; otherwise add a new file.
-
-If you ship a REST endpoint without (2) and (3), conversational admins won't see the feature.
-
-## Checklist for adding a new admin endpoint
-
- [ ] REST endpoint exists in `core/http/endpoints/localai/*.go` and is gated by `auth.RequireAdmin()` in `core/http/routes/localai.go`.
- [ ] `LocalAIClient` interface in `pkg/mcp/localaitools/client.go` has a method covering the new operation.
- [ ] DTOs added/updated in `pkg/mcp/localaitools/dto.go` (JSON-tagged; never expose raw service types).
- [ ] `inproc/client.go` implements the new method by calling the service directly (not via HTTP loopback).
- [ ] `httpapi/client.go` implements the new method by calling the REST endpoint.
- [ ] Tool registration added in the appropriate `pkg/mcp/localaitools/tools_*.go`. Mutating tools must reference safety rule 1 in the description.
- [ ] If the tool is mutating, ensure `Options{DisableMutating: true}` skips it (mirror the pattern in `tools_models.go`).
- [ ] Skill prompt added or updated under `pkg/mcp/localaitools/prompts/skills/`. The prompt must instruct the LLM when to call the tool, what to ask the user first, and what to do on error.
- [ ] Tests:
-   - `pkg/mcp/localaitools/server_test.go` adds the tool name to `expectedFullCatalog` and `expectedReadOnlyCatalog` (if read-only).
-   - Tool dispatch is added to `TestEachToolDispatchesToClient`.
-   - `pkg/mcp/localaitools/httpapi/client_test.go` covers the new HTTP path.
-
-## Adding a new skill recipe (no new tool)
-
-Sometimes you want to teach the LLM a new pattern that uses existing tools. Drop a markdown file under `pkg/mcp/localaitools/prompts/skills/<verb>_<noun>.md`. The file is automatically embedded by `//go:embed` and assembled into the system prompt in lexicographic order. No Go changes needed.
-
-Conventions:
- Filename: `<verb>_<noun>.md` (e.g. `install_chat_model.md`, `upgrade_backend.md`).
- First line: `# Skill: <Title Case description>`.
- Number the steps. Reference exact tool names in backticks.
- If the skill mutates state, remind the LLM to confirm with the user.
-
-## Code conventions
-
-These rules guard against the magic-literal drift that surfaced in the first audit. Do not re-introduce bare strings.
-
- **Tool names** always come from the `Tool*` constants in `pkg/mcp/localaitools/tools.go`. Tool registrations, the test catalog (`server_test.go`'s `expectedFullCatalog` / `expectedReadOnlyCatalog`), and dispatch tables reference the constants. The embedded skill prompts under `prompts/` keep bare strings — that's the one allowed exception, and `TestPromptsContainSafetyAnchors` enforces alignment.
- **Toggle/pin actions** use the `modeladmin.Action` type (`pkg/mcp/localaitools` and `core/services/modeladmin`). Use `ActionEnable`/`ActionDisable`/`ActionPin`/`ActionUnpin`; never bare `"enable"`/`"pin"` strings.
- **Capability tags** for `list_installed_models` use the `localaitools.Capability` type (`capability.go`). The `LocalAIClient.ListInstalledModels` interface takes a typed `Capability`, and the `inproc` switch only accepts canonical values (`"embed"`/`"embedding"` are not aliases — only `CapabilityEmbeddings`).
- **HTTP error checks** in `httpapi.Client` use `errors.Is(err, ErrHTTPNotFound)`, not substring matches on `err.Error()`. The typed `*HTTPError` carries `StatusCode` and `Body`; add new sentinel errors as needed rather than re-introducing string matching.
- **Channel sends** to `GalleryService.ModelGalleryChannel` / `BackendGalleryChannel` from inproc clients MUST select on `ctx.Done()` so a cancelled chat completion releases the goroutine. See `inproc.sendModelOp` / `sendBackendOp`.
- **Disk writes** of model config YAML go through `modeladmin.writeFileAtomic` (temp file + `os.Rename`). `os.WriteFile` truncates on crash and corrupts the model.
- **MCP server lifecycle**: every initialised holder MUST register `Close()` with `signals.RegisterGracefulTerminationHandler`. The standalone `mcp-server` CLI uses `signal.NotifyContext` to honour SIGINT/SIGTERM.
-
-## File map (where to look)
-
-```
-pkg/mcp/localaitools/
-  client.go              # LocalAIClient interface + DTO registry
-  dto.go                 # JSON-tagged DTOs shared by both client impls
-  server.go              # NewServer(client, opts) — registers tools
-  tools.go               # Tool* name constants (single source of truth)
-  capability.go          # Capability type + constants
-  tools_models.go        # gallery_search, install_model, import_model_uri, ...
-  tools_backends.go
-  tools_config.go
-  tools_system.go
-  tools_state.go
-  prompts.go             # //go:embed loader + SystemPrompt(opts)
-  prompts/00_role.md
-  prompts/10_safety.md   # SAFETY RULES — change with care
-  prompts/20_tools.md    # curated tool catalog with one-liners
-  prompts/skills/*.md
-  inproc/client.go       # in-process LocalAIClient (services-direct)
-  httpapi/client.go      # REST LocalAIClient (for standalone CLI / remote)
-core/http/endpoints/mcp/
-  localai_assistant.go   # process-wide holder + LocalToolExecutor
-core/cli/mcp_server.go   # local-ai mcp-server subcommand
-```
-
-## Why two clients
-
-The in-process MCP server runs inside the same LocalAI binary that serves chat. Going over HTTP loopback would (a) require minting a synthetic admin API key for the server to authenticate against itself, (b) double-marshal every tool dispatch, and (c) lose access to in-process channels (e.g. `GalleryService.ModelGalleryChannel` for streaming install progress). So in-process uses `inproc.Client`. The standalone stdio CLI talks to a *remote* LocalAI; HTTP is the only option, so it uses `httpapi.Client`. Both implement the same `LocalAIClient` interface, and the parity test in `pkg/mcp/localaitools/parity_test.go` (when present) keeps their output equivalent.
-
-## Why prompt-enforced confirmation, not code gates
-
-The user chose KISS. Every mutating tool has a safety rule (`prompts/10_safety.md` rule 1) that requires the LLM to summarise the action and wait for explicit user confirmation before calling it. There is no `plan_*`/`apply_*` two-step in code. If you add a mutating tool, do **not** add per-tool confirmation logic in Go — instead, list the new tool name in `prompts/10_safety.md` so the LLM knows it falls under the confirmation rule.
-
-## Distributed mode
-
-The in-memory MCP server runs only on the head node (where the chat handler runs). `inproc.Client` wraps services that are already distributed-aware (`GalleryService` coordinates with workers; `ListNodes` reads the NATS-populated registry). No NATS routing of MCP tools — the admin surface lives on the head, period.
--- a/.agents/sglang-backend.md
+++ b/.agents/sglang-backend.md
@@ -1,62 +0,0 @@
-# Working on the SGLang Backend
-
-The SGLang backend lives at `backend/python/sglang/backend.py` (async gRPC). It wraps SGLang's `Engine` (`sglang.srt.entrypoints.engine.Engine`) and translates LocalAI's gRPC `PredictOptions` into SGLang sampling params + outputs into `Reply.chat_deltas`. Structurally it mirrors `backend/python/vllm/backend.py` — keep them shaped the same so changes in one have an obvious analog in the other.
-
-## `engine_args` is the universal escape hatch
-
-A small fixed set of fields on `ModelOptions` is mapped to typed SGLang kwargs in `LoadModel` (model, quantization, load_format, gpu_memory_utilization → mem_fraction_static, trust_remote_code, enforce_eager → disable_cuda_graph, tensor_parallel_size → tp_size, max_model_len → context_length, dtype). **Everything else** flows through the `engine_args:` YAML map.
-
-Validation happens in `_apply_engine_args`. Keys are checked against `dataclasses.fields(ServerArgs)` (`sglang.srt.server_args.ServerArgs` is a flat `@dataclass` with ~380 fields). Unknown keys raise `ValueError` at LoadModel time with a `difflib.get_close_matches` suggestion — same shape as the vLLM backend.
-
-**Precedence:** typed `ModelOptions` fields populate `engine_kwargs` first, then `engine_args` overrides them. So a YAML that sets both `gpu_memory_utilization: 0.9` and `engine_args.mem_fraction_static: 0.5` ends up at `0.5`. Document this when answering "why didn't my YAML field stick?".
-
-**ServerArgs is flat.** Unlike vLLM, where speculative decoding is nested under `engine_args.speculative_config: {...}`, SGLang exposes flat top-level fields: `speculative_algorithm`, `speculative_draft_model_path`, `speculative_num_steps`, `speculative_eagle_topk`, `speculative_num_draft_tokens`, `speculative_dflash_block_size`, etc. There is no `speculative_config:` dict. Same goes for compilation, kv-transfer, attention — all flat.
-
-The canonical reference is `python/sglang/srt/server_args.py:ServerArgs` (line ~304). When SGLang adds new flags, no LocalAI code change is needed — they're automatically available via `engine_args:`. The validator picks them up because it introspects the live dataclass.
-
-## Speculative decoding cheatsheet
-
-`--speculative-algorithm` accepts `EAGLE`, `EAGLE3`, `NEXTN`, `STANDALONE`, `NGRAM`, `DFLASH`. `NEXTN` is silently rewritten to `EAGLE` in `ServerArgs.__post_init__` (`server_args.py:3286-3287`). MTP (Multi-Token Prediction) is the same EAGLE path with `num_steps=1, eagle_topk=1, num_draft_tokens=2` against a target whose architecture has multi-token heads (e.g. MiMo-7B-RL, DeepSeek-V3-MTP).
-
-| Algorithm | Drafter requirement | Gallery demo target | Gallery demo drafter |
-|-----------|--------------------|---------------------|----------------------|
-| `NEXTN` / `EAGLE` (MTP) | Assistant drafter or built-in heads | google/gemma-4-E2B-it, google/gemma-4-E4B-it | google/gemma-4-E2B-it-assistant, google/gemma-4-E4B-it-assistant |
-| `EAGLE3` | EAGLE3 draft head | (no gallery entry yet) | e.g. jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B |
-| `DFLASH` | Block-diffusion drafter | (no gallery entry yet) | e.g. z-lab/Qwen3-4B-DFlash-b16 |
-| `STANDALONE` | Smaller LLM as drafter | (no gallery entry yet) | any smaller chat-tuned LLM in the same family |
-| `NGRAM` | None — uses prefix history | (no gallery entry yet) | n/a |
-
-The Gemma 4 demos use `mem_fraction_static: 0.85` (cookbook default) and the cookbook's `num_steps=5, num_draft_tokens=6, eagle_topk=1` parameters. Other algorithms are reachable from any user YAML via `engine_args:` but don't have shipped demos yet — that's a deliberate gallery scope choice, not a backend limitation.
-
-Gemma 4 support requires sglang built from a commit that includes [PR #21952](https://github.com/sgl-project/sglang/pull/21952). LocalAI's pinned release for cublas12 / cublas13 includes it. The `l4t13` (JetPack 7 / sbsa cu130) build floors at `sglang>=0.5.0` because the `pypi.jetson-ai-lab.io` mirror still ships only `0.5.1.post2` as of 2026-05-06 — Gemma 4 / MTP recipes are therefore not available on l4t13 until that mirror catches up. `backend.py` keeps backward compat with the 0.5.x → 0.5.11 `SamplingParams.seed` → `sampling_seed` rename via runtime detection.
-
-Compatibility caveats per the SGLang docs: DFLASH and NGRAM are incompatible with `enable_dp_attention`; DFLASH requires `pp_size == 1`; STANDALONE is incompatible with `enable_dp_attention`; NGRAM is CUDA-only and disables the overlap scheduler.
-
-### `mem_fraction_static` + quantization + MTP on consumer GPUs
-
-When combining online weight quantization (`engine_args.quantization: fp8` / `awq` / etc.) with built-in-head MTP (`speculative_algorithm: EAGLE`/`NEXTN`) on a tight VRAM budget, sglang's default `mem_fraction_static: 0.85` will OOM during draft-worker init. The reason: sglang quantizes the **target** model's transformer blocks but loads the **MTP draft worker's vocab embedding** at the source dtype (typically bf16). For a 7 B-class model with a 150k-token vocab × 4096 hidden, that's another ~1.2 GiB allocated *after* the static pool is reserved. At 0.85 fraction on a 16 GB card there's no room left.
-
-Workaround: drop `mem_fraction_static` to ~0.7 so the post-static heap can absorb the MTP embedding alloc + CUDA graph private pools. Verified end-to-end on MiMo-7B-RL + fp8 + MTP on a 16 GB RTX 5070 Ti (`gallery/sglang-mimo-7b-mtp.yaml`) at ~88 tok/s. Models with larger vocabs or more MTP layers (e.g. DeepSeek-V3-MTP) need an even smaller fraction.
-
-This isn't documented anywhere upstream as of 2026-05-06 — the SGLang Gemma 4 cookbook uses 0.85 because their MTP path doesn't go through `eagle_worker_v2.py` for an embedding-bearing draft module. Don't blanket-apply 0.7 across all sglang YAMLs; only when MTP-with-built-in-heads + quantization combine.
-
-## Tool-call and reasoning parsers stay on `Options[]`
-
-ServerArgs has `tool_call_parser` and `reasoning_parser` fields, and the backend does pass them through to `Engine` so SGLang's own HTTP/OAI surface keeps working. But for the **LocalAI** request path the backend constructs fresh per-request parser instances in `_make_parsers` (`backend.py:286`) because the parsers are stateful — the streaming and non-streaming paths each need their own.
-
-So the user-facing knob stays on `Options[]`:
-
-```yaml
-options:
-  - tool_parser:hermes
-  - reasoning_parser:deepseek_r1
-```
-
-Putting these in `engine_args:` will set them on `ServerArgs` but the LocalAI-level streaming `ChatDelta` will not pick them up. Don't recommend that path.
-
-## What's missing today (out of scope, but worth tracking)
-
- `core/config/hooks_sglang.go` — there is no SGLang equivalent of `hooks_vllm.go`. The vLLM hook auto-selects parsers for known model families from `parser_defaults.json` and seeds production engine_args defaults. A symmetric hook for SGLang could reuse the same `parser_defaults.json` (the SGLang parser names are different but the family detection is shared) and seed defaults like `enable_metrics: true` or attention-backend choices.
- `core/gallery/importers/sglang.go` — vLLM has an importer that resolves model architecture → parser defaults at gallery-import time. A matching importer for SGLang would let `local-ai install` populate sensible parsers automatically.
-
-These should be a follow-up PR, not a blocker for the engine_args feature.
--- a/.agents/testing-mcp-apps.md
+++ b/.agents/testing-mcp-apps.md
@@ -1,120 +0,0 @@
-# Testing MCP Apps (Interactive Tool UIs)
-
-MCP Apps is an extension to MCP where tools declare interactive HTML UIs via `_meta.ui.resourceUri`. When the LLM calls such a tool, the UI renders the app in a sandboxed iframe inline in the chat. The app communicates bidirectionally with the host via `postMessage` (JSON-RPC) and can call server tools, send messages, and update model context.
-
-Spec: https://modelcontextprotocol.io/extensions/apps/overview
-
-## Quick Start: Run a Test MCP App Server
-
-The `@modelcontextprotocol/server-basic-react` npm package is a ready-to-use test server that exposes a `get-time` tool with an interactive React clock UI. It requires Node >= 20, so run it in Docker:
-
-```bash
-docker run -d --name mcp-app-test -p 3001:3001 node:22-slim \
-  sh -c 'npx -y @modelcontextprotocol/server-basic-react'
-```
-
-Wait ~10 seconds for it to start, then verify:
-
-```bash
-# Check it's running
-docker logs mcp-app-test
-# Expected: "MCP server listening on http://localhost:3001/mcp"
-
-# Verify MCP protocol works
-curl -s -X POST http://localhost:3001/mcp \
-  -H 'Content-Type: application/json' \
-  -H 'Accept: application/json, text/event-stream' \
-  -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"test","version":"1.0.0"}}}'
-
-# List tools — should show get-time with _meta.ui.resourceUri
-curl -s -X POST http://localhost:3001/mcp \
-  -H 'Content-Type: application/json' \
-  -H 'Accept: application/json, text/event-stream' \
-  -d '{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}}'
-```
-
-The `tools/list` response should contain:
-```json
-{
-  "name": "get-time",
-  "_meta": {
-    "ui": { "resourceUri": "ui://get-time/mcp-app.html" }
-  }
-}
-```
-
-## Testing in LocalAI's UI
-
-1. Make sure LocalAI is running (e.g. `http://localhost:8080`)
-2. Build the React UI: `cd core/http/react-ui && npm install && npm run build`
-3. Open the Chat page in your browser
-4. Click **"Client MCP"** in the chat header
-5. Add a new client MCP server:
-   - **URL**: `http://localhost:3001/mcp`
-   - **Use CORS proxy**: enabled (default) — required because the browser can't hit `localhost:3001` directly due to CORS; LocalAI's proxy at `/api/cors-proxy` handles it
-6. The server should connect and discover the `get-time` tool
-7. Select a model and send: **"What time is it?"**
-8. The LLM should call the `get-time` tool
-9. The tool result should render the interactive React clock app in an iframe as a standalone chat message (not inside the collapsed activity group)
-
-## What to Verify
-
- [ ] Tool appears in the connected tools list (not filtered — `get-time` is callable by the LLM)
- [ ] The iframe renders as a standalone chat message with a puzzle-piece icon
- [ ] The app loads and is interactive (clock UI, buttons work)
- [ ] No "Reconnect to MCP server" overlay (connection is live)
- [ ] Console logs show bidirectional communication:
-  - `tools/call` messages from app to host (app calling server tools)
-  - `ui/message` notifications (app sending messages)
- [ ] After the app renders, the LLM continues and produces a text response with the time
- [ ] Non-UI tools continue to work normally (text-only results)
- [ ] Page reload shows the HTML statically with a reconnect overlay until you reconnect
-
-## Console Log Patterns
-
-Healthy bidirectional communication looks like:
-
-```
-Parsed message { jsonrpc: "2.0", id: N, result: {...} }     // Bridge init
-get-time result: { content: [...] }                          // Tool result received
-Calling get-time tool...                                     // App calls tool
-Sending message { method: "tools/call", ... }                // App -> host -> server
-Parsed message { jsonrpc: "2.0", id: N, result: {...} }     // Server response
-Sending message text to Host: ...                            // App sends message
-Sending message { method: "ui/message", ... }                // Message notification
-Message accepted                                             // Host acknowledged
-```
-
-Benign warnings to ignore:
- `Source map error: ... about:srcdoc` — browser devtools can't find source maps for srcdoc iframes
- `Ignoring message from unknown source` — duplicate postMessage from iframe navigation
- `notifications/cancelled` — app cleaning up previous requests
-
-## Architecture Notes
-
- **No server-side changes needed** — the MCP App protocol runs entirely in the browser
- `PostMessageTransport` wraps `window.postMessage` between host and `srcdoc` iframe
- `AppBridge` (from `@modelcontextprotocol/ext-apps`) auto-forwards `tools/call`, `resources/read`, `resources/list` from the app to the MCP server via the host's `Client`
- The iframe uses `sandbox="allow-scripts allow-forms"` (no `allow-same-origin`) — opaque origin, no access to host cookies/DOM/localStorage
- App-only tools (`_meta.ui.visibility: "app-only"`) are filtered from the LLM's tool list but remain callable by the app iframe
-
-## Key Files
-
- `core/http/react-ui/src/components/MCPAppFrame.jsx` — iframe + AppBridge component
- `core/http/react-ui/src/hooks/useMCPClient.js` — MCP client hook with app UI helpers (`hasAppUI`, `getAppResource`, `getClientForTool`, `getToolDefinition`)
- `core/http/react-ui/src/hooks/useChat.js` — agentic loop, attaches `appUI` to tool_result messages
- `core/http/react-ui/src/pages/Chat.jsx` — renders MCPAppFrame as standalone chat messages
-
-## Other Test Servers
-
-The `@modelcontextprotocol/ext-apps` repo has many example servers:
- `@modelcontextprotocol/server-basic-react` — simple clock (React)
- More examples at https://github.com/modelcontextprotocol/ext-apps/tree/main/examples
-
-All examples support both stdio and HTTP transport. Run without `--stdio` for HTTP mode on port 3001.
-
-## Cleanup
-
-```bash
-docker rm -f mcp-app-test
-```
--- a/.agents/vllm-backend.md
+++ b/.agents/vllm-backend.md
@@ -1,115 +0,0 @@
-# Working on the vLLM Backend
-
-The vLLM backend lives at `backend/python/vllm/backend.py` (async gRPC) and the multimodal variant at `backend/python/vllm-omni/backend.py` (sync gRPC). Both wrap vLLM's `AsyncLLMEngine` / `Omni` and translate the LocalAI gRPC `PredictOptions` into vLLM `SamplingParams` + outputs into `Reply.chat_deltas`.
-
-This file captures the non-obvious bits — most of the bring-up was a single PR (`feat/vllm-parity`) and the things below are easy to get wrong.
-
-## Tool calling and reasoning use vLLM's *native* parsers
-
-Do not write regex-based tool-call extractors for vLLM. vLLM ships:
-
- `vllm.tool_parsers.ToolParserManager` — 50+ registered parsers (`hermes`, `llama3_json`, `llama4_pythonic`, `mistral`, `qwen3_xml`, `deepseek_v3`, `granite4`, `openai`, `kimi_k2`, `glm45`, …)
- `vllm.reasoning.ReasoningParserManager` — 25+ registered parsers (`deepseek_r1`, `qwen3`, `mistral`, `gemma4`, …)
-
-Both can be used standalone: instantiate with a tokenizer, call `extract_tool_calls(text, request=None)` / `extract_reasoning(text, request=None)`. The backend stores the parser *classes* on `self.tool_parser_cls` / `self.reasoning_parser_cls` at LoadModel time and instantiates them per request.
-
-**Selection:** vLLM does *not* auto-detect parsers from model name — neither does the LocalAI backend. The user (or `core/config/hooks_vllm.go`) must pick one and pass it via `Options[]`:
-
-```yaml
-options:
-  - tool_parser:hermes
-  - reasoning_parser:qwen3
-```
-
-Auto-defaults for known model families live in `core/config/parser_defaults.json` and are applied:
- at gallery import time by `core/gallery/importers/vllm.go`
- at model load time by the `vllm` / `vllm-omni` backend hook in `core/config/hooks_vllm.go`
-
-User-supplied `tool_parser:`/`reasoning_parser:` in the config wins over defaults — the hook checks for existing entries before appending.
-
-**When to update `parser_defaults.json`:** any time vLLM ships a new tool or reasoning parser, or you onboard a new model family that LocalAI users will pull from HuggingFace. The file is keyed by *family pattern* matched against `normalizeModelID(cfg.Model)` (lowercase, org-prefix stripped, `_`→`-`). Patterns are checked **longest-first** — keep `qwen3.5` before `qwen3`, `llama-3.3` before `llama-3`, etc., or the wrong family wins. Add a covering test in `core/config/hooks_test.go`.
-
-**Sister file — `core/config/inference_defaults.json`:** same pattern but for sampling parameters (temperature, top_p, top_k, min_p, repeat_penalty, presence_penalty). Loaded by `core/config/inference_defaults.go` and applied by `ApplyInferenceDefaults()`. The schema is `map[string]float64` only — *strings don't fit*, which is why parser defaults needed their own JSON file. The inference file is **auto-generated from unsloth** via `go generate ./core/config/` (see `core/config/gen_inference_defaults/`) — don't hand-edit it; instead update the upstream source or regenerate. Both files share `normalizeModelID()` and the longest-first pattern ordering.
-
-**Constructor compatibility gotcha:** the abstract `ToolParser.__init__` accepts `tools=`, but several concrete parsers (Hermes2ProToolParser, etc.) override `__init__` and *only* accept `tokenizer`. Always:
-
-```python
-try:
-    tp = self.tool_parser_cls(self.tokenizer, tools=tools)
-except TypeError:
-    tp = self.tool_parser_cls(self.tokenizer)
-```
-
-## ChatDelta is the streaming contract
-
-The Go side (`core/backend/llm.go`, `pkg/functions/chat_deltas.go`) consumes `Reply.chat_deltas` to assemble the OpenAI response. For tool calls to surface in `chat/completions`, the Python backend **must** populate `Reply.chat_deltas[].tool_calls` with `ToolCallDelta{index, id, name, arguments}`. Returning the raw `<tool_call>...</tool_call>` text in `Reply.message` is *not* enough — the Go regex fallback exists for llama.cpp, not for vllm.
-
-Same story for `reasoning_content` — emit it on `ChatDelta.reasoning_content`, not as part of `content`.
-
-## Message conversion to chat templates
-
-`tokenizer.apply_chat_template()` expects a list of dicts, not proto Messages. The shared helper in `backend/python/common/vllm_utils.py` (`messages_to_dicts`) handles the mapping including:
-
- `tool_call_id` and `name` for `role="tool"` messages
- `tool_calls` JSON-string field → parsed Python list for `role="assistant"`
- `reasoning_content` for thinking models
-
-Pass `tools=json.loads(request.Tools)` and (when `request.Metadata.get("enable_thinking") == "true"`) `enable_thinking=True` to `apply_chat_template`. Wrap in `try/except TypeError` because not every tokenizer template accepts those kwargs.
-
-## CPU support and the SIMD/library minefield
-
-vLLM publishes prebuilt CPU wheels at `https://github.com/vllm-project/vllm/releases/...`. The pin lives in `backend/python/vllm/requirements-cpu-after.txt`.
-
-**Version compatibility — important:** newer vllm CPU wheels (≥ 0.15) declare `torch==2.10.0+cpu` as a hard dep, but `torch==2.10.0` only exists on the PyTorch test channel and pulls in an incompatible `torchvision`. Stay on **`vllm 0.14.1+cpu` + `torch 2.9.1+cpu`** until both upstream catch up. Bumping requires verifying torchvision/torchaudio match.
-
-`requirements-cpu.txt` uses `--extra-index-url https://download.pytorch.org/whl/cpu`. `install.sh` adds `--index-strategy=unsafe-best-match` for the `cpu` profile so uv resolves transformers/vllm from PyPI while pulling torch from the PyTorch index.
-
-**SIMD baseline:** the prebuilt CPU wheel is compiled with AVX-512 VNNI/BF16. On a CPU without those instructions, importing `vllm.model_executor.models.registry` SIGILLs at `_run_in_subprocess` time during model inspection. There is no runtime flag to disable it. Workarounds:
-
-1. **Run on a host with the right SIMD baseline** (default — fast)
-2. **Build from source** with `FROM_SOURCE=true` env var. Plumbing exists end-to-end:
-   - `install.sh` hides `requirements-cpu-after.txt`, runs `installRequirements` for the base deps, then clones vllm and `VLLM_TARGET_DEVICE=cpu uv pip install --no-deps .`
-   - `backend/Dockerfile.python` declares `ARG FROM_SOURCE` + `ENV FROM_SOURCE`
-   - `Makefile` `docker-build-backend` macro forwards `--build-arg FROM_SOURCE=$(FROM_SOURCE)` when set
-   - Source build takes 30–50 minutes — too slow for per-PR CI but fine for local.
-
-**Runtime shared libraries:** vLLM's `vllm._C` extension `dlopen`s `libnuma.so.1` at import time. If missing, the C extension silently fails and `torch.ops._C_utils.init_cpu_threads_env` is never registered → `EngineCore` crashes on `init_device` with:
-
-```
-AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env'
-```
-
-`backend/python/vllm/package.sh` bundles `libnuma.so.1` and `libgomp.so.1` into `${BACKEND}/lib/`, which `libbackend.sh` adds to `LD_LIBRARY_PATH` at run time. The builder stage in `backend/Dockerfile.python` installs `libnuma1`/`libgomp1` so package.sh has something to copy. Do *not* assume the production host has these — backend images are `FROM scratch`.
-
-## Backend hook system (`core/config/backend_hooks.go`)
-
-Per-backend defaults that used to be hardcoded in `ModelConfig.Prepare()` now live in `core/config/hooks_*.go` files and self-register via `init()`:
-
- `hooks_llamacpp.go` → GGUF metadata parsing, context size, GPU layers, jinja template
- `hooks_vllm.go` → tool/reasoning parser auto-selection from `parser_defaults.json`
-
-Hook keys:
- `"llama-cpp"`, `"vllm"`, `"vllm-omni"`, … — backend-specific
- `""` — runs only when `cfg.Backend` is empty (auto-detect case)
- `"*"` — global catch-all, runs for every backend before specific hooks
-
-Multiple hooks per key are supported and run in registration order. Adding a new backend default:
-
-```go
-// core/config/hooks_<backend>.go
-func init() {
-    RegisterBackendHook("<backend>", myDefaults)
-}
-func myDefaults(cfg *ModelConfig, modelPath string) {
-    // only fill in fields the user didn't set
-}
-```
-
-## The `Messages.ToProto()` fields you need to set
-
-`core/schema/message.go:ToProto()` must serialize:
- `ToolCallID` → `proto.Message.ToolCallId` (for `role="tool"` messages — links result back to the call)
- `Reasoning` → `proto.Message.ReasoningContent`
- `ToolCalls` → `proto.Message.ToolCalls` (JSON-encoded string)
-
-These were originally not serialized and tool-calling conversations broke silently — the C++ llama.cpp backend reads them but always got empty strings. Any new field added to `schema.Message` *and* `proto.Message` needs a matching line in `ToProto()`.
--- a/.air.toml
+++ b/.air.toml
@@ -1,8 +0,0 @@
-# .air.toml
-[build]
-cmd = "make build"
-bin = "./local-ai"
-args_bin = [ "--debug" ]
-include_ext = ["go", "html", "yaml", "toml", "json", "txt", "md"]
-exclude_dir = ["pkg/grpc/proto"]
-delay = 1000
--- a/.devcontainer-scripts/postcreate.sh
+++ b/.devcontainer-scripts/postcreate.sh
@@ -1,17 +0,0 @@
-#!/bin/bash
-
-cd /workspace
-
-# Get the files into the volume without a bind mount
-if [ ! -d ".git" ]; then
-    git clone https://github.com/mudler/LocalAI.git .
-else
-    git fetch
-fi
-
-echo "Standard Post-Create script completed."
-
-if [ -f "/devcontainer-customization/postcreate.sh" ]; then
-    echo "Launching customization postcreate.sh"
-    bash "/devcontainer-customization/postcreate.sh"
-fi
--- a/.devcontainer-scripts/poststart.sh
+++ b/.devcontainer-scripts/poststart.sh
@@ -1,13 +0,0 @@
-#!/bin/bash
-
-cd /workspace
-
-# Ensures generated source files are present upon load
-make prepare
-
-echo "Standard Post-Start script completed."
-
-if [ -f "/devcontainer-customization/poststart.sh" ]; then
-    echo "Launching customization poststart.sh"
-    bash "/devcontainer-customization/poststart.sh"
-fi
--- a/.devcontainer-scripts/utils.sh
+++ b/.devcontainer-scripts/utils.sh
@@ -1,55 +0,0 @@
-#!/bin/bash
-
-# This file contains some really simple functions that are useful when building up customization scripts.
-
-
-# Checks if the git config has a user registered - and sets it up if not.
-#
-# Param 1: name
-# Param 2: email
-#
-config_user() {
-    echo "Configuring git for $1 <$2>"
-    local gcn=$(git config --global user.name)
-    if [ -z "${gcn}" ]; then
-        echo "Setting up git user / remote"
-        git config --global user.name "$1"
-        git config --global user.email "$2"
-        
-    fi
-}
-
-# Checks if the git remote is configured - and sets it up if not. Fetches either way.
-#
-# Param 1: remote name
-# Param 2: remote url
-#
-config_remote() {
-    echo "Adding git remote and fetching $2 as $1"
-    local gr=$(git remote -v | grep $1)
-    if [ -z "${gr}" ]; then
-        git remote add $1 $2
-    fi
-    git fetch $1
-}
-
-# Setup special .ssh files
-# Prints out lines of text to make things pretty
-# Param 1: bash array, filenames relative to the customization directory that should be copied to ~/.ssh
-setup_ssh() {
-    echo "starting ~/.ssh directory setup..."
-    mkdir -p "${HOME}.ssh"
-    chmod 0700 "${HOME}/.ssh"
-    echo "-----"
-    local files=("$@")
-    for file in "${files[@]}" ; do
-        local cfile="/devcontainer-customization/${file}"
-        local hfile="${HOME}/.ssh/${file}"
-        if [ ! -f "${hfile}" ]; then
-            echo "copying \"${file}\""
-            cp "${cfile}" "${hfile}"
-            chmod 600 "${hfile}"
-        fi
-    done
-    echo "~/.ssh directory setup complete!"
-}
--- a/.devcontainer/customization/README.md
+++ b/.devcontainer/customization/README.md
@@ -1,25 +0,0 @@
-Place any additional resources your environment requires in this directory
-
-Script hooks are currently called for:
-`postcreate.sh` and `poststart.sh`
-
-If files with those names exist here, they will be called at the end of the normal script.
-
-This is a good place to set things like `git config --global user.name` are set - and to handle any other files that are mounted via this directory.
-
-To assist in doing so, `source /.devcontainer-scripts/utils.sh` will provide utility functions that may be useful - for example:
-
-```
-#!/bin/bash
-
-source "/.devcontainer-scripts/utils.sh"
-
-sshfiles=("config", "key.pub")
-
-setup_ssh "${sshfiles[@]}"
-
-config_user "YOUR NAME" "YOUR EMAIL"
-
-config_remote "REMOTE NAME" "REMOTE URL"
-
-```
--- a/.devcontainer/devcontainer.json
+++ b/.devcontainer/devcontainer.json
@@ -1,24 +0,0 @@
-{
-    "$schema": "https://raw.githubusercontent.com/devcontainers/spec/main/schemas/devContainer.schema.json",
-    "name": "LocalAI",
-    "workspaceFolder": "/workspace",
-    "dockerComposeFile": [ "./docker-compose-devcontainer.yml" ],
-    "service": "api",
-    "shutdownAction": "stopCompose",
-    "customizations": {
-        "vscode": {
-            "extensions": [
-                "golang.go",
-                "ms-vscode.makefile-tools",
-                "ms-azuretools.vscode-docker",
-                "ms-python.python",
-                "ms-python.debugpy",
-                "wayou.vscode-todo-highlight",
-                "waderyan.gitblame"
-            ]
-        }
-    },
-    "forwardPorts": [8080, 3000],
-    "postCreateCommand": "bash /.devcontainer-scripts/postcreate.sh",
-    "postStartCommand": "bash /.devcontainer-scripts/poststart.sh"
-}
--- a/.devcontainer/docker-compose-devcontainer.yml
+++ b/.devcontainer/docker-compose-devcontainer.yml
@@ -1,48 +0,0 @@
-services:
-  api:
-    build:
-      context: ..
-      dockerfile: Dockerfile
-      target: devcontainer
-    env_file:
-      - ../.env
-    ports:
-      - 8080:8080
-    volumes:
-      - localai_workspace:/workspace
-      - models:/host-models
-      - backends:/host-backends
-      - ./customization:/devcontainer-customization
-    command: /bin/sh -c "while sleep 1000; do :; done"
-    cap_add:
-      - SYS_PTRACE
-    security_opt:
-      - seccomp:unconfined
-  prometheus:
-    image: prom/prometheus
-    container_name: prometheus
-    command:
-      - '--config.file=/etc/prometheus/prometheus.yml'
-    ports:
-      - 9090:9090
-    restart: unless-stopped
-    volumes:
-      - ./prometheus:/etc/prometheus
-      - prom_data:/prometheus
-  grafana:
-    image: grafana/grafana
-    container_name: grafana
-    ports:
-      - 3000:3000
-    restart: unless-stopped
-    environment:
-      - GF_SECURITY_ADMIN_USER=admin
-      - GF_SECURITY_ADMIN_PASSWORD=grafana
-    volumes:
-      - ./grafana:/etc/grafana/provisioning/datasources
-
-volumes:
-  prom_data:
-  localai_workspace:
-  models:
-  backends:
--- a/.devcontainer/grafana/datasource.yml
+++ b/.devcontainer/grafana/datasource.yml
@@ -1,10 +0,0 @@
-
-apiVersion: 1
-
-datasources:
- name: Prometheus
-  type: prometheus
-  url: http://prometheus:9090 
-  isDefault: true
-  access: proxy
-  editable: true
--- a/.devcontainer/prometheus/prometheus.yml
+++ b/.devcontainer/prometheus/prometheus.yml
@@ -1,21 +0,0 @@
-global:
-  scrape_interval: 15s
-  scrape_timeout: 10s
-  evaluation_interval: 15s
-alerting:
-  alertmanagers:
-  - static_configs:
-    - targets: []
-    scheme: http
-    timeout: 10s
-    api_version: v1
-scrape_configs:
- job_name: prometheus
-  honor_timestamps: true
-  scrape_interval: 15s
-  scrape_timeout: 10s
-  metrics_path: /metrics
-  scheme: http
-  static_configs:
-  - targets:
-    - localhost:9090
--- a/.docker/apt-mirror.sh
+++ b/.docker/apt-mirror.sh
@@ -1,39 +0,0 @@
-#!/bin/sh
-# Reconfigure Ubuntu apt sources to point at an alternate mirror.
-#
-# Used by Dockerfiles via `RUN --mount=type=bind,source=.docker/apt-mirror.sh,...`
-# and by CI workflows on the runner to mitigate outages of the default
-# archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com pool.
-#
-# Inputs (env):
-#   APT_MIRROR        Replacement for archive.ubuntu.com and security.ubuntu.com
-#                     (e.g. "http://azure.archive.ubuntu.com" or
-#                      "https://mirrors.edge.kernel.org").
-#                     Leave empty to keep upstream. The trailing "/ubuntu/..."
-#                     path is preserved by the rewrite.
-#   APT_PORTS_MIRROR  Replacement for ports.ubuntu.com (arm64/ppc64el/...).
-#                     Leave empty to keep upstream.
-#
-# Both default to empty, in which case the script is a no-op.
-
-set -e
-
-if [ -z "${APT_MIRROR}" ] && [ -z "${APT_PORTS_MIRROR}" ]; then
-    exit 0
-fi
-
-# Ubuntu 24.04 (noble) ships DEB822 sources at /etc/apt/sources.list.d/ubuntu.sources;
-# older releases use /etc/apt/sources.list. We rewrite whichever exists.
-for f in /etc/apt/sources.list.d/ubuntu.sources /etc/apt/sources.list; do
-    [ -f "$f" ] || continue
-    if [ -n "${APT_MIRROR}" ]; then
-        # Use a comma delimiter so the alternation pipe in the regex
-        # is not interpreted as the s/// separator.
-        sed -i -E "s,https?://(archive\.ubuntu\.com|security\.ubuntu\.com),${APT_MIRROR},g" "$f"
-    fi
-    if [ -n "${APT_PORTS_MIRROR}" ]; then
-        sed -i -E "s,https?://ports\.ubuntu\.com,${APT_PORTS_MIRROR},g" "$f"
-    fi
-done
-
-echo "apt-mirror: rewrote sources (APT_MIRROR='${APT_MIRROR}', APT_PORTS_MIRROR='${APT_PORTS_MIRROR}')"
--- a/.docker/ik-llama-cpp-compile.sh
+++ b/.docker/ik-llama-cpp-compile.sh
@@ -1,30 +0,0 @@
-#!/usr/bin/env bash
-# Shared compile logic for backend/Dockerfile.ik-llama-cpp.
-# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
-
-set -euxo pipefail
-
-export CCACHE_DIR=/root/.ccache
-ccache --max-size=5G || true
-ccache -z || true
-
-export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
-
-if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
-  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
-  export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
-  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
-  rm -rf /LocalAI/backend/cpp/ik-llama-cpp-*-build
-fi
-
-cd /LocalAI/backend/cpp/ik-llama-cpp
-
-if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
-  # ARM64 / ROCm: build without x86 SIMD
-  make ik-llama-cpp-fallback
-else
-  # ik_llama.cpp's IQK kernels require at least AVX2
-  make ik-llama-cpp-avx2
-fi
-
-ccache -s || true
--- a/.docker/install-base-deps.sh
+++ b/.docker/install-base-deps.sh
@@ -1,250 +0,0 @@
-#!/usr/bin/env bash
-# Single source of truth for builder-base contents.
-#
-# Used by:
-#   - backend/Dockerfile.base-grpc-builder        (CI prebuilt-base source of truth)
-#   - backend/Dockerfile.llama-cpp                (builder-fromsource stage)
-#   - backend/Dockerfile.ik-llama-cpp             (builder-fromsource stage)
-#   - backend/Dockerfile.turboquant               (builder-fromsource stage)
-#
-# All four files invoke this script via
-#   RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
-#       --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
-#       bash /usr/local/sbin/install-base-deps
-#
-# so the prebuilt CI base image and the from-source local-dev path are
-# bit-equivalent by construction.
-#
-# Inputs (env, populated from Dockerfile ARG/ENV):
-#   BUILD_TYPE                ("cublas"|"l4t"|"hipblas"|"vulkan"|"sycl"|"clblas"|"")
-#   CUDA_MAJOR_VERSION        ("12" | "13" | "")
-#   CUDA_MINOR_VERSION        ("8" | "0" | "")
-#   TARGETARCH                ("amd64" | "arm64")
-#   UBUNTU_VERSION            ("2204" | "2404")
-#   SKIP_DRIVERS              ("false" | "true")
-#   CMAKE_FROM_SOURCE         ("false" | "true")
-#   CMAKE_VERSION             ("3.31.10")
-#   GRPC_VERSION              ("v1.65.0")
-#   GRPC_MAKEFLAGS            ("-j4 -Otarget")
-#   APT_MIRROR / APT_PORTS_MIRROR  (optional; consumed by /usr/local/sbin/apt-mirror)
-#   AMDGPU_TARGETS            (optional; only relevant for hipblas downstream)
-#
-# IMPORTANT: install logic is copied verbatim from the prior in-Dockerfile
-# RUN blocks. Do not paraphrase apt invocations / version pins / sed line
-# numbers / deb URLs — the bit-equivalence guarantee depends on it.
-
-set -eux
-
-# --- 0. apt mirror rewrite (no-op when APT_MIRROR / APT_PORTS_MIRROR unset) ---
-if [ -x /usr/local/sbin/apt-mirror ]; then
-    APT_MIRROR="${APT_MIRROR:-}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR:-}" \
-        sh /usr/local/sbin/apt-mirror
-fi
-
-export DEBIAN_FRONTEND=noninteractive
-export MAKEFLAGS="${GRPC_MAKEFLAGS:-}"
-
-# --- 1. Base apt build deps ---
-apt-get update
-apt-get install -y --no-install-recommends \
-    build-essential \
-    ccache git \
-    ca-certificates \
-    make \
-    pkg-config libcurl4-openssl-dev \
-    curl unzip \
-    libssl-dev wget
-apt-get clean
-rm -rf /var/lib/apt/lists/*
-
-# --- 2. Vulkan SDK (BUILD_TYPE=vulkan) ---
-# NB: this block intentionally installs `cmake` via apt as part of the
-# Vulkan tooling — must run before the dedicated CMake step below.
-if [ "${BUILD_TYPE:-}" = "vulkan" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
-    apt-get update
-    apt-get install -y  --no-install-recommends \
-        software-properties-common pciutils wget gpg-agent
-    apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
-        libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
-        libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
-        git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
-        ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
-        clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
-    # Mesa Vulkan ICD drivers (ANV/RADV/lavapipe + Arm SoC) and their ICD
-    # manifests. The LunarG SDK below only provides the loader and shader
-    # tooling, not hardware drivers — without Mesa the packaged Vulkan backend
-    # would ship a loader that finds no GPU. package-gpu-libs.sh bundles these
-    # .so files plus their deps into the backend so it stays self-contained.
-    apt-get install -y mesa-vulkan-drivers libdrm2
-    if [ "amd64" = "${TARGETARCH:-}" ]; then
-        wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz"
-        tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz
-        rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz
-        mkdir -p /opt/vulkan-sdk
-        mv 1.4.335.0 /opt/vulkan-sdk/
-        ( cd /opt/vulkan-sdk/1.4.335.0 && \
-          ./vulkansdk --no-deps --maxjobs \
-              vulkan-loader \
-              vulkan-validationlayers \
-              vulkan-extensionlayer \
-              vulkan-tools \
-              shaderc )
-        cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/
-        cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/
-        cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/
-        cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/
-        rm -rf /opt/vulkan-sdk
-    fi
-    if [ "arm64" = "${TARGETARCH:-}" ]; then
-        mkdir vulkan
-        ( cd vulkan && \
-          curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
-          tar -xvf vulkan-sdk.tar.xz && \
-          rm vulkan-sdk.tar.xz && \
-          cd 1.4.335.0 && \
-          cp -rfv aarch64/bin/* /usr/bin/ && \
-          cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
-          cp -rfv aarch64/include/* /usr/include/ && \
-          cp -rfv aarch64/share/* /usr/share/ )
-        rm -rf vulkan
-    fi
-    ldconfig
-    apt-get clean
-    rm -rf /var/lib/apt/lists/*
-fi
-
-# --- 3. CUDA toolkit (BUILD_TYPE=cublas|l4t) ---
-if { [ "${BUILD_TYPE:-}" = "cublas" ] || [ "${BUILD_TYPE:-}" = "l4t" ]; } && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
-    apt-get update
-    apt-get install -y  --no-install-recommends \
-        software-properties-common pciutils
-    if [ "amd64" = "${TARGETARCH:-}" ]; then
-        curl -O "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb"
-    fi
-    if [ "arm64" = "${TARGETARCH:-}" ]; then
-        if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
-            curl -O "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb"
-        else
-            curl -O "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb"
-        fi
-    fi
-    dpkg -i cuda-keyring_1.1-1_all.deb
-    rm -f cuda-keyring_1.1-1_all.deb
-    apt-get update
-    apt-get install -y --no-install-recommends \
-        "cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-        "libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-        "libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-        "libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-        "libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-        "libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}"
-    if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "${TARGETARCH:-}" ]; then
-        apt-get install -y --no-install-recommends \
-            "libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-            "libcudnn9-cuda-${CUDA_MAJOR_VERSION}" \
-            "cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
-            "libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}"
-    fi
-    apt-get clean
-    rm -rf /var/lib/apt/lists/*
-fi
-
-# --- 4. cuDSS / NVPL on arm64 + cublas (legacy JetPack / Tegra) ---
-# https://github.com/NVIDIA/Isaac-GR00T/issues/343
-if [ "${BUILD_TYPE:-}" = "cublas" ] && [ "${TARGETARCH:-}" = "arm64" ]; then
-    wget "https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb"
-    dpkg -i "cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb"
-    cp /var/cudss-local-tegra-repo-ubuntu"${UBUNTU_VERSION}"-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/
-    apt-get update
-    apt-get -y install cudss "cudss-cuda-${CUDA_MAJOR_VERSION}"
-    wget "https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb"
-    dpkg -i "nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb"
-    cp /var/nvpl-local-repo-ubuntu"${UBUNTU_VERSION}"-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/
-    apt-get update
-    apt-get install -y nvpl
-fi
-
-# --- 5. clBLAS (BUILD_TYPE=clblas) ---
-# Present in variant Dockerfiles' from-source path but not in master's
-# Dockerfile.base-grpc-builder. No CI matrix entry currently uses this,
-# but keep parity so a future BUILD_TYPE=clblas build doesn't drift.
-if [ "${BUILD_TYPE:-}" = "clblas" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
-    apt-get update
-    apt-get install -y --no-install-recommends \
-        libclblast-dev
-    apt-get clean
-    rm -rf /var/lib/apt/lists/*
-fi
-
-# --- 6. ROCm / HIP build deps (BUILD_TYPE=hipblas) ---
-if [ "${BUILD_TYPE:-}" = "hipblas" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
-    apt-get update
-    apt-get install -y --no-install-recommends \
-        hipblas-dev \
-        hipblaslt-dev \
-        rocblas-dev
-    apt-get clean
-    rm -rf /var/lib/apt/lists/*
-    # I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install,
-    # which results in local-ai and others not being able to locate the libraries.
-    # We run ldconfig ourselves to work around this packaging deficiency.
-    ldconfig
-    # Log which GPU architectures have rocBLAS kernel support
-    echo "rocBLAS library data architectures:"
-    (ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
-        echo "WARNING: No rocBLAS kernel data found"
-fi
-
-echo "TARGETARCH: ${TARGETARCH:-}"
-
-# --- 7. protoc (always) ---
-# The version in 22.04 is too old. We will create one as part of installing
-# the GRPC build below but that will also bring in a newer version of absl
-# which stablediffusion cannot compile with. This version of protoc is only
-# here so that we can generate the grpc code for the stablediffusion build.
-if [ "amd64" = "${TARGETARCH:-}" ]; then
-    curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip
-    unzip -j -d /usr/local/bin protoc.zip bin/protoc
-    rm protoc.zip
-fi
-if [ "arm64" = "${TARGETARCH:-}" ]; then
-    curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip
-    unzip -j -d /usr/local/bin protoc.zip bin/protoc
-    rm protoc.zip
-fi
-
-# --- 8. CMake (apt or compiled from source) ---
-# The version in 22.04 is too old. Vulkan path above already pulled cmake
-# via apt; the from-source branch here will install over it which is fine.
-if [ "${CMAKE_FROM_SOURCE:-false}" = "true" ]; then
-    curl -L -s "https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz" -o cmake.tar.gz
-    tar xvf cmake.tar.gz
-    ( cd "cmake-${CMAKE_VERSION}" && ./configure && make && make install )
-else
-    apt-get update
-    apt-get install -y \
-        cmake
-    apt-get clean
-    rm -rf /var/lib/apt/lists/*
-fi
-
-# --- 9. gRPC compile + install at /opt/grpc ---
-# We install GRPC to a different prefix here so that we can copy in only
-# the build artifacts later — saves several hundred MB on the final docker
-# image size vs copying in the entire GRPC source tree and running
-# `make install` in the target container.
-#
-# The TESTONLY abseil sed patch and /opt/grpc prefix are load-bearing —
-# downstream Dockerfiles `COPY` /opt/grpc to /usr/local (or rely on the
-# prebuilt base having it at /opt/grpc).
-mkdir -p /build
-cd /build
-git clone --recurse-submodules --jobs 4 -b "${GRPC_VERSION}" --depth 1 --shallow-submodules https://github.com/grpc/grpc
-mkdir -p /build/grpc/cmake/build
-cd /build/grpc/cmake/build
-sed -i "216i\\  TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt"
-cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../..
-make
-make install
-cd /
-rm -rf /build
--- a/.docker/llama-cpp-compile.sh
+++ b/.docker/llama-cpp-compile.sh
@@ -1,45 +0,0 @@
-#!/usr/bin/env bash
-# Shared compile logic for backend/Dockerfile.llama-cpp.
-# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
-
-set -euxo pipefail
-
-export CCACHE_DIR=/root/.ccache
-ccache --max-size=5G || true
-ccache -z || true
-
-export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
-
-if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
-  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
-  export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
-  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
-  rm -rf /LocalAI/backend/cpp/llama-cpp-*-build
-fi
-
-cd /LocalAI/backend/cpp/llama-cpp
-if [ -z "${BUILD_TYPE:-}" ]; then
-  # Pure CPU image (BUILD_TYPE empty): one build with ggml CPU_ALL_VARIANTS replaces the
-  # per-microarch binaries (x86: avx/avx2/avx512/fallback; arm64: armv8.x/armv9.x). ggml
-  # dlopens the best libggml-cpu-*.so at runtime by probing host CPU features.
-  #
-  # arm64: the CPU_ALL_VARIANTS table includes armv9.2 SME variants whose -march=...+sme is
-  # rejected by the Ubuntu 24.04 default gcc-13. gcc-14 accepts it, so build the arm64
-  # variants with it (the host never *selects* SME unless it has it, but every variant must
-  # still compile).
-  if [ "${TARGETARCH}" = "arm64" ]; then
-    apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
-    export CC=gcc-14 CXX=g++-14
-  fi
-  make llama-cpp-cpu-all
-else
-  # GPU build (cublas/hipblas/sycl/vulkan/...): the accelerator does the compute, so a
-  # single fallback CPU build is enough - no per-microarch CPU variants needed. (This also
-  # keeps the heavy GPU backend compile from also building the whole CPU variant matrix,
-  # and avoids the gcc-14 apt step on GPU base images such as nvidia l4t.)
-  make llama-cpp-fallback
-fi
-make llama-cpp-grpc
-make llama-cpp-rpc-server
-
-ccache -s || true
--- a/.docker/turboquant-compile.sh
+++ b/.docker/turboquant-compile.sh
@@ -1,39 +0,0 @@
-#!/usr/bin/env bash
-# Shared compile logic for backend/Dockerfile.turboquant.
-# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
-
-set -euxo pipefail
-
-export CCACHE_DIR=/root/.ccache
-ccache --max-size=5G || true
-ccache -z || true
-
-export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
-
-if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
-  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
-  export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
-  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
-  rm -rf /LocalAI/backend/cpp/turboquant-*-build
-fi
-
-cd /LocalAI/backend/cpp/turboquant
-
-if [ -z "${BUILD_TYPE:-}" ]; then
-  # Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
-  # arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
-  if [ "${TARGETARCH}" = "arm64" ]; then
-    apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
-    export CC=gcc-14 CXX=g++-14
-  fi
-  make turboquant-cpu-all
-else
-  # GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator
-  # does the compute. Keeps the GPU compile from also building the CPU variant matrix and
-  # avoids the gcc-14 apt step on GPU base images such as nvidia l4t.
-  make turboquant-fallback
-fi
-make turboquant-grpc
-make turboquant-rpc-server
-
-ccache -s || true
--- a/.dockerignore
+++ b/.dockerignore
@@ -1,57 +1,11 @@
 .idea
 .github
 .vscode
-.devcontainer
 models
-backends
-volumes
 examples/chatbot-ui/models
-backend/go/image/stablediffusion-ggml/build/
-backend/go/*/build
-backend/go/*/.cache
-backend/go/*/sources
-backend/go/*/package
 examples/rwkv/models
 examples/**/models
 Dockerfile*
-__pycache__

 # SonarQube
-.scannerwork
-
-# backend virtual environments
-**/venv
-backend/python/**/source
-
-# In-place llama.cpp clone + per-variant build copies. The Makefile
-# clones llama.cpp itself at the pinned LLAMA_VERSION; if a stale
-# local checkout is COPY'd into the image, the `llama.cpp:` target
-# sees the directory and skips re-cloning, so grpc-server.cpp ends
-# up compiled against whatever (likely older) commit the host had.
-backend/cpp/llama-cpp/llama.cpp
-backend/cpp/llama-cpp-*-build
-
-# privacy-filter: same in-place pattern. The Makefile fetches privacy-filter.cpp
-# at the pinned commit (or symlinks a PRIVACY_FILTER_SRC checkout for local dev).
-# A stale dir/symlink COPY'd into the image makes the clone step fail (dangling
-# symlink) or compile against the wrong commit, so keep host build state out.
-backend/cpp/privacy-filter/privacy-filter.cpp
-backend/cpp/privacy-filter/build
-backend/cpp/privacy-filter/grpc-server
-backend/cpp/privacy-filter/package
-
-# Rust backend build output (sources are tracked; target/ is generated)
-backend/rust/*/target
-
-# Local-only artifacts that bloat the build context but the image never needs.
-# Saved image tarballs, locally-installed backends, the host-built binary, and
-# assorted tool/scratch dirs. None of these are git-tracked.
-backend-images
-local-backends
-local-ai
-.crush
-protoc
-tests
-
-# Installed via npm inside the build stage; no need to ship the host copy.
-**/node_modules
+.scannerwork
--- a/.env
+++ b/.env
@@ -10,7 +10,7 @@
 #
 ## Define galleries.
 ## models will to install will be visible in `/models/available`
-# LOCALAI_GALLERIES=[{"name":"localai", "url":"github:mudler/LocalAI/gallery/index.yaml@master"}]
+# LOCALAI_GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}]

 ## CORS settings
 # LOCALAI_CORS=true
@@ -26,14 +26,24 @@
 ## Disables COMPEL (Diffusers)
 # COMPEL=0

-## Disables SD_EMBED (Diffusers)
-# SD_EMBED=0
-
 ## Enable/Disable single backend (useful if only one GPU is available)
 # LOCALAI_SINGLE_ACTIVE_BACKEND=true

-# Forces shutdown of the backends if busy (only if LOCALAI_SINGLE_ACTIVE_BACKEND is set)
-# LOCALAI_FORCE_BACKEND_SHUTDOWN=true
+## Specify a build type. Available: cublas, openblas, clblas.
+## cuBLAS: This is a GPU-accelerated version of the complete standard BLAS (Basic Linear Algebra Subprograms) library. It's provided by Nvidia and is part of their CUDA toolkit.
+## OpenBLAS: This is an open-source implementation of the BLAS library that aims to provide highly optimized code for various platforms. It includes support for multi-threading and can be compiled to use hardware-specific features for additional performance. OpenBLAS can run on many kinds of hardware, including CPUs from Intel, AMD, and ARM.
+## clBLAS:   This is an open-source implementation of the BLAS library that uses OpenCL, a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. clBLAS is designed to take advantage of the parallel computing power of GPUs but can also run on any hardware that supports OpenCL. This includes hardware from different vendors like Nvidia, AMD, and Intel.
+# BUILD_TYPE=openblas
+
+## Uncomment and set to true to enable rebuilding from source
+# REBUILD=true
+
+## Enable go tags, available: stablediffusion, tts
+## stablediffusion: image generation with stablediffusion
+## tts: enables text-to-speech with go-piper 
+## (requires REBUILD=true)
+#
+# GO_TAGS=stablediffusion

 ## Path where to store generated images
 # LOCALAI_IMAGE_PATH=/tmp/generated/images
@@ -61,26 +71,9 @@
 ### Define the number of parallel LLAMA.cpp workers (Defaults to 1)
 # LLAMACPP_PARALLEL=1

-### Define a list of GRPC Servers for llama-cpp workers to distribute the load
-# https://github.com/ggerganov/llama.cpp/pull/6829
-# https://github.com/ggerganov/llama.cpp/blob/master/tools/rpc/README.md
-# LLAMACPP_GRPC_SERVERS=""
-
 ### Enable to run parallel requests
 # LOCALAI_PARALLEL_REQUESTS=true

-# Enable to allow p2p mode
-# LOCALAI_P2P=true
-
-# Enable to use federated mode
-# LOCALAI_FEDERATED=true
-
-# Enable to start federation server
-# FEDERATED_SERVER=true
-
-# Define to use federation token
-# TOKEN=""
-
 ### Watchdog settings
 ###
 # Enables watchdog to kill backends that are inactive for too much time
@@ -93,4 +86,4 @@
 # LOCALAI_WATCHDOG_BUSY=true
 #
 # Time in duration format (e.g. 1h30m) after which a backend is considered busy
-# LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m
+# LOCALAI_WATCHDOG_BUSY_TIMEOUT=5m
--- a/.gitattributes
+++ b/.gitattributes
@@ -1,2 +1 @@
 *.sh text eol=lf
-backend/cpp/llama/*.hpp linguist-vendored
--- a/.githooks/pre-commit
+++ b/.githooks/pre-commit
@@ -1,72 +0,0 @@
-#!/usr/bin/env sh
-#
-# LocalAI pre-commit hook. Install it (once per clone) with:
-#
-#     make install-hooks
-#
-# Runs only the checks relevant to what's staged:
-#   - Go files          -> make lint + make test-coverage-check
-#   - core/http/react-ui -> make test-ui-coverage-check (Playwright e2e + gate)
-#   - realtime state machines / specs -> make test-realtime-conformance
-#       (respcoord/**, turncoord/**, or formal-verification/** -- a pure .fizz
-#        spec edit must still re-verify the design, detected separately from Go)
-# A commit touching none of these is skipped entirely (other docs/YAML can't
-# change lint findings, Go coverage, the UI, or the realtime conformance gate).
-#
-# To bypass for a single commit (e.g. a WIP checkpoint): git commit --no-verify
-set -eu
-
-repo_root="$(git rev-parse --show-toplevel)"
-cd "$repo_root"
-
-staged="$(git diff --cached --name-only --diff-filter=ACMRD)"
-
-go_changed=0
-ui_changed=0
-rt_changed=0
-if echo "$staged" | grep -qE '\.go$'; then go_changed=1; fi
-if echo "$staged" | grep -qE '^core/http/react-ui/'; then ui_changed=1; fi
-if echo "$staged" | grep -qE '^(core/http/endpoints/openai/(coordinator|respcoord|turncoord|conncoord|compactcoord|ttscoord)/|formal-verification/)'; then rt_changed=1; fi
-
-if [ "$go_changed" -eq 0 ] && [ "$ui_changed" -eq 0 ] && [ "$rt_changed" -eq 0 ]; then
-	echo "pre-commit: no Go, React UI, or realtime-spec changes staged — skipping."
-	exit 0
-fi
-
-if [ "$go_changed" -eq 1 ]; then
-	# Resolve the ref golangci-lint's new-from-merge-base should compare
-	# against. .golangci.yml pins origin/master, which is correct in CI
-	# (origin == the canonical repo) but wrong from a fork clone, where
-	# origin/master lags behind and lint would report the whole upstream
-	# backlog. Prefer upstream/master, then origin/master, then master.
-	lint_base=""
-	for ref in upstream/master origin/master master; do
-		if git rev-parse --verify --quiet "${ref}^{commit}" >/dev/null 2>&1; then
-			lint_base="$ref"
-			break
-		fi
-	done
-
-	echo "pre-commit ▶ golangci-lint (make lint${lint_base:+, new-from $lint_base})"
-	make lint LINT_NEW_FROM="$lint_base"
-
-	echo "pre-commit ▶ coverage gate (make test-coverage-check) — builds and runs the"
-	echo "             pkg/core suites plus tests/e2e; can take a few minutes."
-	make test-coverage-check
-fi
-
-if [ "$ui_changed" -eq 1 ]; then
-	echo "pre-commit ▶ React UI e2e + coverage gate (make test-ui-coverage-check) —"
-	echo "             rebuilds the UI + ui-test-server, runs the Playwright specs, and"
-	echo "             fails if line coverage regressed; can take a couple of minutes."
-	make test-ui-coverage-check
-fi
-
-if [ "$rt_changed" -eq 1 ]; then
-	echo "pre-commit ▶ realtime state-machine conformance (make test-realtime-conformance) —"
-	echo "             Go transition/rapid tests under -race + FizzBee model check of the"
-	echo "             authoritative specs. Fail-closed: needs FizzBee (make install-fizzbee)."
-	make test-realtime-conformance
-fi
-
-echo "pre-commit ✓ all relevant checks passed"
--- a/.github/actions/configure-apt-mirror/action.yml
+++ b/.github/actions/configure-apt-mirror/action.yml
@@ -1,100 +0,0 @@
-name: 'Configure apt mirror'
-description: |
-  Reconfigure the GitHub Actions runner's Ubuntu apt sources to use an
-  alternate mirror, and emit the effective URLs as outputs so callers can
-  forward them as Docker build-args.
-
-  Two mirror profiles depending on where the runner lives, because the
-  best mirror differs by network:
-
-    * github-hosted runners run on Azure, so they default to the
-      Azure-hosted Ubuntu mirror (lowest latency, same VPC).
-    * self-hosted runners (arc-runner-set, bigger-runner, ...) typically
-      cannot route to azure.archive.ubuntu.com, so they default to the
-      kernel.org mirror, which is publicly reachable from anywhere.
-
-  Pass an empty string to either input to skip the rewrite for that
-  profile and keep upstream archive.ubuntu.com / ports.ubuntu.com.
-
-inputs:
-  github-hosted-mirror:
-    description: 'archive/security mirror URL for github-hosted runners (empty = upstream)'
-    required: false
-    default: 'http://azure.archive.ubuntu.com'
-  github-hosted-ports-mirror:
-    description: 'ports.ubuntu.com mirror URL for github-hosted runners (empty = upstream)'
-    required: false
-    default: 'http://azure.ports.ubuntu.com'
-  self-hosted-mirror:
-    description: 'archive/security mirror URL for self-hosted runners (empty = upstream)'
-    required: false
-    # HTTP, not HTTPS: the bare ubuntu:24.04 builder image doesn't ship
-    # ca-certificates, so the very first apt-get update over TLS would
-    # fail with "No system certificates available" before it can install
-    # anything. apt validates package integrity via GPG signatures, so
-    # plain HTTP is safe for the archive itself.
-    default: 'http://mirrors.edge.kernel.org'
-  self-hosted-ports-mirror:
-    description: 'ports.ubuntu.com mirror URL for self-hosted runners (empty = upstream)'
-    required: false
-    # mirrors.edge.kernel.org does NOT carry /ubuntu-ports/ — only the
-    # main /ubuntu/ archive — so arm64 builds 404 there. Leave ports
-    # upstream by default. The original DDoS was on archive.ubuntu.com
-    # so ports.ubuntu.com remains the path of least surprise.
-    default: ''
-
-outputs:
-  effective-mirror:
-    description: 'The mirror URL actually applied for this runner (or empty)'
-    value: ${{ steps.pick.outputs.mirror }}
-  effective-ports-mirror:
-    description: 'The ports mirror URL actually applied for this runner (or empty)'
-    value: ${{ steps.pick.outputs.ports-mirror }}
-
-runs:
-  using: 'composite'
-  steps:
-    - name: Pick effective mirror for this runner
-      id: pick
-      shell: bash
-      env:
-        RUNNER_ENV: ${{ runner.environment }}
-        GH_MIRROR: ${{ inputs.github-hosted-mirror }}
-        GH_PORTS_MIRROR: ${{ inputs.github-hosted-ports-mirror }}
-        SH_MIRROR: ${{ inputs.self-hosted-mirror }}
-        SH_PORTS_MIRROR: ${{ inputs.self-hosted-ports-mirror }}
-      run: |
-        if [ "${RUNNER_ENV}" = "github-hosted" ]; then
-          MIRROR="${GH_MIRROR}"
-          PORTS_MIRROR="${GH_PORTS_MIRROR}"
-        else
-          MIRROR="${SH_MIRROR}"
-          PORTS_MIRROR="${SH_PORTS_MIRROR}"
-        fi
-        echo "configure-apt-mirror: runner=${RUNNER_ENV} mirror='${MIRROR}' ports-mirror='${PORTS_MIRROR}'"
-        echo "mirror=${MIRROR}" >> "$GITHUB_OUTPUT"
-        echo "ports-mirror=${PORTS_MIRROR}" >> "$GITHUB_OUTPUT"
-
-    - name: Rewrite apt sources
-      if: steps.pick.outputs.mirror != '' || steps.pick.outputs.ports-mirror != ''
-      shell: bash
-      env:
-        APT_MIRROR: ${{ steps.pick.outputs.mirror }}
-        APT_PORTS_MIRROR: ${{ steps.pick.outputs.ports-mirror }}
-      run: |
-        set -e
-        # Ubuntu 24.04 (noble) ships DEB822 sources at
-        # /etc/apt/sources.list.d/ubuntu.sources; older releases use
-        # /etc/apt/sources.list. Rewrite whichever exists.
-        for f in /etc/apt/sources.list.d/ubuntu.sources /etc/apt/sources.list; do
-          sudo test -f "$f" || continue
-          if [ -n "${APT_MIRROR}" ]; then
-            # Comma delimiter so the alternation pipe in the regex is not
-            # interpreted as the s/// separator.
-            sudo sed -i -E "s,https?://(archive\.ubuntu\.com|security\.ubuntu\.com),${APT_MIRROR},g" "$f"
-          fi
-          if [ -n "${APT_PORTS_MIRROR}" ]; then
-            sudo sed -i -E "s,https?://ports\.ubuntu\.com,${APT_PORTS_MIRROR},g" "$f"
-          fi
-        done
-        echo "Runner apt mirror configured (APT_MIRROR='${APT_MIRROR}', APT_PORTS_MIRROR='${APT_PORTS_MIRROR}')"
--- a/.github/actions/free-disk-space/action.yml
+++ b/.github/actions/free-disk-space/action.yml
@@ -1,65 +0,0 @@
-name: 'Free disk space on hosted runners'
-description: |
-  Aggressively clean GitHub-hosted ubuntu-latest runners to reclaim ~6-10 GB
-  of working space before docker buildx steps. Combines jlumbroso/free-disk-space
-  with explicit apt purges of large packages we never use (dotnet, ghc, mono,
-  android, jdk, ...).
-
-  No-op on self-hosted runners; pass mode=skip to force-disable.
-
-inputs:
-  mode:
-    description: 'hosted (default — clean) or skip (no-op)'
-    required: false
-    default: 'hosted'
-
-runs:
-  using: 'composite'
-  steps:
-    - name: Free Disk Space (Ubuntu)
-      if: inputs.mode == 'hosted' && runner.environment == 'github-hosted'
-      uses: jlumbroso/free-disk-space@main
-      with:
-        tool-cache: true
-        android: true
-        dotnet: true
-        haskell: true
-        large-packages: true
-        docker-images: true
-        swap-storage: true
-
-    - name: Release space from worker
-      if: inputs.mode == 'hosted' && runner.environment == 'github-hosted'
-      shell: bash
-      run: |
-        echo "Listing top largest packages"
-        pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
-        head -n 30 <<< "${pkgs}"
-        df -h
-        sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
-        sudo apt-get remove --auto-remove android-sdk-platform-tools snapd || true
-        sudo apt-get purge --auto-remove android-sdk-platform-tools snapd || true
-        sudo rm -rf /usr/local/lib/android
-        sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
-        sudo rm -rf /usr/share/dotnet
-        sudo apt-get remove -y '^mono-.*' || true
-        sudo apt-get remove -y '^ghc-.*' || true
-        sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
-        sudo apt-get remove -y 'php.*' || true
-        sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
-        sudo apt-get remove -y '^google-.*' || true
-        sudo apt-get remove -y azure-cli || true
-        sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
-        sudo apt-get remove -y '^gfortran-.*' || true
-        sudo apt-get remove -y microsoft-edge-stable || true
-        sudo apt-get remove -y firefox || true
-        sudo apt-get remove -y powershell || true
-        sudo apt-get remove -y r-base-core || true
-        sudo apt-get autoremove -y
-        sudo apt-get clean
-        sudo rm -rfv build || true
-        sudo rm -rf /usr/share/dotnet || true
-        sudo rm -rf /opt/ghc || true
-        sudo rm -rf "/usr/local/share/boost" || true
-        sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
-        df -h
--- a/.github/actions/setup-build-disk/action.yml
+++ b/.github/actions/setup-build-disk/action.yml
@@ -1,59 +0,0 @@
-name: 'Set up build disk on hosted runners'
-description: |
-  Relocate Docker's data-root to /mnt (which has ~75 GB free, vs ~20 GB
-  on / after free-disk-space). Combined with the apt cleanup, gives
-  ~100 GB working space for buildx — enough for ROCm dev image + vLLM
-  torch install + flash-attn build.
-
-  No-op on:
-    - self-hosted runners (no /mnt expectation)
-    - non-X64 runners (verify /mnt shape on ubuntu-24.04-arm separately
-      before enabling there — see Task 3.2 in the migration plan)
-    - mode=skip (force-disable from caller)
-
-  Must run after free-disk-space (which removes large packages — would
-  fail mid-uninstall if Docker were stopped) and before any Docker
-  operation (setup-qemu, setup-buildx, login, build) so the relocated
-  data-root catches all subsequent docker activity.
-
-inputs:
-  mode:
-    description: 'auto (default — relocate on hosted X64 only) or skip'
-    required: false
-    default: 'auto'
-
-runs:
-  using: 'composite'
-  steps:
-    - name: Relocate Docker data-root to /mnt
-      if: inputs.mode == 'auto' && runner.environment == 'github-hosted' && runner.arch == 'X64'
-      shell: bash
-      run: |
-        set -euo pipefail
-        echo "Before relocation:"
-        df -h / /mnt || true
-        sudo systemctl stop docker docker.socket
-        sudo mkdir -p /mnt/docker-data /mnt/docker-tmp
-        # buildx CLI runs as the unprivileged runner user and creates
-        # config dirs under TMPDIR before binding them into the buildkit
-        # container. /mnt is owned by root by default; mirror /tmp's
-        # 1777 (world-writable + sticky) so non-root processes can write.
-        sudo chmod 1777 /mnt/docker-tmp
-        if [ -d /var/lib/docker ] && [ ! -L /var/lib/docker ]; then
-          sudo rsync -a /var/lib/docker/ /mnt/docker-data/
-          sudo rm -rf /var/lib/docker
-          sudo ln -s /mnt/docker-data /var/lib/docker
-        fi
-        # daemon.json may not exist; merge data-root in or create minimal.
-        if [ -f /etc/docker/daemon.json ]; then
-          sudo jq '."data-root" = "/mnt/docker-data"' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.new >/dev/null
-          sudo mv /etc/docker/daemon.json.new /etc/docker/daemon.json
-        else
-          echo '{"data-root":"/mnt/docker-data"}' | sudo tee /etc/docker/daemon.json
-        fi
-        sudo systemctl start docker
-        # Make TMPDIR persist for subsequent steps in the same job.
-        echo "TMPDIR=/mnt/docker-tmp" >> "$GITHUB_ENV"
-        echo "After relocation:"
-        df -h / /mnt
-        docker info | grep -i 'docker root dir' || true
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
--- a/.github/bump_deps.sh
+++ b/.github/bump_deps.sh
@@ -3,25 +3,7 @@ set -xe
 REPO=$1
 BRANCH=$2
 VAR=$3
-FILE=$4
-
-if [ -z "$FILE" ]; then
-    FILE="Makefile"
-fi

 LAST_COMMIT=$(curl -s -H "Accept: application/vnd.github.VERSION.sha" "https://api.github.com/repos/$REPO/commits/$BRANCH")

-# Read $VAR from Makefile (only first match)
-set +e
-CURRENT_COMMIT="$(grep -m1 "^$VAR?=" $FILE | cut -d'=' -f2)"
-set -e
-
-sed -i $FILE -e "s/$VAR?=.*/$VAR?=$LAST_COMMIT/"
-
-if [ -z "$CURRENT_COMMIT" ]; then
-    echo "Could not find $VAR in Makefile."
-    exit 0
-fi
-
-echo "Changes: https://github.com/$REPO/compare/${CURRENT_COMMIT}..${LAST_COMMIT}" >> "${VAR}_message.txt"
-echo "${LAST_COMMIT}" >> "${VAR}_commit.txt"
+sed -i Makefile -e "s/$VAR?=.*/$VAR?=$LAST_COMMIT/"
--- a/.github/bump_docs.sh
+++ b/.github/bump_docs.sh
@@ -2,6 +2,6 @@
 set -xe
 REPO=$1

-LATEST_TAG=$(curl -s "https://api.github.com/repos/$REPO/releases/latest" | jq -r '.tag_name')
+LATEST_TAG=$(curl -s "https://api.github.com/repos/$REPO/releases/latest" | jq -r '.name')

 cat <<< $(jq ".version = \"$LATEST_TAG\"" docs/data/version.json) > docs/data/version.json
--- a/.github/bump_vllm_metal.sh
+++ b/.github/bump_vllm_metal.sh
@@ -1,55 +0,0 @@
-#!/bin/bash
-# Bump the single vllm-metal pin (VLLM_METAL_VERSION) in the vLLM backend's
-# darwin (Apple Silicon) install path. The macOS/Metal build
-# (backend/python/vllm/install.sh, Darwin branch) installs vllm-metal, which is
-# version-locked to a specific vLLM source release. install.sh derives that vLLM
-# version at build time from vllm-metal's own installer (`vllm_v=`) at the pinned
-# tag, so there is only ONE value to bump here -- mirroring bump_vllm_wheel.sh,
-# which bumps the Linux cu130 wheel pin.
-#
-# This deliberately tracks vllm-project/vllm-metal, NOT vllm-project/vllm: the
-# darwin build can only use the exact vLLM version vllm-metal supports, so it may
-# lag the Linux pin (requirements-cublas13-after.txt) until vllm-metal catches up.
-set -xe
-REPO=$1   # vllm-project/vllm-metal
-FILE=$2   # backend/python/vllm/install.sh
-VAR=$3    # VLLM_METAL_VERSION (used for the workflow's output file names)
-
-if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
-    echo "usage: $0 <repo> <install-file> <var-name>" >&2
-    exit 1
-fi
-
-# vllm-metal ships frequent dev releases, all flagged as non-prerelease, so
-# /releases/latest returns the newest one (with its cp312 wheel asset).
-LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
-    "https://api.github.com/repos/$REPO/releases/latest" \
-    | python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
-
-# The coupled vLLM source version lives in vllm-metal's installer at that tag.
-NEW_VLLM_VERSION=$(curl -fsSL \
-    "https://raw.githubusercontent.com/$REPO/$LATEST_TAG/install.sh" \
-    | grep -oE 'vllm_v="[0-9]+\.[0-9]+\.[0-9]+"' | head -1 | cut -d'"' -f2)
-
-if [ -z "$LATEST_TAG" ] || [ -z "$NEW_VLLM_VERSION" ]; then
-    echo "Could not resolve vllm-metal tag ($LATEST_TAG) or its vllm_v ($NEW_VLLM_VERSION)." >&2
-    exit 1
-fi
-
-set +e
-CURRENT_TAG=$(grep -oE 'VLLM_METAL_VERSION="[^"]*"' "$FILE" | head -1 | cut -d'"' -f2)
-set -e
-
-# Rewrite the single pin. install.sh derives VLLM_VERSION from this tag at build
-# time, so there is nothing else to touch. peter-evans/create-pull-request opens
-# no PR on a clean tree, so a no-op rewrite (already current) is safe.
-sed -i "$FILE" \
-    -e "s|VLLM_METAL_VERSION=\"[^\"]*\"|VLLM_METAL_VERSION=\"$LATEST_TAG\"|"
-
-if [ -z "$CURRENT_TAG" ]; then
-    echo "Could not find VLLM_METAL_VERSION=\"...\" in $FILE." >&2
-    exit 0
-fi
-
-echo "vllm-metal ${CURRENT_TAG} -> ${LATEST_TAG} (builds vLLM ${NEW_VLLM_VERSION}): https://github.com/$REPO/releases/tag/${LATEST_TAG}" >> "${VAR}_message.txt"
-echo "${LATEST_TAG}" >> "${VAR}_commit.txt"
--- a/.github/bump_vllm_wheel.sh
+++ b/.github/bump_vllm_wheel.sh
@@ -1,45 +0,0 @@
-#!/bin/bash
-# Bump the cublas13 vLLM wheel pin in requirements-cublas13-after.txt.
-#
-# vLLM's PyPI wheel is built against CUDA 12 so the cublas13 build pulls a
-# cu130-flavoured wheel from vLLM's per-tag index at
-# https://wheels.vllm.ai/<TAG>/cu130/. That URL segment is itself version-locked
-# (no /latest/ alias upstream), so bumping vLLM means rewriting both the URL
-# segment and the version constraint atomically. bump_deps.sh handles git-sha
-# vars in Makefiles; this script handles the two-value rewrite specific to the
-# vLLM requirements file.
-set -xe
-REPO=$1   # vllm-project/vllm
-FILE=$2   # backend/python/vllm/requirements-cublas13-after.txt
-VAR=$3    # VLLM_VERSION (used for output file names so the workflow can read them)
-
-if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
-    echo "usage: $0 <repo> <requirements-file> <var-name>" >&2
-    exit 1
-fi
-
-# /releases/latest returns the most recent non-prerelease tag.
-LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
-    "https://api.github.com/repos/$REPO/releases/latest" \
-    | python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
-
-# Strip leading 'v' (vLLM tags are 'v0.20.0', the URL/version use '0.20.0').
-NEW_VERSION="${LATEST_TAG#v}"
-
-set +e
-CURRENT_VERSION=$(grep -oE '^vllm==[0-9]+\.[0-9]+\.[0-9]+' "$FILE" | head -1 | cut -d= -f3)
-set -e
-
-# sed both lines unconditionally — peter-evans/create-pull-request opens no PR
-# when the working tree is clean, so a no-op rewrite is safe.
-sed -i "$FILE" \
-    -e "s|wheels\.vllm\.ai/[^/]*/cu130|wheels.vllm.ai/$NEW_VERSION/cu130|g" \
-    -e "s|^vllm==.*|vllm==$NEW_VERSION|"
-
-if [ -z "$CURRENT_VERSION" ]; then
-    echo "Could not find vllm==X.Y.Z in $FILE."
-    exit 0
-fi
-
-echo "Changes: https://github.com/$REPO/compare/v${CURRENT_VERSION}...${LATEST_TAG}" >> "${VAR}_message.txt"
-echo "${NEW_VERSION}" >> "${VAR}_commit.txt"
--- a/.github/check_and_update.py
+++ b/.github/check_and_update.py
@@ -1,85 +0,0 @@
-import hashlib
-from huggingface_hub import hf_hub_download, get_paths_info
-import requests
-import sys
-import os
-
-uri = sys.argv[1]
-file_name = uri.split('/')[-1]
-
-# Function to parse the URI and determine download method
-def parse_uri(uri):
-    if uri.startswith('huggingface://'):
-        repo_id = uri.split('://')[1]
-        return 'huggingface', repo_id.rsplit('/', 1)[0]
-    elif 'huggingface.co' in uri:
-        parts = uri.split('/resolve/')
-        if len(parts) > 1:
-            repo_path = parts[0].split('https://huggingface.co/')[-1]
-            return 'huggingface', repo_path
-    return 'direct', uri
-
-def calculate_sha256(file_path):
-    sha256_hash = hashlib.sha256()
-    with open(file_path, 'rb') as f:
-        for byte_block in iter(lambda: f.read(4096), b''):
-            sha256_hash.update(byte_block)
-    return sha256_hash.hexdigest()
-
-def manual_safety_check_hf(repo_id):
-    scanResponse = requests.get('https://huggingface.co/api/models/' + repo_id + "/scan")
-    scan = scanResponse.json()
-    # Check if 'hasUnsafeFile' exists in the response
-    if 'hasUnsafeFile' in scan:
-        if scan['hasUnsafeFile']:
-            return scan
-        else:
-            return None
-    else:
-        return None
-
-download_type, repo_id_or_url = parse_uri(uri)
-
-new_checksum =  None
-file_path = None
-
-# Decide download method based on URI type
-if download_type == 'huggingface':
-    # Check if the repo is flagged as dangerous by HF
-    hazard = manual_safety_check_hf(repo_id_or_url)
-    if hazard != None:
-        print(f'Error: HuggingFace has detected security problems for {repo_id_or_url}: {str(hazard)}', filename=file_name)
-        sys.exit(5)
-    # Use HF API to pull sha
-    for file in get_paths_info(repo_id_or_url, [file_name], repo_type='model'):
-        try:
-            new_checksum = file.lfs.sha256
-            break
-        except Exception as e:
-            print(f'Error from Hugging Face Hub: {str(e)}', file=sys.stderr)
-            sys.exit(2)
-    if new_checksum is None:
-        try:
-            file_path = hf_hub_download(repo_id=repo_id_or_url, filename=file_name)
-        except Exception as e:
-            print(f'Error from Hugging Face Hub: {str(e)}', file=sys.stderr)
-            sys.exit(2)
-else:
-    response = requests.get(repo_id_or_url)
-    if response.status_code == 200:
-        with open(file_name, 'wb') as f:
-            f.write(response.content)
-        file_path = file_name
-    elif response.status_code == 404:
-        print(f'File not found: {response.status_code}', file=sys.stderr)
-        sys.exit(2)
-    else:
-        print(f'Error downloading file: {response.status_code}', file=sys.stderr)
-        sys.exit(1)
-
-if new_checksum is None:
-    new_checksum = calculate_sha256(file_path)
-    print(new_checksum)
-    os.remove(file_path)
-else:
-    print(new_checksum)
--- a/.github/checksum_checker.sh
+++ b/.github/checksum_checker.sh
@@ -1,63 +0,0 @@
-#!/bin/bash
-# This scripts needs yq and huggingface_hub to be installed
-# to install hugingface_hub run pip install huggingface_hub
-
-# Path to the input YAML file
-input_yaml=$1
-
-# Function to download file and check checksum using Python
-function check_and_update_checksum() {
-    model_name="$1"
-    file_name="$2"
-    uri="$3"
-    old_checksum="$4"
-    idx="$5"
-
-    # Download the file and calculate new checksum using Python
-    new_checksum=$(python3 ./.github/check_and_update.py $uri)
-    result=$?
-
-    if [[ $result -eq 5 ]]; then
-        echo "Contaminated entry detected, deleting entry for $model_name..."
-        yq eval -i "del([$idx])" "$input_yaml"
-        return
-    fi
-
-    if [[ "$new_checksum" == "" ]]; then
-        echo "Error calculating checksum for $file_name. Skipping..."
-        return
-    fi
-
-    echo "Checksum for $file_name: $new_checksum"
-
-    # Compare and update the YAML file if checksums do not match
-    
-    if [[ $result -eq 2 ]]; then
-        echo "File not found, deleting entry for $file_name..."
-        # yq eval -i "del(.[$idx].files[] | select(.filename == \"$file_name\"))" "$input_yaml"
-    elif [[ "$old_checksum" != "$new_checksum" ]]; then
-        echo "Checksum mismatch for $file_name. Updating..."
-        yq eval -i "del(.[$idx].files[] | select(.filename == \"$file_name\").sha256)" "$input_yaml"
-        yq eval -i "(.[$idx].files[] | select(.filename == \"$file_name\")).sha256 = \"$new_checksum\"" "$input_yaml"
-    elif [[ $result -ne 0 ]]; then
-        echo "Error downloading file $file_name. Skipping..."
-    else
-        echo "Checksum match for $file_name. No update needed."
-    fi
-}
-
-# Read the YAML and process each file
-len=$(yq eval '. | length' "$input_yaml")
-for ((i=0; i<$len; i++))
-do
-    name=$(yq eval ".[$i].name" "$input_yaml")
-    files_len=$(yq eval ".[$i].files | length" "$input_yaml")
-    for ((j=0; j<$files_len; j++))
-    do
-        filename=$(yq eval ".[$i].files[$j].filename" "$input_yaml")
-        uri=$(yq eval ".[$i].files[$j].uri" "$input_yaml")
-        checksum=$(yq eval ".[$i].files[$j].sha256" "$input_yaml")
-        echo "Checking model $name, file $filename. URI = $uri, Checksum = $checksum"
-        check_and_update_checksum "$name" "$filename" "$uri" "$checksum" "$i"
-    done
-done
--- a/.github/ci/modelslist.go
+++ b/.github/ci/modelslist.go
@@ -1,304 +0,0 @@
-package main
-
-import (
-	"fmt"
-	"html/template"
-	"io/ioutil"
-	"os"
-
-	"github.com/microcosm-cc/bluemonday"
-	"gopkg.in/yaml.v3"
-)
-
-var modelPageTemplate string = `
-<!DOCTYPE html>
-<html>
-<head>
-    <meta charset="UTF-8">
-    <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>LocalAI models</title>
-    <link href="https://cdnjs.cloudflare.com/ajax/libs/flowbite/2.3.0/flowbite.min.css" rel="stylesheet" />
-    <script src="https://cdn.jsdelivr.net/npm/vanilla-lazyload@19.1.3/dist/lazyload.min.js"></script>
-
-    <link
-    rel="stylesheet"
-    href="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.8.0/build/styles/default.min.css"
-  />
-    <script
-    defer
-    src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.8.0/build/highlight.min.js"
-  ></script>
-    <script
-    defer
-    src="https://cdn.jsdelivr.net/npm/alpinejs@3.x.x/dist/cdn.min.js"
-  ></script>
-  <script
-    defer
-    src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"
-  ></script>
-  <script
-    defer
-    src="https://cdn.jsdelivr.net/npm/dompurify@3.0.6/dist/purify.min.js"
-  ></script>
-
-  <link href="/static/general.css" rel="stylesheet" />
-    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&family=Roboto:wght@400;500&display=swap" rel="stylesheet">
-    <link
-    href="https://fonts.googleapis.com/css?family=Roboto:300,400,500,700,900&display=swap"
-    rel="stylesheet" />
-  <link
-    rel="stylesheet"
-    href="https://cdn.jsdelivr.net/npm/tw-elements/css/tw-elements.min.css" />
-  <script src="https://cdn.tailwindcss.com/3.3.0"></script>
-  <script>
-    tailwind.config = {
-      darkMode: "class",
-      theme: {
-        fontFamily: {
-          sans: ["Roboto", "sans-serif"],
-          body: ["Roboto", "sans-serif"],
-          mono: ["ui-monospace", "monospace"],
-        },
-      },
-      corePlugins: {
-        preflight: false,
-      },
-    };
-  </script>
-    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.1.1/css/all.min.css">
-    <script src="https://unpkg.com/htmx.org@1.9.12" integrity="sha384-ujb1lZYygJmzgSwoxRggbCHcjc0rB2XoQrxeTUQyRjrOnlCoYta87iKBWq3EsdM2" crossorigin="anonymous"></script>
-</head>
-
-<body class="bg-gray-900 text-gray-200">
-<div class="flex flex-col min-h-screen">
-
-<nav class="bg-gray-800 shadow-lg">
-    <div class="container mx-auto px-4 py-4">
-        <div class="flex items-center justify-between">
-            <div class="flex items-center">
-                <a href="/" class="text-white text-xl font-bold"><img src="https://github.com/mudler/LocalAI/assets/2420543/0966aa2a-166e-4f99-a3e5-6c915fc997dd" alt="LocalAI Logo" class="h-10 mr-3 border-2 border-gray-300 shadow rounded"></a>
-                <a href="/" class="text-white text-xl font-bold">LocalAI</a>
-            </div>
-            <!-- Menu button for small screens -->
-            <div class="lg:hidden">
-                <button id="menu-toggle" class="text-gray-400 hover:text-white focus:outline-none">
-                    <i class="fas fa-bars fa-lg"></i>
-                </button>
-            </div>
-            <!-- Navigation links -->
-            <div class="hidden lg:flex lg:items-center lg:justify-end lg:flex-1 lg:w-0">
-                <a href="https://localai.io" class="text-gray-400 hover:text-white px-3 py-2 rounded" target="_blank" ><i class="fas fa-book-reader pr-2"></i> Documentation</a>
-            </div>
-        </div>
-        <!-- Collapsible menu for small screens -->
-        <div class="hidden lg:hidden" id="mobile-menu">
-            <div class="pt-4 pb-3 border-t border-gray-700">
-
-                <a href="https://localai.io" class="block text-gray-400 hover:text-white px-3 py-2 rounded mt-1" target="_blank" ><i class="fas fa-book-reader pr-2"></i> Documentation</a>
-
-            </div>
-        </div>
-    </div>
-</nav>
-
-<style>
-  .is-hidden {
-	display: none;
-	  }
-</style>
-
-<div class="container mx-auto px-4 flex-grow">
-
-<div class="models mt-12">
-	<h2 class="text-center text-3xl font-semibold text-gray-100">
-	LocalAI model gallery list </h2><br>
-
-	<h2 class="text-center text-3xl font-semibold text-gray-100">
-
-	 🖼️ Available {{.AvailableModels}} models</i> <a href="https://localai.io/models/" target="_blank" >
-			<i class="fas fa-circle-info pr-2"></i>
-		</a></h2>
-
-	<h3>
-	Refer to the Model gallery <a href="https://localai.io/models/" target="_blank" ><i class="fas fa-circle-info pr-2"></i></a> for more information on how to use the models with LocalAI.<br>
-
-	You can install models with the CLI command <code>local-ai models install <model-name></code>. or by using the WebUI.
-	</h3>
-
-	<input class="form-control appearance-none block w-full mt-5 px-3 py-2 text-base font-normal text-gray-300 pb-2 mb-5 bg-gray-800 bg-clip-padding border border-solid border-gray-600 rounded transition ease-in-out m-0 focus:text-gray-300 focus:bg-gray-900 focus:border-blue-500 focus:outline-none" type="search"
-	id="searchbox" placeholder="Live search keyword..">
-	  <div class="dark grid grid-cols-1 grid-rows-1 md:grid-cols-3 block rounded-lg shadow-secondary-1 dark:bg-surface-dark">
-		{{ range $_, $model := .Models }}
-		<div class="box me-4 mb-2 block rounded-lg bg-white shadow-secondary-1  dark:bg-gray-800 dark:bg-surface-dark dark:text-white text-surface pb-2">
-		<div>
-		    {{ $icon := "https://upload.wikimedia.org/wikipedia/commons/6/65/No-Image-Placeholder.svg" }}
-			{{ if $model.Icon }}
-	  		{{ $icon = $model.Icon }}
-	  		{{ end }}
-			<div class="flex justify-center items-center">
-				<img data-src="{{ $icon }}" alt="{{$model.Name}}" class="rounded-t-lg max-h-48 max-w-96 object-cover mt-3 lazy">
-			</div>
-	  		<div class="p-6 text-surface dark:text-white">
-				<h5 class="mb-2 text-xl font-medium leading-tight">{{$model.Name}}</h5>
-
-
-				<p class="mb-4 text-base truncate">{{ $model.Description }}</p>
-
-			</div>
-			<div class="px-6 pt-4 pb-2">
-
-      <!-- Modal toggle -->
-      <button data-modal-target="{{ $model.Name}}-modal" data-modal-toggle="{{ $model.Name }}-modal" class="block text-white bg-blue-700 hover:bg-blue-800 focus:ring-4 focus:outline-none focus:ring-blue-300 font-medium rounded-lg text-sm px-5 py-2.5 text-center dark:bg-blue-600 dark:hover:bg-blue-700 dark:focus:ring-blue-800" type="button">
-        More info
-      </button>
-
-    <!-- Main modal -->
-    <div id="{{ $model.Name}}-modal" tabindex="-1" aria-hidden="true" class="hidden overflow-y-auto overflow-x-hidden fixed top-0 right-0 left-0 z-50 justify-center items-center w-full md:inset-0 h-[calc(100%-1rem)] max-h-full">
-        <div class="relative p-4 w-full max-w-2xl max-h-full">
-            <!-- Modal content -->
-            <div class="relative bg-white rounded-lg shadow dark:bg-gray-700">
-                <!-- Modal header -->
-                <div class="flex items-center justify-between p-4 md:p-5 border-b rounded-t dark:border-gray-600">
-                    <h3 class="text-xl font-semibold text-gray-900 dark:text-white">
-                        {{ $model.Name}}
-                    </h3>
-                    <button type="button" class="text-gray-400 bg-transparent hover:bg-gray-200 hover:text-gray-900 rounded-lg text-sm w-8 h-8 ms-auto inline-flex justify-center items-center dark:hover:bg-gray-600 dark:hover:text-white" data-modal-hide="{{$model.Name}}-modal">
-                        <svg class="w-3 h-3" aria-hidden="true" xmlns="http://www.w3.org/2000/svg" fill="none" viewBox="0 0 14 14">
-                            <path stroke="currentColor" stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="m1 1 6 6m0 0 6 6M7 7l6-6M7 7l-6 6"/>
-                        </svg>
-                        <span class="sr-only">Close modal</span>
-                    </button>
-                </div>
-                <!-- Modal body -->
-                <div class="p-4 md:p-5 space-y-4">
-                    <div class="flex justify-center items-center">
-                    <img data-src="{{ $icon }}" alt="{{$model.Name}}" class="lazy rounded-t-lg max-h-48 max-w-96 object-cover mt-3">
-                  </div>
-
-                    <p class="text-base leading-relaxed text-gray-500 dark:text-gray-400">
-                    {{ $model.Description }}
-
-                    </p>
-
-                    <p class="text-base leading-relaxed text-gray-500 dark:text-gray-400">
-                    To install the model with the CLI, run: <br>
-                    <code> local-ai models install {{$model.Name}} </code> <br>
-
-                    <hr>
-                    See also <a href="https://localai.io/models/" target="_blank" >
-                    Installation <i class="fas fa-circle-info pr-2"></i>
-                    </a> to see how to install models with the REST API.
-                    </p>
-
-                    <p class="text-base leading-relaxed text-gray-500 dark:text-gray-400">
-                    <ul>
-                    {{ range $_, $u := $model.URLs }}
-                    <li><a href="{{ $u }}" target=_blank><i class="fa-solid fa-link"></i> {{ $u }}</a></li>
-                    {{ end }}
-                    </ul>
-                    </p>
-                </div>
-                <!-- Modal footer -->
-                <div class="flex items-center p-4 md:p-5 border-t border-gray-200 rounded-b dark:border-gray-600">
-                    <button data-modal-hide="{{ $model.Name}}-modal" type="button" class="py-2.5 px-5 ms-3 text-sm font-medium text-gray-900 focus:outline-none bg-white rounded-lg border border-gray-200 hover:bg-gray-100 hover:text-blue-700 focus:z-10 focus:ring-4 focus:ring-gray-100 dark:focus:ring-gray-700 dark:bg-gray-800 dark:text-gray-400 dark:border-gray-600 dark:hover:text-white dark:hover:bg-gray-700">Close</button>
-                </div>
-            </div>
-        </div>
-    </div>
-
-
-			</div>
-		</div>
-		</div>
-		{{ end }}
-
-		</div>
-  </div>
-</div>
-
-<script>
-var lazyLoadInstance = new LazyLoad({
-  // Your custom settings go here
-});
-
-let cards = document.querySelectorAll('.box')
-
-function liveSearch() {
-    let search_query = document.getElementById("searchbox").value;
-
-    //Use innerText if all contents are visible
-    //Use textContent for including hidden elements
-    for (var i = 0; i < cards.length; i++) {
-        if(cards[i].textContent.toLowerCase()
-                .includes(search_query.toLowerCase())) {
-            cards[i].classList.remove("is-hidden");
-        } else {
-            cards[i].classList.add("is-hidden");
-        }
-    }
-}
-
-//A little delay
-let typingTimer;
-let typeInterval = 500;
-let searchInput = document.getElementById('searchbox');
-
-searchInput.addEventListener('keyup', () => {
-    clearTimeout(typingTimer);
-    typingTimer = setTimeout(liveSearch, typeInterval);
-});
-</script>
-
-</div>
-
-<script src="https://cdnjs.cloudflare.com/ajax/libs/flowbite/2.3.0/flowbite.min.js"></script>
-</body>
-</html>
-`
-
-type GalleryModel struct {
-	Name        string   `json:"name" yaml:"name"`
-	URLs        []string `json:"urls" yaml:"urls"`
-	Icon        string   `json:"icon" yaml:"icon"`
-	Description string   `json:"description" yaml:"description"`
-}
-
-func main() {
-	// read the YAML file which contains the models
-
-	f, err := ioutil.ReadFile(os.Args[1])
-	if err != nil {
-		fmt.Println("Error reading file:", err)
-		return
-	}
-
-	models := []*GalleryModel{}
-	err = yaml.Unmarshal(f, &models)
-	if err != nil {
-		// write to stderr
-		os.Stderr.WriteString("Error unmarshaling YAML: " + err.Error() + "\n")
-		return
-	}
-
-	// Ensure that all arbitrary text content is sanitized before display
-	for i, m := range models {
-		models[i].Name = bluemonday.StrictPolicy().Sanitize(m.Name)
-		models[i].Description = bluemonday.StrictPolicy().Sanitize(m.Description)
-	}
-
-	// render the template
-	data := struct {
-		Models          []*GalleryModel
-		AvailableModels int
-	}{
-		Models:          models,
-		AvailableModels: len(models),
-	}
-	tmpl := template.Must(template.New("modelPage").Parse(modelPageTemplate))
-
-	err = tmpl.Execute(os.Stdout, data)
-	if err != nil {
-		fmt.Println("Error executing template:", err)
-		return
-	}
-}
--- a/.github/dependabot.yml
+++ b/.github/dependabot.yml
@@ -1,16 +1,10 @@
 # https://docs.github.com/en/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file
 version: 2
 updates:
-  - package-ecosystem: "gitsubmodule"
-    directory: "/"
-    schedule:
-      interval: "weekly"
  - package-ecosystem: "gomod"
    directory: "/"
    schedule:
      interval: "weekly"
-    ignore:
-    - dependency-name: "github.com/mudler/LocalAI/pkg/grpc/proto"
  - package-ecosystem: "github-actions"
    # Workflow files stored in the default location of `.github/workflows`. (You don't need to specify `/.github/workflows` for `directory`. You can use `directory: "/"`.)
    directory: "/"
@@ -29,91 +23,3 @@ updates:
    schedule:
      # Check for updates to GitHub Actions every weekday
      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/backend/python/bark"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/backend/python/common/template"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/backend/python/coqui"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/backend/python/diffusers"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/backend/python/exllama"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/backend/python/exllama2"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/backend/python/mamba"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/backend/python/openvoice"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/backend/python/rerankers"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/backend/python/sentencetransformers"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/backend/python/transformers"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/backend/python/vllm"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/examples/chainlit"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/examples/functions"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/examples/langchain/langchainpy-localai-example"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/examples/langchain-chroma"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "pip"
-    directory: "/examples/streamlit-bot"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "docker"
-    directory: "/examples/k8sgpt"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "docker"
-    directory: "/examples/kubernetes"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "docker"
-    directory: "/examples/langchain"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "gomod"
-    directory: "/examples/semantic-todo"
-    schedule:
-      interval: "weekly"
-  - package-ecosystem: "docker"
-    directory: "/examples/telegram-bot"
-    schedule:
-      interval: "weekly"
--- a/.github/gallery-agent/gallery.go
+++ b/.github/gallery-agent/gallery.go
@@ -1,213 +0,0 @@
-package main
-
-import (
-	"context"
-	"encoding/json"
-	"fmt"
-	"os"
-	"strings"
-
-	"github.com/mudler/LocalAI/core/gallery/importers"
-	"sigs.k8s.io/yaml"
-)
-
-func formatTextContent(text string) string {
-	return formatTextContentWithIndent(text, 4, 6)
-}
-
-// formatTextContentWithIndent formats text content with specified base and list item indentation
-func formatTextContentWithIndent(text string, baseIndent int, listItemIndent int) string {
-	var formattedLines []string
-	lines := strings.Split(text, "\n")
-	for _, line := range lines {
-		trimmed := strings.TrimRight(line, " \t\r")
-		if trimmed == "" {
-			// Keep empty lines as empty (no indentation)
-			formattedLines = append(formattedLines, "")
-		} else {
-			// Preserve relative indentation from yaml.Marshal output
-			// Count existing leading spaces to preserve relative structure
-			leadingSpaces := len(trimmed) - len(strings.TrimLeft(trimmed, " \t"))
-			trimmedStripped := strings.TrimLeft(trimmed, " \t")
-
-			var totalIndent int
-			if strings.HasPrefix(trimmedStripped, "-") {
-				// List items: use listItemIndent (ignore existing leading spaces)
-				totalIndent = listItemIndent
-			} else {
-				// Regular lines: use baseIndent + preserve relative indentation
-				// This handles both top-level keys (leadingSpaces=0) and nested properties (leadingSpaces>0)
-				totalIndent = baseIndent + leadingSpaces
-			}
-
-			indentStr := strings.Repeat(" ", totalIndent)
-			formattedLines = append(formattedLines, indentStr+trimmedStripped)
-		}
-	}
-	formattedText := strings.Join(formattedLines, "\n")
-	// Remove any trailing spaces from the formatted description
-	formattedText = strings.TrimRight(formattedText, " \t")
-	return formattedText
-}
-
-// generateYAMLEntry generates a YAML entry for a model using the specified anchor
-func generateYAMLEntry(model ProcessedModel, quantization string) string {
-	modelConfig, err := importers.DiscoverModelConfig("https://huggingface.co/"+model.ModelID, json.RawMessage(`{ "quantization": "`+quantization+`"}`))
-	if err != nil {
-		panic(err)
-	}
-
-	// Extract model name from ModelID
-	parts := strings.Split(model.ModelID, "/")
-	modelName := model.ModelID
-	if len(parts) > 0 {
-		modelName = strings.ToLower(parts[len(parts)-1])
-	}
-	// Remove common suffixes
-	modelName = strings.ReplaceAll(modelName, "-gguf", "")
-	modelName = strings.ReplaceAll(modelName, "-q4_k_m", "")
-	modelName = strings.ReplaceAll(modelName, "-q4_k_s", "")
-	modelName = strings.ReplaceAll(modelName, "-q3_k_m", "")
-	modelName = strings.ReplaceAll(modelName, "-q2_k", "")
-
-	description := model.ReadmeContent
-	if description == "" {
-		description = fmt.Sprintf("AI model: %s", modelName)
-	}
-
-	// Clean up description to prevent YAML linting issues
-	description = cleanTextContent(description)
-	formattedDescription := formatTextContent(description)
-
-	// Strip name and description from config file since they are
-	// already present at the gallery entry level and should not
-	// appear under overrides.
-	configFileContent := modelConfig.ConfigFile
-	var cfgMap map[string]any
-	if err := yaml.Unmarshal([]byte(configFileContent), &cfgMap); err == nil {
-		delete(cfgMap, "name")
-		delete(cfgMap, "description")
-		if cleaned, err := yaml.Marshal(cfgMap); err == nil {
-			configFileContent = string(cleaned)
-		}
-	}
-
-	configFile := formatTextContent(configFileContent)
-
-	filesYAML, _ := yaml.Marshal(modelConfig.Files)
-
-	// Files section: list items need 4 spaces (not 6), since files: is at 2 spaces
-	files := formatTextContentWithIndent(string(filesYAML), 4, 4)
-
-	// Build metadata sections
-	var metadataSections []string
-
-	// Add license if present
-	if model.License != "" {
-		metadataSections = append(metadataSections, fmt.Sprintf(`  license: "%s"`, model.License))
-	}
-
-	// Add tags if present
-	if len(model.Tags) > 0 {
-		tagsYAML, _ := yaml.Marshal(model.Tags)
-		tagsFormatted := formatTextContentWithIndent(string(tagsYAML), 4, 4)
-		tagsFormatted = strings.TrimRight(tagsFormatted, "\n")
-		metadataSections = append(metadataSections, fmt.Sprintf("  tags:\n%s", tagsFormatted))
-	}
-
-	// Add icon if present
-	if model.Icon != "" {
-		metadataSections = append(metadataSections, fmt.Sprintf(`  icon: %s`, model.Icon))
-	}
-
-	// Build the metadata block
-	metadataBlock := ""
-	if len(metadataSections) > 0 {
-		metadataBlock = strings.Join(metadataSections, "\n") + "\n"
-	}
-
-	yamlTemplate := ""
-	yamlTemplate = `- name: "%s"
-  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
-  urls:
-    - https://huggingface.co/%s
-  description: |
-%s%s
-  overrides:
-%s
-  files:
-%s`
-	// Trim trailing newlines from formatted sections to prevent extra blank lines
-	formattedDescription = strings.TrimRight(formattedDescription, "\n")
-	configFile = strings.TrimRight(configFile, "\n")
-	files = strings.TrimRight(files, "\n")
-	// Add newline before metadata block if present
-	if metadataBlock != "" {
-		metadataBlock = "\n" + strings.TrimRight(metadataBlock, "\n")
-	}
-	return fmt.Sprintf(yamlTemplate,
-		modelName,
-		model.ModelID,
-		formattedDescription,
-		metadataBlock,
-		configFile,
-		files,
-	)
-}
-
-// generateYAMLForModels generates YAML entries for selected models and appends to index.yaml
-func generateYAMLForModels(ctx context.Context, models []ProcessedModel, quantization string) error {
-
-	// Generate YAML entries for each model
-	var yamlEntries []string
-	for _, model := range models {
-		fmt.Printf("Generating YAML entry for model: %s\n", model.ModelID)
-
-		// Generate YAML entry
-		yamlEntry := generateYAMLEntry(model, quantization)
-		yamlEntries = append(yamlEntries, yamlEntry)
-	}
-
-	// Prepend to index.yaml (write at the top)
-	if len(yamlEntries) > 0 {
-		indexPath := getGalleryIndexPath()
-		fmt.Printf("Prepending YAML entries to %s...\n", indexPath)
-
-		// Read current content
-		content, err := os.ReadFile(indexPath)
-		if err != nil {
-			return fmt.Errorf("failed to read %s: %w", indexPath, err)
-		}
-
-		existingContent := string(content)
-		yamlBlock := strings.Join(yamlEntries, "\n")
-
-		// Check if file starts with "---"
-		var newContent string
-		if strings.HasPrefix(existingContent, "---\n") {
-			// File starts with "---", prepend new entries after it
-			restOfContent := strings.TrimPrefix(existingContent, "---\n")
-			// Ensure proper spacing: "---\n" + new entries + "\n" + rest of content
-			newContent = "---\n" + yamlBlock + "\n" + restOfContent
-		} else if strings.HasPrefix(existingContent, "---") {
-			// File starts with "---" but no newline after
-			restOfContent := strings.TrimPrefix(existingContent, "---")
-			newContent = "---\n" + yamlBlock + "\n" + strings.TrimPrefix(restOfContent, "\n")
-		} else {
-			// No "---" at start, prepend new entries at the very beginning
-			// Trim leading whitespace from existing content
-			existingContent = strings.TrimLeft(existingContent, " \t\n\r")
-			newContent = yamlBlock + "\n" + existingContent
-		}
-
-		// Write back to file
-		err = os.WriteFile(indexPath, []byte(newContent), 0644)
-		if err != nil {
-			return fmt.Errorf("failed to write %s: %w", indexPath, err)
-		}
-
-		fmt.Printf("Successfully prepended %d models to %s\n", len(yamlEntries), indexPath)
-	}
-
-	return nil
-}
--- a/.github/gallery-agent/helpers.go
+++ b/.github/gallery-agent/helpers.go
@@ -1,301 +0,0 @@
-package main
-
-import (
-	"encoding/json"
-	"fmt"
-	"io"
-	"net/http"
-	"os"
-	"regexp"
-	"strings"
-
-	hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
-	"sigs.k8s.io/yaml"
-)
-
-var galleryIndexPath = os.Getenv("GALLERY_INDEX_PATH")
-
-// getGalleryIndexPath returns the gallery index file path, with a default fallback
-func getGalleryIndexPath() string {
-	if galleryIndexPath != "" {
-		return galleryIndexPath
-	}
-	return "gallery/index.yaml"
-}
-
-type galleryModel struct {
-	Name string   `yaml:"name"`
-	Urls []string `yaml:"urls"`
-}
-
-// loadGalleryURLSet parses gallery/index.yaml once and returns the set of
-// HuggingFace model URLs already present in the gallery.
-func loadGalleryURLSet() (map[string]struct{}, error) {
-	indexPath := getGalleryIndexPath()
-	content, err := os.ReadFile(indexPath)
-	if err != nil {
-		return nil, fmt.Errorf("failed to read %s: %w", indexPath, err)
-	}
-
-	var galleryModels []galleryModel
-	if err := yaml.Unmarshal(content, &galleryModels); err != nil {
-		return nil, fmt.Errorf("failed to unmarshal %s: %w", indexPath, err)
-	}
-
-	set := make(map[string]struct{}, len(galleryModels))
-	for _, gm := range galleryModels {
-		for _, u := range gm.Urls {
-			set[u] = struct{}{}
-		}
-	}
-
-	// Also skip URLs already proposed in open (unmerged) gallery-agent PRs.
-	// The workflow injects these via EXTRA_SKIP_URLS so we don't keep
-	// re-proposing the same model every run while a PR is waiting to merge.
-	for _, line := range strings.FieldsFunc(os.Getenv("EXTRA_SKIP_URLS"), func(r rune) bool {
-		return r == '\n' || r == ',' || r == ' '
-	}) {
-		u := strings.TrimSpace(line)
-		if u != "" {
-			set[u] = struct{}{}
-		}
-	}
-
-	return set, nil
-}
-
-// modelAlreadyInGallery checks whether a HuggingFace model repo is already
-// referenced in the gallery URL set.
-func modelAlreadyInGallery(set map[string]struct{}, modelID string) bool {
-	_, ok := set["https://huggingface.co/"+modelID]
-	return ok
-}
-
-// baseModelFromTags returns the first `base_model:<repo>` value found in the
-// tag list, or "" if none is present. HuggingFace surfaces the base model
-// declared in the model card's YAML frontmatter as such a tag.
-func baseModelFromTags(tags []string) string {
-	for _, t := range tags {
-		if strings.HasPrefix(t, "base_model:") {
-			return strings.TrimPrefix(t, "base_model:")
-		}
-	}
-	return ""
-}
-
-// licenseFromTags returns the `license:<id>` value from the tag list, or "".
-func licenseFromTags(tags []string) string {
-	for _, t := range tags {
-		if strings.HasPrefix(t, "license:") {
-			return strings.TrimPrefix(t, "license:")
-		}
-	}
-	return ""
-}
-
-// curatedTags produces the gallery tag list from HuggingFace's raw tag set.
-// Always includes llm + gguf, then adds whitelisted family / capability
-// markers when they appear in the HF tag list.
-func curatedTags(hfTags []string) []string {
-	whitelist := []string{
-		"gpu", "cpu",
-		"llama", "mistral", "mixtral", "qwen", "qwen2", "qwen3",
-		"gemma", "gemma2", "gemma3", "phi", "phi3", "phi4",
-		"deepseek", "yi", "falcon", "command-r",
-		"vision", "multimodal", "code", "chat",
-		"instruction-tuned", "reasoning", "thinking",
-	}
-	seen := map[string]struct{}{}
-	out := []string{"llm", "gguf"}
-	seen["llm"] = struct{}{}
-	seen["gguf"] = struct{}{}
-
-	hfSet := map[string]struct{}{}
-	for _, t := range hfTags {
-		hfSet[strings.ToLower(t)] = struct{}{}
-	}
-	for _, w := range whitelist {
-		if _, ok := hfSet[w]; ok {
-			if _, dup := seen[w]; !dup {
-				out = append(out, w)
-				seen[w] = struct{}{}
-			}
-		}
-	}
-	return out
-}
-
-// resolveReadme fetches a description-quality README for a (possibly
-// quantized) repo: if a `base_model:` tag is present, fetch the base repo's
-// README; otherwise fall back to the repo's own README.
-func resolveReadme(client *hfapi.Client, modelID string, hfTags []string) (string, error) {
-	if base := baseModelFromTags(hfTags); base != "" && base != modelID {
-		if content, err := client.GetReadmeContent(base, "README.md"); err == nil && strings.TrimSpace(content) != "" {
-			return cleanTextContent(content), nil
-		}
-	}
-	content, err := client.GetReadmeContent(modelID, "README.md")
-	if err != nil {
-		return "", err
-	}
-	return cleanTextContent(content), nil
-}
-
-// extractDescription turns a raw HuggingFace README into a concise plain-text
-// description suitable for embedding in gallery/index.yaml: strips YAML
-// frontmatter, HTML tags/comments, markdown images, link URLs (keeping the
-// link text), markdown tables, and then truncates at a paragraph boundary
-// around ~1200 characters. Raw README should still be used for icon
-// extraction — call this only for the `description:` field.
-func extractDescription(readme string) string {
-	s := readme
-
-	// Strip leading YAML frontmatter: `---\n...\n---\n` at start of file.
-	if strings.HasPrefix(strings.TrimLeft(s, " \t\n"), "---") {
-		trimmed := strings.TrimLeft(s, " \t\n")
-		rest := strings.TrimPrefix(trimmed, "---")
-		if idx := strings.Index(rest, "\n---"); idx >= 0 {
-			after := rest[idx+len("\n---"):]
-			after = strings.TrimPrefix(after, "\n")
-			s = after
-		}
-	}
-
-	// Strip HTML comments and tags.
-	s = regexp.MustCompile(`(?s)<!--.*?-->`).ReplaceAllString(s, "")
-	s = regexp.MustCompile(`(?is)<[^>]+>`).ReplaceAllString(s, "")
-
-	// Strip markdown images entirely.
-	s = regexp.MustCompile(`!\[[^\]]*\]\([^)]*\)`).ReplaceAllString(s, "")
-	// Replace markdown links `[text](url)` with just `text`.
-	s = regexp.MustCompile(`\[([^\]]+)\]\([^)]+\)`).ReplaceAllString(s, "$1")
-
-	// Drop table lines and horizontal rules, and flatten all leading
-	// whitespace: generateYAMLEntry embeds this under a `description: |`
-	// literal block whose indentation is set by the first non-empty line.
-	// If any line has extra leading whitespace (e.g. from an indented
-	// `<p align="center">` block in the original README), YAML will pick
-	// that up as the block's indent and every later line at a smaller
-	// indent blows the block scalar. Stripping leading whitespace here
-	// guarantees uniform 4-space indentation after formatTextContent runs.
-	var kept []string
-	for _, line := range strings.Split(s, "\n") {
-		t := strings.TrimLeft(line, " \t")
-		ts := strings.TrimSpace(t)
-		if strings.HasPrefix(ts, "|") {
-			continue
-		}
-		if strings.HasPrefix(ts, ":--") || strings.HasPrefix(ts, "---") || strings.HasPrefix(ts, "===") {
-			continue
-		}
-		kept = append(kept, t)
-	}
-	s = strings.Join(kept, "\n")
-
-	// Normalise whitespace and drop any leading blank lines so the literal
-	// block in YAML doesn't start with a blank first line (which would
-	// break the indentation detector the same way).
-	s = cleanTextContent(s)
-	s = strings.TrimLeft(s, " \t\n")
-
-	// Truncate at a paragraph boundary around maxLen chars.
-	const maxLen = 1200
-	if len(s) > maxLen {
-		cut := strings.LastIndex(s[:maxLen], "\n\n")
-		if cut < maxLen/3 {
-			cut = maxLen
-		}
-		s = strings.TrimRight(s[:cut], " \t\n") + "\n\n..."
-	}
-
-	return s
-}
-
-// cleanTextContent removes trailing spaces/tabs and collapses multiple empty
-// lines so README content embeds cleanly into YAML without lint noise.
-func cleanTextContent(text string) string {
-	lines := strings.Split(text, "\n")
-	var cleaned []string
-	var prevEmpty bool
-	for _, line := range lines {
-		trimmed := strings.TrimRight(line, " \t\r")
-		if trimmed == "" {
-			if !prevEmpty {
-				cleaned = append(cleaned, "")
-			}
-			prevEmpty = true
-		} else {
-			cleaned = append(cleaned, trimmed)
-			prevEmpty = false
-		}
-	}
-	return strings.TrimRight(strings.Join(cleaned, "\n"), "\n")
-}
-
-// extractIconFromReadme scans README content for an image URL usable as a
-// gallery entry icon.
-func extractIconFromReadme(readmeContent string) string {
-	if readmeContent == "" {
-		return ""
-	}
-
-	markdownImageRegex := regexp.MustCompile(`(?i)!\[[^\]]*\]\(([^)]+\.(png|jpg|jpeg|svg|webp|gif))\)`)
-	htmlImageRegex := regexp.MustCompile(`(?i)<img[^>]+src=["']([^"']+\.(png|jpg|jpeg|svg|webp|gif))["']`)
-	plainImageRegex := regexp.MustCompile(`(?i)https?://[^\s<>"']+\.(png|jpg|jpeg|svg|webp|gif)`)
-
-	if m := markdownImageRegex.FindStringSubmatch(readmeContent); len(m) > 1 && strings.HasPrefix(strings.ToLower(m[1]), "http") {
-		return strings.TrimSpace(m[1])
-	}
-	if m := htmlImageRegex.FindStringSubmatch(readmeContent); len(m) > 1 && strings.HasPrefix(strings.ToLower(m[1]), "http") {
-		return strings.TrimSpace(m[1])
-	}
-	if m := plainImageRegex.FindStringSubmatch(readmeContent); len(m) > 0 && strings.HasPrefix(strings.ToLower(m[0]), "http") {
-		return strings.TrimSpace(m[0])
-	}
-	return ""
-}
-
-// getHuggingFaceAvatarURL returns the HF avatar URL for a user, or "".
-func getHuggingFaceAvatarURL(author string) string {
-	if author == "" {
-		return ""
-	}
-	userURL := fmt.Sprintf("https://huggingface.co/api/users/%s/overview", author)
-	resp, err := http.Get(userURL)
-	if err != nil {
-		return ""
-	}
-	defer resp.Body.Close()
-	if resp.StatusCode != http.StatusOK {
-		return ""
-	}
-	body, err := io.ReadAll(resp.Body)
-	if err != nil {
-		return ""
-	}
-	var info map[string]any
-	if err := json.Unmarshal(body, &info); err != nil {
-		return ""
-	}
-	if v, ok := info["avatarUrl"].(string); ok && v != "" {
-		return v
-	}
-	if v, ok := info["avatar"].(string); ok && v != "" {
-		return v
-	}
-	return ""
-}
-
-// extractModelIcon extracts an icon URL from the README, falling back to the
-// HuggingFace user avatar.
-func extractModelIcon(model ProcessedModel) string {
-	if icon := extractIconFromReadme(model.ReadmeContent); icon != "" {
-		return icon
-	}
-	if model.Author != "" {
-		if avatar := getHuggingFaceAvatarURL(model.Author); avatar != "" {
-			return avatar
-		}
-	}
-	return ""
-}
--- a/.github/gallery-agent/main.go
+++ b/.github/gallery-agent/main.go
@@ -1,291 +0,0 @@
-package main
-
-import (
-	"context"
-	"encoding/json"
-	"errors"
-	"fmt"
-	"os"
-	"strconv"
-	"time"
-
-	hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
-)
-
-// ProcessedModelFile represents a processed model file with additional metadata
-type ProcessedModelFile struct {
-	Path     string `json:"path"`
-	Size     int64  `json:"size"`
-	SHA256   string `json:"sha256"`
-	IsReadme bool   `json:"is_readme"`
-	FileType string `json:"file_type"` // "model", "readme", "other"
-}
-
-// ProcessedModel represents a processed model with all gathered metadata
-type ProcessedModel struct {
-	ModelID                 string               `json:"model_id"`
-	Author                  string               `json:"author"`
-	Downloads               int                  `json:"downloads"`
-	LastModified            string               `json:"last_modified"`
-	Files                   []ProcessedModelFile `json:"files"`
-	PreferredModelFile      *ProcessedModelFile  `json:"preferred_model_file,omitempty"`
-	ReadmeFile              *ProcessedModelFile  `json:"readme_file,omitempty"`
-	ReadmeContent           string               `json:"readme_content,omitempty"`
-	ReadmeContentPreview    string               `json:"readme_content_preview,omitempty"`
-	QuantizationPreferences []string             `json:"quantization_preferences"`
-	ProcessingError         string               `json:"processing_error,omitempty"`
-	Tags                    []string             `json:"tags,omitempty"`
-	License                 string               `json:"license,omitempty"`
-	Icon                    string               `json:"icon,omitempty"`
-}
-
-// AddedModelSummary represents a summary of models added to the gallery
-type AddedModelSummary struct {
-	SearchTerm     string   `json:"search_term"`
-	TotalFound     int      `json:"total_found"`
-	ModelsAdded    int      `json:"models_added"`
-	AddedModelIDs  []string `json:"added_model_ids"`
-	AddedModelURLs []string `json:"added_model_urls"`
-	Quantization   string   `json:"quantization"`
-	ProcessingTime string   `json:"processing_time"`
-}
-
-func main() {
-	startTime := time.Now()
-
-	// Synthetic mode for local testing
-	if sm := os.Getenv("SYNTHETIC_MODE"); sm == "true" || sm == "1" {
-		fmt.Println("Running in SYNTHETIC MODE - generating random test data")
-		if err := runSyntheticMode(); err != nil {
-			fmt.Fprintf(os.Stderr, "Error in synthetic mode: %v\n", err)
-			os.Exit(1)
-		}
-		return
-	}
-
-	searchTerm := os.Getenv("SEARCH_TERM")
-	if searchTerm == "" {
-		searchTerm = "GGUF"
-	}
-
-	limitStr := os.Getenv("LIMIT")
-	if limitStr == "" {
-		limitStr = "15"
-	}
-	limit, err := strconv.Atoi(limitStr)
-	if err != nil {
-		fmt.Fprintf(os.Stderr, "Error parsing LIMIT: %v\n", err)
-		os.Exit(1)
-	}
-
-	quantization := os.Getenv("QUANTIZATION")
-	if quantization == "" {
-		quantization = "Q4_K_M"
-	}
-
-	maxModelsStr := os.Getenv("MAX_MODELS")
-	if maxModelsStr == "" {
-		maxModelsStr = "1"
-	}
-	maxModels, err := strconv.Atoi(maxModelsStr)
-	if err != nil {
-		fmt.Fprintf(os.Stderr, "Error parsing MAX_MODELS: %v\n", err)
-		os.Exit(1)
-	}
-
-	fmt.Printf("Gallery Agent Configuration:\n")
-	fmt.Printf("  Search Term: %s\n", searchTerm)
-	fmt.Printf("  Limit: %d\n", limit)
-	fmt.Printf("  Quantization: %s\n", quantization)
-	fmt.Printf("  Max Models to Add: %d\n", maxModels)
-	fmt.Printf("  Gallery Index Path: %s\n", getGalleryIndexPath())
-	fmt.Println()
-
-	// Phase 1: load current gallery and query HuggingFace.
-	gallerySet, err := loadGalleryURLSet()
-	if err != nil {
-		fmt.Fprintf(os.Stderr, "Error loading gallery index: %v\n", err)
-		os.Exit(1)
-	}
-	fmt.Printf("Loaded %d existing gallery entries\n", len(gallerySet))
-
-	client := hfapi.NewClient()
-
-	fmt.Println("Searching for trending models on HuggingFace...")
-	rawModels, err := client.GetTrending(searchTerm, limit)
-	if err != nil {
-		if errors.Is(err, hfapi.ErrRateLimited) {
-			fmt.Printf("HuggingFace API is rate limited after retries, skipping this run: %v\n", err)
-			writeSummary(AddedModelSummary{
-				SearchTerm:     searchTerm,
-				TotalFound:     0,
-				ModelsAdded:    0,
-				Quantization:   quantization,
-				ProcessingTime: time.Since(startTime).String(),
-			})
-			return
-		}
-		fmt.Fprintf(os.Stderr, "Error fetching models: %v\n", err)
-		os.Exit(1)
-	}
-	fmt.Printf("Found %d trending models matching %q\n", len(rawModels), searchTerm)
-	totalFound := len(rawModels)
-
-	// Phase 2: drop anything already in the gallery *before* any expensive
-	// per-model work (GetModelDetails, README fetches, icon lookups).
-	fresh := rawModels[:0]
-	for _, m := range rawModels {
-		if modelAlreadyInGallery(gallerySet, m.ModelID) {
-			fmt.Printf("Skipping existing model: %s\n", m.ModelID)
-			continue
-		}
-		fresh = append(fresh, m)
-	}
-	fmt.Printf("%d candidates after gallery dedup\n", len(fresh))
-
-	// Phase 3: HuggingFace already returned these in trendingScore order —
-	// just cap to MAX_MODELS.
-	if len(fresh) > maxModels {
-		fresh = fresh[:maxModels]
-	}
-	if len(fresh) == 0 {
-		fmt.Println("No new models to add to the gallery.")
-		writeSummary(AddedModelSummary{
-			SearchTerm:     searchTerm,
-			TotalFound:     totalFound,
-			ModelsAdded:    0,
-			Quantization:   quantization,
-			ProcessingTime: time.Since(startTime).String(),
-		})
-		return
-	}
-
-	// Phase 4: fetch details and build ProcessedModel entries for survivors.
-	var processed []ProcessedModel
-	quantPrefs := []string{quantization, "Q4_K_M", "Q4_K_S", "Q3_K_M", "Q2_K", "Q8_0"}
-	for _, m := range fresh {
-		fmt.Printf("Processing model: %s (downloads=%d)\n", m.ModelID, m.Downloads)
-
-		pm := ProcessedModel{
-			ModelID:                 m.ModelID,
-			Author:                  m.Author,
-			Downloads:               m.Downloads,
-			LastModified:            m.LastModified,
-			QuantizationPreferences: quantPrefs,
-		}
-
-		details, err := client.GetModelDetails(m.ModelID)
-		if err != nil {
-			fmt.Printf("  Error getting model details: %v (skipping)\n", err)
-			continue
-		}
-
-		preferred := hfapi.FindPreferredModelFile(details.Files, quantPrefs)
-		if preferred == nil {
-			fmt.Printf("  No GGUF file matching %v — skipping\n", quantPrefs)
-			continue
-		}
-
-		pm.Files = make([]ProcessedModelFile, len(details.Files))
-		for j, f := range details.Files {
-			fileType := "other"
-			if f.IsReadme {
-				fileType = "readme"
-			} else if f.Path == preferred.Path {
-				fileType = "model"
-			}
-			pm.Files[j] = ProcessedModelFile{
-				Path:     f.Path,
-				Size:     f.Size,
-				SHA256:   f.SHA256,
-				IsReadme: f.IsReadme,
-				FileType: fileType,
-			}
-			if f.Path == preferred.Path {
-				copyFile := pm.Files[j]
-				pm.PreferredModelFile = &copyFile
-			}
-			if f.IsReadme {
-				copyFile := pm.Files[j]
-				pm.ReadmeFile = &copyFile
-			}
-		}
-
-		// Deterministic README resolution: follow base_model tag if set.
-		// Keep the raw (HTML-bearing) README around while we extract the
-		// icon, then strip it down to a plain-text description for the
-		// `description:` YAML field.
-		readme, err := resolveReadme(client, m.ModelID, m.Tags)
-		if err != nil {
-			fmt.Printf("  Warning: failed to fetch README: %v\n", err)
-		}
-		pm.ReadmeContent = readme
-
-		pm.License = licenseFromTags(m.Tags)
-		pm.Tags = curatedTags(m.Tags)
-		pm.Icon = extractModelIcon(pm)
-
-		if pm.ReadmeContent != "" {
-			pm.ReadmeContent = extractDescription(pm.ReadmeContent)
-			pm.ReadmeContentPreview = truncateString(pm.ReadmeContent, 200)
-		}
-
-		fmt.Printf("  License: %s, Tags: %v, Icon: %s\n", pm.License, pm.Tags, pm.Icon)
-		processed = append(processed, pm)
-	}
-
-	if len(processed) == 0 {
-		fmt.Println("No processable models after detail fetch.")
-		writeSummary(AddedModelSummary{
-			SearchTerm:     searchTerm,
-			TotalFound:     totalFound,
-			ModelsAdded:    0,
-			Quantization:   quantization,
-			ProcessingTime: time.Since(startTime).String(),
-		})
-		return
-	}
-
-	// Phase 5: write YAML entries.
-	var addedIDs, addedURLs []string
-	for _, pm := range processed {
-		addedIDs = append(addedIDs, pm.ModelID)
-		addedURLs = append(addedURLs, "https://huggingface.co/"+pm.ModelID)
-	}
-
-	fmt.Println("Generating YAML entries for selected models...")
-	if err := generateYAMLForModels(context.Background(), processed, quantization); err != nil {
-		fmt.Fprintf(os.Stderr, "Error generating YAML entries: %v\n", err)
-		os.Exit(1)
-	}
-
-	writeSummary(AddedModelSummary{
-		SearchTerm:     searchTerm,
-		TotalFound:     totalFound,
-		ModelsAdded:    len(addedIDs),
-		AddedModelIDs:  addedIDs,
-		AddedModelURLs: addedURLs,
-		Quantization:   quantization,
-		ProcessingTime: time.Since(startTime).String(),
-	})
-}
-
-func writeSummary(summary AddedModelSummary) {
-	data, err := json.MarshalIndent(summary, "", "  ")
-	if err != nil {
-		fmt.Fprintf(os.Stderr, "Error marshaling summary: %v\n", err)
-		return
-	}
-	if err := os.WriteFile("gallery-agent-summary.json", data, 0644); err != nil {
-		fmt.Fprintf(os.Stderr, "Error writing summary file: %v\n", err)
-		return
-	}
-	fmt.Println("Summary written to gallery-agent-summary.json")
-}
-
-func truncateString(s string, maxLen int) string {
-	if len(s) <= maxLen {
-		return s
-	}
-	return s[:maxLen] + "..."
-}
--- a/.github/gallery-agent/testing.go
+++ b/.github/gallery-agent/testing.go
@@ -1,224 +0,0 @@
-package main
-
-import (
-	"context"
-	"fmt"
-	"math/rand/v2"
-	"strings"
-	"time"
-)
-
-// runSyntheticMode generates synthetic test data and appends it to the gallery
-func runSyntheticMode() error {
-	generator := NewSyntheticDataGenerator()
-
-	// Generate a random number of synthetic models (1-3)
-	numModels := generator.rand.IntN(3) + 1
-	fmt.Printf("Generating %d synthetic models for testing...\n", numModels)
-
-	var models []ProcessedModel
-	for range numModels {
-		model := generator.GenerateProcessedModel()
-		models = append(models, model)
-		fmt.Printf("Generated synthetic model: %s\n", model.ModelID)
-	}
-
-	// Generate YAML entries and append to gallery/index.yaml
-	fmt.Println("Generating YAML entries for synthetic models...")
-	err := generateYAMLForModels(context.Background(), models, "Q4_K_M")
-	if err != nil {
-		return fmt.Errorf("error generating YAML entries: %w", err)
-	}
-
-	fmt.Printf("Successfully added %d synthetic models to the gallery for testing!\n", len(models))
-	return nil
-}
-
-// SyntheticDataGenerator provides methods to generate synthetic test data
-type SyntheticDataGenerator struct {
-	rand *rand.Rand
-}
-
-// NewSyntheticDataGenerator creates a new synthetic data generator
-func NewSyntheticDataGenerator() *SyntheticDataGenerator {
-	return &SyntheticDataGenerator{
-		rand: rand.New(rand.NewPCG(uint64(time.Now().UnixNano()), 0)),
-	}
-}
-
-// GenerateProcessedModelFile creates a synthetic ProcessedModelFile
-func (g *SyntheticDataGenerator) GenerateProcessedModelFile() ProcessedModelFile {
-	fileTypes := []string{"model", "readme", "other"}
-	fileType := fileTypes[g.rand.IntN(len(fileTypes))]
-
-	var path string
-	var isReadme bool
-
-	switch fileType {
-	case "model":
-		path = fmt.Sprintf("model-%s.gguf", g.randomString(8))
-		isReadme = false
-	case "readme":
-		path = "README.md"
-		isReadme = true
-	default:
-		path = fmt.Sprintf("file-%s.txt", g.randomString(6))
-		isReadme = false
-	}
-
-	return ProcessedModelFile{
-		Path:     path,
-		Size:     int64(g.rand.IntN(1000000000) + 1000000), // 1MB to 1GB
-		SHA256:   g.randomSHA256(),
-		IsReadme: isReadme,
-		FileType: fileType,
-	}
-}
-
-// GenerateProcessedModel creates a synthetic ProcessedModel
-func (g *SyntheticDataGenerator) GenerateProcessedModel() ProcessedModel {
-	authors := []string{"microsoft", "meta", "google", "openai", "anthropic", "mistralai", "huggingface"}
-	modelNames := []string{"llama", "gpt", "claude", "mistral", "gemma", "phi", "qwen", "codellama"}
-
-	author := authors[g.rand.IntN(len(authors))]
-	modelName := modelNames[g.rand.IntN(len(modelNames))]
-	modelID := fmt.Sprintf("%s/%s-%s", author, modelName, g.randomString(6))
-
-	// Generate files
-	numFiles := g.rand.IntN(5) + 2 // 2-6 files
-	files := make([]ProcessedModelFile, numFiles)
-
-	// Ensure at least one model file and one readme
-	hasModelFile := false
-	hasReadme := false
-
-	for i := range numFiles {
-		files[i] = g.GenerateProcessedModelFile()
-		if files[i].FileType == "model" {
-			hasModelFile = true
-		}
-		if files[i].FileType == "readme" {
-			hasReadme = true
-		}
-	}
-
-	// Add required files if missing
-	if !hasModelFile {
-		modelFile := g.GenerateProcessedModelFile()
-		modelFile.FileType = "model"
-		modelFile.Path = fmt.Sprintf("%s-Q4_K_M.gguf", modelName)
-		files = append(files, modelFile)
-	}
-
-	if !hasReadme {
-		readmeFile := g.GenerateProcessedModelFile()
-		readmeFile.FileType = "readme"
-		readmeFile.Path = "README.md"
-		readmeFile.IsReadme = true
-		files = append(files, readmeFile)
-	}
-
-	// Find preferred model file
-	var preferredModelFile *ProcessedModelFile
-	for i := range files {
-		if files[i].FileType == "model" {
-			preferredModelFile = &files[i]
-			break
-		}
-	}
-
-	// Find readme file
-	var readmeFile *ProcessedModelFile
-	for i := range files {
-		if files[i].FileType == "readme" {
-			readmeFile = &files[i]
-			break
-		}
-	}
-
-	readmeContent := g.generateReadmeContent(modelName, author)
-
-	// Generate sample metadata
-	licenses := []string{"apache-2.0", "mit", "llama2", "gpl-3.0", "bsd", ""}
-	license := licenses[g.rand.IntN(len(licenses))]
-
-	sampleTags := []string{"llm", "gguf", "gpu", "cpu", "text-to-text", "chat", "instruction-tuned"}
-	numTags := g.rand.IntN(4) + 3 // 3-6 tags
-	tags := make([]string, numTags)
-	for i := range numTags {
-		tags[i] = sampleTags[g.rand.IntN(len(sampleTags))]
-	}
-	// Remove duplicates
-	tags = g.removeDuplicates(tags)
-
-	// Optionally include icon (50% chance)
-	icon := ""
-	if g.rand.IntN(2) == 0 {
-		icon = fmt.Sprintf("https://cdn-avatars.huggingface.co/v1/production/uploads/%s.png", g.randomString(24))
-	}
-
-	return ProcessedModel{
-		ModelID:                 modelID,
-		Author:                  author,
-		Downloads:               g.rand.IntN(1000000) + 1000,
-		LastModified:            g.randomDate(),
-		Files:                   files,
-		PreferredModelFile:      preferredModelFile,
-		ReadmeFile:              readmeFile,
-		ReadmeContent:           readmeContent,
-		ReadmeContentPreview:    truncateString(readmeContent, 200),
-		QuantizationPreferences: []string{"Q4_K_M", "Q4_K_S", "Q3_K_M", "Q2_K"},
-		ProcessingError:         "",
-		Tags:                    tags,
-		License:                 license,
-		Icon:                    icon,
-	}
-}
-
-// Helper methods for synthetic data generation
-func (g *SyntheticDataGenerator) randomString(length int) string {
-	const charset = "abcdefghijklmnopqrstuvwxyz0123456789"
-	b := make([]byte, length)
-	for i := range b {
-		b[i] = charset[g.rand.IntN(len(charset))]
-	}
-	return string(b)
-}
-
-func (g *SyntheticDataGenerator) randomSHA256() string {
-	const charset = "0123456789abcdef"
-	b := make([]byte, 64)
-	for i := range b {
-		b[i] = charset[g.rand.IntN(len(charset))]
-	}
-	return string(b)
-}
-
-func (g *SyntheticDataGenerator) randomDate() string {
-	now := time.Now()
-	daysAgo := g.rand.IntN(365) // Random date within last year
-	pastDate := now.AddDate(0, 0, -daysAgo)
-	return pastDate.Format("2006-01-02T15:04:05.000Z")
-}
-
-func (g *SyntheticDataGenerator) removeDuplicates(slice []string) []string {
-	keys := make(map[string]bool)
-	result := []string{}
-	for _, item := range slice {
-		if !keys[item] {
-			keys[item] = true
-			result = append(result, item)
-		}
-	}
-	return result
-}
-
-func (g *SyntheticDataGenerator) generateReadmeContent(modelName, author string) string {
-	templates := []string{
-		fmt.Sprintf("# %s Model\n\nThis is a %s model developed by %s. It's designed for various natural language processing tasks including text generation, question answering, and conversation.\n\n## Features\n\n- High-quality text generation\n- Efficient inference\n- Multiple quantization options\n- Easy to use with LocalAI\n\n## Usage\n\nUse this model with LocalAI for various AI tasks.", strings.Title(modelName), modelName, author),
-		fmt.Sprintf("# %s\n\nA powerful language model from %s. This model excels at understanding and generating human-like text across multiple domains.\n\n## Capabilities\n\n- Text completion\n- Code generation\n- Creative writing\n- Technical documentation\n\n## Model Details\n\n- Architecture: Transformer-based\n- Training: Large-scale supervised learning\n- Quantization: Available in multiple formats", strings.Title(modelName), author),
-		fmt.Sprintf("# %s Language Model\n\nDeveloped by %s, this model represents state-of-the-art performance in natural language understanding and generation.\n\n## Key Features\n\n- Multilingual support\n- Context-aware responses\n- Efficient memory usage\n- Fast inference speed\n\n## Applications\n\n- Chatbots and virtual assistants\n- Content generation\n- Code completion\n- Educational tools", strings.Title(modelName), author),
-	}
-
-	return templates[g.rand.IntN(len(templates))]
-}
--- a/.github/labeler.yml
+++ b/.github/labeler.yml
@@ -1,15 +1,6 @@
-enhancement:
+enhancements:
 - head-branch: ['^feature', 'feature']

-dependencies:
- any:
-  - changed-files:
-    - any-glob-to-any-file: 'Makefile'
-  - changed-files:
-    - any-glob-to-any-file: '*.mod'
-  - changed-files:
-    - any-glob-to-any-file: '*.sum'
-
 kind/documentation:
 - any:
  - changed-files:
@@ -17,11 +8,6 @@ kind/documentation:
  - changed-files:
    - any-glob-to-any-file: '*.md'

-area/ai-model:
- any:
-  - changed-files:
-    - any-glob-to-any-file: 'gallery/*'
-
 examples:
 - any:
  - changed-files:
@@ -30,4 +16,4 @@ examples:
 ci:
 - any:
  - changed-files:
-    - any-glob-to-any-file: '.github/*'
+    - any-glob-to-any-file: '.github/*'
--- a/.github/release.yml
+++ b/.github/release.yml
@@ -13,9 +13,6 @@ changelog:
      labels:
        - bug
        - regression
-    - title: "🖧 P2P area"
-      labels:
-         - area/p2p
    - title: Exciting New Features 🎉
      labels:
        - Semver-Minor
--- a/.github/scripts/anchor-digest-in-cache.sh
+++ b/.github/scripts/anchor-digest-in-cache.sh
@@ -1,46 +0,0 @@
-#!/usr/bin/env bash
-# Anchor a backend per-arch digest in quay.io/go-skynet/ci-cache so quay's
-# garbage collector won't reap the manifest before backend_merge.yml runs.
-#
-# Context: backend_build.yml pushes by canonical digest only
-# (push-by-digest=true). Unreferenced manifests on quay can be reaped within
-# ~1-2h, but backend-merge-jobs runs only after the *entire* per-arch build
-# matrix drains (max-parallel: 8 × dozens of entries → ~2h+). Without an
-# anchoring tag, the earliest digests are gone by the time `imagetools create`
-# tries to read them, producing "manifest not found" merge failures.
-#
-# We tag the digest under our internal ci-cache image; quay does not GC tagged
-# manifests. The user-facing manifest list still references the original
-# digest in local-ai-backends. backend_merge.yml deletes the anchor tag after
-# the user-facing manifest is published — see cleanup-keepalive-tags.sh.
-#
-# Required env:
-#   GITHUB_RUN_ID  - current workflow run id (set automatically by GHA)
-#   TAG_SUFFIX     - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
-#   PLATFORM_TAG   - amd64 / arm64 / single (single = singleton matrix entry)
-#   DIGEST         - canonical content digest from build step (sha256:...)
-#
-# Optional env:
-#   ANCHOR_IMAGE   - target image (default: quay.io/go-skynet/ci-cache)
-#   SOURCE_IMAGE   - source image (default: quay.io/go-skynet/local-ai-backends)
-#   GITHUB_STEP_SUMMARY - if set, an anchored-by line is appended to it
-set -euo pipefail
-
-: "${GITHUB_RUN_ID:?}"
-: "${TAG_SUFFIX:?}"
-: "${PLATFORM_TAG:?}"
-: "${DIGEST:?}"
-
-anchor_image="${ANCHOR_IMAGE:-quay.io/go-skynet/ci-cache}"
-source_image="${SOURCE_IMAGE:-quay.io/go-skynet/local-ai-backends}"
-
-tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${PLATFORM_TAG}"
-
-docker buildx imagetools create \
-  -t "${anchor_image}:${tag}" \
-  "${source_image}@${DIGEST}"
-
-echo "anchored ${DIGEST} as ${anchor_image}:${tag}"
-if [[ -n "${GITHUB_STEP_SUMMARY:-}" ]]; then
-  echo "anchored \`${DIGEST}\` as \`${anchor_image}:${tag}\`" >> "${GITHUB_STEP_SUMMARY}"
-fi
--- a/.github/scripts/cleanup-keepalive-tags.sh
+++ b/.github/scripts/cleanup-keepalive-tags.sh
@@ -1,49 +0,0 @@
-#!/usr/bin/env bash
-# Best-effort cleanup of the keepalive anchor tags written by
-# anchor-digest-in-cache.sh. Called from backend_merge.yml after the
-# user-facing manifest list has been published.
-#
-# Quay's docker registry v2 doesn't allow tag deletes — only digest deletes.
-# The proper delete is the quay REST API, which requires an OAuth-scoped
-# token. We try QUAY_TOKEN as a bearer token: if the secret is an OAuth app
-# token (typical for service accounts) the delete succeeds; otherwise this
-# is a soft no-op and the tag persists until manually pruned.
-#
-# Cleanup failure MUST NOT fail the merge — the merge has already produced
-# the user-facing manifest list at this point and the keepalive tags are
-# pure overhead. We always exit 0.
-#
-# Required env:
-#   GITHUB_RUN_ID  - current workflow run id (set automatically by GHA)
-#   TAG_SUFFIX     - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
-#   QUAY_TOKEN     - bearer token for quay's REST API
-#
-# Optional env:
-#   QUAY_REPO      - target repo (default: go-skynet/ci-cache)
-#   PLATFORM_TAGS  - space-separated list of platform-tag values to try
-#                    (default: "amd64 arm64 single")
-#                    We don't know which platform-tag(s) exist for this
-#                    tag-suffix without an extra API call, so we just try
-#                    all three and ignore 404s for the ones that don't.
-set -uo pipefail
-
-: "${GITHUB_RUN_ID:?}"
-: "${TAG_SUFFIX:?}"
-: "${QUAY_TOKEN:?}"
-
-quay_repo="${QUAY_REPO:-go-skynet/ci-cache}"
-platform_tags="${PLATFORM_TAGS:-amd64 arm64 single}"
-
-for plat in $platform_tags; do
-  tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${plat}"
-  url="https://quay.io/api/v1/repository/${quay_repo}/tag/${tag}"
-  http=$(curl -sS -o /dev/null -w '%{http_code}' \
-    -X DELETE -H "Authorization: Bearer ${QUAY_TOKEN}" "$url" || echo "000")
-  case "$http" in
-    204|200) echo "deleted $tag" ;;
-    404)     echo "not present: $tag" ;;
-    401|403) echo "auth not OAuth-scoped (http $http) for $tag - skipping; orphan tag will persist" ;;
-    *)       echo "unexpected http $http deleting $tag - skipping" ;;
-  esac
-done
-exit 0
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -1,370 +0,0 @@
---
-name: 'build backend container images'
-
-on:
-  push:
-    branches:
-      - master
-    tags:
-      - '*'
-  schedule:
-    # Weekly full-matrix rebuild to pick up upstream Python wheel updates
-    # (torch, transformers, vllm, ...) which most backends pull unpinned.
-    # The DEPS_REFRESH build-arg in backend_build.yml busts the install
-    # layer cache on a new ISO week, but only fires when the build runs.
-    # Path filtering on commit-driven pushes (scripts/changed-backends.js)
-    # skips untouched backends, so without this cron those images would
-    # drift on stale wheels indefinitely. C++/Go backends with pinned
-    # deps cache-hit and finish fast.
-    #
-    # Schedule events have no event.ref / event.before, so the script's
-    # changedFiles==null fallback emits the full matrix automatically —
-    # no script changes needed.
-    - cron: '0 6 * * 0'  # Sundays 06:00 UTC
-  workflow_dispatch:
-
-concurrency:
-  group: ci-backends-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  generate-matrix:
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ubuntu-latest
-    outputs:
-      matrix-multiarch: ${{ steps.set-matrix.outputs['matrix-multiarch'] }}
-      matrix-darwin: ${{ steps.set-matrix.outputs['matrix-darwin'] }}
-      merge-matrix-multiarch: ${{ steps.set-matrix.outputs['merge-matrix-multiarch'] }}
-      has-backends-multiarch: ${{ steps.set-matrix.outputs['has-backends-multiarch'] }}
-      has-backends-darwin: ${{ steps.set-matrix.outputs['has-backends-darwin'] }}
-      has-merges-multiarch: ${{ steps.set-matrix.outputs['has-merges-multiarch'] }}
-      # Single-arch backends are sharded across SINGLEARCH_SHARDS matrix jobs to
-      # stay under GitHub's 256-jobs-per-matrix limit (see changed-backends.js).
-      matrix-singlearch-1: ${{ steps.set-matrix.outputs['matrix-singlearch-1'] }}
-      merge-matrix-singlearch-1: ${{ steps.set-matrix.outputs['merge-matrix-singlearch-1'] }}
-      has-backends-singlearch-1: ${{ steps.set-matrix.outputs['has-backends-singlearch-1'] }}
-      has-merges-singlearch-1: ${{ steps.set-matrix.outputs['has-merges-singlearch-1'] }}
-      matrix-singlearch-2: ${{ steps.set-matrix.outputs['matrix-singlearch-2'] }}
-      merge-matrix-singlearch-2: ${{ steps.set-matrix.outputs['merge-matrix-singlearch-2'] }}
-      has-backends-singlearch-2: ${{ steps.set-matrix.outputs['has-backends-singlearch-2'] }}
-      has-merges-singlearch-2: ${{ steps.set-matrix.outputs['has-merges-singlearch-2'] }}
-      matrix-singlearch-3: ${{ steps.set-matrix.outputs['matrix-singlearch-3'] }}
-      merge-matrix-singlearch-3: ${{ steps.set-matrix.outputs['merge-matrix-singlearch-3'] }}
-      has-backends-singlearch-3: ${{ steps.set-matrix.outputs['has-backends-singlearch-3'] }}
-      has-merges-singlearch-3: ${{ steps.set-matrix.outputs['has-merges-singlearch-3'] }}
-      matrix-singlearch-4: ${{ steps.set-matrix.outputs['matrix-singlearch-4'] }}
-      merge-matrix-singlearch-4: ${{ steps.set-matrix.outputs['merge-matrix-singlearch-4'] }}
-      has-backends-singlearch-4: ${{ steps.set-matrix.outputs['has-backends-singlearch-4'] }}
-      has-merges-singlearch-4: ${{ steps.set-matrix.outputs['has-merges-singlearch-4'] }}
-    steps:
-      - name: Checkout repository
-        uses: actions/checkout@v7
-
-      - name: Setup Bun
-        uses: oven-sh/setup-bun@v2
-
-      - name: Install dependencies
-        run: |
-          bun add js-yaml
-          bun add @octokit/core
-
-      # Filter the backend matrix from .github/backend-matrix.yml against the
-      # files changed by this push. Tag pushes set FORCE_ALL=true so the script
-      # falls through to the full matrix (releases must rebuild everything).
-      # The script splits the linux matrix into single-arch and multi-arch
-      # groups so backend-merge-jobs can `needs:` only the multi-arch one —
-      # see the comment block above the merge job for context.
-      - name: Filter matrix for changed backends
-        id: set-matrix
-        env:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          GITHUB_EVENT_PATH: ${{ github.event_path }}
-          FORCE_ALL: ${{ startsWith(github.ref, 'refs/tags/') && 'true' || 'false' }}
-        run: bun run scripts/changed-backends.js
-
-  # Multi-arch backends — entries with a `platform-tag` set, paired with a
-  # sibling entry sharing the same `tag-suffix` (one amd64 leg, one arm64
-  # leg). Their digests are the inputs to backend-merge-jobs, so they're in
-  # their own matrix to bound how long the merge waits before quay GCs the
-  # untagged digests.
-  backend-jobs-multiarch:
-    needs: generate-matrix
-    if: needs.generate-matrix.outputs['has-backends-multiarch'] == 'true'
-    uses: ./.github/workflows/backend_build.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-      build-type: ${{ matrix.build-type }}
-      cuda-major-version: ${{ matrix.cuda-major-version }}
-      cuda-minor-version: ${{ matrix.cuda-minor-version }}
-      platforms: ${{ matrix.platforms }}
-      platform-tag: ${{ matrix.platform-tag || '' }}
-      runs-on: ${{ matrix.runs-on }}
-      builder-base-image: ${{ matrix.builder-base-image || '' }}
-      base-image: ${{ matrix.base-image }}
-      backend: ${{ matrix.backend }}
-      dockerfile: ${{ matrix.dockerfile }}
-      skip-drivers: ${{ matrix.skip-drivers }}
-      context: ${{ matrix.context }}
-      ubuntu-version: ${{ matrix.ubuntu-version }}
-      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
-    secrets:
-      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      max-parallel: 8
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-multiarch']) }}
-
-  # Single-arch backends — no `platform-tag`. Heavy ones (CUDA, ROCm, Intel
-  # oneAPI, vLLM/sglang) live here. Independent of the merge job: they can
-  # take their full ~6h cold without blocking manifest assembly for the
-  # multi-arch backends whose per-arch digests would otherwise sit untagged
-  # on quay long enough to be GC'd.
-  backend-jobs-singlearch-1:
-    needs: generate-matrix
-    if: needs.generate-matrix.outputs['has-backends-singlearch-1'] == 'true'
-    uses: ./.github/workflows/backend_build.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-      build-type: ${{ matrix.build-type }}
-      cuda-major-version: ${{ matrix.cuda-major-version }}
-      cuda-minor-version: ${{ matrix.cuda-minor-version }}
-      platforms: ${{ matrix.platforms }}
-      platform-tag: ${{ matrix.platform-tag || '' }}
-      runs-on: ${{ matrix.runs-on }}
-      builder-base-image: ${{ matrix.builder-base-image || '' }}
-      base-image: ${{ matrix.base-image }}
-      backend: ${{ matrix.backend }}
-      dockerfile: ${{ matrix.dockerfile }}
-      skip-drivers: ${{ matrix.skip-drivers }}
-      context: ${{ matrix.context }}
-      ubuntu-version: ${{ matrix.ubuntu-version }}
-      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
-    secrets:
-      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      max-parallel: 8
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch-1']) }}
-
-  backend-jobs-singlearch-2:
-    needs: generate-matrix
-    if: needs.generate-matrix.outputs['has-backends-singlearch-2'] == 'true'
-    uses: ./.github/workflows/backend_build.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-      build-type: ${{ matrix.build-type }}
-      cuda-major-version: ${{ matrix.cuda-major-version }}
-      cuda-minor-version: ${{ matrix.cuda-minor-version }}
-      platforms: ${{ matrix.platforms }}
-      platform-tag: ${{ matrix.platform-tag || '' }}
-      runs-on: ${{ matrix.runs-on }}
-      builder-base-image: ${{ matrix.builder-base-image || '' }}
-      base-image: ${{ matrix.base-image }}
-      backend: ${{ matrix.backend }}
-      dockerfile: ${{ matrix.dockerfile }}
-      skip-drivers: ${{ matrix.skip-drivers }}
-      context: ${{ matrix.context }}
-      ubuntu-version: ${{ matrix.ubuntu-version }}
-      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
-    secrets:
-      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      max-parallel: 8
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch-2']) }}
-
-  backend-jobs-singlearch-3:
-    needs: generate-matrix
-    if: needs.generate-matrix.outputs['has-backends-singlearch-3'] == 'true'
-    uses: ./.github/workflows/backend_build.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-      build-type: ${{ matrix.build-type }}
-      cuda-major-version: ${{ matrix.cuda-major-version }}
-      cuda-minor-version: ${{ matrix.cuda-minor-version }}
-      platforms: ${{ matrix.platforms }}
-      platform-tag: ${{ matrix.platform-tag || '' }}
-      runs-on: ${{ matrix.runs-on }}
-      builder-base-image: ${{ matrix.builder-base-image || '' }}
-      base-image: ${{ matrix.base-image }}
-      backend: ${{ matrix.backend }}
-      dockerfile: ${{ matrix.dockerfile }}
-      skip-drivers: ${{ matrix.skip-drivers }}
-      context: ${{ matrix.context }}
-      ubuntu-version: ${{ matrix.ubuntu-version }}
-      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
-    secrets:
-      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      max-parallel: 8
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch-3']) }}
-
-  backend-jobs-singlearch-4:
-    needs: generate-matrix
-    if: needs.generate-matrix.outputs['has-backends-singlearch-4'] == 'true'
-    uses: ./.github/workflows/backend_build.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-      build-type: ${{ matrix.build-type }}
-      cuda-major-version: ${{ matrix.cuda-major-version }}
-      cuda-minor-version: ${{ matrix.cuda-minor-version }}
-      platforms: ${{ matrix.platforms }}
-      platform-tag: ${{ matrix.platform-tag || '' }}
-      runs-on: ${{ matrix.runs-on }}
-      builder-base-image: ${{ matrix.builder-base-image || '' }}
-      base-image: ${{ matrix.base-image }}
-      backend: ${{ matrix.backend }}
-      dockerfile: ${{ matrix.dockerfile }}
-      skip-drivers: ${{ matrix.skip-drivers }}
-      context: ${{ matrix.context }}
-      ubuntu-version: ${{ matrix.ubuntu-version }}
-      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
-    secrets:
-      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      max-parallel: 8
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch-4']) }}
-
-  # Apply tags to per-arch digests via `imagetools create`. Split into two
-  # jobs that mirror the build split so each merge waits ONLY on its
-  # corresponding build matrix:
-  #
-  #   - backend-merge-jobs-multiarch  needs backend-jobs-multiarch  (~2-3h)
-  #   - backend-merge-jobs-singlearch needs backend-jobs-singlearch (up to ~6h)
-  #
-  # If a single shared merge job depended on both, slow CUDA singlearch
-  # builds would block multiarch merges long enough for quay's GC to reap
-  # the multiarch per-arch digests (the bug fixed by PR #9746). Singletons
-  # also need a merge step because backend_build.yml pushes by canonical
-  # digest only — no tags are applied at build time.
-  backend-merge-jobs-multiarch:
-    needs: [generate-matrix, backend-jobs-multiarch]
-    # !cancelled() lets the merge run even when a few build legs failed.
-    # Without it, GHA's default `needs:` cascade skips the entire merge
-    # matrix on a single failed/cancelled cell. We still want to publish
-    # the manifest lists for tag-suffixes whose legs all succeeded.
-    # Observed in v4.2.1: 2 singlearch build failures cascade-skipped all
-    # ~199 singlearch merge entries.
-    if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
-    uses: ./.github/workflows/backend_merge.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-    secrets:
-      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-multiarch']) }}
-
-  # One merge shard per build shard: backend-merge-jobs-singlearch-<n> needs only
-  # backend-jobs-singlearch-<n>, preserving the "merge waits only on its own
-  # build" property while staying under the 256-jobs-per-matrix limit.
-  backend-merge-jobs-singlearch-1:
-    needs: [generate-matrix, backend-jobs-singlearch-1]
-    if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-singlearch-1'] == 'true' }}
-    uses: ./.github/workflows/backend_merge.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-    secrets:
-      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch-1']) }}
-
-  backend-merge-jobs-singlearch-2:
-    needs: [generate-matrix, backend-jobs-singlearch-2]
-    if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-singlearch-2'] == 'true' }}
-    uses: ./.github/workflows/backend_merge.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-    secrets:
-      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch-2']) }}
-
-  backend-merge-jobs-singlearch-3:
-    needs: [generate-matrix, backend-jobs-singlearch-3]
-    if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-singlearch-3'] == 'true' }}
-    uses: ./.github/workflows/backend_merge.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-    secrets:
-      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch-3']) }}
-
-  backend-merge-jobs-singlearch-4:
-    needs: [generate-matrix, backend-jobs-singlearch-4]
-    if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-singlearch-4'] == 'true' }}
-    uses: ./.github/workflows/backend_merge.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-    secrets:
-      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch-4']) }}
-
-  backend-jobs-darwin:
-    needs: generate-matrix
-    if: needs.generate-matrix.outputs.has-backends-darwin == 'true'
-    uses: ./.github/workflows/backend_build_darwin.yml
-    with:
-      backend: ${{ matrix.backend }}
-      build-type: ${{ matrix.build-type }}
-      go-version: "1.25.x"
-      tag-suffix: ${{ matrix.tag-suffix }}
-      lang: ${{ matrix.lang || 'python' }}
-      use-pip: ${{ matrix.backend == 'diffusers' }}
-      runs-on: "macos-latest"
-    secrets:
-      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix-darwin) }}
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -1,288 +0,0 @@
---
-name: 'build backend container images (reusable)'
-
-on:
-  workflow_call:
-    inputs:
-      base-image:
-        description: 'Base image'
-        required: true
-        type: string
-      build-type:
-        description: 'Build type'
-        default: ''
-        type: string
-      cuda-major-version:
-        description: 'CUDA major version'
-        default: "12"
-        type: string
-      cuda-minor-version:
-        description: 'CUDA minor version'
-        default: "1"
-        type: string
-      platforms:
-        description: 'Platforms'
-        default: ''
-        type: string
-      platform-tag:
-        description: |
-          Short tag identifying the platform leg, e.g. "amd64" or "arm64".
-          Used to scope the per-arch registry cache and the digest artifact name.
-          Required for split-and-merge multi-arch builds; pass "amd64" for
-          single-arch amd64 builds too. Optional (default '') during the
-          migration to per-arch matrix expansion; will be flipped to
-          required: true in Phase 6 once all callers pass an explicit value.
-        required: false
-        default: ''
-        type: string
-      tag-latest:
-        description: 'Tag latest'
-        default: ''
-        type: string
-      tag-suffix:
-        description: 'Tag suffix'
-        default: ''
-        type: string
-      runs-on:
-        description: 'Runs on'
-        required: true
-        default: ''
-        type: string
-      backend:
-        description: 'Backend to build'
-        required: true
-        type: string
-      context:
-        description: 'Build context'
-        required: true
-        type: string
-      dockerfile:
-        description: 'Build Dockerfile'
-        required: true
-        type: string
-      skip-drivers:
-        description: 'Skip drivers'
-        default: 'false'
-        type: string
-      ubuntu-version:
-        description: 'Ubuntu version'
-        required: false
-        default: '2204'
-        type: string
-      amdgpu-targets:
-        description: 'AMD GPU targets for ROCm/HIP builds'
-        required: false
-        default: ''
-        type: string
-      builder-base-image:
-        description: |
-          Pre-built builder base image (e.g. quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64).
-          When set, the variant Dockerfile uses its `builder-prebuilt` stage which FROMs this
-          image directly instead of running its own gRPC stage + apt installs. Empty for
-          backends whose Dockerfile doesn't support a prebuilt base.
-        required: false
-        default: ''
-        type: string
-    secrets:
-      dockerUsername:
-        required: false
-      dockerPassword:
-        required: false
-      quayUsername:
-        required: true
-      quayPassword:
-        required: true
-
-jobs:
-  backend-build:
-    runs-on: ${{ inputs.runs-on }}
-    env:
-        quay_username: ${{ secrets.quayUsername }}
-    steps:
-
-      - name: Checkout
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-
-      - name: Configure apt mirror on runner
-        id: apt_mirror
-        uses: ./.github/actions/configure-apt-mirror
-
-      - name: Free disk space
-        uses: ./.github/actions/free-disk-space
-        with:
-          mode: ${{ inputs.runs-on == 'ubuntu-latest' && 'hosted' || 'skip' }}
-
-      - name: Set up build disk
-        uses: ./.github/actions/setup-build-disk
-
-      - name: Docker meta
-        id: meta
-        if: github.event_name != 'pull_request'
-        uses: docker/metadata-action@v6
-        with:
-          images: |
-            quay.io/go-skynet/local-ai-backends
-            localai/localai-backends
-          tags: |
-            type=ref,event=branch
-            type=semver,pattern={{raw}}
-            type=sha
-          flavor: |
-            latest=${{ inputs.tag-latest }}
-            suffix=${{ inputs.tag-suffix }},onlatest=true
-
-      - name: Docker meta for PR
-        id: meta_pull_request
-        if: github.event_name == 'pull_request'
-        uses: docker/metadata-action@v6
-        with:
-          images: |
-            quay.io/go-skynet/ci-tests
-          tags: |
-            type=ref,event=branch,suffix=${{ github.event.number }}-${{ inputs.backend }}-${{ inputs.build-type }}-${{ inputs.cuda-major-version }}-${{ inputs.cuda-minor-version }}
-            type=semver,pattern={{raw}},suffix=${{ github.event.number }}-${{ inputs.backend }}-${{ inputs.build-type }}-${{ inputs.cuda-major-version }}-${{ inputs.cuda-minor-version }}
-            type=sha,suffix=${{ github.event.number }}-${{ inputs.backend }}-${{ inputs.build-type }}-${{ inputs.cuda-major-version }}-${{ inputs.cuda-minor-version }}
-          flavor: |
-            latest=${{ inputs.tag-latest }}
-            suffix=${{ inputs.tag-suffix }},onlatest=true
-## End testing image
-      - name: Set up QEMU
-        uses: docker/setup-qemu-action@master
-        with:
-          platforms: all
-
-      - name: Set up Docker Buildx
-        id: buildx
-        uses: docker/setup-buildx-action@master
-
-      - name: Login to DockerHub
-        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
-        with:
-          username: ${{ secrets.dockerUsername }}
-          password: ${{ secrets.dockerPassword }}
-
-      - name: Login to Quay.io
-        if: ${{ env.quay_username != '' }}
-        uses: docker/login-action@v4
-        with:
-          registry: quay.io
-          username: ${{ secrets.quayUsername }}
-          password: ${{ secrets.quayPassword }}
-
-      # Weekly cache-buster for the per-backend `make` step. Most Python
-      # backends list unpinned deps (torch, transformers, vllm, ...), so a
-      # warm cache freezes upstream versions indefinitely. Rolling this
-      # weekly forces a re-resolve of the install layer at most once per
-      # week, picking up newer wheels without a full cold rebuild.
-      - name: Compute deps refresh key
-        id: deps_refresh
-        run: echo "key=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
-
-      - name: Build and push by digest
-        id: build
-        uses: docker/build-push-action@v7
-        if: github.event_name != 'pull_request'
-        with:
-          builder: ${{ steps.buildx.outputs.name }}
-          build-args: |
-            BUILD_TYPE=${{ inputs.build-type }}
-            SKIP_DRIVERS=${{ inputs.skip-drivers }}
-            CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
-            CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
-            BASE_IMAGE=${{ inputs.base-image }}
-            BACKEND=${{ inputs.backend }}
-            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
-            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
-            APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
-            APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
-            DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
-            BUILDER_BASE_IMAGE=${{ inputs.builder-base-image }}
-            BUILDER_TARGET=${{ inputs.builder-base-image != '' && 'builder-prebuilt' || 'builder-fromsource' }}
-          context: ${{ inputs.context }}
-          file: ${{ inputs.dockerfile }}
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
-          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}-${{ inputs.platform-tag }},mode=max,ignore-error=true
-          platforms: ${{ inputs.platforms }}
-          outputs: |
-            type=image,name=quay.io/go-skynet/local-ai-backends,push-by-digest=true,name-canonical=true,push=true
-            type=image,name=localai/localai-backends,push-by-digest=true,name-canonical=true,push=true
-          # Disable provenance: with mode=max (the default for push:true)
-          # buildx bundles a per-registry attestation manifest into each
-          # registry's manifest list, which makes the resulting list digest
-          # diverge across registries. steps.build.outputs.digest then
-          # only matches one of them, and the merge job's
-          # `imagetools create <reg>@sha256:<digest>` lookup fails on the
-          # other. Disabling provenance keeps the digest content-only and
-          # identical across both registries — required for digest-based
-          # cross-registry merge.
-          provenance: false
-          labels: ${{ steps.meta.outputs.labels }}
-
-      - name: Export digest
-        if: github.event_name != 'pull_request'
-        run: |
-          mkdir -p /tmp/digests
-          digest="${{ steps.build.outputs.digest }}"
-          touch "/tmp/digests/${digest#sha256:}"
-
-      # See .github/scripts/anchor-digest-in-cache.sh for why this is needed
-      # and how it interacts with backend_merge.yml's cleanup step.
-      - name: Anchor digest in ci-cache so quay GC won't reap before merge
-        if: github.event_name != 'pull_request'
-        env:
-          TAG_SUFFIX: ${{ inputs.tag-suffix }}
-          PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
-          DIGEST: ${{ steps.build.outputs.digest }}
-        run: .github/scripts/anchor-digest-in-cache.sh
-
-      # Artifact name uses a `--` separator between tag-suffix and platform-tag
-      # to avoid prefix collisions during the merge job's pattern-based download.
-      # Tag-suffixes are not prefix-disjoint (e.g. -gpu-nvidia-cuda-12-vllm is a
-      # prefix of -gpu-nvidia-cuda-12-vllm-omni); a single `-` separator plus the
-      # merge-side `digests<tag-suffix>-*` glob would let one merge over-match
-      # the other backend's artifacts. The `-single` placeholder for empty
-      # platform-tag (single-arch entries) keeps the artifact name non-trailing.
-      - name: Upload digest artifact
-        if: github.event_name != 'pull_request'
-        uses: actions/upload-artifact@v7
-        with:
-          name: digests${{ inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
-          path: /tmp/digests/*
-          if-no-files-found: error
-          retention-days: 1
-
-      - name: Build (PR)
-        uses: docker/build-push-action@v7
-        if: github.event_name == 'pull_request'
-        with:
-          builder: ${{ steps.buildx.outputs.name }}
-          build-args: |
-            BUILD_TYPE=${{ inputs.build-type }}
-            SKIP_DRIVERS=${{ inputs.skip-drivers }}
-            CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
-            CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
-            BASE_IMAGE=${{ inputs.base-image }}
-            BACKEND=${{ inputs.backend }}
-            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
-            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
-            APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
-            APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
-            DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
-            BUILDER_BASE_IMAGE=${{ inputs.builder-base-image }}
-            BUILDER_TARGET=${{ inputs.builder-base-image != '' && 'builder-prebuilt' || 'builder-fromsource' }}
-          context: ${{ inputs.context }}
-          file: ${{ inputs.dockerfile }}
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
-          platforms: ${{ inputs.platforms }}
-          push: ${{ env.quay_username != '' }}
-          tags: ${{ steps.meta_pull_request.outputs.tags }}
-          labels: ${{ steps.meta_pull_request.outputs.labels }}
-
-
-
-      - name: job summary
-        run: |
-          echo "Built image: ${{ steps.meta.outputs.labels }}" >> $GITHUB_STEP_SUMMARY
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -1,341 +0,0 @@
---
-name: 'build darwin python backend container images (reusable)'
-
-on:
-  workflow_call:
-    inputs:
-      backend:
-        description: 'Backend to build'
-        required: true
-        type: string
-      build-type:
-        description: 'Build type (e.g., mps)'
-        default: ''
-        type: string
-      use-pip:
-        description: 'Use pip to install dependencies'
-        default: false
-        type: boolean
-      lang:
-        description: 'Programming language (e.g. go)'
-        default: 'python'
-        type: string
-      go-version:
-        description: 'Go version to use'
-        default: '1.24.x'
-        type: string
-      tag-suffix:
-        description: 'Tag suffix for the built image'
-        required: true
-        type: string
-      runs-on:
-        description: 'Runner to use'
-        default: 'macOS-14'
-        type: string
-    secrets:
-      dockerUsername:
-        required: false
-      dockerPassword:
-        required: false
-      quayUsername:
-        required: true
-      quayPassword:
-        required: true
-
-jobs:
-  darwin-backend-build:
-    runs-on: ${{ inputs.runs-on }}
-    strategy:
-      matrix:
-        go-version: ['${{ inputs.go-version }}']
-    env:
-      # Keep the brew Cellar stable across cache restores. Without these,
-      # `brew install` would auto-update brew itself and re-link formulas,
-      # mutating the very paths the cache just restored.
-      HOMEBREW_NO_AUTO_UPDATE: '1'
-      HOMEBREW_NO_INSTALL_CLEANUP: '1'
-      HOMEBREW_NO_ANALYTICS: '1'
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-
-      - name: Setup Go ${{ matrix.go-version }}
-        uses: actions/setup-go@v5
-        with:
-          go-version: ${{ matrix.go-version }}
-          # Caches ~/go/pkg/mod and ~/Library/Caches/go-build keyed on go.sum.
-          # Shared across every darwin matrix entry — first job in a run warms
-          # it, the rest hit warm.
-          cache: true
-
-      # You can test your matrix by printing the current Go version
-      - name: Display Go version
-        run: go version
-
-      # ---- Homebrew cache ----
-      # macOS runners have no Docker daemon, so the BuildKit registry cache used
-      # for Linux backend images (see .agents/ci-caching.md) doesn't apply here.
-      # We cache the brew downloads + Cellar entries for the formulas we install
-      # below. Read on every run, write only on master/tag pushes — same policy
-      # as the Linux registry cache.
-      - name: Restore Homebrew cache
-        id: brew-cache
-        uses: actions/cache/restore@v6
-        with:
-          path: |
-            ~/Library/Caches/Homebrew/downloads
-            /opt/homebrew/Cellar/protobuf
-            /opt/homebrew/Cellar/grpc
-            /opt/homebrew/Cellar/protoc-gen-go
-            /opt/homebrew/Cellar/protoc-gen-go-grpc
-            /opt/homebrew/Cellar/libomp
-            /opt/homebrew/Cellar/llvm
-            /opt/homebrew/Cellar/ccache
-            /opt/homebrew/Cellar/blake3
-            /opt/homebrew/Cellar/fmt
-            /opt/homebrew/Cellar/hiredis
-            /opt/homebrew/Cellar/xxhash
-            /opt/homebrew/Cellar/zstd
-            /opt/homebrew/Cellar/nlohmann-json
-            /opt/homebrew/Cellar/opus
-          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
-
-      - name: Dependencies
-        run: |
-          # ccache is always installed (used by the llama-cpp variant build) so
-          # the brew cache content stays stable across every backend in the
-          # matrix — they all share one cache key.
-          # blake3, fmt, hiredis, xxhash, zstd are ccache's runtime dylib deps.
-          # Without explicitly installing them, a brew cache-hit run restores
-          # ccache's Cellar dir but skips installing those transitive deps,
-          # and ccache fails at runtime with `dyld: Library not loaded`.
-          # nlohmann-json is header-only and required by the ds4 backend
-          # (dsml_renderer.cpp includes <nlohmann/json.hpp>); on Linux it comes
-          # from the apt-installed nlohmann-json3-dev in the build image.
-          # opus + pkg-config are required by the opus go backend: its
-          # Makefile/package.sh call `pkg-config --cflags/--libs opus` to build
-          # libopusshim.dylib and to locate libopus.dylib for bundling. brew's
-          # pkg-config defaults its search path to the Homebrew prefix so the
-          # opus.pc is found.
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json opus pkg-config
-          # Force-reinstall ccache so brew re-validates its full runtime-dep
-          # closure on every run. This is the durable fix: when the upstream
-          # ccache formula gains a new transitive dep (as it has multiple times
-          # already), we don't have to chase missing dylibs one at a time.
-          # The downloads cache makes the reinstall fast (~5s on a hit).
-          brew reinstall ccache
-          # Same pattern for grpc: its CMake config (used by the llama-cpp
-          # `grpc-server` target) does find_package(absl). The cache restores
-          # /opt/homebrew/Cellar/grpc so brew above no-ops the install, but
-          # abseil isn't in our Cellar cache list and never gets installed
-          # alongside, leaving grpc's CMake unable to resolve it. Reinstalling
-          # grpc re-validates and pulls abseil in, mirroring the ccache fix.
-          brew reinstall grpc
-          # The brew cache restores the Cellar dirs but NOT the bin symlinks
-          # at /opt/homebrew/bin/*. brew install above sees the Cellar present
-          # and decides "already installed" without re-linking, so on a cache-
-          # hit run the formulas aren't on PATH. Force-link them; --overwrite
-          # tolerates pre-existing symlinks from earlier installs.
-          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json opus pkg-config 2>/dev/null || true
-
-      - name: Save Homebrew cache
-        if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
-        uses: actions/cache/save@v6
-        with:
-          path: |
-            ~/Library/Caches/Homebrew/downloads
-            /opt/homebrew/Cellar/protobuf
-            /opt/homebrew/Cellar/grpc
-            /opt/homebrew/Cellar/protoc-gen-go
-            /opt/homebrew/Cellar/protoc-gen-go-grpc
-            /opt/homebrew/Cellar/libomp
-            /opt/homebrew/Cellar/llvm
-            /opt/homebrew/Cellar/ccache
-            /opt/homebrew/Cellar/blake3
-            /opt/homebrew/Cellar/fmt
-            /opt/homebrew/Cellar/hiredis
-            /opt/homebrew/Cellar/xxhash
-            /opt/homebrew/Cellar/zstd
-            /opt/homebrew/Cellar/nlohmann-json
-            /opt/homebrew/Cellar/opus
-          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
-
-      # ---- ccache for llama.cpp CMake builds ----
-      # Three CMake variants (fallback, grpc, rpc-server) compile the same
-      # llama.cpp source tree with overlapping flags — ccache dedupes object
-      # files across them. Key on the pinned LLAMA_VERSION so a pin bump
-      # invalidates cleanly; restore-keys fall back to the latest entry for the
-      # same pin so unchanged TUs stay warm even when the cache is fresh.
-      - name: Compute llama.cpp version
-        if: inputs.backend == 'llama-cpp'
-        id: llama-version
-        run: |
-          version=$(grep '^LLAMA_VERSION' backend/cpp/llama-cpp/Makefile | head -1 | cut -d= -f2 | cut -d'?' -f1 | tr -d ' ')
-          echo "version=${version}" >> "$GITHUB_OUTPUT"
-
-      - name: Restore ccache
-        if: inputs.backend == 'llama-cpp'
-        id: ccache-cache
-        uses: actions/cache/restore@v6
-        with:
-          path: ~/Library/Caches/ccache
-          key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
-          restore-keys: |
-            ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-
-
-      - name: Configure ccache
-        if: inputs.backend == 'llama-cpp'
-        run: |
-          mkdir -p "$HOME/Library/Caches/ccache"
-          ccache -M 2G
-          ccache -z
-          # llama-cpp-darwin.sh reads CMAKE_ARGS / CCACHE_DIR from env.
-          {
-            echo "CMAKE_ARGS=${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache"
-            echo "CCACHE_DIR=$HOME/Library/Caches/ccache"
-          } >> "$GITHUB_ENV"
-
-      # ---- Python wheel cache (uv + pip) ----
-      # Mirrors the Linux DEPS_REFRESH cadence (see .agents/ci-caching.md): the
-      # ISO-week segment of the cache key forces at most one cold rebuild per
-      # backend per week, automatically picking up newer wheels for unpinned
-      # deps (torch, mlx, diffusers, …). Restore-keys fall back to the most
-      # recent build of the same backend so off-week PRs still hit warm.
-      - name: Compute weekly cache bucket
-        if: inputs.lang == 'python'
-        id: weekly
-        run: echo "bucket=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
-
-      - name: Restore Python wheel cache
-        if: inputs.lang == 'python'
-        id: pyenv-cache
-        uses: actions/cache/restore@v6
-        with:
-          path: |
-            ~/Library/Caches/pip
-            ~/Library/Caches/uv
-          key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
-          restore-keys: |
-            pyenv-darwin-${{ inputs.backend }}-
-
-      # llama-cpp on Darwin uses a bespoke build script (scripts/build/llama-cpp-darwin.sh)
-      # that compiles three CMake variants from backend/cpp/llama-cpp and bundles dylibs
-      # via otool — it doesn't fit the build-darwin-go-backend / build-darwin-python-backend
-      # mold. Drive it via its dedicated `backends/llama-cpp-darwin` make target instead.
-      - name: Build ${{ inputs.backend }}-darwin (llama-cpp)
-        if: inputs.backend == 'llama-cpp'
-        run: |
-          make protogen-go
-          make backends/llama-cpp-darwin
-
-      - name: Build ds4 backend (Darwin Metal)
-        if: inputs.backend == 'ds4'
-        run: |
-          make backends/ds4-darwin
-
-      # privacy-filter is a C++/ggml backend like ds4 - a single grpc-server with
-      # otool dylib bundling - so it gets its own bespoke darwin script rather than
-      # the generic build-darwin-go-backend path.
-      - name: Build privacy-filter backend (Darwin Metal)
-        if: inputs.backend == 'privacy-filter'
-        run: |
-          make protogen-go
-          make backends/privacy-filter-darwin
-
-      - name: Build ${{ inputs.backend }}-darwin
-        if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4' && inputs.backend != 'privacy-filter'
-        run: |
-          make protogen-go
-          BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend
-
-      - name: ccache stats
-        if: inputs.backend == 'llama-cpp'
-        run: ccache -s
-
-      - name: Save ccache
-        if: inputs.backend == 'llama-cpp' && github.event_name != 'pull_request'
-        uses: actions/cache/save@v6
-        with:
-          path: ~/Library/Caches/ccache
-          key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
-
-      - name: Save Python wheel cache
-        if: inputs.lang == 'python' && github.event_name != 'pull_request' && steps.pyenv-cache.outputs.cache-hit != 'true'
-        uses: actions/cache/save@v6
-        with:
-          path: |
-            ~/Library/Caches/pip
-            ~/Library/Caches/uv
-          key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
-
-      - name: Upload ${{ inputs.backend }}.tar
-        uses: actions/upload-artifact@v7
-        with:
-          name: ${{ inputs.backend }}-tar
-          path: backend-images/${{ inputs.backend }}.tar
-
-  darwin-backend-publish:
-    needs: darwin-backend-build
-    if: github.event_name != 'pull_request'
-    runs-on: ubuntu-latest
-    steps:
-      - name: Download ${{ inputs.backend }}.tar
-        uses: actions/download-artifact@v8
-        with:
-          name: ${{ inputs.backend }}-tar
-          path: .
-
-      - name: Install crane
-        run: |
-          curl -L https://github.com/google/go-containerregistry/releases/latest/download/go-containerregistry_Linux_x86_64.tar.gz | tar -xz
-          sudo mv crane /usr/local/bin/
-
-      - name: Log in to DockerHub
-        run: |
-          echo "${{ secrets.dockerPassword }}" | crane auth login docker.io -u "${{ secrets.dockerUsername }}" --password-stdin
-
-      - name: Log in to quay.io
-        run: |
-          echo "${{ secrets.quayPassword }}" | crane auth login quay.io -u "${{ secrets.quayUsername }}" --password-stdin
-
-      - name: Docker meta
-        id: meta
-        uses: docker/metadata-action@v6
-        with:
-          images: |
-            localai/localai-backends
-          tags: |
-            type=ref,event=branch
-            type=semver,pattern={{raw}}
-            type=sha
-          flavor: |
-            latest=auto
-            suffix=${{ inputs.tag-suffix }},onlatest=true
-
-      - name: Docker meta
-        id: quaymeta
-        uses: docker/metadata-action@v6
-        with:
-          images: |
-            quay.io/go-skynet/local-ai-backends
-          tags: |
-            type=ref,event=branch
-            type=semver,pattern={{raw}}
-            type=sha
-          flavor: |
-            latest=auto
-            suffix=${{ inputs.tag-suffix }},onlatest=true
-
-      - name: Push Docker image (DockerHub)
-        run: |
-          for tag in $(echo "${{ steps.meta.outputs.tags }}" | tr ',' '\n'); do
-            crane push ${{ inputs.backend }}.tar $tag
-          done
-
-      - name: Push Docker image (Quay)
-        run: |
-          for tag in $(echo "${{ steps.quaymeta.outputs.tags }}" | tr ',' '\n'); do
-            crane push ${{ inputs.backend }}.tar $tag
-          done
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -1,217 +0,0 @@
---
-name: 'merge backend manifest list (reusable)'
-
-# Reusable workflow that joins per-arch digest artifacts (uploaded by
-# backend_build.yml when called with platform-tag) into a single tagged
-# multi-arch manifest list. Called once per backend by backend.yml after
-# both per-arch build jobs succeed.
-
-on:
-  workflow_call:
-    inputs:
-      tag-latest:
-        description: 'Whether the manifest list should also be tagged latest (auto/false/true)'
-        required: false
-        type: string
-        default: ''
-      tag-suffix:
-        description: 'Backend tag suffix (e.g. -cpu-faster-whisper). Used to compute the artifact pattern and the final tag suffix.'
-        required: true
-        type: string
-    secrets:
-      dockerUsername:
-        required: false
-      dockerPassword:
-        required: false
-      quayUsername:
-        required: true
-      quayPassword:
-        required: true
-
-jobs:
-  merge:
-    runs-on: ubuntu-latest
-    # id-token: write is required for keyless cosign — the workflow
-    # exchanges the GitHub OIDC token for a short-lived Fulcio cert that
-    # signs each pushed manifest. Without this permission the runner
-    # cannot mint the token, and `cosign sign` fails with "no token".
-    permissions:
-      contents: read
-      id-token: write
-    env:
-      quay_username: ${{ secrets.quayUsername }}
-      # cosign v2.4.x still gates --registry-referrers-mode=oci-1-1 behind
-      # this flag. Without it, signing fails with:
-      #   invalid argument "oci-1-1" for "--registry-referrers-mode" flag:
-      #   in order to use mode "oci-1-1", you must set COSIGN_EXPERIMENTAL=1
-      COSIGN_EXPERIMENTAL: '1'
-    steps:
-      # Sparse checkout: the merge job needs `.github/scripts/` (for the
-      # keepalive cleanup script) but none of the source tree.
-      - name: Checkout (.github/scripts only)
-        uses: actions/checkout@v7
-        with:
-          sparse-checkout: |
-            .github/scripts
-          sparse-checkout-cone-mode: false
-
-      # `--` separator anchors the glob so we don't over-match sibling
-      # backends whose tag-suffix happens to be a prefix of ours
-      # (e.g. -cpu-vllm vs -cpu-vllm-omni). Must stay in sync with the
-      # upload-artifact name in backend_build.yml.
-      - name: Download digests
-        uses: actions/download-artifact@v8
-        with:
-          pattern: digests${{ inputs.tag-suffix }}--*
-          merge-multiple: true
-          path: /tmp/digests
-
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@master
-
-      # cosign signs each pushed manifest list with --recursive so the
-      # index and every per-arch entry get an attached Sigstore bundle.
-      # Recent cosign releases always emit the new bundle format, so
-      # there's no extra CLI flag to opt into it.
-      - name: Install cosign
-        if: github.event_name != 'pull_request'
-        uses: sigstore/cosign-installer@v3
-        with:
-          cosign-release: 'v2.4.1'
-
-      - name: Login to DockerHub
-        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
-        with:
-          username: ${{ secrets.dockerUsername }}
-          password: ${{ secrets.dockerPassword }}
-
-      - name: Login to Quay.io
-        if: ${{ env.quay_username != '' }}
-        uses: docker/login-action@v4
-        with:
-          registry: quay.io
-          username: ${{ secrets.quayUsername }}
-          password: ${{ secrets.quayPassword }}
-
-      - name: Docker meta
-        id: meta
-        if: github.event_name != 'pull_request'
-        uses: docker/metadata-action@v6
-        with:
-          images: |
-            quay.io/go-skynet/local-ai-backends
-            localai/localai-backends
-          tags: |
-            type=ref,event=branch
-            type=semver,pattern={{raw}}
-            type=sha
-          flavor: |
-            latest=${{ inputs.tag-latest }}
-            suffix=${{ inputs.tag-suffix }},onlatest=true
-
-      # Source from ci-cache, not local-ai-backends.
-      #
-      # The build job pushes per-arch manifests to local-ai-backends with
-      # push-by-digest=true (no tag), then anchors a tagged copy into
-      # ci-cache so the manifest can be retrieved hours later when this
-      # merge runs. Quay's manifest GC, however, is per-repository: the
-      # anchor tag in ci-cache protects the manifest there, but the same
-      # digest in local-ai-backends has no tag in *that* repo and gets
-      # reaped independently. Sourcing local-ai-backends@<digest> here
-      # then fails with "manifest not found" — exactly the regression
-      # we hit on v4.2.2 (19/37 multiarch merges failed).
-      #
-      # ci-cache@<digest> resolves because we anchored it there. buildx
-      # imagetools create copies the manifest into local-ai-backends
-      # (cross-repo within the same registry, blobs already cross-mounted
-      # from the original push so no transfer needed) and publishes the
-      # manifest list with the user-facing tags. The resulting manifest
-      # list is fully self-contained in local-ai-backends — child digests
-      # only, no embedded references to ci-cache.
-      - name: Create manifest list and push (quay)
-        if: github.event_name != 'pull_request'
-        working-directory: /tmp/digests
-        run: |
-          set -euo pipefail
-          tags=$(jq -cr '
-            .tags
-            | map(select(startswith("quay.io/")))
-            | map("-t " + .)
-            | join(" ")
-          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          if [ -z "$tags" ]; then
-            echo "No quay.io tags from docker/metadata-action; skipping quay merge"
-            exit 0
-          fi
-          # shellcheck disable=SC2086
-          docker buildx imagetools create $tags \
-            $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
-          # Resolve the manifest-list digest (any tag points at it) so
-          # cosign can sign by digest. Signing by tag would leave the
-          # signature orphaned the next time the tag moves.
-          first_tag=$(jq -cr '
-            .tags | map(select(startswith("quay.io/"))) | .[0]
-          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
-          # --recursive walks the list and signs every per-arch entry
-          # too — clients that resolve a tag to a platform-specific
-          # manifest before checking signatures need the per-arch
-          # signatures, not just the list-level one.
-          cosign sign --yes --recursive \
-            --registry-referrers-mode=oci-1-1 \
-            "quay.io/go-skynet/local-ai-backends@${digest}"
-
-      - name: Create manifest list and push (dockerhub)
-        if: github.event_name != 'pull_request'
-        working-directory: /tmp/digests
-        run: |
-          set -euo pipefail
-          tags=$(jq -cr '
-            .tags
-            | map(select(startswith("localai/")))
-            | map("-t " + .)
-            | join(" ")
-          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          if [ -z "$tags" ]; then
-            echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
-            exit 0
-          fi
-          # shellcheck disable=SC2086
-          docker buildx imagetools create $tags \
-            $(printf 'localai/localai-backends@sha256:%s ' *)
-          first_tag=$(jq -cr '
-            .tags | map(select(startswith("localai/"))) | .[0]
-          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
-          cosign sign --yes --recursive \
-            --registry-referrers-mode=oci-1-1 \
-            "localai/localai-backends@${digest}"
-
-      - name: Inspect manifest
-        if: github.event_name != 'pull_request'
-        run: |
-          set -euo pipefail
-          first_tag=$(jq -cr '.tags[0]' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          if [ -n "$first_tag" ] && [ "$first_tag" != "null" ]; then
-            docker buildx imagetools inspect "$first_tag"
-          fi
-
-      # See .github/scripts/cleanup-keepalive-tags.sh for why this is
-      # best-effort and what the failure modes are.
-      - name: Cleanup keepalive tags in ci-cache
-        if: github.event_name != 'pull_request' && success()
-        env:
-          TAG_SUFFIX: ${{ inputs.tag-suffix }}
-          QUAY_TOKEN: ${{ secrets.quayPassword }}
-        run: .github/scripts/cleanup-keepalive-tags.sh
-
-      - name: Job summary
-        if: github.event_name != 'pull_request'
-        run: |
-          set -euo pipefail
-          echo "Merged manifest tags:" >> "$GITHUB_STEP_SUMMARY"
-          jq -r '.tags[]' <<< "$DOCKER_METADATA_OUTPUT_JSON" | sed 's/^/- /' >> "$GITHUB_STEP_SUMMARY"
-          echo >> "$GITHUB_STEP_SUMMARY"
-          echo "Per-arch digests:" >> "$GITHUB_STEP_SUMMARY"
-          ls -1 /tmp/digests | sed 's/^/- sha256:/' >> "$GITHUB_STEP_SUMMARY"
--- a/.github/workflows/backend_pr.yml
+++ b/.github/workflows/backend_pr.yml
@@ -1,294 +0,0 @@
-name: 'build backend container images (PR-filtered)'
-
-on:
-  pull_request:
-
-concurrency:
-  group: ci-backends-pr-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  generate-matrix:
-    runs-on: ubuntu-latest
-    outputs:
-      matrix-multiarch: ${{ steps.set-matrix.outputs['matrix-multiarch'] }}
-      matrix-darwin: ${{ steps.set-matrix.outputs['matrix-darwin'] }}
-      merge-matrix-multiarch: ${{ steps.set-matrix.outputs['merge-matrix-multiarch'] }}
-      has-backends-multiarch: ${{ steps.set-matrix.outputs['has-backends-multiarch'] }}
-      has-backends-darwin: ${{ steps.set-matrix.outputs['has-backends-darwin'] }}
-      has-merges-multiarch: ${{ steps.set-matrix.outputs['has-merges-multiarch'] }}
-      # Single-arch backends are sharded across SINGLEARCH_SHARDS matrix jobs to
-      # stay under GitHub's 256-jobs-per-matrix limit (see changed-backends.js).
-      matrix-singlearch-1: ${{ steps.set-matrix.outputs['matrix-singlearch-1'] }}
-      merge-matrix-singlearch-1: ${{ steps.set-matrix.outputs['merge-matrix-singlearch-1'] }}
-      has-backends-singlearch-1: ${{ steps.set-matrix.outputs['has-backends-singlearch-1'] }}
-      has-merges-singlearch-1: ${{ steps.set-matrix.outputs['has-merges-singlearch-1'] }}
-      matrix-singlearch-2: ${{ steps.set-matrix.outputs['matrix-singlearch-2'] }}
-      merge-matrix-singlearch-2: ${{ steps.set-matrix.outputs['merge-matrix-singlearch-2'] }}
-      has-backends-singlearch-2: ${{ steps.set-matrix.outputs['has-backends-singlearch-2'] }}
-      has-merges-singlearch-2: ${{ steps.set-matrix.outputs['has-merges-singlearch-2'] }}
-      matrix-singlearch-3: ${{ steps.set-matrix.outputs['matrix-singlearch-3'] }}
-      merge-matrix-singlearch-3: ${{ steps.set-matrix.outputs['merge-matrix-singlearch-3'] }}
-      has-backends-singlearch-3: ${{ steps.set-matrix.outputs['has-backends-singlearch-3'] }}
-      has-merges-singlearch-3: ${{ steps.set-matrix.outputs['has-merges-singlearch-3'] }}
-      matrix-singlearch-4: ${{ steps.set-matrix.outputs['matrix-singlearch-4'] }}
-      merge-matrix-singlearch-4: ${{ steps.set-matrix.outputs['merge-matrix-singlearch-4'] }}
-      has-backends-singlearch-4: ${{ steps.set-matrix.outputs['has-backends-singlearch-4'] }}
-      has-merges-singlearch-4: ${{ steps.set-matrix.outputs['has-merges-singlearch-4'] }}
-    steps:
-      - name: Checkout repository
-        uses: actions/checkout@v7
-
-      - name: Setup Bun
-        uses: oven-sh/setup-bun@v2
-
-      - name: Install dependencies
-        run: |
-          bun add js-yaml
-          bun add @octokit/core
-
-      # filters the matrix in backend.yml; splits into single-arch and
-      # multi-arch groups so backend-merge-jobs can `needs:` only the latter
-      # (matches backend.yml's structure).
-      - name: Filter matrix for changed backends
-        id: set-matrix
-        env:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          GITHUB_EVENT_PATH: ${{ github.event_path }}
-        run: bun run scripts/changed-backends.js
-
-  backend-jobs-multiarch:
-    needs: generate-matrix
-    uses: ./.github/workflows/backend_build.yml
-    if: needs.generate-matrix.outputs['has-backends-multiarch'] == 'true'
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-      build-type: ${{ matrix.build-type }}
-      cuda-major-version: ${{ matrix.cuda-major-version }}
-      cuda-minor-version: ${{ matrix.cuda-minor-version }}
-      platforms: ${{ matrix.platforms }}
-      platform-tag: ${{ matrix.platform-tag || '' }}
-      runs-on: ${{ matrix.runs-on }}
-      builder-base-image: ${{ matrix.builder-base-image || '' }}
-      base-image: ${{ matrix.base-image }}
-      backend: ${{ matrix.backend }}
-      dockerfile: ${{ matrix.dockerfile }}
-      skip-drivers: ${{ matrix.skip-drivers }}
-      context: ${{ matrix.context }}
-      ubuntu-version: ${{ matrix.ubuntu-version }}
-      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: true
-      max-parallel: 8
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-multiarch']) }}
-  backend-jobs-singlearch-1:
-    needs: generate-matrix
-    if: needs.generate-matrix.outputs['has-backends-singlearch-1'] == 'true'
-    uses: ./.github/workflows/backend_build.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-      build-type: ${{ matrix.build-type }}
-      cuda-major-version: ${{ matrix.cuda-major-version }}
-      cuda-minor-version: ${{ matrix.cuda-minor-version }}
-      platforms: ${{ matrix.platforms }}
-      platform-tag: ${{ matrix.platform-tag || '' }}
-      runs-on: ${{ matrix.runs-on }}
-      builder-base-image: ${{ matrix.builder-base-image || '' }}
-      base-image: ${{ matrix.base-image }}
-      backend: ${{ matrix.backend }}
-      dockerfile: ${{ matrix.dockerfile }}
-      skip-drivers: ${{ matrix.skip-drivers }}
-      context: ${{ matrix.context }}
-      ubuntu-version: ${{ matrix.ubuntu-version }}
-      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: true
-      max-parallel: 8
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch-1']) }}
-
-  backend-jobs-singlearch-2:
-    needs: generate-matrix
-    if: needs.generate-matrix.outputs['has-backends-singlearch-2'] == 'true'
-    uses: ./.github/workflows/backend_build.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-      build-type: ${{ matrix.build-type }}
-      cuda-major-version: ${{ matrix.cuda-major-version }}
-      cuda-minor-version: ${{ matrix.cuda-minor-version }}
-      platforms: ${{ matrix.platforms }}
-      platform-tag: ${{ matrix.platform-tag || '' }}
-      runs-on: ${{ matrix.runs-on }}
-      builder-base-image: ${{ matrix.builder-base-image || '' }}
-      base-image: ${{ matrix.base-image }}
-      backend: ${{ matrix.backend }}
-      dockerfile: ${{ matrix.dockerfile }}
-      skip-drivers: ${{ matrix.skip-drivers }}
-      context: ${{ matrix.context }}
-      ubuntu-version: ${{ matrix.ubuntu-version }}
-      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: true
-      max-parallel: 8
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch-2']) }}
-
-  backend-jobs-singlearch-3:
-    needs: generate-matrix
-    if: needs.generate-matrix.outputs['has-backends-singlearch-3'] == 'true'
-    uses: ./.github/workflows/backend_build.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-      build-type: ${{ matrix.build-type }}
-      cuda-major-version: ${{ matrix.cuda-major-version }}
-      cuda-minor-version: ${{ matrix.cuda-minor-version }}
-      platforms: ${{ matrix.platforms }}
-      platform-tag: ${{ matrix.platform-tag || '' }}
-      runs-on: ${{ matrix.runs-on }}
-      builder-base-image: ${{ matrix.builder-base-image || '' }}
-      base-image: ${{ matrix.base-image }}
-      backend: ${{ matrix.backend }}
-      dockerfile: ${{ matrix.dockerfile }}
-      skip-drivers: ${{ matrix.skip-drivers }}
-      context: ${{ matrix.context }}
-      ubuntu-version: ${{ matrix.ubuntu-version }}
-      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: true
-      max-parallel: 8
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch-3']) }}
-
-  backend-jobs-singlearch-4:
-    needs: generate-matrix
-    if: needs.generate-matrix.outputs['has-backends-singlearch-4'] == 'true'
-    uses: ./.github/workflows/backend_build.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-      build-type: ${{ matrix.build-type }}
-      cuda-major-version: ${{ matrix.cuda-major-version }}
-      cuda-minor-version: ${{ matrix.cuda-minor-version }}
-      platforms: ${{ matrix.platforms }}
-      platform-tag: ${{ matrix.platform-tag || '' }}
-      runs-on: ${{ matrix.runs-on }}
-      builder-base-image: ${{ matrix.builder-base-image || '' }}
-      base-image: ${{ matrix.base-image }}
-      backend: ${{ matrix.backend }}
-      dockerfile: ${{ matrix.dockerfile }}
-      skip-drivers: ${{ matrix.skip-drivers }}
-      context: ${{ matrix.context }}
-      ubuntu-version: ${{ matrix.ubuntu-version }}
-      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: true
-      max-parallel: 8
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch-4']) }}
-  backend-merge-jobs-multiarch:
-    needs: [generate-matrix, backend-jobs-multiarch]
-    # backend_merge.yml's push-side steps are all gated on
-    # github.event_name != 'pull_request', so on a PR the merge job would
-    # do nothing. Skip it entirely to avoid spinning up an empty runner.
-    # !cancelled() lets the merge run even when a few build legs fail —
-    # see the matching note in backend.yml.
-    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
-    uses: ./.github/workflows/backend_merge.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-multiarch']) }}
-
-  backend-merge-jobs-singlearch-1:
-    needs: [generate-matrix, backend-jobs-singlearch-1]
-    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch-1'] == 'true' }}
-    uses: ./.github/workflows/backend_merge.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch-1']) }}
-
-  backend-merge-jobs-singlearch-2:
-    needs: [generate-matrix, backend-jobs-singlearch-2]
-    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch-2'] == 'true' }}
-    uses: ./.github/workflows/backend_merge.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch-2']) }}
-
-  backend-merge-jobs-singlearch-3:
-    needs: [generate-matrix, backend-jobs-singlearch-3]
-    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch-3'] == 'true' }}
-    uses: ./.github/workflows/backend_merge.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch-3']) }}
-
-  backend-merge-jobs-singlearch-4:
-    needs: [generate-matrix, backend-jobs-singlearch-4]
-    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch-4'] == 'true' }}
-    uses: ./.github/workflows/backend_merge.yml
-    with:
-      tag-latest: ${{ matrix.tag-latest }}
-      tag-suffix: ${{ matrix.tag-suffix }}
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: false
-      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch-4']) }}
-  backend-jobs-darwin:
-    needs: generate-matrix
-    uses: ./.github/workflows/backend_build_darwin.yml
-    if: needs.generate-matrix.outputs.has-backends-darwin == 'true'
-    with:
-      backend: ${{ matrix.backend }}
-      build-type: ${{ matrix.build-type }}
-      go-version: "1.25.x"
-      tag-suffix: ${{ matrix.tag-suffix }}
-      lang: ${{ matrix.lang || 'python' }}
-      use-pip: ${{ matrix.backend == 'diffusers' }}
-      runs-on: "macos-latest"
-    secrets:
-      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-    strategy:
-      fail-fast: true
-      matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix-darwin) }}
--- a/.github/workflows/base-images.yml
+++ b/.github/workflows/base-images.yml
@@ -1,161 +0,0 @@
---
-name: 'build base-grpc images'
-
-# Builds + pushes pre-compiled builder base images that downstream
-# llama-cpp / ik-llama-cpp / turboquant variant Dockerfiles will FROM
-# (PR 2). Each base contains apt deps + protoc + cmake + gRPC at
-# /opt/grpc + (conditionally) CUDA / ROCm / Vulkan toolchains.
-#
-# Triggers:
-#   - schedule (Saturdays 05:00 UTC) - picks up Ubuntu/CUDA/ROCm
-#     security updates and re-runs ahead of the backend.yml weekly
-#     cron (Sundays 06:00 UTC).
-#   - workflow_dispatch - manual one-off rebuild.
-#   - push to master that touches Dockerfile.base-grpc-builder or
-#     this workflow itself - keeps bases in sync with their inputs.
-#
-# Bootstrap (one-time after this PR merges):
-#   gh workflow run base-images.yml --ref master
-# Wait ~30 min for all 9 matrix variants to push to
-# quay.io/go-skynet/ci-cache:base-grpc-* before merging PR 2.
-
-on:
-  schedule:
-    - cron: '0 5 * * 6'
-  workflow_dispatch:
-  push:
-    branches: [master]
-    paths:
-      - 'backend/Dockerfile.base-grpc-builder'
-      - '.github/workflows/base-images.yml'
-      # The install logic and apt-mirror helper are bind-mounted into
-      # Dockerfile.base-grpc-builder at build time — changes to either
-      # affect the produced base images and must trigger a rebuild.
-      - '.docker/install-base-deps.sh'
-      - '.docker/apt-mirror.sh'
-
-concurrency:
-  group: ci-base-images-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  build:
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ${{ matrix.runs-on }}
-    strategy:
-      fail-fast: false
-      matrix:
-        include:
-          - tag: 'base-grpc-amd64'
-            runs-on: 'ubuntu-latest'
-            base-image: 'ubuntu:24.04'
-            build-type: ''
-            cuda-major-version: ''
-            cuda-minor-version: ''
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-arm64'
-            runs-on: 'ubuntu-24.04-arm'
-            base-image: 'ubuntu:24.04'
-            build-type: ''
-            cuda-major-version: ''
-            cuda-minor-version: ''
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-cuda-12-amd64'
-            runs-on: 'ubuntu-latest'
-            base-image: 'ubuntu:24.04'
-            build-type: 'cublas'
-            cuda-major-version: '12'
-            cuda-minor-version: '8'
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-cuda-13-amd64'
-            runs-on: 'ubuntu-latest'
-            base-image: 'ubuntu:22.04'
-            build-type: 'cublas'
-            cuda-major-version: '13'
-            cuda-minor-version: '0'
-            ubuntu-version: '2204'
-          - tag: 'base-grpc-cuda-13-arm64'
-            runs-on: 'ubuntu-24.04-arm'
-            base-image: 'ubuntu:24.04'
-            build-type: 'cublas'
-            cuda-major-version: '13'
-            cuda-minor-version: '0'
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-rocm-amd64'
-            runs-on: 'ubuntu-latest'
-            base-image: 'rocm/dev-ubuntu-24.04:7.2.1'
-            build-type: 'hipblas'
-            cuda-major-version: ''
-            cuda-minor-version: ''
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-vulkan-amd64'
-            runs-on: 'ubuntu-latest'
-            base-image: 'ubuntu:24.04'
-            build-type: 'vulkan'
-            cuda-major-version: ''
-            cuda-minor-version: ''
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-vulkan-arm64'
-            runs-on: 'ubuntu-24.04-arm'
-            base-image: 'ubuntu:24.04'
-            build-type: 'vulkan'
-            cuda-major-version: ''
-            cuda-minor-version: ''
-            ubuntu-version: '2404'
-          - tag: 'base-grpc-intel-amd64'
-            runs-on: 'ubuntu-latest'
-            base-image: 'intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04'
-            build-type: 'sycl'
-            cuda-major-version: ''
-            cuda-minor-version: ''
-            ubuntu-version: '2404'
-          # Legacy JetPack r36.4.0 base for older Jetson devices (CUDA 12).
-          # Distinct from base-grpc-cuda-13-arm64 (Ubuntu 24.04 + CUDA 13 sbsa)
-          # which targets newer Jetsons. Some matrix entries
-          # (-nvidia-l4t-arm64-llama-cpp / -turboquant) still build against
-          # the JetPack image, so we need a matching base.
-          - tag: 'base-grpc-l4t-cuda-12-arm64'
-            runs-on: 'ubuntu-24.04-arm'
-            base-image: 'nvcr.io/nvidia/l4t-jetpack:r36.4.0'
-            build-type: 'l4t'
-            cuda-major-version: '12'
-            cuda-minor-version: '0'
-            ubuntu-version: '2204'
-            # JetPack r36.4.0 already ships CUDA preinstalled at /usr/local/cuda;
-            # apt-installing cuda-nvcc-12-0 from the public repos fails because
-            # those packages aren't published for the JetPack apt feed. Match
-            # the original l4t matrix entry which set skip-drivers: 'true'.
-            skip-drivers: 'true'
-    steps:
-      - uses: actions/checkout@v7
-        with:
-          submodules: false
-      - name: Free disk space
-        uses: ./.github/actions/free-disk-space
-      - name: Set up build disk
-        uses: ./.github/actions/setup-build-disk
-      - uses: docker/setup-qemu-action@master
-        with:
-          platforms: all
-      - uses: docker/setup-buildx-action@master
-      - uses: docker/login-action@v4
-        with:
-          registry: quay.io
-          username: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-          password: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-      - uses: docker/build-push-action@v7
-        with:
-          context: .
-          file: ./backend/Dockerfile.base-grpc-builder
-          build-args: |
-            BASE_IMAGE=${{ matrix.base-image }}
-            BUILD_TYPE=${{ matrix.build-type }}
-            CUDA_MAJOR_VERSION=${{ matrix.cuda-major-version }}
-            CUDA_MINOR_VERSION=${{ matrix.cuda-minor-version }}
-            UBUNTU_VERSION=${{ matrix.ubuntu-version }}
-            SKIP_DRIVERS=${{ matrix.skip-drivers || 'false' }}
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-${{ matrix.tag }}
-          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache-${{ matrix.tag }},mode=max,ignore-error=true
-          provenance: false
-          tags: quay.io/go-skynet/ci-cache:${{ matrix.tag }}
-          push: true
--- a/.github/workflows/build-test.yaml
+++ b/.github/workflows/build-test.yaml
@@ -1,69 +0,0 @@
-name: Build test
-
-on:
-  push:
-    branches:
-      - master
-  pull_request:
-
-jobs:
-  build-test:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Checkout
-        uses: actions/checkout@v7
-        with:
-          fetch-depth: 0
-      - name: Set up Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: 1.25
-      - name: Run GoReleaser
-        run: |
-          make dev-dist
-  launcher-build-darwin:
-    runs-on: macos-latest
-    steps:
-      - name: Checkout
-        uses: actions/checkout@v7
-        with:
-          fetch-depth: 0
-      - name: Set up Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: 1.25
-      - name: Build launcher for macOS ARM64
-        run: |
-          make build-launcher-darwin
-          ls -liah dist
-      - name: Upload macOS launcher artifacts
-        uses: actions/upload-artifact@v7
-        with:
-          name: launcher-macos
-          path: dist/
-          retention-days: 30
-      
-  launcher-build-linux:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Checkout
-        uses: actions/checkout@v7
-        with:
-          fetch-depth: 0
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
-      - name: Set up Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: 1.25
-      - name: Build launcher for Linux
-        run: |
-          sudo apt-get update
-          sudo apt-get install golang gcc libgl1-mesa-dev xorg-dev libxkbcommon-dev
-          make build-launcher-linux
-      - name: Upload Linux launcher artifacts
-        uses: actions/upload-artifact@v7
-        with:
-          name: launcher-linux
-          path: local-ai-launcher-linux.tar.xz
-          retention-days: 30
--- a/.github/workflows/bump-inference-defaults.yml
+++ b/.github/workflows/bump-inference-defaults.yml
@@ -1,48 +0,0 @@
-name: Bump inference defaults
-
-on:
-  schedule:
-    # Run daily at 06:00 UTC
-    - cron: '0 6 * * *'
-  workflow_dispatch: # Allow manual trigger
-
-permissions:
-  contents: write
-  pull-requests: write
-
-jobs:
-  bump:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v7
-
-      - uses: actions/setup-go@v5
-        with:
-          go-version-file: go.mod
-
-      - name: Re-fetch inference defaults
-        run: make generate-force
-
-      - name: Check for changes
-        id: diff
-        run: |
-          if git diff --quiet core/config/inference_defaults.json; then
-            echo "changed=false" >> "$GITHUB_OUTPUT"
-          else
-            echo "changed=true" >> "$GITHUB_OUTPUT"
-          fi
-
-      - name: Create Pull Request
-        if: steps.diff.outputs.changed == 'true'
-        uses: peter-evans/create-pull-request@v8
-        with:
-          commit-message: "chore: bump inference defaults from unsloth"
-          title: "chore: bump inference defaults from unsloth"
-          body: |
-            Auto-generated update of `core/config/inference_defaults.json` from
-            [unsloth's inference_defaults.json](https://github.com/unslothai/unsloth/blob/main/studio/backend/assets/configs/inference_defaults.json).
-
-            This PR was created automatically by the `bump-inference-defaults` workflow.
-          branch: chore/bump-inference-defaults
-          delete-branch: true
-          labels: automated
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -1,200 +1,63 @@
-name: Bump Backend dependencies
+name: Bump dependencies
 on:
  schedule:
    - cron: 0 20 * * *
  workflow_dispatch:
 jobs:
-  bump-backends:
-    if: github.repository == 'mudler/LocalAI'
+  bump:
    strategy:
      fail-fast: false
      matrix:
        include:
-          - repository: "ggml-org/llama.cpp"
-            variable: "LLAMA_VERSION"
+          - repository: "go-skynet/go-llama.cpp"
+            variable: "GOLLAMA_VERSION"
            branch: "master"
-            file: "backend/cpp/llama-cpp/Makefile"
-          - repository: "ikawrakow/ik_llama.cpp"
-            variable: "IK_LLAMA_VERSION"
-            branch: "main"
-            file: "backend/cpp/ik-llama-cpp/Makefile"
-          - repository: "TheTom/llama-cpp-turboquant"
-            variable: "TURBOQUANT_VERSION"
-            branch: "feature/turboquant-kv-cache"
-            file: "backend/cpp/turboquant/Makefile"
-          - repository: "antirez/ds4"
-            variable: "DS4_VERSION"
-            branch: "main"
-            file: "backend/cpp/ds4/Makefile"
-          - repository: "localai-org/privacy-filter.cpp"
-            variable: "PRIVACY_FILTER_VERSION"
+          - repository: "ggerganov/llama.cpp"
+            variable: "CPPLLAMA_VERSION"
            branch: "master"
-            file: "backend/cpp/privacy-filter/Makefile"
-          - repository: "ggml-org/whisper.cpp"
+          - repository: "go-skynet/go-ggml-transformers.cpp"
+            variable: "GOGGMLTRANSFORMERS_VERSION"
+            branch: "master"
+          - repository: "donomii/go-rwkv.cpp"
+            variable: "RWKV_VERSION"
+            branch: "main"
+          - repository: "ggerganov/whisper.cpp"
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
-            file: "backend/go/whisper/Makefile"
-          - repository: "CrispStrobe/CrispASR"
-            variable: "CRISPASR_VERSION"
+          - repository: "go-skynet/go-bert.cpp"
+            variable: "BERT_VERSION"
+            branch: "master"
+          - repository: "go-skynet/bloomz.cpp"
+            variable: "BLOOMZ_VERSION"
            branch: "main"
-            file: "backend/go/crispasr/Makefile"
-          - repository: "mudler/parakeet.cpp"
-            variable: "PARAKEET_VERSION"
+          - repository: "nomic-ai/gpt4all"
+            variable: "GPT4ALL_VERSION"
+            branch: "main"
+          - repository: "mudler/go-ggllm.cpp"
+            variable: "GOGGLLM_VERSION"
            branch: "master"
-            file: "backend/go/parakeet-cpp/Makefile"
-          - repository: "mudler/ced.cpp"
-            variable: "CED_VERSION"
+          - repository: "mudler/go-stable-diffusion"
+            variable: "STABLEDIFFUSION_VERSION"
            branch: "master"
-            file: "backend/go/ced/Makefile"
-          - repository: "mudler/voice-detect.cpp"
-            variable: "VOICEDETECT_VERSION"
-            branch: "master"
-            file: "backend/go/voice-detect/Makefile"
-          - repository: "mudler/face-detect.cpp"
-            variable: "FACEDETECT_VERSION"
-            branch: "master"
-            file: "backend/go/face-detect/Makefile"
-          - repository: "mudler/depth-anything.cpp"
-            variable: "DEPTHANYTHING_VERSION"
-            branch: "master"
-            file: "backend/go/depth-anything-cpp/Makefile"
-          - repository: "leejet/stable-diffusion.cpp"
-            variable: "STABLEDIFFUSION_GGML_VERSION"
-            branch: "master"
-            file: "backend/go/stablediffusion-ggml/Makefile"
          - repository: "mudler/go-piper"
            variable: "PIPER_VERSION"
            branch: "master"
-            file: "backend/go/piper/Makefile"
-          - repository: "antirez/voxtral.c"
-            variable: "VOXTRAL_VERSION"
-            branch: "main"
-            file: "backend/go/voxtral/Makefile"
-          - repository: "ace-step/acestep.cpp"
-            variable: "ACESTEP_CPP_VERSION"
-            branch: "master"
-            file: "backend/go/acestep-cpp/Makefile"
-          - repository: "PABannier/sam3.cpp"
-            variable: "SAM3_VERSION"
-            branch: "main"
-            file: "backend/go/sam3-cpp/Makefile"
-          - repository: "mudler/rf-detr.cpp"
-            variable: "RFDETR_VERSION"
-            branch: "main"
-            file: "backend/go/rfdetr-cpp/Makefile"
-          - repository: "mudler/locate-anything.cpp"
-            variable: "LOCATEANYTHING_VERSION"
-            branch: "master"
-            file: "backend/go/locate-anything-cpp/Makefile"
-          - repository: "ServeurpersoCom/qwentts.cpp"
-            variable: "QWEN3TTS_CPP_VERSION"
-            branch: "master"
-            file: "backend/go/qwen3-tts-cpp/Makefile"
-          - repository: "ServeurpersoCom/omnivoice.cpp"
-            variable: "OMNIVOICE_VERSION"
-            branch: "master"
-            file: "backend/go/omnivoice-cpp/Makefile"
-          - repository: "localai-org/vibevoice.cpp"
-            variable: "VIBEVOICE_CPP_VERSION"
-            branch: "master"
-            file: "backend/go/vibevoice-cpp/Makefile"
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v4
      - name: Bump dependencies 🔧
-        id: bump
        run: |
-          bash .github/bump_deps.sh ${{ matrix.repository }} ${{ matrix.branch }} ${{ matrix.variable }} ${{ matrix.file }}
-          {
-            echo 'message<<EOF'
-            cat "${{ matrix.variable }}_message.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          {
-            echo 'commit<<EOF'
-            cat "${{ matrix.variable }}_commit.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          rm -rfv ${{ matrix.variable }}_message.txt
-          rm -rfv ${{ matrix.variable }}_commit.txt
+          bash .github/bump_deps.sh ${{ matrix.repository }} ${{ matrix.branch }} ${{ matrix.variable }}
      - name: Create Pull Request
-        uses: peter-evans/create-pull-request@v8
+        uses: peter-evans/create-pull-request@v6
        with:
          token: ${{ secrets.UPDATE_BOT_TOKEN }}
          push-to-fork: ci-forks/LocalAI
          commit-message: ':arrow_up: Update ${{ matrix.repository }}'
-          title: 'chore: :arrow_up: Update ${{ matrix.repository }} to `${{ steps.bump.outputs.commit }}`'
+          title: ':arrow_up: Update ${{ matrix.repository }}'
          branch: "update/${{ matrix.variable }}"
-          body: ${{ steps.bump.outputs.message }}
+          body: Bump of ${{ matrix.repository }} version
          signoff: true

-  bump-vllm-wheel:
-    # vLLM's cu130 wheel comes from a per-tag index URL (no /latest/ alias),
-    # so the cublas13 requirements file pins both a URL segment and a version
-    # constraint. bump_deps.sh handles git-sha-in-Makefile only — this job
-    # rewrites both values atomically when a new vLLM stable tag ships.
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v7
-      - name: Bump vLLM cu130 wheel pin 🔧
-        id: bump
-        run: |
-          bash .github/bump_vllm_wheel.sh vllm-project/vllm backend/python/vllm/requirements-cublas13-after.txt VLLM_VERSION
-          {
-            echo 'message<<EOF'
-            cat "VLLM_VERSION_message.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          {
-            echo 'commit<<EOF'
-            cat "VLLM_VERSION_commit.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          rm -rfv VLLM_VERSION_message.txt VLLM_VERSION_commit.txt
-      - name: Create Pull Request
-        uses: peter-evans/create-pull-request@v8
-        with:
-          token: ${{ secrets.UPDATE_BOT_TOKEN }}
-          push-to-fork: ci-forks/LocalAI
-          commit-message: ':arrow_up: Update vllm-project/vllm cu130 wheel'
-          title: 'chore: :arrow_up: Update vllm-project/vllm cu130 wheel to `${{ steps.bump.outputs.commit }}`'
-          branch: "update/VLLM_VERSION"
-          body: ${{ steps.bump.outputs.message }}
-          signoff: true

-  bump-vllm-metal:
-    # The darwin (Apple Silicon) vLLM build installs vllm-metal, which is locked
-    # to a specific vLLM source release. install.sh pins both VLLM_METAL_VERSION
-    # (the wheel release) and VLLM_VERSION (the vLLM it builds against); this job
-    # tracks vllm-project/vllm-metal and rewrites both atomically. Separate from
-    # bump-vllm-wheel because darwin follows vllm-metal, not vllm/vllm latest.
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v7
-      - name: Bump vllm-metal pin 🔧
-        id: bump
-        run: |
-          bash .github/bump_vllm_metal.sh vllm-project/vllm-metal backend/python/vllm/install.sh VLLM_METAL_VERSION
-          {
-            echo 'message<<EOF'
-            cat "VLLM_METAL_VERSION_message.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          {
-            echo 'commit<<EOF'
-            cat "VLLM_METAL_VERSION_commit.txt"
-            echo EOF
-          } >> "$GITHUB_OUTPUT"
-          rm -rfv VLLM_METAL_VERSION_message.txt VLLM_METAL_VERSION_commit.txt
-      - name: Create Pull Request
-        uses: peter-evans/create-pull-request@v8
-        with:
-          token: ${{ secrets.UPDATE_BOT_TOKEN }}
-          push-to-fork: ci-forks/LocalAI
-          commit-message: ':arrow_up: Update vllm-project/vllm-metal (darwin)'
-          title: 'chore: :arrow_up: Update vllm-metal (darwin) to `${{ steps.bump.outputs.commit }}`'
-          branch: "update/VLLM_METAL_VERSION"
-          body: ${{ steps.bump.outputs.message }}
-          signoff: true
+
--- a/.github/workflows/bump_docs.yaml
+++ b/.github/workflows/bump_docs.yaml
@@ -1,11 +1,10 @@
-name: Bump Documentation
+name: Bump dependencies
 on:
  schedule:
    - cron: 0 20 * * *
  workflow_dispatch:
 jobs:
-  bump-docs:
-    if: github.repository == 'mudler/LocalAI'
+  bump:
    strategy:
      fail-fast: false
      matrix:
@@ -13,17 +12,17 @@ jobs:
          - repository: "mudler/LocalAI"
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
+      - uses: actions/checkout@v4
      - name: Bump dependencies 🔧
        run: |
          bash .github/bump_docs.sh ${{ matrix.repository }}
      - name: Create Pull Request
-        uses: peter-evans/create-pull-request@v8
+        uses: peter-evans/create-pull-request@v6
        with:
          token: ${{ secrets.UPDATE_BOT_TOKEN }}
          push-to-fork: ci-forks/LocalAI
          commit-message: ':arrow_up: Update docs version ${{ matrix.repository }}'
-          title: 'docs: :arrow_up: update docs version ${{ matrix.repository }}'
+          title: ':arrow_up: Update docs version ${{ matrix.repository }}'
          branch: "update/docs"
          body: Bump of ${{ matrix.repository }} version inside docs
          signoff: true
--- a/.github/workflows/checksum_checker.yaml
+++ b/.github/workflows/checksum_checker.yaml
@@ -1,41 +0,0 @@
-name: Check if checksums are up-to-date
-on:
-  schedule:
-    - cron: 0 20 * * *
-  workflow_dispatch:
-jobs:
-  checksum_check:
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v7
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
-      - name: Install dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y pip wget
-          pip install huggingface_hub
-      - name: 'Setup yq'
-        uses: dcarbone/install-yq-action@v1.3.1
-        with:
-          version: 'v4.44.2'
-          download-compressed: true
-          force: true
-
-      - name: Checksum checker 🔧
-        run: |
-          export HF_HOME=/hf_cache
-          sudo mkdir /hf_cache
-          sudo chmod 777 /hf_cache
-          bash .github/checksum_checker.sh gallery/index.yaml
-      - name: Create Pull Request
-        uses: peter-evans/create-pull-request@v8
-        with:
-          token: ${{ secrets.UPDATE_BOT_TOKEN }}
-          push-to-fork: ci-forks/LocalAI
-          commit-message: ':arrow_up: Checksum updates in gallery/index.yaml'
-          title: 'chore(model-gallery): :arrow_up: update checksum'
-          branch: "update/checksum"
-          body: Updating checksums in gallery/index.yaml
-          signoff: true
--- a/.github/workflows/disabled/dependabot_auto.yml
+++ b/.github/workflows/disabled/dependabot_auto.yml
@@ -9,18 +9,18 @@ permissions:

 jobs:
  dependabot:
-    if: github.repository == 'mudler/LocalAI' && github.actor == 'dependabot[bot]'
    runs-on: ubuntu-latest
+    if: ${{ github.actor == 'dependabot[bot]' }}
    steps:
      - name: Dependabot metadata
        id: metadata
-        uses: dependabot/fetch-metadata@v2.5.0
+        uses: dependabot/fetch-metadata@v2.0.0
        with:
          github-token: "${{ secrets.GITHUB_TOKEN }}"
          skip-commit-verification: true

      - name: Checkout repository
-        uses: actions/checkout@v6
+        uses: actions/checkout@v4

      - name: Approve a PR if not already approved
        run: |
--- a/.github/workflows/deploy-explorer.yaml
+++ b/.github/workflows/deploy-explorer.yaml
@@ -1,65 +0,0 @@
-name: Explorer deployment
-
-on:
-  push:
-    branches:
-      - master
-    tags:
-      - 'v*'
-
-concurrency:
-  group: ci-deploy-${{ github.head_ref || github.ref }}-${{ github.repository }}
-
-jobs:
-  build-linux:
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ubuntu-latest
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-      - uses: actions/setup-go@v5
-        with:
-          go-version: '1.21.x'
-          cache: false
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y wget curl build-essential ffmpeg protobuf-compiler ccache upx-ucl gawk cmake libgmock-dev
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          make protogen-go
-      - name: Build api
-        run: |
-          CGO_ENABLED=0 make build
-      - name: rm
-        uses: appleboy/ssh-action@v1.2.5
-        with:
-            host: ${{ secrets.EXPLORER_SSH_HOST }}
-            username: ${{ secrets.EXPLORER_SSH_USERNAME }}
-            key: ${{ secrets.EXPLORER_SSH_KEY }}
-            port: ${{ secrets.EXPLORER_SSH_PORT }}
-            script: |
-                sudo rm -rf local-ai/ || true
-      - name: copy file via ssh
-        uses: appleboy/scp-action@v1.0.0
-        with:
-            host: ${{ secrets.EXPLORER_SSH_HOST }}
-            username: ${{ secrets.EXPLORER_SSH_USERNAME }}
-            key: ${{ secrets.EXPLORER_SSH_KEY }}
-            port: ${{ secrets.EXPLORER_SSH_PORT }}
-            source: "local-ai"
-            overwrite: true
-            rm: true
-            target: ./local-ai
-      - name: restarting
-        uses: appleboy/ssh-action@v1.2.5
-        with:
-            host: ${{ secrets.EXPLORER_SSH_HOST }}
-            username: ${{ secrets.EXPLORER_SSH_USERNAME }}
-            key: ${{ secrets.EXPLORER_SSH_KEY }}
-            port: ${{ secrets.EXPLORER_SSH_PORT }}
-            script: |
-                sudo cp -rfv local-ai/local-ai /usr/bin/local-ai
-                sudo systemctl restart local-ai
--- a/.github/workflows/disabled/comment-pr.yaml
+++ b/.github/workflows/disabled/comment-pr.yaml
@@ -1,83 +0,0 @@
-name: Comment PRs
-on:
-  pull_request_target:
-
-jobs:
-  comment-pr:
-    env:
-        MODEL_NAME: hermes-2-theta-llama-3-8b
-    runs-on: ubuntu-latest
-    steps:
-    - name: Checkout code
-      uses: actions/checkout@v3
-      with:
-        ref: "${{ github.event.pull_request.merge_commit_sha }}"
-        fetch-depth: 0 # needed to checkout all branches for this Action to work
-    - uses: mudler/localai-github-action@v1
-      with:
-        model: 'hermes-2-theta-llama-3-8b' # Any from models.localai.io, or from huggingface.com with: "huggingface://<repository>/file"
-      # Check the PR diff using the current branch and the base branch of the PR
-    - uses: GrantBirki/git-diff-action@v2.7.0
-      id: git-diff-action
-      with:
-            json_diff_file_output: diff.json
-            raw_diff_file_output: diff.txt
-            file_output_only: "true"
-            base_branch: ${{ github.event.pull_request.base.sha }}
-    - name: Show diff
-      env:
-        DIFF: ${{ steps.git-diff-action.outputs.raw-diff-path }}
-      run: |
-            cat $DIFF
-    - name: Summarize
-      env:
-        DIFF: ${{ steps.git-diff-action.outputs.raw-diff-path }}
-      id: summarize
-      run: |
-            input="$(cat $DIFF)"
-
-            # Define the LocalAI API endpoint
-            API_URL="http://localhost:8080/chat/completions"
-
-            # Create a JSON payload using jq to handle special characters
-            json_payload=$(jq -n --arg input "$input" '{
-            model: "'$MODEL_NAME'",
-            messages: [
-                {
-                role: "system",
-                content: "You are LocalAI-bot in Github that helps understanding PRs and assess complexity. Explain what has changed in this PR diff and why"
-                },
-                {
-                role: "user",
-                content: $input
-                }
-            ]
-            }')
-
-            # Send the request to LocalAI
-            response=$(curl -s -X POST $API_URL \
-            -H "Content-Type: application/json" \
-            -d "$json_payload")
-
-            # Extract the summary from the response
-            summary="$(echo $response | jq -r '.choices[0].message.content')"
-
-            # Print the summary
-            #  -H "Authorization: Bearer $API_KEY" \
-            echo "Summary:"
-            echo "$summary"
-            echo "payload sent"
-            echo "$json_payload"
-            {
-                echo 'message<<EOF'
-                echo "$summary"
-                echo EOF
-              } >> "$GITHUB_OUTPUT"
-            docker logs --tail 10 local-ai
-    - uses: mshick/add-pr-comment@v2
-      if: always()
-      with:
-          repo-token: ${{ secrets.UPDATE_BOT_TOKEN }}
-          message: ${{ steps.summarize.outputs.message }}
-          message-failure: |
-            Uh oh! Could not analyze this PR, maybe it's too big?
--- a/.github/workflows/disabled/notify-models.yaml
+++ b/.github/workflows/disabled/notify-models.yaml
@@ -1,174 +0,0 @@
-name: Notifications for new models
-on:
-  pull_request_target:
-     types:
-       - closed
-
-permissions:
-  contents: read
-  pull-requests: read
-
-jobs:
-  notify-discord:
-    if: github.repository == 'mudler/LocalAI' && (github.event.pull_request.merged == true) && (contains(github.event.pull_request.labels.*.name, 'area/ai-model'))
-    env:
-        MODEL_NAME: gemma-3-12b-it-qat
-    runs-on: ubuntu-latest
-    steps:
-    - uses: actions/checkout@v6
-      with:
-        fetch-depth: 0 # needed to checkout all branches for this Action to work
-        ref: ${{ github.event.pull_request.head.sha }} # Checkout the PR head to get the actual changes
-    - uses: mudler/localai-github-action@v1
-      with:
-        model: 'gemma-3-12b-it-qat' # Any from models.localai.io, or from huggingface.com with: "huggingface://<repository>/file"
-        # Check the PR diff using the current branch and the base branch of the PR
-    - uses: GrantBirki/git-diff-action@v2.8.1
-      id: git-diff-action
-      with:
-            json_diff_file_output: diff.json
-            raw_diff_file_output: diff.txt
-            file_output_only: "true"
-    - name: Summarize
-      env:
-        DIFF: ${{ steps.git-diff-action.outputs.raw-diff-path }}
-      id: summarize
-      run: |
-            input="$(cat $DIFF)"
-
-            # Define the LocalAI API endpoint
-            API_URL="http://localhost:8080/chat/completions"
-
-            # Create a JSON payload using jq to handle special characters
-            json_payload=$(jq -n --arg input "$input" '{
-            model: "'$MODEL_NAME'",
-            messages: [
-                {
-                role: "system",
-                content: "You are LocalAI-bot. Write a discord message to notify everyone about the new model from the git diff. Make it informal. An example can include: the URL of the model, the name, and a brief description of the model if exists. Also add an hint on how to install it in LocalAI and that can be browsed over https://models.localai.io. For example: local-ai run model_name_here"
-                },
-                {
-                role: "user",
-                content: $input
-                }
-            ]
-            }')
-
-            # Send the request to LocalAI
-            response=$(curl -s -X POST $API_URL \
-            -H "Content-Type: application/json" \
-            -d "$json_payload")
-
-            # Extract the summary from the response
-            summary="$(echo $response | jq -r '.choices[0].message.content')"
-
-            # Print the summary
-            #  -H "Authorization: Bearer $API_KEY" \
-            echo "Summary:"
-            echo "$summary"
-            echo "payload sent"
-            echo "$json_payload"
-            {
-                echo 'message<<EOF'
-                echo "$summary"
-                echo EOF
-              } >> "$GITHUB_OUTPUT"
-            docker logs --tail 10 local-ai
-    - name: Discord notification
-      env:
-        DISCORD_WEBHOOK: ${{ secrets.DISCORD_WEBHOOK_URL }}
-        DISCORD_USERNAME: "LocalAI-Bot"
-        DISCORD_AVATAR: "https://avatars.githubusercontent.com/u/139863280?v=4"
-      uses: Ilshidur/action-discord@master
-      with:
-        args: ${{ steps.summarize.outputs.message }}
-    - name: Setup tmate session if fails
-      if: ${{ failure() }}
-      uses: mxschmitt/action-tmate@v3.23
-      with:
-        detached: true
-        connect-timeout-seconds: 180
-        limit-access-to-actor: true
-  notify-twitter:
-    if: github.repository == 'mudler/LocalAI' && (github.event.pull_request.merged == true) && (contains(github.event.pull_request.labels.*.name, 'area/ai-model'))
-    env:
-        MODEL_NAME: gemma-3-12b-it-qat
-    runs-on: ubuntu-latest
-    steps:
-    - uses: actions/checkout@v6
-      with:
-        fetch-depth: 0 # needed to checkout all branches for this Action to work
-        ref: ${{ github.event.pull_request.head.sha }} # Checkout the PR head to get the actual changes
-    - name: Start LocalAI
-      run: |
-        echo "Starting LocalAI..."
-        docker run -e -ti -d --name local-ai -p 8080:8080 localai/localai:master run --debug $MODEL_NAME
-        until [ "`docker inspect -f {{.State.Health.Status}} local-ai`" == "healthy" ]; do echo "Waiting for container to be ready";  docker logs --tail 10 local-ai; sleep 2; done
-      # Check the PR diff using the current branch and the base branch of the PR
-    - uses: GrantBirki/git-diff-action@v2.8.1
-      id: git-diff-action
-      with:
-            json_diff_file_output: diff.json
-            raw_diff_file_output: diff.txt
-            file_output_only: "true"
-    - name: Summarize
-      env:
-        DIFF: ${{ steps.git-diff-action.outputs.raw-diff-path }}
-      id: summarize
-      run: |
-            input="$(cat $DIFF)"
-
-            # Define the LocalAI API endpoint
-            API_URL="http://localhost:8080/chat/completions"
-
-            # Create a JSON payload using jq to handle special characters
-            json_payload=$(jq -n --arg input "$input" '{
-            model: "'$MODEL_NAME'",
-            messages: [
-                {
-                role: "system",
-                content: "You are LocalAI-bot. Write a twitter message to notify everyone about the new model from the git diff. Make it informal and really short. An example can include: the name, and a brief description of the model if exists. Also add an hint on how to install it in LocalAI. For example: local-ai run model_name_here"
-                },
-                {
-                role: "user",
-                content: $input
-                }
-            ]
-            }')
-
-            # Send the request to LocalAI
-            response=$(curl -s -X POST $API_URL \
-            -H "Content-Type: application/json" \
-            -d "$json_payload")
-
-            # Extract the summary from the response
-            summary="$(echo $response | jq -r '.choices[0].message.content')"
-
-            # Print the summary
-            #  -H "Authorization: Bearer $API_KEY" \
-            echo "Summary:"
-            echo "$summary"
-            echo "payload sent"
-            echo "$json_payload"
-            {
-                echo 'message<<EOF'
-                echo "$summary"
-                echo EOF
-              } >> "$GITHUB_OUTPUT"
-            docker logs --tail 10 local-ai
-    - uses: Eomm/why-don-t-you-tweet@v2
-      with:
-        tweet-message: ${{ steps.summarize.outputs.message }}
-      env:
-        # Get your tokens from https://developer.twitter.com/apps
-        TWITTER_CONSUMER_API_KEY: ${{ secrets.TWITTER_APP_KEY }}
-        TWITTER_CONSUMER_API_SECRET: ${{ secrets.TWITTER_APP_SECRET }}
-        TWITTER_ACCESS_TOKEN: ${{ secrets.TWITTER_ACCESS_TOKEN }}
-        TWITTER_ACCESS_TOKEN_SECRET: ${{ secrets.TWITTER_ACCESS_TOKEN_SECRET }}
-    - name: Setup tmate session if fails
-      if: ${{ failure() }}
-      uses: mxschmitt/action-tmate@v3.23
-      with:
-        detached: true
-        connect-timeout-seconds: 180
-        limit-access-to-actor: true
--- a/.github/workflows/disabled/prlint.yaml
+++ b/.github/workflows/disabled/prlint.yaml
@@ -1,28 +0,0 @@
-name: Check PR style
-
-on:
-  pull_request_target:
-    types:
-      - opened
-      - reopened
-      - edited
-      - synchronize
-
-jobs:
-  title-lint:
-    runs-on: ubuntu-latest
-    permissions:
-      statuses: write
-    steps:
-      - uses: aslafy-z/conventional-pr-title-action@v3
-        env:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-#  check-pr-description:
-#    runs-on: ubuntu-latest
-#    steps:
-#      - uses: actions/checkout@v2
-#      - uses: jadrol/pr-description-checker-action@v1.0.0
-#        id: description-checker
-#        with:
-#          repo-token: ${{ secrets.GITHUB_TOKEN }}
-#          exempt-labels: no qa
--- a/.github/workflows/gallery-agent.yaml
+++ b/.github/workflows/gallery-agent.yaml
@@ -1,214 +0,0 @@
-name: Gallery Agent
-on:
-
-  schedule:
-    - cron: '0 */12 * * *'  # Run every 4 hours
-  workflow_dispatch:
-    inputs:
-      search_term:
-        description: 'Search term for models'
-        required: false
-        default: 'GGUF'
-        type: string
-      limit:
-        description: 'Maximum number of models to process'
-        required: false
-        default: '15'
-        type: string
-      quantization:
-        description: 'Preferred quantization format'
-        required: false
-        default: 'Q4_K_M'
-        type: string
-      max_models:
-        description: 'Maximum number of models to add to the gallery'
-        required: false
-        default: '1'
-        type: string
-jobs:
-  gallery-agent:
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ubuntu-latest
-    steps:
-      - name: Checkout repository
-        uses: actions/checkout@v7
-        with:
-          token: ${{ secrets.GITHUB_TOKEN }}
-
-      - name: Set up Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.21'
-      - name: Proto Dependencies
-        run: |
-          # Install protoc
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Process gallery-agent PR commands
-        env:
-          GH_TOKEN: ${{ secrets.UPDATE_BOT_TOKEN }}
-          REPO: ${{ github.repository }}
-          SEARCH: 'gallery agent in:title'
-        run: |
-          # Walk gallery-agent PRs and act on maintainer comments:
-          #   /gallery-agent blacklist → label `gallery-agent/blacklisted` + close (never repropose)
-          #   /gallery-agent recreate  → close without label (next run may repropose)
-          # Only comments from OWNER / MEMBER / COLLABORATOR are honored so
-          # random users can't drive the bot.
-          #
-          # We scan both open PRs AND recently-closed PRs that don't already
-          # carry the blacklist label. This covers the common flow where a
-          # maintainer writes /gallery-agent blacklist and immediately clicks
-          # Close — without this, the next scheduled run wouldn't see the
-          # command (PR is already closed) and would repropose the model.
-          gh label create gallery-agent/blacklisted \
-            --repo "$REPO" --color ededed \
-            --description "gallery-agent must not repropose this model" 2>/dev/null || true
-
-          prs_open=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" \
-            --json number --jq '.[].number')
-          # Closed PRs from the last 14 days that don't yet have the blacklist label.
-          # Bounded window keeps the scan cheap while covering late-applied commands.
-          since=$(date -u -d '14 days ago' +%Y-%m-%d)
-          prs_closed=$(gh pr list --repo "$REPO" --state closed \
-            --search "$SEARCH closed:>=$since -label:gallery-agent/blacklisted" \
-            --json number --jq '.[].number')
-          prs=$(printf '%s\n%s\n' "$prs_open" "$prs_closed" | sort -u | sed '/^$/d')
-          for pr in $prs; do
-            state=$(gh pr view "$pr" --repo "$REPO" --json state --jq '.state')
-            cmds=$(gh pr view "$pr" --repo "$REPO" --json comments \
-              --jq '.comments[] | select(.authorAssociation=="OWNER" or .authorAssociation=="MEMBER" or .authorAssociation=="COLLABORATOR") | .body')
-            if echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+blacklist([[:space:]]|$)'; then
-              echo "PR #$pr: blacklist command found (state=$state)"
-              gh pr edit "$pr" --repo "$REPO" --add-label gallery-agent/blacklisted || true
-              if [ "$state" = "OPEN" ]; then
-                gh pr close "$pr" --repo "$REPO" --comment "Blacklisted via \`/gallery-agent blacklist\`. This model will not be reproposed." || true
-              fi
-            elif [ "$state" = "OPEN" ] && echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+recreate([[:space:]]|$)'; then
-              echo "PR #$pr: recreate command found"
-              gh pr close "$pr" --repo "$REPO" --comment "Closed via \`/gallery-agent recreate\`. The next scheduled run will propose this model again." || true
-            fi
-          done
-
-      - name: Collect skip URLs for the gallery agent
-        id: open_prs
-        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          REPO: ${{ github.repository }}
-          SEARCH: 'gallery agent in:title'
-        run: |
-          # Skip set =
-          #   URLs from any open gallery-agent PR (avoid duplicate PRs for the same model while one is pending)
-          # + URLs from closed PRs carrying the `gallery-agent/blacklisted` label (hard blacklist)
-          # Plain-closed PRs without the label are ignored — closing a PR is
-          # not by itself a "never propose again" signal; maintainers must
-          # opt in via the /gallery-agent blacklist comment command.
-          urls_open=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" \
-            --json body --jq '[.[].body] | join("\n")' \
-            | grep -oE 'https://huggingface\.co/[^ )]+' || true)
-          urls_blacklist=$(gh pr list --repo "$REPO" --state closed --search "$SEARCH" \
-            --label gallery-agent/blacklisted \
-            --json body --jq '[.[].body] | join("\n")' \
-            | grep -oE 'https://huggingface\.co/[^ )]+' || true)
-          urls=$(printf '%s\n%s\n' "$urls_open" "$urls_blacklist" | sort -u | sed '/^$/d')
-          echo "Skip URLs:"
-          echo "$urls"
-          {
-            echo "urls<<EOF"
-            echo "$urls"
-            echo "EOF"
-          } >> "$GITHUB_OUTPUT"
-
-      - name: Run gallery agent
-        env:
-          SEARCH_TERM: ${{ github.event.inputs.search_term || 'GGUF' }}
-          LIMIT: ${{ github.event.inputs.limit || '15' }}
-          QUANTIZATION: ${{ github.event.inputs.quantization || 'Q4_K_M' }}
-          MAX_MODELS: ${{ github.event.inputs.max_models || '1' }}
-          EXTRA_SKIP_URLS: ${{ steps.open_prs.outputs.urls }}
-        run: |
-          export GALLERY_INDEX_PATH=$PWD/gallery/index.yaml
-          go run ./.github/gallery-agent
-
-      - name: Check for changes
-        id: check_changes
-        run: |
-          if git diff --quiet gallery/index.yaml; then
-            echo "changes=false" >> $GITHUB_OUTPUT
-            echo "No changes detected in gallery/index.yaml"
-          else
-            echo "changes=true" >> $GITHUB_OUTPUT
-            echo "Changes detected in gallery/index.yaml"
-            git diff gallery/index.yaml
-          fi
-
-      - name: Read gallery agent summary
-        id: read_summary
-        if: steps.check_changes.outputs.changes == 'true'
-        run: |
-          if [ -f "./gallery-agent-summary.json" ]; then
-            echo "summary_exists=true" >> $GITHUB_OUTPUT
-            # Extract summary data using jq
-            echo "search_term=$(jq -r '.search_term' ./gallery-agent-summary.json)" >> $GITHUB_OUTPUT
-            echo "total_found=$(jq -r '.total_found' ./gallery-agent-summary.json)" >> $GITHUB_OUTPUT
-            echo "models_added=$(jq -r '.models_added' ./gallery-agent-summary.json)" >> $GITHUB_OUTPUT
-            echo "quantization=$(jq -r '.quantization' ./gallery-agent-summary.json)" >> $GITHUB_OUTPUT
-            echo "processing_time=$(jq -r '.processing_time' ./gallery-agent-summary.json)" >> $GITHUB_OUTPUT
-            
-            # Create a formatted list of added models with URLs
-            added_models=$(jq -r 'range(0; .added_model_ids | length) as $i | "- [\(.added_model_ids[$i])](\(.added_model_urls[$i]))"' ./gallery-agent-summary.json | tr '\n' '\n')
-            echo "added_models<<EOF" >> $GITHUB_OUTPUT
-            echo "$added_models" >> $GITHUB_OUTPUT
-            echo "EOF" >> $GITHUB_OUTPUT
-            rm -f ./gallery-agent-summary.json
-          else
-            echo "summary_exists=false" >> $GITHUB_OUTPUT
-          fi
-
-      - name: Create Pull Request
-        if: steps.check_changes.outputs.changes == 'true'
-        uses: peter-evans/create-pull-request@v8
-        with:
-          token: ${{ secrets.UPDATE_BOT_TOKEN }}
-          push-to-fork: ci-forks/LocalAI
-          commit-message: 'chore(model gallery): :robot: add new models via gallery agent'
-          title: 'chore(model gallery): :robot: add ${{ steps.read_summary.outputs.models_added || 0 }} new models via gallery agent'
-          # Branch has to be unique so PRs are not overriding each other
-          branch-suffix: timestamp
-          body: |
-            This PR was automatically created by the gallery agent workflow.
-            
-            **Summary:**
-            - **Search Term:** ${{ steps.read_summary.outputs.search_term || github.event.inputs.search_term || 'GGUF' }}
-            - **Models Found:** ${{ steps.read_summary.outputs.total_found || 'N/A' }}
-            - **Models Added:** ${{ steps.read_summary.outputs.models_added || '0' }}
-            - **Quantization:** ${{ steps.read_summary.outputs.quantization || github.event.inputs.quantization || 'Q4_K_M' }}
-            - **Processing Time:** ${{ steps.read_summary.outputs.processing_time || 'N/A' }}
-            
-            **Added Models:**
-            ${{ steps.read_summary.outputs.added_models || '- No models added' }}
-
-            ### Bot commands
-
-            Maintainers (owner / member / collaborator) can control this PR
-            by leaving a comment with one of:
-
-            - `/gallery-agent recreate` — close this PR; the next scheduled
-              run will propose this model again (useful if the entry needs
-              to be regenerated with fresh metadata).
-            - `/gallery-agent blacklist` — close this PR and permanently
-              prevent the gallery agent from ever reproposing this model.
-
-            Plain "Close" (without a command) is treated as a no-op: the
-            model may be reproposed by a future run.
-
-            **Workflow Details:**
-            - Triggered by: `${{ github.event_name }}`
-            - Run ID: `${{ github.run_id }}`
-            - Commit: `${{ github.sha }}`
-          signoff: true
-          delete-branch: true
--- a/.github/workflows/generate_grpc_cache.yaml
+++ b/.github/workflows/generate_grpc_cache.yaml
@@ -0,0 +1,90 @@
+name: 'generate and publish GRPC docker caches'
+
+on:
+- workflow_dispatch
+
+concurrency:
+  group: grpc-cache-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true
+
+jobs:
+  generate_caches:
+    strategy:
+      matrix:
+        include:
+          - grpc-base-image: ubuntu:22.04
+            runs-on: 'ubuntu-latest'
+            platforms: 'linux/amd64'
+    runs-on: ${{matrix.runs-on}}
+    steps:
+      - name: Release space from worker
+        if: matrix.runs-on == 'ubuntu-latest'
+        run: |
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          df -h
+          echo
+          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
+          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
+          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
+          sudo rm -rf /usr/local/lib/android
+          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
+          sudo rm -rf /usr/share/dotnet
+          sudo apt-get remove -y '^mono-.*' || true
+          sudo apt-get remove -y '^ghc-.*' || true
+          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
+          sudo apt-get remove -y 'php.*' || true
+          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
+          sudo apt-get remove -y '^google-.*' || true
+          sudo apt-get remove -y azure-cli || true
+          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
+          sudo apt-get remove -y '^gfortran-.*' || true
+          sudo apt-get remove -y microsoft-edge-stable || true
+          sudo apt-get remove -y firefox || true
+          sudo apt-get remove -y powershell || true
+          sudo apt-get remove -y r-base-core || true
+          sudo apt-get autoremove -y
+          sudo apt-get clean
+          echo
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          sudo rm -rfv build || true
+          sudo rm -rf /usr/share/dotnet || true
+          sudo rm -rf /opt/ghc || true
+          sudo rm -rf "/usr/local/share/boost" || true
+          sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
+          df -h
+
+      - name: Set up QEMU
+        uses: docker/setup-qemu-action@master
+        with:
+          platforms: all
+
+      - name: Set up Docker Buildx
+        id: buildx
+        uses: docker/setup-buildx-action@master
+
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Cache GRPC
+        uses: docker/build-push-action@v5
+        with:
+          builder: ${{ steps.buildx.outputs.name }}
+          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
+          # This means that even the MAKEFLAGS have to be an EXACT match.
+          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
+          build-args: |
+            GRPC_BASE_IMAGE=${{ matrix.grpc-base-image }}
+            MAKEFLAGS=--jobs=4 --output-sync=target
+            GRPC_VERSION=v1.58.0
+          context: .
+          file: ./Dockerfile
+          cache-to: type=gha,ignore-error=true
+          target: grpc
+          platforms: ${{ matrix.platforms }}
+          push: false
--- a/.github/workflows/generate_intel_image.yaml
+++ b/.github/workflows/generate_intel_image.yaml
@@ -1,60 +0,0 @@
-name: 'generate and publish intel docker caches'
-
-on:
-  workflow_dispatch:
-  push:
-    branches:
-      - master
-
-concurrency:
-  group: intel-cache-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  generate_caches:
-    if: github.repository == 'mudler/LocalAI'
-    strategy:
-      matrix:
-        include:
-          - base-image: intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04
-            runs-on: 'arc-runner-set'
-            platforms: 'linux/amd64'
-    runs-on: ${{matrix.runs-on}}
-    steps:
-      - name: Set up QEMU
-        uses: docker/setup-qemu-action@master
-        with:
-          platforms: all
-      - name: Login to DockerHub
-        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
-        with:
-          username: ${{ secrets.DOCKERHUB_USERNAME }}
-          password: ${{ secrets.DOCKERHUB_PASSWORD }}
-
-      - name: Login to quay
-        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
-        with:
-          registry: quay.io
-          username: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-          password: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-      - name: Set up Docker Buildx
-        id: buildx
-        uses: docker/setup-buildx-action@master
-
-      - name: Checkout
-        uses: actions/checkout@v7
-
-      - name: Cache Intel images
-        uses: docker/build-push-action@v7
-        with:
-          builder: ${{ steps.buildx.outputs.name }}
-          build-args: |
-            BASE_IMAGE=${{ matrix.base-image }}
-          context: .
-          file: ./Dockerfile
-          tags: quay.io/go-skynet/intel-oneapi-base:24.04
-          push: true
-          target: intel
-          platforms: ${{ matrix.platforms }}
--- a/.github/workflows/gh-pages.yml
+++ b/.github/workflows/gh-pages.yml
@@ -1,75 +0,0 @@
-name: Deploy docs to GitHub Pages
-
-on:
-  push:
-    branches:
-      - master
-    paths:
-      - 'docs/**'
-      - 'gallery/**'
-      - 'images/**'
-      - '.github/ci/modelslist.go'
-      - '.github/workflows/gh-pages.yml'
-  workflow_dispatch:
-
-permissions:
-  contents: read
-  pages: write
-  id-token: write
-
-concurrency:
-  group: pages
-  cancel-in-progress: false
-
-jobs:
-  build:
-    runs-on: ubuntu-latest
-    env:
-      HUGO_VERSION: "0.146.3"
-    steps:
-      - name: Checkout
-        uses: actions/checkout@v7
-        with:
-          fetch-depth: 0  # needed for enableGitInfo
-          submodules: true
-
-      - name: Setup Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: '1.22'
-          cache: false
-
-      - name: Setup Hugo
-        uses: peaceiris/actions-hugo@v3
-        with:
-          hugo-version: ${{ env.HUGO_VERSION }}
-          extended: true
-
-      - name: Setup Pages
-        id: pages
-        uses: actions/configure-pages@v6
-
-      - name: Generate gallery
-        run: go run ./.github/ci/modelslist.go ./gallery/index.yaml > docs/static/gallery.html
-
-      - name: Build site
-        working-directory: docs
-        run: |
-          mkdir -p layouts/_default
-          hugo --minify --baseURL "${{ steps.pages.outputs.base_url }}/"
-
-      - name: Upload artifact
-        uses: actions/upload-pages-artifact@v5
-        with:
-          path: docs/public
-
-  deploy:
-    environment:
-      name: github-pages
-      url: ${{ steps.deployment.outputs.page_url }}
-    runs-on: ubuntu-latest
-    needs: build
-    steps:
-      - name: Deploy to GitHub Pages
-        id: deployment
-        uses: actions/deploy-pages@v5
--- a/.github/workflows/image-pr.yml
+++ b/.github/workflows/image-pr.yml
@@ -1,103 +1,130 @@
 ---
-  name: 'build container images tests'
-  
-  on:
-    pull_request:
-  
-  concurrency:
-    group: ci-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-    cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-  
-  jobs:
-    image-build:
-      uses: ./.github/workflows/image_build.yml
-      with:
-        tag-latest: ${{ matrix.tag-latest }}
-        tag-suffix: ${{ matrix.tag-suffix }}
-        build-type: ${{ matrix.build-type }}
-        cuda-major-version: ${{ matrix.cuda-major-version }}
-        cuda-minor-version: ${{ matrix.cuda-minor-version }}
-        platforms: ${{ matrix.platforms }}
-        platform-tag: ${{ matrix.platform-tag || '' }}
-        runs-on: ${{ matrix.runs-on }}
-        base-image: ${{ matrix.base-image }}
-        makeflags: ${{ matrix.makeflags }}
-        ubuntu-version: ${{ matrix.ubuntu-version }}
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-      strategy:
-        # Pushing with all jobs in parallel
-        # eats the bandwidth of all the nodes
-        max-parallel: ${{ github.event_name != 'pull_request' && 4 || 8 }}
-        fail-fast: false
-        matrix:
-          include:
-            - build-type: 'cublas'
-              cuda-major-version: "12"
-              cuda-minor-version: "8"
-              platforms: 'linux/amd64'
-              tag-latest: 'false'
-              tag-suffix: '-gpu-nvidia-cuda-12'
-              runs-on: 'ubuntu-latest'
-              base-image: "ubuntu:24.04"
-              makeflags: "--jobs=3 --output-sync=target"
-              ubuntu-version: '2404'
-            - build-type: 'cublas'
-              cuda-major-version: "13"
-              cuda-minor-version: "0"
-              platforms: 'linux/amd64'
-              tag-latest: 'false'
-              tag-suffix: '-gpu-nvidia-cuda-13'
-              runs-on: 'ubuntu-latest'
-              base-image: "ubuntu:22.04"
-              makeflags: "--jobs=3 --output-sync=target"
-              ubuntu-version: '2404'
-            - build-type: 'hipblas'
-              platforms: 'linux/amd64'
-              tag-latest: 'false'
-              tag-suffix: '-hipblas'
-              base-image: "rocm/dev-ubuntu-24.04:7.2.1"
-              runs-on: 'ubuntu-latest'
-              makeflags: "--jobs=3 --output-sync=target"
-              ubuntu-version: '2404'
-            - build-type: 'sycl'
-              platforms: 'linux/amd64'
-              tag-latest: 'false'
-              base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
-              tag-suffix: 'sycl'
-              runs-on: 'ubuntu-latest'
-              makeflags: "--jobs=3 --output-sync=target"
-              ubuntu-version: '2404'
-            - build-type: 'vulkan'
-              platforms: 'linux/amd64'
-              platform-tag: 'amd64'
-              tag-latest: 'false'
-              tag-suffix: '-vulkan-core'
-              runs-on: 'ubuntu-latest'
-              base-image: "ubuntu:24.04"
-              makeflags: "--jobs=4 --output-sync=target"
-              ubuntu-version: '2404'
-            - build-type: 'vulkan'
-              platforms: 'linux/arm64'
-              platform-tag: 'arm64'
-              tag-latest: 'false'
-              tag-suffix: '-vulkan-core'
-              runs-on: 'ubuntu-24.04-arm'
-              base-image: "ubuntu:24.04"
-              makeflags: "--jobs=4 --output-sync=target"
-              ubuntu-version: '2404'
-            - build-type: 'cublas'
-              cuda-major-version: "13"
-              cuda-minor-version: "0"
-              platforms: 'linux/arm64'
-              tag-latest: 'false'
-              tag-suffix: '-nvidia-l4t-arm64-cuda-13'
-              base-image: "ubuntu:24.04"
-              runs-on: 'ubuntu-24.04-arm'
-              makeflags: "--jobs=4 --output-sync=target"
-              skip-drivers: 'false'
-              ubuntu-version: '2404'
-  
+name: 'build container images tests'
+
+on:
+  pull_request:
+
+concurrency:
+  group: ci-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true
+
+jobs:
+  extras-image-build:
+    uses: ./.github/workflows/image_build.yml
+    with:
+      tag-latest: ${{ matrix.tag-latest }}
+      tag-suffix: ${{ matrix.tag-suffix }}
+      ffmpeg: ${{ matrix.ffmpeg }}
+      image-type: ${{ matrix.image-type }}
+      build-type: ${{ matrix.build-type }}
+      cuda-major-version: ${{ matrix.cuda-major-version }}
+      cuda-minor-version: ${{ matrix.cuda-minor-version }}
+      platforms: ${{ matrix.platforms }}
+      runs-on: ${{ matrix.runs-on }}
+      base-image: ${{ matrix.base-image }}
+      grpc-base-image: ${{ matrix.grpc-base-image }}
+      makeflags: ${{ matrix.makeflags }}
+    secrets:
+      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+    strategy:
+      # Pushing with all jobs in parallel
+      # eats the bandwidth of all the nodes
+      max-parallel: ${{ github.event_name != 'pull_request' && 2 || 4 }}
+      matrix:
+        include:
+          - build-type: ''
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-ffmpeg'
+            ffmpeg: 'true'
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:22.04"
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "1"
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-cublas-cuda12-ffmpeg'
+            ffmpeg: 'true'
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:22.04"
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'hipblas'
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-hipblas'
+            ffmpeg: 'false'
+            image-type: 'extras'
+            base-image: "rocm/dev-ubuntu-22.04:6.0-complete"
+            grpc-base-image: "ubuntu:22.04"
+            runs-on: 'arc-runner-set'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'sycl_f16'
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            base-image: "intel/oneapi-basekit:2024.0.1-devel-ubuntu22.04"
+            grpc-base-image: "ubuntu:22.04"
+            tag-suffix: 'sycl-f16-ffmpeg'
+            ffmpeg: 'true'
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+            makeflags: "--jobs=3 --output-sync=target"
+  core-image-build:
+    uses: ./.github/workflows/image_build.yml
+    with:
+      tag-latest: ${{ matrix.tag-latest }}
+      tag-suffix: ${{ matrix.tag-suffix }}
+      ffmpeg: ${{ matrix.ffmpeg }}
+      image-type: ${{ matrix.image-type }}
+      build-type: ${{ matrix.build-type }}
+      cuda-major-version: ${{ matrix.cuda-major-version }}
+      cuda-minor-version: ${{ matrix.cuda-minor-version }}
+      platforms: ${{ matrix.platforms }}
+      runs-on: ${{ matrix.runs-on }}
+      base-image: ${{ matrix.base-image }}
+      grpc-base-image: ${{ matrix.grpc-base-image }}
+      makeflags: ${{ matrix.makeflags }}
+    secrets:
+      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+    strategy:
+      matrix:
+        include:
+          - build-type: ''
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-ffmpeg-core'
+            ffmpeg: 'true'
+            image-type: 'core'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:22.04"
+            makeflags: "--jobs=4 --output-sync=target"
+          - build-type: 'sycl_f16'
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            base-image: "intel/oneapi-basekit:2024.0.1-devel-ubuntu22.04"
+            grpc-base-image: "ubuntu:22.04"
+            tag-suffix: 'sycl-f16-ffmpeg-core'
+            ffmpeg: 'true'
+            image-type: 'core'
+            runs-on: 'arc-runner-set'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "1"
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-cublas-cuda12-ffmpeg-core'
+            ffmpeg: 'true'
+            image-type: 'core'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:22.04"
+            makeflags: "--jobs=4 --output-sync=target"
--- a/.github/workflows/image.yml
+++ b/.github/workflows/image.yml
@@ -1,315 +1,317 @@
 ---
-  name: 'build container images'
+name: 'build container images'
+
+on:
+  push:
+    branches:
+      - master
+    tags:
+      - '*'
+
+concurrency:
+  group: ci-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true
+
+jobs:
+  self-hosted-jobs:
+    uses: ./.github/workflows/image_build.yml
+    with:
+      tag-latest: ${{ matrix.tag-latest }}
+      tag-suffix: ${{ matrix.tag-suffix }}
+      ffmpeg: ${{ matrix.ffmpeg }}
+      image-type: ${{ matrix.image-type }}
+      build-type: ${{ matrix.build-type }}
+      cuda-major-version: ${{ matrix.cuda-major-version }}
+      cuda-minor-version: ${{ matrix.cuda-minor-version }}
+      platforms: ${{ matrix.platforms }}
+      runs-on: ${{ matrix.runs-on }}
+      base-image: ${{ matrix.base-image }}
+      grpc-base-image: ${{ matrix.grpc-base-image }}
+      aio: ${{ matrix.aio }}
+      makeflags: ${{ matrix.makeflags }}
+      latest-image: ${{ matrix.latest-image }}
+      latest-image-aio: ${{ matrix.latest-image-aio }}
+    secrets:
+      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+    strategy:
+      # Pushing with all jobs in parallel
+      # eats the bandwidth of all the nodes
+      max-parallel: ${{ github.event_name != 'pull_request' && 2 || 4 }}
+      matrix:
+        include:
+          # Extra images
+          - build-type: ''
+            #platforms: 'linux/amd64,linux/arm64'
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: ''
+            ffmpeg: ''
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:22.04"
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: ''
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-ffmpeg'
+            ffmpeg: 'true'
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:22.04"
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'cublas'
+            cuda-major-version: "11"
+            cuda-minor-version: "7"
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-cublas-cuda11'
+            ffmpeg: ''
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:22.04"
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "1"
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-cublas-cuda12'
+            ffmpeg: ''
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:22.04"
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'cublas'
+            cuda-major-version: "11"
+            cuda-minor-version: "7"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-cublas-cuda11-ffmpeg'
+            ffmpeg: 'true'
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:22.04"
+            aio: "-aio-gpu-nvidia-cuda-11"
+            latest-image: 'latest-gpu-nvidia-cuda-11'
+            latest-image-aio: 'latest-aio-gpu-nvidia-cuda-11'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "1"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-cublas-cuda12-ffmpeg'
+            ffmpeg: 'true'
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+            base-image: "ubuntu:22.04"
+            aio: "-aio-gpu-nvidia-cuda-12"
+            latest-image: 'latest-gpu-nvidia-cuda-12'
+            latest-image-aio: 'latest-aio-gpu-nvidia-cuda-12'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: ''
+            #platforms: 'linux/amd64,linux/arm64'
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: ''
+            ffmpeg: ''
+            image-type: 'extras'
+            base-image: "ubuntu:22.04"
+            runs-on: 'arc-runner-set'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'hipblas'
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-hipblas-ffmpeg'
+            ffmpeg: 'true'
+            image-type: 'extras'
+            aio: "-aio-gpu-hipblas"
+            base-image: "rocm/dev-ubuntu-22.04:6.0-complete"
+            grpc-base-image: "ubuntu:22.04"
+            latest-image: 'latest-gpu-hipblas'
+            latest-image-aio: 'latest-aio-gpu-hipblas'
+            runs-on: 'arc-runner-set'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'hipblas'
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-hipblas'
+            ffmpeg: 'false'
+            image-type: 'extras'
+            base-image: "rocm/dev-ubuntu-22.04:6.0-complete"
+            grpc-base-image: "ubuntu:22.04"
+            runs-on: 'arc-runner-set'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'sycl_f16'
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            base-image: "intel/oneapi-basekit:2024.0.1-devel-ubuntu22.04"
+            grpc-base-image: "ubuntu:22.04"
+            tag-suffix: '-sycl-f16-ffmpeg'
+            ffmpeg: 'true'
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+            aio: "-aio-gpu-intel-f16"
+            latest-image: 'latest-gpu-intel-f16'
+            latest-image-aio: 'latest-aio-gpu-intel-f16'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'sycl_f32'
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            base-image: "intel/oneapi-basekit:2024.0.1-devel-ubuntu22.04"
+            grpc-base-image: "ubuntu:22.04"
+            tag-suffix: '-sycl-f32-ffmpeg'
+            ffmpeg: 'true'
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+            aio: "-aio-gpu-intel-f32"
+            latest-image: 'latest-gpu-intel-f32'
+            latest-image-aio: 'latest-aio-gpu-intel-f32'
+            makeflags: "--jobs=3 --output-sync=target"
+          # Core images
+          - build-type: 'sycl_f16'
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            base-image: "intel/oneapi-basekit:2024.0.1-devel-ubuntu22.04"
+            grpc-base-image: "ubuntu:22.04"
+            tag-suffix: '-sycl-f16-core'
+            ffmpeg: 'false'
+            image-type: 'core'
+            runs-on: 'arc-runner-set'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'sycl_f32'
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            base-image: "intel/oneapi-basekit:2024.0.1-devel-ubuntu22.04"
+            grpc-base-image: "ubuntu:22.04"
+            tag-suffix: '-sycl-f32-core'
+            ffmpeg: 'false'
+            image-type: 'core'
+            runs-on: 'arc-runner-set'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'sycl_f16'
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            base-image: "intel/oneapi-basekit:2024.0.1-devel-ubuntu22.04"
+            grpc-base-image: "ubuntu:22.04"
+            tag-suffix: '-sycl-f16-ffmpeg-core'
+            ffmpeg: 'true'
+            image-type: 'core'
+            runs-on: 'arc-runner-set'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'sycl_f32'
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            base-image: "intel/oneapi-basekit:2024.0.1-devel-ubuntu22.04"
+            grpc-base-image: "ubuntu:22.04"
+            tag-suffix: '-sycl-f32-ffmpeg-core'
+            ffmpeg: 'true'
+            image-type: 'core'
+            runs-on: 'arc-runner-set'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'hipblas'
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-hipblas-ffmpeg-core'
+            ffmpeg: 'true'
+            image-type: 'core'
+            base-image: "rocm/dev-ubuntu-22.04:6.0-complete"
+            grpc-base-image: "ubuntu:22.04"
+            runs-on: 'arc-runner-set'
+            makeflags: "--jobs=3 --output-sync=target"
+          - build-type: 'hipblas'
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-hipblas-core'
+            ffmpeg: 'false'
+            image-type: 'core'
+            base-image: "rocm/dev-ubuntu-22.04:6.0-complete"
+            grpc-base-image: "ubuntu:22.04"
+            runs-on: 'arc-runner-set'
+            makeflags: "--jobs=3 --output-sync=target"
  
-  on:
-    push:
-      branches:
-        - master
-      tags:
-        - '*'
-  
-  concurrency:
-    group: ci-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-    cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-  
-  jobs:
-    hipblas-jobs:
-      if: github.repository == 'mudler/LocalAI'
-      uses: ./.github/workflows/image_build.yml
-      with:
-        tag-latest: ${{ matrix.tag-latest }}
-        tag-suffix: ${{ matrix.tag-suffix }}
-        build-type: ${{ matrix.build-type }}
-        cuda-major-version: ${{ matrix.cuda-major-version }}
-        cuda-minor-version: ${{ matrix.cuda-minor-version }}
-        platforms: ${{ matrix.platforms }}
-        runs-on: ${{ matrix.runs-on }}
-        base-image: ${{ matrix.base-image }}
-        makeflags: ${{ matrix.makeflags }}
-        ubuntu-version: ${{ matrix.ubuntu-version }}
-        ubuntu-codename: ${{ matrix.ubuntu-codename }}
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-      strategy:
-        matrix:
-          include:
-            - build-type: 'hipblas'
-              platforms: 'linux/amd64'
-              tag-latest: 'auto'
-              tag-suffix: '-gpu-hipblas'
-              base-image: "rocm/dev-ubuntu-24.04:7.2.1"
-              runs-on: 'ubuntu-latest'
-              makeflags: "--jobs=3 --output-sync=target"
-              ubuntu-version: '2404'
-              ubuntu-codename: 'noble'
-
-    core-image-build:
-      if: github.repository == 'mudler/LocalAI'
-      uses: ./.github/workflows/image_build.yml
-      with:
-        tag-latest: ${{ matrix.tag-latest }}
-        tag-suffix: ${{ matrix.tag-suffix }}
-        build-type: ${{ matrix.build-type }}
-        cuda-major-version: ${{ matrix.cuda-major-version }}
-        cuda-minor-version: ${{ matrix.cuda-minor-version }}
-        platforms: ${{ matrix.platforms }}
-        platform-tag: ${{ matrix.platform-tag || '' }}
-        runs-on: ${{ matrix.runs-on }}
-        base-image: ${{ matrix.base-image }}
-        makeflags: ${{ matrix.makeflags }}
-        skip-drivers: ${{ matrix.skip-drivers }}
-        ubuntu-version: ${{ matrix.ubuntu-version }}
-        ubuntu-codename: ${{ matrix.ubuntu-codename }}
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-      strategy:
-        #max-parallel: ${{ github.event_name != 'pull_request' && 2 || 4 }}
-        matrix:
-          include:
-            - build-type: ''
-              platforms: 'linux/amd64'
-              platform-tag: 'amd64'
-              tag-latest: 'auto'
-              tag-suffix: ''
-              base-image: "ubuntu:24.04"
-              runs-on: 'ubuntu-latest'
-              makeflags: "--jobs=4 --output-sync=target"
-              skip-drivers: 'false'
-              ubuntu-version: '2404'
-              ubuntu-codename: 'noble'
-            - build-type: ''
-              platforms: 'linux/arm64'
-              platform-tag: 'arm64'
-              tag-latest: 'auto'
-              tag-suffix: ''
-              base-image: "ubuntu:24.04"
-              runs-on: 'ubuntu-24.04-arm'
-              makeflags: "--jobs=4 --output-sync=target"
-              skip-drivers: 'false'
-              ubuntu-version: '2404'
-              ubuntu-codename: 'noble'
-            - build-type: 'cublas'
-              cuda-major-version: "12"
-              cuda-minor-version: "8"
-              platforms: 'linux/amd64'
-              tag-latest: 'auto'
-              tag-suffix: '-gpu-nvidia-cuda-12'
-              runs-on: 'ubuntu-latest'
-              base-image: "ubuntu:24.04"
-              skip-drivers: 'false'
-              makeflags: "--jobs=4 --output-sync=target"
-              ubuntu-version: '2404'
-              ubuntu-codename: 'noble'
-            - build-type: 'cublas'
-              cuda-major-version: "13"
-              cuda-minor-version: "0"
-              platforms: 'linux/amd64'
-              tag-latest: 'auto'
-              tag-suffix: '-gpu-nvidia-cuda-13'
-              runs-on: 'ubuntu-latest'
-              base-image: "ubuntu:22.04"
-              skip-drivers: 'false'
-              makeflags: "--jobs=4 --output-sync=target"
-              ubuntu-version: '2404'
-              ubuntu-codename: 'noble'
-            - build-type: 'vulkan'
-              platforms: 'linux/amd64'
-              platform-tag: 'amd64'
-              tag-latest: 'auto'
-              tag-suffix: '-gpu-vulkan'
-              runs-on: 'ubuntu-latest'
-              base-image: "ubuntu:24.04"
-              skip-drivers: 'false'
-              makeflags: "--jobs=4 --output-sync=target"
-              ubuntu-version: '2404'
-              ubuntu-codename: 'noble'
-            - build-type: 'vulkan'
-              platforms: 'linux/arm64'
-              platform-tag: 'arm64'
-              tag-latest: 'auto'
-              tag-suffix: '-gpu-vulkan'
-              runs-on: 'ubuntu-24.04-arm'
-              base-image: "ubuntu:24.04"
-              skip-drivers: 'false'
-              makeflags: "--jobs=4 --output-sync=target"
-              ubuntu-version: '2404'
-              ubuntu-codename: 'noble'
-            - build-type: 'intel'
-              platforms: 'linux/amd64'
-              tag-latest: 'auto'
-              base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
-              tag-suffix: '-gpu-intel'
-              runs-on: 'ubuntu-latest'
-              makeflags: "--jobs=3 --output-sync=target"
-              ubuntu-version: '2404'
-              ubuntu-codename: 'noble'
-
-    core-image-merge:
-      # !cancelled(): without it, GHA's default `needs:` cascade skips the
-      # merge whenever any matrix cell of the parent build fails or is
-      # cancelled. Same fix as backend.yml's merge jobs — we still want to
-      # publish the manifest list for tag-suffixes whose legs all succeeded.
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: core-image-build
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: ''
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    gpu-vulkan-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: core-image-build
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-gpu-vulkan'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    # Single-arch server-image merges. Same conceptual fix as the backend
-    # singletons in PR #9781: image_build.yml pushes by canonical digest
-    # only, so without a downstream merge step there's no tag for consumers
-    # (no :latest-gpu-nvidia-cuda-12, no :v<X>-gpu-nvidia-cuda-12, etc.).
-    # Each merge job needs only its parent build matrix and is filtered by
-    # tag-suffix in image_merge.yml's artifact-download pattern.
-    gpu-nvidia-cuda-12-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: core-image-build
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-gpu-nvidia-cuda-12'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    gpu-nvidia-cuda-13-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: core-image-build
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-gpu-nvidia-cuda-13'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    gpu-intel-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: core-image-build
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-gpu-intel'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    gpu-hipblas-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: hipblas-jobs
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-gpu-hipblas'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    nvidia-l4t-arm64-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: gh-runner
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-nvidia-l4t-arm64'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    nvidia-l4t-arm64-cuda-13-image-merge:
-      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
-      needs: gh-runner
-      uses: ./.github/workflows/image_merge.yml
-      with:
-        tag-latest: 'auto'
-        tag-suffix: '-nvidia-l4t-arm64-cuda-13'
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-
-    gh-runner:
-      if: github.repository == 'mudler/LocalAI'
-      uses: ./.github/workflows/image_build.yml
-      with:
-        tag-latest: ${{ matrix.tag-latest }}
-        tag-suffix: ${{ matrix.tag-suffix }}
-        build-type: ${{ matrix.build-type }}
-        cuda-major-version: ${{ matrix.cuda-major-version }}
-        cuda-minor-version: ${{ matrix.cuda-minor-version }}
-        platforms: ${{ matrix.platforms }}
-        runs-on: ${{ matrix.runs-on }}
-        base-image: ${{ matrix.base-image }}
-        makeflags: ${{ matrix.makeflags }}
-        skip-drivers: ${{ matrix.skip-drivers }}
-        ubuntu-version: ${{ matrix.ubuntu-version }}
-        ubuntu-codename: ${{ matrix.ubuntu-codename }}
-      secrets:
-        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
-        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
-        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
-      strategy:
-        matrix:
-          include:
-            - build-type: 'cublas'
-              cuda-major-version: "12"
-              cuda-minor-version: "0"
-              platforms: 'linux/arm64'
-              tag-latest: 'auto'
-              tag-suffix: '-nvidia-l4t-arm64'
-              base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
-              runs-on: 'ubuntu-24.04-arm'
-              makeflags: "--jobs=4 --output-sync=target"
-              skip-drivers: 'true'
-              ubuntu-version: "2204"
-              ubuntu-codename: 'jammy'
-            - build-type: 'cublas'
-              cuda-major-version: "13"
-              cuda-minor-version: "0"
-              platforms: 'linux/arm64'
-              tag-latest: 'auto'
-              tag-suffix: '-nvidia-l4t-arm64-cuda-13'
-              base-image: "ubuntu:24.04"
-              runs-on: 'ubuntu-24.04-arm'
-              makeflags: "--jobs=4 --output-sync=target"
-              skip-drivers: 'false'
-              ubuntu-version: '2404'
-              ubuntu-codename: 'noble'
-  
+  core-image-build:
+    uses: ./.github/workflows/image_build.yml
+    with:
+      tag-latest: ${{ matrix.tag-latest }}
+      tag-suffix: ${{ matrix.tag-suffix }}
+      ffmpeg: ${{ matrix.ffmpeg }}
+      image-type: ${{ matrix.image-type }}
+      build-type: ${{ matrix.build-type }}
+      cuda-major-version: ${{ matrix.cuda-major-version }}
+      cuda-minor-version: ${{ matrix.cuda-minor-version }}
+      platforms: ${{ matrix.platforms }}
+      runs-on: ${{ matrix.runs-on }}
+      aio: ${{ matrix.aio }}
+      base-image: ${{ matrix.base-image }}
+      grpc-base-image: ${{ matrix.grpc-base-image }}
+      makeflags: ${{ matrix.makeflags }}
+      latest-image: ${{ matrix.latest-image }}
+      latest-image-aio: ${{ matrix.latest-image-aio }}
+    secrets:
+      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+    strategy:
+      matrix:
+        include:
+          - build-type: ''
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-ffmpeg-core'
+            ffmpeg: 'true'
+            image-type: 'core'
+            base-image: "ubuntu:22.04"
+            runs-on: 'ubuntu-latest'
+            aio: "-aio-cpu"
+            latest-image: 'latest-cpu'
+            latest-image-aio: 'latest-aio-cpu'
+            makeflags: "--jobs=4 --output-sync=target"
+          - build-type: 'cublas'
+            cuda-major-version: "11"
+            cuda-minor-version: "7"
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-cublas-cuda11-core'
+            ffmpeg: ''
+            image-type: 'core'
+            base-image: "ubuntu:22.04"
+            runs-on: 'ubuntu-latest'
+            makeflags: "--jobs=4 --output-sync=target"
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "1"
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-cublas-cuda12-core'
+            ffmpeg: ''
+            image-type: 'core'
+            base-image: "ubuntu:22.04"
+            runs-on: 'ubuntu-latest'
+            makeflags: "--jobs=4 --output-sync=target"
+          - build-type: 'cublas'
+            cuda-major-version: "11"
+            cuda-minor-version: "7"
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-cublas-cuda11-ffmpeg-core'
+            ffmpeg: 'true'
+            image-type: 'core'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:22.04"
+            makeflags: "--jobs=4 --output-sync=target"
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "1"
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-cublas-cuda12-ffmpeg-core'
+            ffmpeg: 'true'
+            image-type: 'core'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:22.04"
+            makeflags: "--jobs=4 --output-sync=target"
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -8,42 +8,50 @@ on:
        description: 'Base image'
        required: true
        type: string
+      grpc-base-image:
+        description: 'GRPC Base image, must be a compatible image with base-image'
+        required: false
+        default: ''
+        type: string
      build-type:
        description: 'Build type'
        default: ''
        type: string
      cuda-major-version:
        description: 'CUDA major version'
-        default: "12"
+        default: "11"
        type: string
      cuda-minor-version:
        description: 'CUDA minor version'
-        default: "9"
+        default: "7"
        type: string
      platforms:
        description: 'Platforms'
        default: ''
        type: string
-      platform-tag:
-        description: |
-          Short tag identifying the platform leg, e.g. "amd64" or "arm64".
-          Used to scope the per-arch registry cache and the digest artifact name.
-          Optional during the migration; will be flipped to required: true once
-          every caller passes an explicit value.
-        required: false
-        default: ''
-        type: string
      tag-latest:
        description: 'Tag latest'
        default: ''
        type: string
+      latest-image:
+          description: 'Tag latest'
+          default: ''
+          type: string
+      latest-image-aio:
+          description: 'Tag latest'
+          default: ''
+          type: string
      tag-suffix:
        description: 'Tag suffix'
        default: ''
        type: string
-      skip-drivers:
-        description: 'Skip drivers by default'
-        default: 'false'
+      ffmpeg:
+        description: 'FFMPEG'
+        default: ''
+        type: string
+      image-type:
+        description: 'Image type'
+        default: ''
        type: string
      runs-on:
        description: 'Runs on'
@@ -55,15 +63,10 @@ on:
        required: false
        default: '--jobs=4 --output-sync=target'
        type: string
-      ubuntu-version:
-        description: 'Ubuntu version'
+      aio:
+        description: 'AIO Image Name'
        required: false
-        default: '2204'
-        type: string
-      ubuntu-codename:
-        description: 'Ubuntu codename'
-        required: false
-        default: 'noble'
+        default: ''
        type: string
    secrets:
      dockerUsername:
@@ -78,26 +81,62 @@ jobs:
  reusable_image-build:
    runs-on: ${{ inputs.runs-on }}
    steps:
-
+      - name: Force Install GIT latest
+        run: |
+          sudo apt-get update \
+          && sudo apt-get install -y software-properties-common \
+          && sudo apt-get update \
+          && sudo add-apt-repository -y ppa:git-core/ppa \
+          && sudo apt-get update \
+          && sudo apt-get install -y git
      - name: Checkout
-        uses: actions/checkout@v7
+        uses: actions/checkout@v4

-      - name: Configure apt mirror on runner
-        id: apt_mirror
-        uses: ./.github/actions/configure-apt-mirror
-
-      - name: Free disk space
-        uses: ./.github/actions/free-disk-space
-        with:
-          mode: ${{ inputs.runs-on == 'ubuntu-latest' && 'hosted' || 'skip' }}
-
-      - name: Set up build disk
-        uses: ./.github/actions/setup-build-disk
+      - name: Release space from worker
+        if: inputs.runs-on == 'ubuntu-latest'
+        run: |
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          df -h
+          echo
+          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
+          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
+          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
+          sudo rm -rf /usr/local/lib/android
+          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
+          sudo rm -rf /usr/share/dotnet
+          sudo apt-get remove -y '^mono-.*' || true
+          sudo apt-get remove -y '^ghc-.*' || true
+          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
+          sudo apt-get remove -y 'php.*' || true
+          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
+          sudo apt-get remove -y '^google-.*' || true
+          sudo apt-get remove -y azure-cli || true
+          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
+          sudo apt-get remove -y '^gfortran-.*' || true
+          sudo apt-get remove -y microsoft-edge-stable || true
+          sudo apt-get remove -y firefox || true
+          sudo apt-get remove -y powershell || true
+          sudo apt-get remove -y r-base-core || true
+          sudo apt-get autoremove -y
+          sudo apt-get clean
+          echo
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          sudo rm -rfv build || true
+          sudo rm -rf /usr/share/dotnet || true
+          sudo rm -rf /opt/ghc || true
+          sudo rm -rf "/usr/local/share/boost" || true
+          sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
+          df -h

      - name: Docker meta
        id: meta
-        if: github.event_name != 'pull_request'
-        uses: docker/metadata-action@v6
+        uses: docker/metadata-action@v5
        with:
          images: |
            quay.io/go-skynet/local-ai
@@ -106,24 +145,38 @@ jobs:
            type=ref,event=branch
            type=semver,pattern={{raw}}
            type=sha
-            type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
-          flavor: |
-            latest=${{ inputs.tag-latest }}
-            suffix=${{ inputs.tag-suffix }},onlatest=true
-      - name: Docker meta for PR
-        id: meta_pull_request
-        if: github.event_name == 'pull_request'
-        uses: docker/metadata-action@v6
-        with:
-          images: |
-            quay.io/go-skynet/ci-tests
-          tags: |
-            type=ref,event=branch,suffix=localai${{ github.event.number }}-${{ inputs.build-type }}-${{ inputs.cuda-major-version }}-${{ inputs.cuda-minor-version }}
-            type=semver,pattern={{raw}},suffix=localai${{ github.event.number }}-${{ inputs.build-type }}-${{ inputs.cuda-major-version }}-${{ inputs.cuda-minor-version }}
-            type=sha,suffix=localai${{ github.event.number }}-${{ inputs.build-type }}-${{ inputs.cuda-major-version }}-${{ inputs.cuda-minor-version }}
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }}
+
+      - name: Docker meta AIO (quay.io)
+        if: inputs.aio != ''
+        id: meta_aio
+        uses: docker/metadata-action@v5
+        with:
+          images: |
+            quay.io/go-skynet/local-ai
+          tags: |
+            type=ref,event=branch
+            type=semver,pattern={{raw}}
+          flavor: |
+            latest=${{ inputs.tag-latest }}
+            suffix=${{ inputs.aio }}
+
+      - name: Docker meta AIO (dockerhub)
+        if: inputs.aio != ''
+        id: meta_aio_dockerhub
+        uses: docker/metadata-action@v5
+        with:
+          images: |
+            localai/localai
+          tags: |
+            type=ref,event=branch
+            type=semver,pattern={{raw}}
+          flavor: |
+            latest=${{ inputs.tag-latest }}
+            suffix=${{ inputs.aio }}
+
      - name: Set up QEMU
        uses: docker/setup-qemu-action@master
        with:
@@ -135,107 +188,123 @@ jobs:

      - name: Login to DockerHub
        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
+        uses: docker/login-action@v3
        with:
          username: ${{ secrets.dockerUsername }}
          password: ${{ secrets.dockerPassword }}

      - name: Login to DockerHub
        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
+        uses: docker/login-action@v3
        with:
          registry: quay.io
          username: ${{ secrets.quayUsername }}
          password: ${{ secrets.quayPassword }}

-      - name: Build and push by digest
-        id: build
-        uses: docker/build-push-action@v7
-        if: github.event_name != 'pull_request'
+      - name: Cache GRPC
+        uses: docker/build-push-action@v5
        with:
          builder: ${{ steps.buildx.outputs.name }}
+          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
+          # This means that even the MAKEFLAGS have to be an EXACT match.
+          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
          build-args: |
-            BUILD_TYPE=${{ inputs.build-type }}
-            CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
-            CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
-            BASE_IMAGE=${{ inputs.base-image }}
-            MAKEFLAGS=${{ inputs.makeflags }}
-            SKIP_DRIVERS=${{ inputs.skip-drivers }}
-            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
-            UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
-            APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
-            APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
+            GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
+            MAKEFLAGS=--jobs=4 --output-sync=target
+            GRPC_VERSION=v1.58.0
          context: .
          file: ./Dockerfile
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
-          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}-${{ inputs.platform-tag }},mode=max,ignore-error=true
+          cache-from: type=gha
+          target: grpc
          platforms: ${{ inputs.platforms }}
-          outputs: |
-            type=image,name=quay.io/go-skynet/local-ai,push-by-digest=true,name-canonical=true,push=true
-            type=image,name=localai/localai,push-by-digest=true,name-canonical=true,push=true
-          # See backend_build.yml for the rationale — provenance=mode=max
-          # diverges the manifest-list digest per registry, breaking the
-          # downstream imagetools create lookup.
-          provenance: false
+          push: false
+          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}

-      - name: Export digest
-        if: github.event_name != 'pull_request'
-        run: |
-          mkdir -p /tmp/digests
-          digest="${{ steps.build.outputs.digest }}"
-          touch "/tmp/digests/${digest#sha256:}"
-
-      # See .github/scripts/anchor-digest-in-cache.sh for why this is needed
-      # and how it interacts with image_merge.yml's cleanup step. Mirrors the
-      # same anchor in backend_build.yml — quay's per-repo manifest GC reaps
-      # untagged manifests in local-ai before the merge runs.
-      - name: Anchor digest in ci-cache so quay GC won't reap before merge
-        if: github.event_name != 'pull_request'
-        env:
-          TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
-          PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
-          DIGEST: ${{ steps.build.outputs.digest }}
-          SOURCE_IMAGE: quay.io/go-skynet/local-ai
-        run: .github/scripts/anchor-digest-in-cache.sh
-
-      - name: Upload digest artifact
-        if: github.event_name != 'pull_request'
-        uses: actions/upload-artifact@v7
-        with:
-          # `--` separator + 'single' placeholder for empty platform-tag —
-          # same pattern as backend_build.yml. Prevents prefix collisions
-          # in the merge-side glob (e.g. -nvidia-l4t-arm64 is a prefix of
-          # -nvidia-l4t-arm64-cuda-13).
-          name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
-          path: /tmp/digests/*
-          if-no-files-found: error
-          retention-days: 1
-### Start testing image
      - name: Build and push
-        uses: docker/build-push-action@v7
-        if: github.event_name == 'pull_request'
+        uses: docker/build-push-action@v5
        with:
          builder: ${{ steps.buildx.outputs.name }}
          build-args: |
            BUILD_TYPE=${{ inputs.build-type }}
            CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
            CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
+            FFMPEG=${{ inputs.ffmpeg }}
+            IMAGE_TYPE=${{ inputs.image-type }}
            BASE_IMAGE=${{ inputs.base-image }}
            MAKEFLAGS=${{ inputs.makeflags }}
-            SKIP_DRIVERS=${{ inputs.skip-drivers }}
-            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
-            UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
-            APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
-            APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
          context: .
          file: ./Dockerfile
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
+          cache-from: type=gha
          platforms: ${{ inputs.platforms }}
-          #push: true
-          tags: ${{ steps.meta_pull_request.outputs.tags }}
-          labels: ${{ steps.meta_pull_request.outputs.labels }}
-## End testing image
+          push: ${{ github.event_name != 'pull_request' }}
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+
+      - name: Inspect image
+        if: github.event_name != 'pull_request'
+        run: |
+          docker pull localai/localai:${{ steps.meta.outputs.version }}
+          docker image inspect localai/localai:${{ steps.meta.outputs.version }}
+          docker pull quay.io/go-skynet/local-ai:${{ steps.meta.outputs.version }}
+          docker image inspect quay.io/go-skynet/local-ai:${{ steps.meta.outputs.version }}
+
+      - name: Build and push AIO image
+        if: inputs.aio != ''
+        uses: docker/build-push-action@v5
+        with:
+          builder: ${{ steps.buildx.outputs.name }}
+          build-args: |
+            BASE_IMAGE=quay.io/go-skynet/local-ai:${{ steps.meta.outputs.version }}
+            MAKEFLAGS=${{ inputs.makeflags }}
+          context: .
+          file: ./Dockerfile.aio
+          platforms: ${{ inputs.platforms }}
+          push: ${{ github.event_name != 'pull_request' }}
+          tags: ${{ steps.meta_aio.outputs.tags }}
+          labels: ${{ steps.meta_aio.outputs.labels }}
+
+      - name: Build and push AIO image (dockerhub)
+        if: inputs.aio != ''
+        uses: docker/build-push-action@v5
+        with:
+          builder: ${{ steps.buildx.outputs.name }}
+          build-args: |
+            BASE_IMAGE=localai/localai:${{ steps.meta.outputs.version }}
+            MAKEFLAGS=${{ inputs.makeflags }}
+          context: .
+          file: ./Dockerfile.aio
+          platforms: ${{ inputs.platforms }}
+          push: ${{ github.event_name != 'pull_request' }}
+          tags: ${{ steps.meta_aio_dockerhub.outputs.tags }}
+          labels: ${{ steps.meta_aio_dockerhub.outputs.labels }}
+
+      - name: Latest tag
+        # run this on branches, when it is a tag and there is a latest-image defined
+        if: github.event_name != 'pull_request' && inputs.latest-image != ''  && github.ref_type == 'tag'
+        run: |
+          docker pull localai/localai:${{ steps.meta.outputs.version }}
+          docker tag localai/localai:${{ steps.meta.outputs.version }} localai/localai:${{ inputs.latest-image }}
+          docker push localai/localai:${{ inputs.latest-image }}
+          docker pull quay.io/go-skynet/local-ai:${{ steps.meta.outputs.version }}
+          docker tag quay.io/go-skynet/local-ai:${{ steps.meta.outputs.version }} quay.io/go-skynet/local-ai:${{ inputs.latest-image }}
+          docker push quay.io/go-skynet/local-ai:${{ inputs.latest-image }}
+      - name: Latest AIO tag
+        # run this on branches, when it is a tag and there is a latest-image defined
+        if: github.event_name != 'pull_request' && inputs.latest-image-aio != ''  && github.ref_type == 'tag'
+        run: |
+          docker pull localai/localai:${{ steps.meta_aio_dockerhub.outputs.version }}
+          docker tag localai/localai:${{ steps.meta_aio_dockerhub.outputs.version }} localai/localai:${{ inputs.latest-image-aio }}
+          docker push localai/localai:${{ inputs.latest-image-aio }}
+          docker pull quay.io/go-skynet/local-ai:${{ steps.meta_aio.outputs.version }}
+          docker tag quay.io/go-skynet/local-ai:${{ steps.meta_aio.outputs.version }} quay.io/go-skynet/local-ai:${{ inputs.latest-image-aio }}
+          docker push quay.io/go-skynet/local-ai:${{ inputs.latest-image-aio }}
+  
      - name: job summary
        run: |
          echo "Built image: ${{ steps.meta.outputs.labels }}" >> $GITHUB_STEP_SUMMARY
+
+      - name: job summary(AIO)
+        if: inputs.aio != ''
+        run: |
+          echo "Built image: ${{ steps.meta_aio.outputs.labels }}" >> $GITHUB_STEP_SUMMARY
--- a/.github/workflows/image_merge.yml
+++ b/.github/workflows/image_merge.yml
@@ -1,146 +0,0 @@
---
-name: 'merge LocalAI image manifest list (reusable)'
-
-# Reusable workflow that joins per-arch digest artifacts (uploaded by
-# image_build.yml when called with platform-tag) into a single tagged
-# multi-arch manifest list.
-
-on:
-  workflow_call:
-    inputs:
-      tag-latest:
-        description: 'Whether the manifest list should also be tagged latest (auto/false/true)'
-        required: false
-        type: string
-        default: ''
-      tag-suffix:
-        description: 'Image tag suffix (empty for core image). Used in artifact pattern with a -core placeholder for empty.'
-        required: true
-        type: string
-    secrets:
-      dockerUsername:
-        required: false
-      dockerPassword:
-        required: false
-      quayUsername:
-        required: true
-      quayPassword:
-        required: true
-
-jobs:
-  merge:
-    runs-on: ubuntu-latest
-    env:
-      quay_username: ${{ secrets.quayUsername }}
-    steps:
-      # Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
-      # script). Skips the rest of the source tree.
-      - name: Checkout (.github/scripts only)
-        uses: actions/checkout@v7
-        with:
-          sparse-checkout: |
-            .github/scripts
-          sparse-checkout-cone-mode: false
-
-      - name: Download digests
-        uses: actions/download-artifact@v8
-        with:
-          # `--` separator anchors the glob so we don't over-match sibling
-          # tag-suffixes (e.g. -nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13).
-          # Must stay in sync with image_build.yml's upload-artifact name.
-          pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--*
-          merge-multiple: true
-          path: /tmp/digests
-
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@master
-
-      - name: Login to DockerHub
-        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
-        with:
-          username: ${{ secrets.dockerUsername }}
-          password: ${{ secrets.dockerPassword }}
-
-      - name: Login to Quay.io
-        uses: docker/login-action@v4
-        with:
-          registry: quay.io
-          username: ${{ secrets.quayUsername }}
-          password: ${{ secrets.quayPassword }}
-
-      - name: Docker meta
-        id: meta
-        uses: docker/metadata-action@v6
-        with:
-          images: |
-            quay.io/go-skynet/local-ai
-            localai/localai
-          tags: |
-            type=ref,event=branch
-            type=semver,pattern={{raw}}
-            type=sha
-            type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
-          flavor: |
-            latest=${{ inputs.tag-latest }}
-            suffix=${{ inputs.tag-suffix }},onlatest=true
-
-      # Source from ci-cache, not local-ai. See backend_merge.yml for the
-      # detailed rationale — quay's manifest GC is per-repository, so the
-      # untagged digest in local-ai gets reaped while the same content lives
-      # tagged under ci-cache (anchored by image_build.yml). buildx imagetools
-      # create copies the manifest into local-ai (blobs already cross-mounted)
-      # and publishes the manifest list with user-facing tags. End state in
-      # local-ai is self-contained; no embedded reference to ci-cache.
-      - name: Create manifest list and push (quay)
-        working-directory: /tmp/digests
-        run: |
-          set -euo pipefail
-          tags=$(jq -cr '.tags | map(select(startswith("quay.io/"))) | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          if [ -z "$tags" ]; then
-            echo "No quay.io tags from docker/metadata-action; skipping quay merge"
-          else
-            # shellcheck disable=SC2086
-            docker buildx imagetools create $tags \
-              $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
-          fi
-
-      - name: Create manifest list and push (dockerhub)
-        if: github.event_name != 'pull_request'
-        working-directory: /tmp/digests
-        run: |
-          set -euo pipefail
-          tags=$(jq -cr '.tags | map(select(startswith("localai/"))) | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          if [ -z "$tags" ]; then
-            echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
-          else
-            # shellcheck disable=SC2086
-            docker buildx imagetools create $tags \
-              $(printf 'localai/localai@sha256:%s ' *)
-          fi
-
-      - name: Inspect manifest
-        run: |
-          set -euo pipefail
-          first_tag=$(jq -cr '.tags[0]' <<< "$DOCKER_METADATA_OUTPUT_JSON")
-          if [ -n "$first_tag" ] && [ "$first_tag" != "null" ]; then
-            docker buildx imagetools inspect "$first_tag"
-          fi
-
-      # See .github/scripts/cleanup-keepalive-tags.sh for the best-effort
-      # semantics — fails soft when the registry credential isn't OAuth-scoped.
-      - name: Cleanup keepalive tags in ci-cache
-        if: github.event_name != 'pull_request' && success()
-        env:
-          TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
-          QUAY_TOKEN: ${{ secrets.quayPassword }}
-        run: .github/scripts/cleanup-keepalive-tags.sh
-
-      - name: Job summary
-        run: |
-          set -euo pipefail
-          echo "Merged manifest tags:" >> "$GITHUB_STEP_SUMMARY"
-          jq -r '.tags[]' <<< "$DOCKER_METADATA_OUTPUT_JSON" | sed 's/^/- /' >> "$GITHUB_STEP_SUMMARY"
-          echo >> "$GITHUB_STEP_SUMMARY"
-          echo "Per-arch digests:" >> "$GITHUB_STEP_SUMMARY"
-          ls -1 /tmp/digests | sed 's/^/- sha256:/' >> "$GITHUB_STEP_SUMMARY"
--- a/.github/workflows/disabled/labeler.yml
+++ b/.github/workflows/disabled/labeler.yml
@@ -9,4 +9,4 @@ jobs:
      pull-requests: write
    runs-on: ubuntu-latest
    steps:
-    - uses: actions/labeler@v6
+    - uses: actions/labeler@v5
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -1,48 +0,0 @@
---
-name: 'lint'
-
-on:
-  pull_request:
-    paths-ignore:
-      - 'docs/**'
-      - 'examples/**'
-      - 'README.md'
-      - '**/*.md'
-  push:
-    branches:
-      - master
-
-concurrency:
-  group: ci-lint-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  golangci-lint:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v7
-        with:
-          # Full history so golangci-lint's new-from-merge-base can reach
-          # origin/master and compute the diff against it.
-          fetch-depth: 0
-      - uses: actions/setup-go@v5
-        with:
-          go-version: '1.26.x'
-          cache: false
-      - name: install golangci-lint
-        run: |
-          curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh \
-            | sh -s -- -b "$(go env GOPATH)/bin" v2.11.4
-      - name: generate grpc proto sources
-        # pkg/grpc/proto/*.go is generated, not checked in. Several packages
-        # import it, so without this step typecheck fails project-wide.
-        run: make protogen-go
-      - name: stub react-ui dist for go:embed
-        # core/http/app.go has //go:embed react-ui/dist/*; the glob needs at
-        # least one non-hidden entry to satisfy typecheck. We don't run
-        # `make react-ui` here because lint doesn't need the real bundle.
-        run: |
-          mkdir -p core/http/react-ui/dist
-          touch core/http/react-ui/dist/index.html
-      - name: lint
-        run: make lint
--- a/.github/workflows/disabled/localaibot_automerge.yml
+++ b/.github/workflows/disabled/localaibot_automerge.yml
@@ -6,15 +6,14 @@ permissions:
  contents: write
  pull-requests: write
  packages: read
-  issues: write # for Homebrew/actions/post-comment
-  actions: write # to dispatch publish workflow
+
 jobs:
  dependabot:
-    if: github.repository == 'mudler/LocalAI' && github.actor == 'localai-bot' && contains(github.event.pull_request.title, 'chore:')
    runs-on: ubuntu-latest
+    if: ${{ github.actor == 'localai-bot' }}
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v6
+        uses: actions/checkout@v4

      - name: Approve a PR if not already approved
        run: |
--- a/.github/workflows/notify-releases.yaml
+++ b/.github/workflows/notify-releases.yaml
@@ -1,65 +0,0 @@
-name: Release notifications
-on:
-  release:
-    types:
-      - published
-
-jobs:
-  notify-discord:
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ubuntu-latest
-    env:
-        RELEASE_BODY: ${{ github.event.release.body }}
-        RELEASE_TITLE: ${{ github.event.release.name }}
-        RELEASE_TAG_NAME: ${{ github.event.release.tag_name }}
-        MODEL_NAME: gemma-3-12b-it-qat
-    steps:
-    - uses: mudler/localai-github-action@v1
-      with:
-        model: 'gemma-3-12b-it-qat' # Any from models.localai.io, or from huggingface.com with: "huggingface://<repository>/file"
-    - name: Summarize
-      id: summarize
-      run: |
-            input="$RELEASE_TITLE\b$RELEASE_BODY"
-
-            # Define the LocalAI API endpoint
-            API_URL="http://localhost:8080/chat/completions"
-
-            # Create a JSON payload using jq to handle special characters
-            json_payload=$(jq -n --arg input "$input" '{
-            model: "'$MODEL_NAME'",
-            messages: [
-                {
-                role: "system",
-                content: "Write a discord message with a bullet point summary of the release notes."
-                },
-                {
-                role: "user",
-                content: $input
-                }
-            ]
-            }')
-
-            # Send the request to LocalAI API
-            response=$(curl -s -X POST $API_URL \
-            -H "Content-Type: application/json" \
-            -d "$json_payload")
-
-            # Extract the summary from the response
-            summary=$(echo $response | jq -r '.choices[0].message.content')
-
-            # Print the summary
-            #  -H "Authorization: Bearer $API_KEY" \
-            {
-                echo 'message<<EOF'
-                echo "$summary"
-                echo EOF
-              } >> "$GITHUB_OUTPUT"
-    - name: Discord notification
-      env:
-        DISCORD_WEBHOOK: ${{ secrets.DISCORD_WEBHOOK_URL_RELEASE }}
-        DISCORD_USERNAME: "LocalAI-Bot"
-        DISCORD_AVATAR: "https://avatars.githubusercontent.com/u/139863280?v=4"
-      uses: Ilshidur/action-discord@master
-      with:
-        args: ${{ steps.summarize.outputs.message }}
--- a/.github/workflows/realtime-conformance.yml
+++ b/.github/workflows/realtime-conformance.yml
@@ -1,69 +0,0 @@
---
-name: 'realtime-conformance'
-
-# Verifies the realtime state-machine implementations conform to their formal
-# designs (docs/design/realtime-state-machines.md, formal-verification/). BOTH
-# layers are enforced and the gate is fail-closed: the Go conformance layer
-# (respcoord + turncoord transition/rapid tests under -race) AND the FizzBee model check of
-# the authoritative specs. FizzBee is pinned + checksum-verified
-# (formal-verification/fizzbee.sha256), so a failed install fails the job rather
-# than silently skipping verification.
-
-on:
-  pull_request:
-    paths:
-      - 'core/http/endpoints/openai/coordinator/**'
-      - 'core/http/endpoints/openai/respcoord/**'
-      - 'core/http/endpoints/openai/turncoord/**'
-      - 'core/http/endpoints/openai/conncoord/**'
-      - 'core/http/endpoints/openai/compactcoord/**'
-      - 'core/http/endpoints/openai/ttscoord/**'
-      - 'formal-verification/**'
-      - 'scripts/realtime-conformance.sh'
-      - 'scripts/install-fizzbee.sh'
-      - '.github/workflows/realtime-conformance.yml'
-  push:
-    branches:
-      - master
-    paths:
-      - 'core/http/endpoints/openai/coordinator/**'
-      - 'core/http/endpoints/openai/respcoord/**'
-      - 'core/http/endpoints/openai/turncoord/**'
-      - 'core/http/endpoints/openai/conncoord/**'
-      - 'core/http/endpoints/openai/compactcoord/**'
-      - 'core/http/endpoints/openai/ttscoord/**'
-      - 'formal-verification/**'
-      - 'scripts/realtime-conformance.sh'
-
-concurrency:
-  group: realtime-conformance-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  conformance:
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        go-version: ['1.26.x']
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-      - name: Setup Go ${{ matrix.go-version }}
-        uses: actions/setup-go@v5
-        with:
-          go-version: ${{ matrix.go-version }}
-          cache: false
-      - name: Cache FizzBee
-        uses: actions/cache@v6
-        with:
-          path: .tools/fizzbee
-          key: fizzbee-v0.5.2-${{ runner.os }}-${{ hashFiles('formal-verification/fizzbee.sha256') }}
-      - name: Install FizzBee (pinned, checksum-verified)
-        # No `|| true`: a failed/forged download must fail the job, not silently
-        # drop the design verification. install-fizzbee.sh is a no-op if the
-        # cached binary is already present and valid.
-        run: ./scripts/install-fizzbee.sh
-      - name: Run conformance gate (fail-closed)
-        # No skip env: both the Go conformance and the FizzBee model check are
-        # required. The gate auto-detects .tools/fizzbee/fizz.
-        run: make test-realtime-conformance
--- a/.github/workflows/release.yaml
+++ b/.github/workflows/release.yaml
@@ -1,81 +1,218 @@
-name: goreleaser
+name: Build and Release

-on:
-  push:
-    tags:
-      - 'v*'
+on: 
+- push
+- pull_request
+
+env:
+  GRPC_VERSION: v1.58.0
+
+permissions:
+  contents: write
+
+concurrency:
+  group: ci-releases-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true

 jobs:
-  goreleaser:
+  build-linux:
+    strategy:
+      matrix:
+        include:
+          - build: 'avx2'
+            defines: ''
+          - build: 'avx'
+            defines: '-DLLAMA_AVX2=OFF'
+          - build: 'avx512'
+            defines: '-DLLAMA_AVX512=ON'
+          - build: 'cuda12'
+            defines: ''
+          - build: 'cuda11'
+            defines: ''
    runs-on: ubuntu-latest
    steps:
-      - name: Checkout
-        uses: actions/checkout@v7
+      - name: Clone
+        uses: actions/checkout@v4
        with:
-          fetch-depth: 0
-      - name: Set up Go
-        uses: actions/setup-go@v5
+          submodules: true
+      - uses: actions/setup-go@v5
        with:
-          go-version: 1.23
-      - name: Run GoReleaser
-        uses: goreleaser/goreleaser-action@v7
-        with:
-          version: v2.11.0
-          args: release --clean
-        env:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          MACOS_SIGN_P12: ${{ secrets.MACOS_CERTIFICATE }}
-          MACOS_SIGN_PASSWORD: ${{ secrets.MACOS_CERTIFICATE_PWD }}
-          MACOS_NOTARY_KEY: ${{ secrets.MACOS_NOTARY_KEY }}
-          MACOS_NOTARY_KEY_ID: ${{ secrets.MACOS_NOTARY_KEY_ID }}
-          MACOS_NOTARY_ISSUER_ID: ${{ secrets.MACOS_NOTARY_ISSUER_ID }}
-  launcher-build-darwin:
-    runs-on: macos-latest
-    steps:
-      - name: Checkout
-        uses: actions/checkout@v7
-        with:
-          fetch-depth: 0
-      - name: Set up Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: 1.23
-      - name: Import signing certificate
-        env:
-          MACOS_CERTIFICATE: ${{ secrets.MACOS_CERTIFICATE }}
-          MACOS_CERTIFICATE_PWD: ${{ secrets.MACOS_CERTIFICATE_PWD }}
-          MACOS_CI_KEYCHAIN_PWD: ${{ secrets.MACOS_CI_KEYCHAIN_PWD }}
-        run: bash contrib/macos/sign-and-notarize.sh import-cert
-      - name: Build, sign and notarize the DMG
-        env:
-          MACOS_SIGN_IDENTITY: ${{ secrets.MACOS_SIGN_IDENTITY }}
-          MACOS_NOTARY_KEY: ${{ secrets.MACOS_NOTARY_KEY }}
-          MACOS_NOTARY_KEY_ID: ${{ secrets.MACOS_NOTARY_KEY_ID }}
-          MACOS_NOTARY_ISSUER_ID: ${{ secrets.MACOS_NOTARY_ISSUER_ID }}
-        run: make release-launcher-darwin
-      - name: Upload DMG to Release
-        uses: softprops/action-gh-release@v3
-        with:
-          files: ./dist/LocalAI.dmg
-  launcher-build-linux:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Checkout
-        uses: actions/checkout@v7
-        with:
-          fetch-depth: 0
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
-      - name: Set up Go
-        uses: actions/setup-go@v5
-        with:
-          go-version: 1.23
-      - name: Build launcher for Linux
+          go-version: '1.21.x'
+          cache: false
+      - name: Dependencies
        run: |
          sudo apt-get update
-          sudo apt-get install golang gcc libgl1-mesa-dev xorg-dev libxkbcommon-dev
-          make build-launcher-linux
-      - name: Upload Linux launcher artifacts
-        uses: softprops/action-gh-release@v3
+          sudo apt-get install build-essential ffmpeg protobuf-compiler
+      - name: Install CUDA Dependencies
+        if: ${{ matrix.build == 'cuda12' || matrix.build == 'cuda11' }}
+        run: |
+          if [ "${{ matrix.build }}" == "cuda12" ]; then
+            export CUDA_VERSION=12-3
+          else
+            export CUDA_VERSION=11-7
+          fi
+          curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
+          sudo dpkg -i cuda-keyring_1.1-1_all.deb
+          sudo apt-get update
+          sudo apt-get install -y cuda-nvcc-${CUDA_VERSION} libcublas-dev-${CUDA_VERSION}
+      - name: Cache grpc
+        id: cache-grpc
+        uses: actions/cache@v4
        with:
-          files: ./local-ai-launcher-linux.tar.xz
+          path: grpc
+          key: ${{ runner.os }}-grpc-${{ env.GRPC_VERSION }}
+      - name: Build grpc
+        if: steps.cache-grpc.outputs.cache-hit != 'true'
+        run: |
+          git clone --recurse-submodules -b ${{ env.GRPC_VERSION }} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
+          cd grpc && mkdir -p cmake/build && cd cmake/build && cmake -DgRPC_INSTALL=ON \
+            -DgRPC_BUILD_TESTS=OFF \
+            ../.. && sudo make --jobs 5 --output-sync=target
+      - name: Install gRPC
+        run: |
+          cd grpc && cd cmake/build && sudo make --jobs 5 --output-sync=target install
+      - name: Build
+        id: build
+        env:
+          CMAKE_ARGS: "${{ matrix.defines }}"
+          BUILD_ID: "${{ matrix.build }}"
+        run: |
+          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest
+          go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
+          export PATH=$PATH:$GOPATH/bin
+          if [ "${{ matrix.build }}" == "cuda12" ] || [ "${{ matrix.build }}" == "cuda11" ]; then
+            export BUILD_TYPE=cublas
+            export PATH=/usr/local/cuda/bin:$PATH
+            make dist
+          else
+            STATIC=true make dist
+          fi
+      - uses: actions/upload-artifact@v4
+        with:
+          name: LocalAI-linux-${{ matrix.build }}
+          path: release/
+      - name: Release
+        uses: softprops/action-gh-release@v2
+        if: startsWith(github.ref, 'refs/tags/')
+        with:
+          files: |
+            release/*
+
+  build-stablediffusion:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Clone
+        uses: actions/checkout@v4
+        with:
+          submodules: true
+      - uses: actions/setup-go@v5
+        with:
+          go-version: '1.21.x'
+          cache: false
+      - name: Dependencies
+        run: |
+          sudo apt-get install -y --no-install-recommends libopencv-dev protobuf-compiler
+          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest
+          go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
+      - name: Build stablediffusion
+        run: |
+          export PATH=$PATH:$GOPATH/bin
+          make backend-assets/grpc/stablediffusion
+          mkdir -p release && cp backend-assets/grpc/stablediffusion release
+      - uses: actions/upload-artifact@v4
+        with:
+          name: stablediffusion
+          path: release/
+
+  build-macOS:
+    strategy:
+      matrix:
+        include:
+          - build: 'avx2'
+            defines: ''
+          - build: 'avx'
+            defines: '-DLLAMA_AVX2=OFF'
+          - build: 'avx512'
+            defines: '-DLLAMA_AVX512=ON'
+    runs-on: macOS-latest
+    steps:
+      - name: Clone
+        uses: actions/checkout@v4
+        with:
+          submodules: true
+      - uses: actions/setup-go@v5
+        with:
+          go-version: '1.21.x'
+          cache: false
+      - name: Dependencies
+        run: |
+          brew install protobuf grpc
+          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest
+          go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
+      - name: Build
+        id: build
+        env:
+          CMAKE_ARGS: "${{ matrix.defines }}"
+          BUILD_ID: "${{ matrix.build }}"
+        run: |
+          export C_INCLUDE_PATH=/usr/local/include
+          export CPLUS_INCLUDE_PATH=/usr/local/include
+          export PATH=$PATH:$GOPATH/bin
+          make dist
+      - uses: actions/upload-artifact@v4
+        with:
+          name: LocalAI-MacOS-${{ matrix.build }}
+          path: release/
+      - name: Release
+        uses: softprops/action-gh-release@v2
+        if: startsWith(github.ref, 'refs/tags/')
+        with:
+          files: |
+            release/*
+
+
+  build-macOS-arm64:
+    strategy:
+      matrix:
+        include:
+          - build: 'avx2'
+            defines: ''
+          - build: 'avx'
+            defines: '-DLLAMA_AVX2=OFF'
+          - build: 'avx512'
+            defines: '-DLLAMA_AVX512=ON'
+    runs-on: macos-14
+    steps:
+      - name: Clone
+        uses: actions/checkout@v4
+        with:
+          submodules: true
+      - uses: actions/setup-go@v5
+        with:
+          go-version: '1.21.x'
+          cache: false
+      - name: Dependencies
+        run: |
+          brew install protobuf grpc
+          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest
+          go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
+      - name: Build
+        id: build
+        env:
+          CMAKE_ARGS: "${{ matrix.defines }}"
+          BUILD_ID: "${{ matrix.build }}"
+        run: |
+          export C_INCLUDE_PATH=/usr/local/include
+          export CPLUS_INCLUDE_PATH=/usr/local/include
+          export PATH=$PATH:$GOPATH/bin
+          make dist
+      - uses: actions/upload-artifact@v4
+        with:
+          name: LocalAI-MacOS-arm64-${{ matrix.build }}
+          path: release/
+      - name: Release
+        uses: softprops/action-gh-release@v2
+        if: startsWith(github.ref, 'refs/tags/')
+        with:
+          files: |
+            release/*
--- a/.github/workflows/secscan.yaml
+++ b/.github/workflows/secscan.yaml
@@ -14,20 +14,17 @@ jobs:
      GO111MODULE: on
    steps:
      - name: Checkout Source
-        uses: actions/checkout@v7
+        uses: actions/checkout@v4
        if: ${{ github.actor != 'dependabot[bot]' }}
      - name: Run Gosec Security Scanner
        if: ${{ github.actor != 'dependabot[bot]' }}
-        uses: securego/gosec@v2.27.1
+        uses: securego/gosec@master
        with:
          # we let the report trigger content trigger a failure using the GitHub Security features.
-          # backend/go/supertonic is excluded: it vendors upstream supertone-inc/supertonic
-          # (helper.go), whose findings (G304 model-file loads, G404 math/rand for flow-matching
-          # noise, G104 unhandled errors) are inherent to that upstream code, not ours to rewrite.
-          args: '-no-fail -exclude-dir=backend/go/supertonic -fmt sarif -out results.sarif ./...'
+          args: '-no-fail -fmt sarif -out results.sarif ./...'
      - name: Upload SARIF file
        if: ${{ github.actor != 'dependabot[bot]' }}
-        uses: github/codeql-action/upload-sarif@v4
+        uses: github/codeql-action/upload-sarif@v3
        with:
          # Path to SARIF file relative to the root of the repository
          sarif_file: results.sarif
--- a/.github/workflows/stalebot.yml
+++ b/.github/workflows/stalebot.yml
@@ -1,25 +0,0 @@
-name: 'Close stale issues and PRs'
-permissions:
-  issues: write
-  pull-requests: write
-on:
-  schedule:
-    - cron: '30 1 * * *'
-
-jobs:
-  stale:
-    if: github.repository == 'mudler/LocalAI'
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/stale@eb5cf3af3ac0a1aa4c9c45633dd1ae542a27a899 # v9
-        with:
-          stale-issue-message: 'This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.'
-          stale-pr-message: 'This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 10 days.'
-          close-issue-message: 'This issue was closed because it has been stalled for 5 days with no activity.'
-          close-pr-message: 'This PR was closed because it has been stalled for 10 days with no activity.'
-          days-before-issue-stale: 90
-          days-before-pr-stale: 90
-          days-before-issue-close: 5
-          days-before-pr-close: 10
-          exempt-issue-labels: 'roadmap'
-          exempt-pr-labels: 'roadmap'
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -9,23 +9,56 @@ on:
    tags:
      - '*'

+env:
+  GRPC_VERSION: v1.58.0
+
 concurrency:
-  group: ci-tests-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+  group: ci-tests-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true

 jobs:
  tests-linux:
    runs-on: ubuntu-latest
    strategy:
      matrix:
-        go-version: ['1.26.x']
+        go-version: ['1.21.x']
    steps:
+      - name: Release space from worker
+        run: |
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          df -h
+          echo
+          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
+          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
+          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
+          sudo rm -rf /usr/local/lib/android
+          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
+          sudo rm -rf /usr/share/dotnet
+          sudo apt-get remove -y '^mono-.*' || true
+          sudo apt-get remove -y '^ghc-.*' || true
+          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
+          sudo apt-get remove -y 'php.*' || true
+          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
+          sudo apt-get remove -y '^google-.*' || true
+          sudo apt-get remove -y azure-cli || true
+          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
+          sudo apt-get remove -y '^gfortran-.*' || true
+          sudo apt-get autoremove -y
+          sudo apt-get clean
+          echo
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          sudo rm -rfv build || true
+          df -h
      - name: Clone
-        uses: actions/checkout@v7
-        with:
+        uses: actions/checkout@v4
+        with: 
          submodules: true
-      - name: Free disk space
-        uses: ./.github/actions/free-disk-space
      - name: Setup Go ${{ matrix.go-version }}
        uses: actions/setup-go@v5
        with:
@@ -34,58 +67,130 @@ jobs:
      # You can test your matrix by printing the current Go version
      - name: Display Go version
        run: go version
-      - name: Proto Dependencies
-        run: |
-          # Install protoc
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
      - name: Dependencies
        run: |
          sudo apt-get update
-          sudo apt-get install curl ffmpeg libopus-dev
-      - name: Setup Node.js
-        uses: actions/setup-node@v6
+          sudo apt-get install build-essential curl ffmpeg
+          curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > conda.gpg && \
+             sudo install -o root -g root -m 644 conda.gpg /usr/share/keyrings/conda-archive-keyring.gpg && \
+             gpg --keyring /usr/share/keyrings/conda-archive-keyring.gpg --no-default-keyring --fingerprint 34161F5BF5EB1D4BFBBB8F0A8AEB4F8B29D82806 && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" > /etc/apt/sources.list.d/conda.list' && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" | tee -a /etc/apt/sources.list.d/conda.list' && \
+             sudo apt-get update && \
+             sudo apt-get install -y conda
+          sudo apt-get install -y ca-certificates cmake patch python3-pip unzip
+          sudo apt-get install -y libopencv-dev
+
+          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
+          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
+          rm protoc.zip
+
+          go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
+          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest
+
+          # The python3-grpc-tools package in 22.04 is too old
+          pip install --user grpcio-tools
+
+          sudo rm -rfv /usr/bin/conda || true
+          PATH=$PATH:/opt/conda/bin make -C backend/python/sentencetransformers
+
+          # Pre-build piper before we start tests in order to have shared libraries in place
+          make sources/go-piper && \
+          GO_TAGS="tts" make -C sources/go-piper piper.o && \
+          sudo cp -rfv sources/go-piper/piper-phonemize/pi/lib/. /usr/lib/ && \
+          # Pre-build stable diffusion before we install a newer version of abseil (not compatible with stablediffusion-ncn)
+          PATH="$PATH:/root/go/bin" GO_TAGS="stablediffusion tts" GRPC_BACKENDS=backend-assets/grpc/stablediffusion make build
+      - name: Cache grpc
+        id: cache-grpc
+        uses: actions/cache@v4
        with:
-          node-version: '22'
-      - name: Build React UI
-        run: make react-ui
-      # Runs the core suite with coverage and fails if total coverage dropped
-      # below the committed baseline (coverage-baseline.txt). The gate is
-      # strict — any decrease fails. Raise the baseline with
-      # `make test-coverage-baseline` and commit it when coverage rises.
-      - name: Test (with coverage gate)
+          path: grpc
+          key: ${{ runner.os }}-grpc-${{ env.GRPC_VERSION }}
+      - name: Build grpc
+        if: steps.cache-grpc.outputs.cache-hit != 'true'
        run: |
-          PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test-coverage-check
-      - name: Upload coverage report
-        if: ${{ always() }}
-        uses: actions/upload-artifact@v4
-        with:
-          name: coverage-linux
-          path: |
-            coverage/coverage.out
-            coverage/coverage.html
-          if-no-files-found: ignore
+          git clone --recurse-submodules -b ${{ env.GRPC_VERSION }} --depth 1 --jobs 5 --shallow-submodules https://github.com/grpc/grpc && \
+          cd grpc && mkdir -p cmake/build && cd cmake/build && cmake -DgRPC_INSTALL=ON \
+            -DgRPC_BUILD_TESTS=OFF \
+            ../.. && sudo make --jobs 5
+      - name: Install gRPC
+        run: |
+          cd grpc && cd cmake/build && sudo make --jobs 5 install
+      - name: Test
+        run: |
+          PATH="$PATH:/root/go/bin" GO_TAGS="stablediffusion tts" make --jobs 5 --output-sync=target test
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
-        uses: mxschmitt/action-tmate@v3.23
+        uses: mxschmitt/action-tmate@v3.18
+        with:
+          detached: true
+          connect-timeout-seconds: 180
+          limit-access-to-actor: true
+
+  tests-aio-container:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Release space from worker
+        run: |
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          df -h
+          echo
+          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
+          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
+          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
+          sudo rm -rf /usr/local/lib/android
+          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
+          sudo rm -rf /usr/share/dotnet
+          sudo apt-get remove -y '^mono-.*' || true
+          sudo apt-get remove -y '^ghc-.*' || true
+          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
+          sudo apt-get remove -y 'php.*' || true
+          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
+          sudo apt-get remove -y '^google-.*' || true
+          sudo apt-get remove -y azure-cli || true
+          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
+          sudo apt-get remove -y '^gfortran-.*' || true
+          sudo apt-get autoremove -y
+          sudo apt-get clean
+          echo
+          echo "Listing top largest packages"
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
+          head -n 30 <<< "${pkgs}"
+          echo
+          sudo rm -rfv build || true
+          df -h
+      - name: Clone
+        uses: actions/checkout@v4
+        with: 
+          submodules: true
+      - name: Build images
+        run: |
+          docker build --build-arg FFMPEG=true --build-arg IMAGE_TYPE=core --build-arg MAKEFLAGS="--jobs=5 --output-sync=target" -t local-ai:tests -f Dockerfile .
+          BASE_IMAGE=local-ai:tests DOCKER_AIO_IMAGE=local-ai-aio:test make docker-aio
+      - name: Test
+        run: |
+          LOCALAI_MODELS_DIR=$PWD/models LOCALAI_IMAGE_TAG=test LOCALAI_IMAGE=local-ai-aio \
+            make run-e2e-aio
+      - name: Setup tmate session if tests fail
+        if: ${{ failure() }}
+        uses: mxschmitt/action-tmate@v3.18
        with:
          detached: true
          connect-timeout-seconds: 180
          limit-access-to-actor: true

  tests-apple:
-    runs-on: macos-latest
+    runs-on: macOS-14
    strategy:
      matrix:
-        go-version: ['1.26.x']
+        go-version: ['1.21.x']
    steps:
      - name: Clone
-        uses: actions/checkout@v7
-        with:
+        uses: actions/checkout@v4
+        with: 
          submodules: true
      - name: Setup Go ${{ matrix.go-version }}
        uses: actions/setup-go@v5
@@ -97,43 +202,19 @@ jobs:
        run: go version
      - name: Dependencies
        run: |
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus ffmpeg
-          pip install --user --no-cache-dir grpcio-tools grpcio
-      - name: Setup Node.js
-        uses: actions/setup-node@v6
-        with:
-          node-version: '22'
-      - name: Build React UI
-        run: make react-ui
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc
+          pip install --user grpcio-tools
      - name: Test
        run: |
          export C_INCLUDE_PATH=/usr/local/include
          export CPLUS_INCLUDE_PATH=/usr/local/include
-          export CC=/opt/homebrew/opt/llvm/bin/clang
          # Used to run the newer GNUMake version from brew that supports --output-sync
          export PATH="/opt/homebrew/opt/make/libexec/gnubin:$PATH"
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-          PATH="$PATH:$HOME/go/bin" BUILD_TYPE="GITHUB_CI_HAS_BROKEN_METAL" CMAKE_ARGS="-DGGML_F16C=OFF -DGGML_AVX512=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF" make --jobs 4 --output-sync=target test
+          BUILD_TYPE="GITHUB_CI_HAS_BROKEN_METAL" CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF" make --jobs 4 --output-sync=target test
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
-        uses: mxschmitt/action-tmate@v3.23
+        uses: mxschmitt/action-tmate@v3.18
        with:
          detached: true
          connect-timeout-seconds: 180
          limit-access-to-actor: true
-
-  # Fast standalone unit tests for the backends' pure C++ helpers - currently the
-  # llama-cpp message reconstruction (backend/cpp/llama-cpp/message_content.h),
-  # which guards the OpenAI chat content normalization (mudler/LocalAI#10524,
-  # #7324, #7528). The runner discovers every *_test.cpp under backend/cpp/, so
-  # new pure-C++ unit tests are picked up with no CI changes. These need only the
-  # C++ stdlib + nlohmann/json, so they run on every PR without the full
-  # llama.cpp + gRPC backend build. (The same suite is also wired as an opt-in
-  # CMake/ctest target, -DLLAMA_GRPC_BUILD_TESTS=ON, for in-backend-build runs.)
-  tests-backend-cpp:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-      - name: Run backend C++ unit tests
-        run: make test-backend-cpp
--- a/.github/workflows/tests-aio.yml
+++ b/.github/workflows/tests-aio.yml
@@ -1,86 +0,0 @@
---
-name: 'tests-aio'
-
-# Runs the all-in-one (AIO) Docker image with real backends + real models.
-# Heavy: builds llama-cpp/whisper/piper/silero-vad/stablediffusion-ggml/local-store
-# and exercises end-to-end inference inside the container. Moved out of test.yml
-# (which used to run on every PR) so PR CI no longer pays this cost.
-#
-# Triggers:
-#   - schedule (nightly @ 04:00 UTC) — catches packaging/image regressions within 24h
-#   - workflow_dispatch — manual run on-demand
-#   - push to master/tags — sanity check after merge / before release
-
-on:
-  schedule:
-    - cron: '0 4 * * *'
-  workflow_dispatch:
-  push:
-    branches:
-      - master
-    tags:
-      - '*'
-
-concurrency:
-  group: ci-tests-aio-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  tests-aio:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Release space from worker
-        run: |
-          echo "Listing top largest packages"
-          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
-          head -n 30 <<< "${pkgs}"
-          echo
-          df -h
-          echo
-          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
-          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
-          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
-          sudo rm -rf /usr/local/lib/android
-          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
-          sudo rm -rf /usr/share/dotnet
-          sudo apt-get remove -y '^mono-.*' || true
-          sudo apt-get remove -y '^ghc-.*' || true
-          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
-          sudo apt-get remove -y 'php.*' || true
-          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
-          sudo apt-get remove -y '^google-.*' || true
-          sudo apt-get remove -y azure-cli || true
-          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
-          sudo apt-get remove -y '^gfortran-.*' || true
-          sudo apt-get autoremove -y
-          sudo apt-get clean
-          echo
-          echo "Listing top largest packages"
-          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
-          head -n 30 <<< "${pkgs}"
-          echo
-          sudo rm -rfv build || true
-          df -h
-      - name: Clone
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-      - name: Dependencies
-        run: |
-          # Install protoc
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Test
-        run: |
-            PATH="$PATH:$HOME/go/bin" make backends/local-store backends/silero-vad backends/llama-cpp backends/whisper backends/piper backends/stablediffusion-ggml docker-build-e2e e2e-aio
-      - name: Setup tmate session if tests fail
-        if: ${{ failure() }}
-        uses: mxschmitt/action-tmate@v3.23
-        with:
-          detached: true
-          connect-timeout-seconds: 180
-          limit-access-to-actor: true
--- a/.github/workflows/tests-e2e.yml
+++ b/.github/workflows/tests-e2e.yml
@@ -1,64 +0,0 @@
---
-name: 'E2E Backend Tests'
-
-on:
-  pull_request:
-  push:
-    branches:
-      - master
-    tags:
-      - '*'
-
-concurrency:
-  group: ci-tests-e2e-backend-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  tests-e2e-backend:
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        go-version: ['1.25.x']
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
-      - name: Setup Go ${{ matrix.go-version }}
-        uses: actions/setup-go@v5
-        with:
-          go-version: ${{ matrix.go-version }}
-          cache: false
-      - name: Display Go version
-        run: go version
-      - name: Proto Dependencies
-        run: |
-          # Install protoc
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential libopus-dev
-      - name: Setup Node.js
-        uses: actions/setup-node@v6
-        with:
-          node-version: '22'
-      - name: Build React UI
-        run: make react-ui
-      - name: Test Backend E2E
-        run: |
-          PATH="$PATH:$HOME/go/bin" make build-mock-backend test-e2e
-      - name: Setup tmate session if tests fail
-        if: ${{ failure() }}
-        uses: mxschmitt/action-tmate@v3.23
-        with:
-          detached: true
-          connect-timeout-seconds: 180
-          limit-access-to-actor: true
--- a/.github/workflows/tests-pii-ner-e2e.yml
+++ b/.github/workflows/tests-pii-ner-e2e.yml
@@ -1,97 +0,0 @@
---
-name: 'PII NER tier E2E (live GGUF, CPU)'
-
-# Runs the real privacy-filter GGUF NER tier end-to-end on CPU — the gap the
-# hermetic tests/e2e suite cannot cover (it only exercises the in-process
-# pattern tier). Heavy (builds the C++ backend image + downloads a ~2.7 GB
-# GGUF), so it is path-filtered on PRs and otherwise runs nightly / on demand.
-#
-# This drives the container-level harness (tests/e2e-backends) via
-# `make test-extra-backend-privacy-filter`: it builds the privacy-filter image,
-# downloads the model, loads it on CPU, and asserts byte-correct, UTF-8-aligned
-# TokenClassify spans. The complementary HTTP-path specs in tests/e2e
-# (e2e_pii_ner_test.go) Skip unless PII_NER_MODEL_GGUF is wired.
-
-on:
-  workflow_dispatch:
-  schedule:
-    - cron: '0 3 * * *'
-  push:
-    branches:
-      - master
-    paths:
-      - 'backend/cpp/privacy-filter/**'
-      - 'backend/Dockerfile.privacy-filter'
-      - 'core/services/routing/pii/**'
-      - 'core/services/routing/piidetector/**'
-      - 'core/backend/token_classify.go'
-      - 'core/http/endpoints/localai/pii.go'
-      - 'core/schema/pii.go'
-      - 'tests/e2e-backends/**'
-      - 'tests/e2e/e2e_pii_ner_test.go'
-      - 'tests/e2e/e2e_suite_test.go'
-      - '.github/workflows/tests-pii-ner-e2e.yml'
-  pull_request:
-    paths:
-      - 'backend/cpp/privacy-filter/**'
-      - 'backend/Dockerfile.privacy-filter'
-      - 'core/services/routing/pii/**'
-      - 'core/services/routing/piidetector/**'
-      - 'core/backend/token_classify.go'
-      - 'core/http/endpoints/localai/pii.go'
-      - 'core/schema/pii.go'
-      - 'tests/e2e-backends/**'
-      - 'tests/e2e/e2e_pii_ner_test.go'
-      - 'tests/e2e/e2e_suite_test.go'
-      - '.github/workflows/tests-pii-ner-e2e.yml'
-
-concurrency:
-  group: ci-tests-pii-ner-e2e-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  tests-pii-ner-e2e:
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        go-version: ['1.25.x']
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-      - name: Free disk space
-        run: |
-          sudo rm -rf /usr/share/dotnet /usr/local/lib/android /opt/ghc /opt/hostedtoolcache/CodeQL || true
-          sudo docker image prune --all --force || true
-          df -h
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
-      - name: Setup Go ${{ matrix.go-version }}
-        uses: actions/setup-go@v5
-        with:
-          go-version: ${{ matrix.go-version }}
-          cache: false
-      - name: Proto Dependencies
-        run: |
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential
-      # Builds local-ai-backend:privacy-filter, downloads the GGUF, loads it on
-      # CPU and runs the token_classify capability spec (byte-offset contract).
-      - name: Run live PII NER backend E2E
-        run: PATH="$PATH:$HOME/go/bin" make test-extra-backend-privacy-filter
-      - name: Setup tmate session if tests fail
-        if: ${{ failure() }}
-        uses: mxschmitt/action-tmate@v3.23
-        with:
-          detached: true
-          connect-timeout-seconds: 180
-          limit-access-to-actor: true
--- a/.github/workflows/tests-ui-e2e.yml
+++ b/.github/workflows/tests-ui-e2e.yml
@@ -1,82 +0,0 @@
---
-name: 'UI E2E Tests'
-
-on:
-  pull_request:
-    paths:
-      - 'core/http/**'
-      - 'tests/e2e-ui/**'
-      - 'tests/e2e/mock-backend/**'
-  push:
-    branches:
-      - master
-
-concurrency:
-  group: ci-tests-ui-e2e-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
-
-jobs:
-  tests-ui-e2e:
-    runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        go-version: ['1.26.x']
-    steps:
-      - name: Clone
-        uses: actions/checkout@v7
-        with:
-          submodules: true
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
-      - name: Setup Go ${{ matrix.go-version }}
-        uses: actions/setup-go@v5
-        with:
-          go-version: ${{ matrix.go-version }}
-          cache: false
-      - name: Setup Node.js
-        uses: actions/setup-node@v6
-        with:
-          node-version: '22'
-      - name: Setup Bun
-        uses: oven-sh/setup-bun@v2
-        with:
-          bun-version: '1.3.11'
-      - name: Proto Dependencies
-        run: |
-          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
-          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
-          rm protoc.zip
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-      - name: System Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install -y build-essential libopus-dev
-      # Builds an instrumented UI bundle, runs the Playwright specs, and fails
-      # if line coverage regressed beyond the jitter tolerance (the gate is
-      # in `make test-ui-coverage-check`). PLAYWRIGHT_CHROMIUM_PATH is unset
-      # here, so scripts/ensure-playwright-browser.sh installs Chromium via apt.
-      - name: Run UI e2e + coverage gate
-        run: PATH="$PATH:$HOME/go/bin" make test-ui-coverage-check
-      - name: Upload Playwright report
-        if: ${{ failure() }}
-        uses: actions/upload-artifact@v7
-        with:
-          name: playwright-report
-          path: core/http/react-ui/playwright-report/
-          retention-days: 7
-      - name: Upload UI coverage report
-        if: ${{ always() }}
-        uses: actions/upload-artifact@v7
-        with:
-          name: ui-coverage
-          path: core/http/react-ui/coverage/
-          if-no-files-found: ignore
-          retention-days: 7
-      - name: Setup tmate session if tests fail
-        if: ${{ failure() }}
-        uses: mxschmitt/action-tmate@v3.23
-        with:
-          detached: true
-          connect-timeout-seconds: 180
-          limit-access-to-actor: true
--- a/.github/workflows/update_swagger.yaml
+++ b/.github/workflows/update_swagger.yaml
@@ -5,30 +5,21 @@ on:
  workflow_dispatch:
 jobs:
  swagger:
-    if: github.repository == 'mudler/LocalAI'
    strategy:
      fail-fast: false
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v7
-      - name: Configure apt mirror on runner
-        uses: ./.github/actions/configure-apt-mirror
+      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with:
          go-version: 'stable'
-      - name: Dependencies
-        run: |
-          sudo apt-get update
-          sudo apt-get install protobuf-compiler
      - run: |
          go install github.com/swaggo/swag/cmd/swag@latest
-          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
-          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
      - name: Bump swagger 🔧
        run: |
-          make protogen-go swagger
+          make swagger
      - name: Create Pull Request
-        uses: peter-evans/create-pull-request@v8
+        uses: peter-evans/create-pull-request@v6
        with:
          token: ${{ secrets.UPDATE_BOT_TOKEN }}
          push-to-fork: ci-forks/LocalAI
--- a/.github/workflows/yaml-check.yml
+++ b/.github/workflows/yaml-check.yml
@@ -8,7 +8,7 @@ jobs:
    steps:
      - name: 'Checkout'
        uses: actions/checkout@master
-      - name: 'Yamllint model gallery'
+      - name: 'Yamllint'
        uses: karancode/yamllint-github-action@master
        with:
          yamllint_file_or_dir: 'gallery'
@@ -16,11 +16,3 @@ jobs:
          yamllint_comment: true
        env:
          GITHUB_ACCESS_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-      - name: 'Yamllint Backend gallery'
-        uses: karancode/yamllint-github-action@master
-        with:
-          yamllint_file_or_dir: 'backend'
-          yamllint_strict: false
-          yamllint_comment: true
-        env:
-          GITHUB_ACCESS_TOKEN: ${{ secrets.GITHUB_TOKEN }}
--- a/.gitignore
+++ b/.gitignore
@@ -2,34 +2,21 @@
 /sources/
 __pycache__/
 *.a
-*.o
 get-sources
 prepare-sources
-/backend/cpp/llama-cpp/grpc-server
-/backend/cpp/llama-cpp/llama.cpp
-/backend/cpp/llama-*
-!backend/cpp/llama-cpp
-/backends
-/backend-images
-/result.yaml
-protoc
-
-*.log
+/backend/cpp/llama/grpc-server
+/backend/cpp/llama/llama.cpp

 go-ggml-transformers
 go-gpt2
+go-rwkv
 whisper.cpp
 /bloomz
 go-bert

 # LocalAI build binary
 LocalAI
-/local-ai
-/local-ai-launcher
-# Root-level build artifacts when running `go build ./...` against
-# Go backend packages whose main lives under backend/go/.
-/cloud-proxy
-/local-store
+local-ai
 # prevent above rules from omitting the helm chart
 !charts/*
 # prevent above rules from omitting the api/localai folder
@@ -40,8 +27,6 @@ LocalAI
 models/*
 test-models/
 test-dir/
-tests/e2e-aio/backends
-mock-backend

 release/

@@ -54,7 +39,6 @@ backend-assets/*
 !backend-assets/.keep
 prepare
 /ggml-metal.metal
-docs/static/gallery.html

 # Protobuf generated files
 *.pb.go
@@ -62,47 +46,4 @@ docs/static/gallery.html
 *pb2_grpc.py

 # SonarQube
-.scannerwork
-
-# backend virtual environments
-**/venv
-
-# per-developer customization files for the development container
-.devcontainer/customization/*
-
-# Coverage profiles (the committed baseline is coverage-baseline.txt)
-/coverage/
-
-# React UI build artifacts (keep placeholder dist/index.html)
-core/http/react-ui/node_modules/
-core/http/react-ui/dist
-
-# React UI coverage (vite-plugin-istanbul + nyc, via `make test-ui-coverage`)
-core/http/react-ui/.nyc_output/
-core/http/react-ui/coverage/
-
-# Extracted backend binaries for container-based testing
-local-backends/
-
-# UI E2E test artifacts
-tests/e2e-ui/ui-test-server
-core/http/react-ui/playwright-report/
-core/http/react-ui/test-results/
-
-# Local worktrees
-.worktrees/
-
-# SDD / brainstorm scratch (agent-driven development)
-.superpowers/
-
-# Local Apple signing material (never commit)
-.certs/
-
-# Pinned dev tools (e.g. FizzBee for the realtime-conformance gate)
-.tools/
-
-# FizzBee model-check artifacts: the parser emits <spec>.json next to each
-# .fizz and the checker writes run dirs under out/. Both are regenerated by
-# the realtime-conformance gate; only the .fizz sources are authoritative.
-formal-verification/*.json
-formal-verification/out/
+.scannerwork
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,6 +1,6 @@
 [submodule "docs/themes/hugo-theme-relearn"]
 	path = docs/themes/hugo-theme-relearn
 	url = https://github.com/McShelby/hugo-theme-relearn.git
-[submodule "backend/rust/kokoros/sources/Kokoros"]
-	path = backend/rust/kokoros/sources/Kokoros
-	url = https://github.com/lucasjinreal/Kokoros
+[submodule "docs/themes/lotusdocs"]
+	path = docs/themes/lotusdocs
+	url = https://github.com/colinwilson/lotusdocs
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -1,128 +0,0 @@
-version: "2"
-
-# Only issues introduced relative to master are reported. Pre-existing issues
-# in the codebase do not fail the lint job; they're treated as a baseline that
-# can be cleaned up incrementally. New code (added lines on a branch) is held
-# to the full linter set. Locally, `make lint-all` overrides this and reports
-# every issue.
-issues:
-  # origin/master because in shallow CI checkouts only the remote-tracking
-  # branch exists; a bare 'master' ref isn't reachable locally.
-  new-from-merge-base: origin/master
-
-linters:
-  default: standard
-  # staticcheck is noisy on this codebase (mostly QF style suggestions like
-  # "could use tagged switch" or "unnecessary fmt.Sprintf"). Re-enable
-  # selectively if a high-signal subset is identified.
-  disable:
-    - staticcheck
-  enable:
-    - forbidigo
-  settings:
-    forbidigo:
-      forbid:
-        - pattern: '^t\.Errorf$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(...) instead of t.Errorf. See .agents/coding-style.md.'
-        - pattern: '^t\.Error$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(...) instead of t.Error. See .agents/coding-style.md.'
-        - pattern: '^t\.Fatalf$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(Succeed()) / Fail(...) instead of t.Fatalf. See .agents/coding-style.md.'
-        - pattern: '^t\.Fatal$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(Succeed()) / Fail(...) instead of t.Fatal. See .agents/coding-style.md.'
-        - pattern: '^t\.Run$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Describe/Context/It instead of t.Run. See .agents/coding-style.md.'
-        - pattern: '^t\.Skip$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.Skip. See .agents/coding-style.md.'
-        - pattern: '^t\.Skipf$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.Skipf. See .agents/coding-style.md.'
-        - pattern: '^t\.SkipNow$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.SkipNow. See .agents/coding-style.md.'
-        - pattern: '^t\.Logf$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use GinkgoWriter / fmt.Fprintf(GinkgoWriter, ...) instead of t.Logf. See .agents/coding-style.md.'
-        - pattern: '^t\.Log$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use GinkgoWriter / fmt.Fprintln(GinkgoWriter, ...) instead of t.Log. See .agents/coding-style.md.'
-        - pattern: '^t\.Fail$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.'
-        - pattern: '^t\.FailNow$'
-          msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.'
-        # In-process config should flow through ApplicationConfig / kong-bound
-        # CLI flags, not via os.Getenv. The CLI layer is the legitimate
-        # env→struct boundary (kong's `env:"..."` tag); anything deeper that
-        # reads env directly leaks process state into business logic and
-        # makes flags impossible to test or override per-request. Backend
-        # subprocesses, the system/capabilities probe, and a few places that
-        # read non-LocalAI env vars (HOME, PATH, AUTH_TOKEN passed by parent)
-        # are exempt — see linters.exclusions.rules below.
-        - pattern: '^os\.(Getenv|LookupEnv|Environ)$'
-          msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
-        # Outbound HTTP must go through pkg/httpclient, which refuses redirects
-        # by default and sets a TLS floor. The std-library default client and
-        # the http.Get/Post/... convenience helpers follow redirects (up to 10)
-        # and, on a cross-host redirect, forward custom credential headers such
-        # as Anthropic's x-api-key to the redirect target — leaking the secret
-        # (GHSA-3mj3-57v2-4636). forbidigo can't precisely match the
-        # `&http.Client{}` composite literal without also flagging legitimate
-        # `*http.Client` type references, so that form is enforced by
-        # convention + review; these two patterns catch the implicit-default
-        # client, which is the common footgun.
-        - pattern: '^http\.DefaultClient$'
-          msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.DefaultClient — the std client follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
-        - pattern: '^http\.(Get|Post|PostForm|Head)$'
-          msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.Get/Post/PostForm/Head — these use http.DefaultClient, which follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
-  exclusions:
-    paths:
-      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
-      - 'backend/go/whisper/sources'
-      # Vendored upstream supertonic pipeline (supertone-inc/supertonic go/helper.go).
-      - 'backend/go/supertonic/helper.go'
-      - 'docs/'
-    rules:
-      # CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
-      # boundary, and a handful of subcommands legitimately propagate values
-      # to spawned subprocesses (LLAMACPP_GRPC_SERVERS, MLX hostfile, ...).
-      - path: ^core/cli/
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      # Backend subprocesses are independent binaries with their own env
-      # surface; they're not "in-process config" of the LocalAI server.
-      - path: ^backend/
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      # System capability probe reads HOME, PATH-style vars to discover
-      # GPUs, default paths, etc. — not LocalAI config.
-      - path: ^pkg/system/
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      # gRPC server reads AUTH_TOKEN passed in by the parent process at spawn
-      # time; model.Loader sets/inherits env to communicate with subprocesses.
-      - path: ^pkg/grpc/
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      - path: ^pkg/model/
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      # Top-level main binaries (local-ai, launcher) are entry points.
-      - path: ^cmd/
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      # Tests legitimately read $HOME, $TMPDIR, and gating env vars
-      # (LOCALAI_COSIGN_LIVE, etc.) to skip live-network specs.
-      - path: _test\.go$
-        text: 'os\.(Getenv|LookupEnv|Environ)'
-        linters: [forbidigo]
-      # pkg/httpclient is the sanctioned home for outbound HTTP clients; it
-      # necessarily references net/http directly.
-      - path: ^pkg/httpclient/
-        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
-        linters: [forbidigo]
-      # Tests drive local httptest servers where redirect/TLS hardening is
-      # irrelevant; the std client is fine there.
-      - path: _test\.go$
-        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
-        linters: [forbidigo]
-      # Vendored upstream whisper.cpp Go bindings are a separate module and
-      # cannot import pkg/httpclient.
-      - path: ^backend/go/whisper/sources/
-        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
-        linters: [forbidigo]
--- a/.goreleaser.yaml
+++ b/.goreleaser.yaml
@@ -1,54 +0,0 @@
-version: 2
-before:
-  hooks:
-    - make protogen-go
-    - make react-ui
-    - go mod tidy
-dist: release
-source:
-  enabled: true
-  name_template: '{{ .ProjectName }}-{{ .Tag }}-source'
-builds:
-  - id: local-ai
-    main: ./cmd/local-ai
-    env:
-      - CGO_ENABLED=0
-    ldflags:
-      - -s -w
-      - -X "github.com/mudler/LocalAI/internal.Version={{ .Tag }}"
-      - -X "github.com/mudler/LocalAI/internal.Commit={{ .FullCommit }}"
-    goos:
-      - linux
-      - darwin
-      #- windows
-    goarch:
-      - amd64
-      - arm64
-    ignore:
-      - goos: darwin
-        goarch: amd64
-archives:
-  - formats: [ 'binary' ] # this removes the tar of the archives, leaving the binaries alone
-    name_template: local-ai-{{ .Tag }}-{{ .Os }}-{{ .Arch }}{{ if .Arm }}v{{ .Arm }}{{ end }}
-checksum:
-  name_template: '{{ .ProjectName }}-{{ .Tag }}-checksums.txt'
-snapshot:
-  version_template: "{{ .Tag }}-next"
-changelog:
-  use: github-native
-# Sign + notarize the macOS server binary via the quill backend (runs on Linux,
-# no macOS runner needed). Disabled automatically when MACOS_SIGN_P12 is unset
-# (forks / PRs), so those builds stay unsigned and green.
-notarize:
-  macos:
-    - enabled: '{{ isEnvSet "MACOS_SIGN_P12" }}'
-      ids:
-        - local-ai
-      sign:
-        certificate: "{{.Env.MACOS_SIGN_P12}}"
-        password: "{{.Env.MACOS_SIGN_PASSWORD}}"
-      notarize:
-        issuer_id: "{{.Env.MACOS_NOTARY_ISSUER_ID}}"
-        key_id: "{{.Env.MACOS_NOTARY_KEY_ID}}"
-        key: "{{.Env.MACOS_NOTARY_KEY}}"
-        wait: true
--- a/.vscode/launch.json
+++ b/.vscode/launch.json
@@ -3,12 +3,12 @@
    "configurations": [
        {
            "name": "Python: Current File",
-            "type": "debugpy",
+            "type": "python",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal",
            "justMyCode": false,
-            "cwd": "${fileDirname}",
+            "cwd": "${workspaceFolder}/examples/langchain-chroma",
            "env": {
                "OPENAI_API_BASE": "http://localhost:8080/v1",
                "OPENAI_API_KEY": "abc"
@@ -19,16 +19,15 @@
            "type": "go",
            "request": "launch",
            "mode": "debug",
-            "program": "${workspaceRoot}",
-            "args": [],
+            "program": "${workspaceFolder}/main.go",
+            "args": [
+                "api"
+            ],
            "env": {
-                "LOCALAI_LOG_LEVEL": "debug",
-                "LOCALAI_P2P": "true",
-                "LOCALAI_FEDERATED": "true"
-            },
-            "buildFlags": ["-tags", "", "-v"],
-            "envFile": "${workspaceFolder}/.env",
-            "cwd": "${workspaceRoot}"
+                "C_INCLUDE_PATH": "${workspaceFolder}/go-llama:${workspaceFolder}/go-stable-diffusion/:${workspaceFolder}/gpt4all/gpt4all-bindings/golang/:${workspaceFolder}/go-gpt2:${workspaceFolder}/go-rwkv:${workspaceFolder}/whisper.cpp:${workspaceFolder}/go-bert:${workspaceFolder}/bloomz",
+                "LIBRARY_PATH": "${workspaceFolder}/go-llama:${workspaceFolder}/go-stable-diffusion/:${workspaceFolder}/gpt4all/gpt4all-bindings/golang/:${workspaceFolder}/go-gpt2:${workspaceFolder}/go-rwkv:${workspaceFolder}/whisper.cpp:${workspaceFolder}/go-bert:${workspaceFolder}/bloomz",
+                "DEBUG": "true"
+            }
        }
    ]
 }
--- a/.yamllint
+++ b/.yamllint
@@ -1,4 +0,0 @@
-extends: default
-
-rules:
-    line-length: disable
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,47 +0,0 @@
-# LocalAI Agent Instructions
-
-This file is the entry point for AI coding assistants (Claude Code, Cursor, Copilot, Codex, Aider, etc.) working on LocalAI. It is an index to detailed topic guides in the `.agents/` directory. Read the relevant file(s) for the task at hand — you don't need to load all of them.
-
-Human contributors: see [CONTRIBUTING.md](CONTRIBUTING.md) for the development workflow.
-
-## Policy for AI-Assisted Contributions
-
-LocalAI follows the Linux kernel project's [guidelines for AI coding assistants](https://docs.kernel.org/process/coding-assistants.html). Before submitting AI-assisted code, read [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md). Key rules:
-
- **No `Signed-off-by` from AI.** Only the human submitter may sign off on the Developer Certificate of Origin.
- **No `Co-Authored-By: <AI>` trailers.** The human contributor owns the change.
- **Use an `Assisted-by:` trailer** to attribute AI involvement. Format: `Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]`.
- **The human submitter is responsible** for reviewing, testing, and understanding every line of generated code.
-
-## Topics
-
-| File | When to read |
-|------|-------------|
-| [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md) | Policy for AI-assisted contributions — licensing, DCO, attribution |
-| [.agents/building-and-testing.md](.agents/building-and-testing.md) | Building the project, running tests, Docker builds for specific platforms |
-| [.agents/ci-caching.md](.agents/ci-caching.md) | CI build cache layout (registry-backed BuildKit cache on quay.io/go-skynet/ci-cache, per-arch keys), `DEPS_REFRESH` weekly cache-buster for unpinned Python deps, prebuilt `base-grpc-*` images for llama.cpp variants, per-arch native + manifest-merge pattern, `setup-build-disk` `/mnt` relocation, path filter on master push, manual eviction |
-| [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) |
-| [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
-| [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
-| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
-| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
-| [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
-| [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
-| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
-| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
-| [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery |
-| [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) | LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync |
-| [.agents/backend-signing.md](.agents/backend-signing.md) | Backend OCI image signing (keyless cosign + sigstore-go) — producer-side CI setup, consumer-side gallery `verification:` block, strict mode (`LOCALAI_REQUIRE_BACKEND_INTEGRITY`), revocation via `not_before` |
-
-## Quick Reference
-
- **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md).
- **Logging**: Use `github.com/mudler/xlog` (same API as slog)
- **Go style**: Prefer `any` over `interface{}`
- **Comments**: Explain *why*, not *what*
- **Docs**: Update `docs/content/` when adding features or changing config
- **New API endpoints**: LocalAI advertises its capability surface in several independent places — swagger `@Tags`, `/api/instructions` registry, auth `RouteFeatureRegistry`, React UI `capabilities.js`, docs. Read [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) and follow its checklist — missing any surface means clients, admins, and the UI won't know the endpoint exists.
- **Admin endpoints → MCP tool**: every admin endpoint that an admin would manage conversationally (install/list/edit/toggle/upgrade) MUST also be exposed as an MCP tool in `pkg/mcp/localaitools/`. The LocalAI Assistant chat modality and the standalone `local-ai mcp-server` consume that package; drift between REST and MCP is a real risk. Read [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) — the `TestToolHTTPRouteMappingComplete` test fails until you wire the new tool and update the route map.
- **Build**: Inspect `Makefile` and `.github/workflows/` — ask the user before running long builds
- **Backend OS coverage**: a new backend must target every OS it can build for, not just Linux. `.github/backend-matrix.yml` has two matrices — `include:` (Linux) and `includeDarwin:` (macOS / Apple Silicon). Most C/C++/GGML and many Python backends build on Darwin too — wire the `includeDarwin` entry + `backend/index.yaml` `metal:` entries, or say in the PR why an OS is unsupported. See the darwin checklist in [.agents/adding-backends.md](.agents/adding-backends.md).
- **UI**: The active UI is the React app in `core/http/react-ui/`. The older Alpine.js/HTML UI in `core/http/static/` is pending deprecation — all new UI work goes in the React UI
--- a/Show More
+++ b/Show More