Initial plan

2026-05-24 16:51:44 -04:00 · 2026-02-02 22:01:37 +00:00
1756 changed files with 264192 additions and 513028 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -1,273 +0,0 @@
 # Adding a New Backend
 When adding a new backend to LocalAI, you need to update several files to ensure the backend is properly built, tested, and registered. Here's a step-by-step guide based on the pattern used for adding backends like `moonshine`:
 ## 1. Create Backend Directory Structure
 Create the backend directory under the appropriate location:
 - **Python backends**: `backend/python/<backend-name>/`
 - **Go backends**: `backend/go/<backend-name>/`
 - **C++ backends**: `backend/cpp/<backend-name>/`
 - **Rust backends**: `backend/rust/<backend-name>/`
 For Python backends, you'll typically need:
 - `backend.py` - Main gRPC server implementation
 - `Makefile` - Build configuration
 - `install.sh` - Installation script for dependencies
 - `protogen.sh` - Protocol buffer generation script
 - `requirements.txt` - Python dependencies
 - `run.sh` - Runtime script
 - `test.py` / `test.sh` - Test files
 For Rust backends, you'll typically need (see `backend/rust/kokoros/` as a reference):
 - `Cargo.toml` - Crate manifest; depend on the upstream project as a submodule under `sources/`
 - `build.rs` - Invokes `tonic_build` to generate gRPC stubs from `backend/backend.proto` (use the `BACKEND_PROTO_PATH` env var so the Makefile can inject the canonical copy)
 - `src/` - The gRPC server implementation (implement `Backend` via `tonic`)
 - `Makefile` - Copies `backend.proto` into the crate, runs `cargo build --release`, then `package.sh`
 - `package.sh` - Uses `ldd` to bundle the binary's dynamic deps and `ld.so` into `package/lib/`
 - `run.sh` - Sets `LD_LIBRARY_PATH`/`SSL_CERT_DIR` and execs the binary via the bundled `lib/ld.so`
 - `sources/<UpstreamProject>/` - Git submodule with the upstream Rust crate
 ## 2. Add Build Configurations to `.github/backend-matrix.yml`
 The build matrix is data-only YAML at `.github/backend-matrix.yml` (not inside `backend.yml` itself). `backend.yml` (master push) and `backend_pr.yml` (PR) load it via `scripts/changed-backends.js`, which also handles per-file path filtering so only touched backends rebuild on PRs and master pushes alike. Add build matrix entries to `.github/backend-matrix.yml` for each platform/GPU type you want to support. Look at similar backends for reference — `chatterbox`/`faster-whisper` for Python, `piper`/`silero-vad` for Go, `kokoros` for Rust.
 **Without an entry here no image is ever built or pushed, and the gallery entry in `backend/index.yaml` will point at a tag that does not exist.** The `dockerfile:` field must point at `./backend/Dockerfile.<lang>` matching the language bucket from step 1 (e.g. `Dockerfile.python`, `Dockerfile.golang`, `Dockerfile.rust`). The `tag-suffix` must match the `uri:` in the corresponding `backend/index.yaml` image entry exactly.
 **`scripts/changed-backends.js` registration — REQUIRED for any new dockerfile suffix.** This is the single most common omission, because it has no effect on the PR that adds the backend (when no prior path filter could catch it anyway) — it only breaks the *next* PR that touches your backend's directory, which then gets zero CI jobs and looks broken for unrelated reasons. Edit `scripts/changed-backends.js:inferBackendPath` and add a branch BEFORE the more-generic suffixes:
 ```js
 if (item.dockerfile.endsWith("<your-dockerfile-suffix>")) {
    return `backend/cpp/<your-backend>/`;   // or backend/python|go|rust/...
 }
 ```
 The `endsWith()` test is against the matrix entry's `dockerfile:` value (e.g. `./backend/Dockerfile.ds4` → `endsWith("ds4")`). Specificity order matters here just like it does for importers: more-specific suffixes go BEFORE more-generic ones (e.g. `ds4` before `llama-cpp` even though both end with letters, because some upstream might one day call itself `super-ds4-llama-cpp`). Verify locally before pushing:
 ```bash
 # Confirm your dockerfile suffix is unique enough
 node -e "
 const yaml = require('js-yaml'); const fs = require('fs');
 const m = yaml.load(fs.readFileSync('.github/backend-matrix.yml','utf8'));
 for (const e of m.include.filter(e => e.backend === '<your-backend>')) {
  console.log(e.dockerfile, '->', e.dockerfile.endsWith('<suffix>'));
 }"
 ```
 A quick way to find the right insertion point: `grep -n 'item.dockerfile.endsWith' scripts/changed-backends.js`.
 **`bump_deps.yaml` registration — REQUIRED for any backend pinning an upstream commit.** If your backend's Makefile has a `*_VERSION?=<sha>` pin to a third-party repo, the daily auto-bump bot at `.github/workflows/bump_deps.yaml` won't notice it unless you register the backend in its matrix. The bot runs `.github/bump_deps.sh` which `grep`s for `^$VAR?=` in the Makefile you list — so the pin MUST live in the Makefile (not in a separate shell script). The bump for ds4 (#9761) had to walk this back because the original landed the pin in `prepare.sh`, which the bot can't see. Pattern (for `antirez/ds4`):
 ```yaml
 # .github/workflows/bump_deps.yaml
 matrix:
  include:
    - repository: "antirez/ds4"
      variable: "DS4_VERSION"
      branch: "main"
      file: "backend/cpp/ds4/Makefile"
 ```
 And the corresponding Makefile shape (mirror `backend/cpp/llama-cpp/Makefile`):
 ```makefile
 DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
 DS4_REPO?=https://github.com/antirez/ds4
 ...
 ds4:
 	mkdir -p ds4
 	cd ds4 && git init -q && \
 	git remote add origin $(DS4_REPO) && \
 	git fetch --depth 1 origin $(DS4_VERSION) && \
 	git checkout FETCH_HEAD
 ```
 If you have a `prepare.sh` doing the clone, delete it — the recipe belongs in the Makefile target so `make purge && make` works as a clean-and-rebuild and so the bump bot finds the pin.
 **Placement in file:**
 - CPU builds: Add after other CPU builds (e.g., after `cpu-chatterbox`)
 - CUDA 12 builds: Add after other CUDA 12 builds (e.g., after `gpu-nvidia-cuda-12-chatterbox`)
 - CUDA 13 builds: Add after other CUDA 13 builds (e.g., after `gpu-nvidia-cuda-13-chatterbox`)
 **Additional build types you may need:**
 - ROCm/HIP: Use `build-type: 'hipblas'` with `base-image: "rocm/dev-ubuntu-24.04:7.2.1"`
 - Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"`
 - L4T (ARM): Use `build-type: 'l4t'` with `platforms: 'linux/arm64'` and `runs-on: 'ubuntu-24.04-arm'`
 **Per-arch native builds (`linux/amd64` + `linux/arm64`):**
 Multi-arch backends are NOT a single matrix entry with `platforms: 'linux/amd64,linux/arm64'`. Instead, add **two** entries — one with `platforms: 'linux/amd64'` + `platform-tag: 'amd64'` + `runs-on: 'ubuntu-latest'`, one with `platforms: 'linux/arm64'` + `platform-tag: 'arm64'` + `runs-on: 'ubuntu-24.04-arm'` — both sharing the same `tag-suffix`. The script detects the shared `tag-suffix` and emits a `merge-matrix` entry, so `backend-merge-jobs` (in `backend.yml`/`backend_pr.yml`) automatically assembles the manifest list from per-arch digest artifacts. See `-cpu-faster-whisper` in `.github/backend-matrix.yml` for a reference shape.
 **llama-cpp / ik-llama-cpp / turboquant variants only — `builder-base-image`:**
 Entries whose `dockerfile` is `./backend/Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` must also set a `builder-base-image` field pointing at a prebuilt base from `quay.io/go-skynet/ci-cache:base-grpc-*` (CI builds these via `.github/workflows/base-images.yml`). The mapping is by `(build-type, platforms)` — see existing entries for the pattern. CI uses these prebuilt bases to skip the gRPC compile (~25–35 min cold). Local `make backends/<name>` ignores `builder-base-image` and uses the from-source path inside the Dockerfile, so you don't need quay access for local builds.
 ## 3. Add Backend Metadata to `backend/index.yaml`
 **Step 3a: Add Meta Definition**
 Add a YAML anchor definition in the `## metas` section (around line 2-300). Look for similar backends to use as a template such as `diffusers` or `chatterbox`
 **Step 3b: Add Image Entries**
 Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags.
 **Note on integrity:** OCI backends installed from a gallery whose `verification:` block is set are verified against a keyless-cosign policy before extraction; tarball/HTTP backends use the optional `sha256:` field. New backends do not need any extra YAML — the gallery-level `verification:` block covers every entry. See [.agents/backend-signing.md](backend-signing.md) for the producer-side CI step.
 ## 4. Update the Makefile
 The Makefile needs to be updated in several places to support building and testing the new backend:
 **Step 4a: Add to `.NOTPARALLEL`**
 Add `backends/<backend-name>` to the `.NOTPARALLEL` line (around line 2) to prevent parallel execution conflicts:
 ```makefile
 .NOTPARALLEL: ... backends/<backend-name>
 ```
 **Step 4b: Add to `prepare-test-extra`**
 Add the backend to the `prepare-test-extra` target to prepare it for testing. Use the path matching your language bucket (`backend/python/`, `backend/go/`, `backend/rust/`, …):
 ```makefile
 prepare-test-extra: protogen-python
 	...
 	$(MAKE) -C backend/<lang>/<backend-name>
 ```
 For Rust backends the target is usually the crate build target itself (e.g. `$(MAKE) -C backend/rust/<backend-name> <backend-name>-grpc`) so the binary is in place before `test` runs.
 **Step 4c: Add to `test-extra`**
 Add the backend to the `test-extra` target to run its tests — applies to Go and Rust backends too, not only Python:
 ```makefile
 test-extra: prepare-test-extra
 	...
 	$(MAKE) -C backend/<lang>/<backend-name> test
 ```
 Each backend's own `Makefile` should define a `test` target so this line works regardless of language. Integration tests that need large model downloads should be gated behind an env var (see `backend/rust/kokoros/`'s `KOKOROS_MODEL_PATH` pattern) so CI only runs unit tests.
 **Step 4d: Add Backend Definition**
 Add a backend definition variable in the backend definitions section (around line 428-457). The format depends on the backend type:
 **For Python backends with root context** (like `faster-whisper`, `coqui`):
 ```makefile
 BACKEND_<BACKEND_NAME> = <backend-name>|python|.|false|true
 ```
 **For Python backends with `./backend` context** (like `chatterbox`, `moonshine`):
 ```makefile
 BACKEND_<BACKEND_NAME> = <backend-name>|python|./backend|false|true
 ```
 **For Go backends**:
 ```makefile
 BACKEND_<BACKEND_NAME> = <backend-name>|golang|.|false|true
 ```
 **For Rust backends**:
 ```makefile
 BACKEND_<BACKEND_NAME> = <backend-name>|rust|.|false|true
 ```
 The language field (`python`/`golang`/`rust`/…) must match a `backend/Dockerfile.<lang>` file.
 **Step 4e: Generate Docker Build Target**
 Add an eval call to generate the docker-build target (around line 480-501):
 ```makefile
 $(eval $(call generate-docker-build-target,$(BACKEND_<BACKEND_NAME>)))
 ```
 **Step 4f: Add to `docker-build-backends`**
 Add `docker-build-<backend-name>` to the `docker-build-backends` target (around line 507):
 ```makefile
 docker-build-backends: ... docker-build-<backend-name>
 ```
 **Determining the Context:**
 - If the backend is in `backend/python/<backend-name>/` and uses `./backend` as context in the workflow file, use `./backend` context
 - If the backend is in `backend/python/<backend-name>/` but uses `.` as context in the workflow file, use `.` context
 - Check similar backends to determine the correct context
 ## 5. Verification Checklist
 After adding a new backend, verify:
 - [ ] Backend directory structure is complete with all necessary files
 - [ ] Build configurations added to `.github/backend-matrix.yml` for all desired platforms (per-arch entries with `platform-tag` for multi-arch; `builder-base-image` for llama-cpp / ik-llama-cpp / turboquant)
 - [ ] Meta definition added to `backend/index.yaml` in the `## metas` section
 - [ ] Image entries added to `backend/index.yaml` for all build variants (latest + development)
 - [ ] Tag suffixes match between workflow file and index.yaml
 - [ ] Makefile updated with all 6 required changes (`.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, backend definition, docker-build target eval, `docker-build-backends`)
 - [ ] No YAML syntax errors (check with linter)
 - [ ] No Makefile syntax errors (check with linter)
 - [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
 ## Bundling runtime shared libraries (`package.sh`)
 The final `Dockerfile.python` stage is `FROM scratch` — there is no system `libc`, no `apt`, no fallback library path. Only files explicitly copied from the builder stage end up in the backend image. That means any runtime `dlopen` your backend (or its Python deps) needs **must** be packaged into `${BACKEND}/lib/`.
 Pattern:
 1. Make sure the library is installed in the builder stage of `backend/Dockerfile.python` (add it to the top-level `apt-get install`).
 2. Drop a `package.sh` in your backend directory that copies the library — and its soname symlinks — into `$(dirname $0)/lib`. See `backend/python/vllm/package.sh` for a reference implementation that walks `/usr/lib/x86_64-linux-gnu`, `/usr/lib/aarch64-linux-gnu`, etc.
 3. `Dockerfile.python` already runs `package.sh` automatically if it exists, after `package-gpu-libs.sh`.
 4. `libbackend.sh` automatically prepends `${EDIR}/lib` to `LD_LIBRARY_PATH` at run time, so anything packaged this way is found by `dlopen`.
 How to find missing libs: when a Python module silently fails to register torch ops or you see `AttributeError: '_OpNamespace' '...' object has no attribute '...'`, run the backend image's Python with `LD_DEBUG=libs` to see which `dlopen` failed. The filename in the error message (e.g. `libnuma.so.1`) is what you need to package.
 To verify packaging works without trusting the host:
 ```bash
 make docker-build-<backend>
 CID=$(docker create --entrypoint=/run.sh local-ai-backend:<backend>)
 docker cp $CID:/lib /tmp/check && docker rm $CID
 ls /tmp/check    # expect the bundled .so files + symlinks
 ```
 Then boot it inside a fresh `ubuntu:24.04` (which intentionally does *not* have the lib installed) to confirm it actually loads from the backend dir.
 ## Importer integration
 When you add a new backend, you MUST also make it importable via the model import form (`/import-model`). The import form dropdown is sourced dynamically from `GET /backends/known` — it reads the importer registry at `core/gallery/importers/importers.go`, so the steps below are the ONLY way to make your backend show up.
 Required steps:
 1. **If your backend has unambiguous detection signals** (unique file extension, HF `pipeline_tag`, unique repo name pattern, unique artefact like `modules.json`):
   - Create an importer file at `core/gallery/importers/<backend>.go` following the Match/Import pattern in `llama-cpp.go`.
   - Register it in `importers.go:defaultImporters` in **specificity order** — more specific detectors must appear BEFORE more generic ones (e.g. `sentencetransformers` before `transformers`, `stablediffusion-ggml` before `llama-cpp`, `vllm-omni` before `vllm`). First match wins.
 2. **If your backend is a drop-in replacement** (same artefacts as another backend, e.g. `ik-llama-cpp` and `turboquant` both consume GGUF the same way `llama-cpp` does):
   - Do NOT create a new importer. Extend the existing importer's `Import()` to swap the emitted `backend:` field when `preferences.backend` matches. See `llama-cpp.go` for the pattern.
 3. **If your backend has no reliable auto-detect signal** (preference-only — e.g. `sglang`, `tinygrad`, `whisperx`):
   - Do NOT create an importer. Instead add the backend name to the curated pref-only slice in `core/http/endpoints/localai/backend.go` that feeds `/backends/known`. A single line addition.
 4. **Always** add a table-driven test in `core/gallery/importers/importers_test.go` (Ginkgo/Gomega):
   - Use a real public HuggingFace repo URI as the test fixture (existing tests already hit the live HF API — follow that pattern).
   - Cover detection (auto-match without preferences), preference-override (explicit `backend:` in preferences wins), and — if the backend's modality has a common `pipeline_tag` but ambiguous artefacts — an ambiguity test asserting `errors.Is(err, importers.ErrAmbiguousImport)`.
 Rules of thumb:
 - When in doubt, lean pref-only. A wrong auto-detect is worse than a forced preference.
 - Never silently emit a modality mismatch (e.g. emit `llama-cpp` for a TTS repo because `.gguf` is present). Return `ErrAmbiguousImport` instead.
 - Registration order is the single most common source of bugs. Check by running `go test ./core/gallery/importers/...` — the existing suite will fail if you've shadowed a pre-existing detector.
 ## 6. Example: Adding a Python Backend
 For reference, when `moonshine` was added:
 - **Files created**: `backend/python/moonshine/{backend.py, Makefile, install.sh, protogen.sh, requirements.txt, run.sh, test.py, test.sh}`
 - **Workflow entries**: 3 build configurations (CPU, CUDA 12, CUDA 13)
 - **Index entries**: 1 meta definition + 6 image entries (cpu, cuda12, cuda13 x latest/development)
 - **Makefile updates**:
  - Added to `.NOTPARALLEL` line
  - Added to `prepare-test-extra` and `test-extra` targets
  - Added `BACKEND_MOONSHINE = moonshine|python|./backend|false|true`
  - Added eval for docker-build target generation
  - Added `docker-build-moonshine` to `docker-build-backends`
--- a/.agents/adding-gallery-models.md
+++ b/.agents/adding-gallery-models.md
@@ -1,111 +0,0 @@
 # Adding GGUF Models from HuggingFace to the Gallery
 When adding a GGUF model from HuggingFace to the LocalAI model gallery, follow this guide.
 ## Gallery file
 All models are defined in `gallery/index.yaml`. Find the appropriate section (embedding models near other embeddings, chat models near similar chat models) and add a new entry.
 ## Getting the SHA256
 GGUF files on HuggingFace expose their SHA256 via the `x-linked-etag` HTTP header. Fetch it with:
 ```bash
 curl -sI "https://huggingface.co/<org>/<repo>/resolve/main/<filename>.gguf" | grep -i x-linked-etag
 ```
 The value (without quotes) is the SHA256 hash. Example:
 ```bash
 curl -sI "https://huggingface.co/ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/resolve/main/embeddinggemma-300m-qat-Q8_0.gguf" | grep -i x-linked-etag
 # x-linked-etag: "6fa0c02a9c302be6f977521d399b4de3a46310a4f2621ee0063747881b673f67"
 ```
 **Important**: Pay attention to exact filename casing — HuggingFace filenames are case-sensitive (e.g., `Q8_0` vs `q8_0`). Check the repo's file listing to get the exact name.
 ## Entry format — Embedding models
 Embedding models use `gallery/virtual.yaml` as the base config and set `embeddings: true`:
 ```yaml
 - name: "model-name"
  url: github:mudler/LocalAI/gallery/virtual.yaml@master
  urls:
    - https://huggingface.co/<original-model-org>/<original-model-name>
    - https://huggingface.co/<gguf-org>/<gguf-repo-name>
  description: |
    Short description of the model, its size, and capabilities.
  tags:
    - embeddings
  overrides:
    backend: llama-cpp
    embeddings: true
    parameters:
      model: <filename>.gguf
  files:
    - filename: <filename>.gguf
      uri: huggingface://<gguf-org>/<gguf-repo-name>/<filename>.gguf
      sha256: <sha256-hash>
 ```
 ## Entry format — Chat/LLM models
 Chat models typically reference a template config (e.g., `gallery/gemma.yaml`, `gallery/chatml.yaml`) that defines the prompt format. Use YAML anchors (`&name` / `*name`) if adding multiple quantization variants of the same model:
 ```yaml
 - &model-anchor
  url: "github:mudler/LocalAI/gallery/<template>.yaml@master"
  name: "model-name"
  icon: https://example.com/icon.png
  license: <license>
  urls:
    - https://huggingface.co/<org>/<model>
    - https://huggingface.co/<gguf-org>/<gguf-repo>
  description: |
    Model description.
  tags:
    - llm
    - gguf
    - gpu
    - cpu
  overrides:
    parameters:
      model: <filename>-Q4_K_M.gguf
  files:
    - filename: <filename>-Q4_K_M.gguf
      sha256: <sha256>
      uri: huggingface://<gguf-org>/<gguf-repo>/<filename>-Q4_K_M.gguf
 ```
 To add a variant (e.g., different quantization), use YAML merge:
 ```yaml
 - !!merge <<: *model-anchor
  name: "model-name-q8"
  overrides:
    parameters:
      model: <filename>-Q8_0.gguf
  files:
    - filename: <filename>-Q8_0.gguf
      sha256: <sha256>
      uri: huggingface://<gguf-org>/<gguf-repo>/<filename>-Q8_0.gguf
 ```
 ## Available template configs
 Look at existing `.yaml` files in `gallery/` to find the right prompt template for your model architecture:
 - `gemma.yaml` — Gemma-family models (gemma, embeddinggemma, etc.)
 - `chatml.yaml` — ChatML format (many Mistral/OpenHermes models)
 - `deepseek.yaml` — DeepSeek models
 - `virtual.yaml` — Minimal base (good for embedding models that don't need chat templates)
 ## Checklist
 1. **Find the GGUF file** on HuggingFace — note exact filename (case-sensitive)
 2. **Get the SHA256** using the `curl -sI` + `x-linked-etag` method above
 3. **Choose the right template** config from `gallery/` based on model architecture
 4. **Add the entry** to `gallery/index.yaml` near similar models
 5. **Set `embeddings: true`** if it's an embedding model
 6. **Include both URLs** — the original model page and the GGUF repo
 7. **Write a description** — mention model size, capabilities, and quantization type
--- a/.agents/ai-coding-assistants.md
+++ b/.agents/ai-coding-assistants.md
@@ -1,101 +0,0 @@
 # AI Coding Assistants
 This document provides guidance for AI tools and developers using AI
 assistance when contributing to LocalAI.
 **LocalAI follows the same guidelines as the Linux kernel project for
 AI-assisted contributions.** See the upstream policy here:
 <https://docs.kernel.org/process/coding-assistants.html>
 The rules below mirror that policy, adapted to LocalAI's license and
 project layout. If anything is unclear, the kernel document is the
 authoritative reference for intent.
 AI tools helping with LocalAI development should follow the standard
 project development process:
 - [CONTRIBUTING.md](../CONTRIBUTING.md) — development workflow, commit
  conventions, and PR guidelines
 - [.agents/coding-style.md](coding-style.md) — code style, editorconfig,
  logging, and documentation conventions
 - [.agents/building-and-testing.md](building-and-testing.md) — build and
  test procedures
 ## Licensing and Legal Requirements
 All contributions must comply with LocalAI's licensing requirements:
 - LocalAI is licensed under the **MIT License** — see the [LICENSE](../LICENSE)
  file
 - New source files should use the SPDX license identifier `MIT` where
  applicable to the file type
 - Contributions must be compatible with the MIT License and must not
  introduce code under incompatible licenses (e.g., GPL) without an
  explicit discussion with maintainers
 ## Signed-off-by and Developer Certificate of Origin
 **AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally
 certify the Developer Certificate of Origin (DCO). The human submitter
 is responsible for:
 - Reviewing all AI-generated code
 - Ensuring compliance with licensing requirements
 - Adding their own `Signed-off-by` tag (when the project requires DCO)
  to certify the contribution
 - Taking full responsibility for the contribution
 AI agents MUST NOT add `Co-Authored-By` trailers for themselves either.
 A human reviewer owns the contribution; the AI's involvement is recorded
 via `Assisted-by` (see below).
 ## Attribution
 When AI tools contribute to LocalAI development, proper attribution helps
 track the evolving role of AI in the development process. Contributions
 should include an `Assisted-by` tag in the commit message trailer in the
 following format:
 ```
 Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
 ```
 Where:
 - `AGENT_NAME` — name of the AI tool or framework (e.g., `Claude`,
  `Copilot`, `Cursor`)
 - `MODEL_VERSION` — specific model version used (e.g.,
  `claude-opus-4-7`, `gpt-5`)
 - `[TOOL1] [TOOL2]` — optional specialized analysis tools invoked by the
  agent (e.g., `golangci-lint`, `staticcheck`, `go vet`)
 Basic development tools (git, go, make, editors) should **not** be listed.
 ### Example
 ```
 fix(llama-cpp): handle empty tool call arguments
 Previously the parser panicked when the model returned a tool call with
 an empty arguments object. Fall back to an empty JSON object in that
 case so downstream consumers receive a valid payload.
 Assisted-by: Claude:claude-opus-4-7 golangci-lint
 Signed-off-by: Jane Developer <jane@example.com>
 ```
 ## Scope and Responsibility
 Using an AI assistant does not reduce the contributor's responsibility.
 The human submitter must:
 - Understand every line that lands in the PR
 - Verify that generated code compiles, passes tests, and follows the
  project style
 - Confirm that any referenced APIs, flags, or file paths actually exist
  in the current tree (AI models may hallucinate identifiers)
 - Not submit AI output verbatim without review
 Reviewers may ask for clarification on any change regardless of how it
 was produced. "An AI wrote it" is not an acceptable answer to a design
 question.
--- a/.agents/api-endpoints-and-auth.md
+++ b/.agents/api-endpoints-and-auth.md
@@ -1,355 +0,0 @@
 # API Endpoints and Authentication
 This guide covers how to add new API endpoints and properly integrate them with the auth/permissions system.
 > **Before you ship a new endpoint or capability surface**, re-read the [checklist at the bottom of this file](#checklist). LocalAI advertises its feature surface in several independent places — miss any one of them and clients/admins/UI won't know the endpoint exists.
 ## Architecture overview
 Authentication and authorization flow through three layers:
 1. **Global auth middleware** (`core/http/auth/middleware.go` → `auth.Middleware`) — applied to every request in `core/http/app.go`. Handles session cookies, Bearer tokens, API keys, and legacy API keys. Populates `auth_user` and `auth_role` in the Echo context.
 2. **Feature middleware** (`auth.RequireFeature`) — per-feature access control applied to route groups or individual routes. Checks if the authenticated user has the specific feature enabled.
 3. **Admin middleware** (`auth.RequireAdmin`) — restricts endpoints to admin users only.
 When auth is disabled (no auth DB, no legacy API keys), all middleware becomes pass-through (`auth.NoopMiddleware`).
 ## Adding a new API endpoint
 ### Step 1: Create the handler
 Write the endpoint handler in the appropriate package under `core/http/endpoints/`. Follow existing patterns:
 ```go
 // core/http/endpoints/localai/my_feature.go
 func MyFeatureEndpoint(app *application.Application) echo.HandlerFunc {
    return func(c echo.Context) error {
        // Use auth.GetUser(c) to get the authenticated user (may be nil if auth is disabled)
        user := auth.GetUser(c)
        // Your logic here
        return c.JSON(http.StatusOK, result)
    }
 }
 ```
 ### Step 2: Register routes
 Add routes in the appropriate file under `core/http/routes/`. The file you use depends on the endpoint category:
 | File | Category |
 |------|----------|
 | `routes/openai.go` | OpenAI-compatible API endpoints (`/v1/...`) |
 | `routes/localai.go` | LocalAI-specific endpoints (`/api/...`, `/models/...`, `/backends/...`) |
 | `routes/agents.go` | Agent pool endpoints (`/api/agents/...`) |
 | `routes/auth.go` | Auth endpoints (`/api/auth/...`) |
 | `routes/ui_api.go` | UI backend API endpoints |
 ### Step 3: Apply the right middleware
 Choose the appropriate protection level:
 #### No auth required (public)
 Exempt paths bypass auth entirely. Add to `isExemptPath()` in `middleware.go` or use the `/api/auth/` prefix (always exempt). Use sparingly — most endpoints should require auth.
 #### Standard auth (any authenticated user)
 The global middleware already handles this. API paths (`/api/`, `/v1/`, etc.) automatically require authentication when auth is enabled. You don't need to add any extra middleware.
 ```go
 router.GET("/v1/my-endpoint", myHandler)  // auth enforced by global middleware
 ```
 #### Admin only
 Pass `adminMiddleware` to the route. This is set up in `app.go` and passed to `Register*Routes` functions:
 ```go
 // In the Register function signature, accept the middleware:
 func RegisterMyRoutes(router *echo.Echo, app *application.Application, adminMiddleware echo.MiddlewareFunc) {
    router.POST("/models/apply", myHandler, adminMiddleware)
 }
 ```
 #### Feature-gated
 For endpoints that should be toggleable per-user, use feature middleware. There are two approaches:
 **Approach A: Route-level middleware** (preferred for groups of related endpoints)
 ```go
 // In app.go, create the feature middleware:
 myFeatureMw := auth.RequireFeature(application.AuthDB(), auth.FeatureMyFeature)
 // Pass it to the route registration function:
 routes.RegisterMyRoutes(e, app, myFeatureMw)
 // In the routes file, apply to a group:
 g := e.Group("/api/my-feature", myFeatureMw)
 g.GET("", listHandler)
 g.POST("", createHandler)
 ```
 **Approach B: RouteFeatureRegistry** (preferred for individual OpenAI-compatible endpoints)
 Add an entry to `RouteFeatureRegistry` in `core/http/auth/features.go`. The `RequireRouteFeature` global middleware will automatically enforce it:
 ```go
 var RouteFeatureRegistry = []RouteFeature{
    // ... existing entries ...
    {"POST", "/v1/my-endpoint", FeatureMyFeature},
 }
 ```
 ## Adding a new feature
 When you need a new toggleable feature (not just a new endpoint under an existing feature):
 ### 1. Define the feature constant
 Add to `core/http/auth/permissions.go`:
 ```go
 const (
    // Add to the appropriate group:
    // Agent features (default OFF for new users)
    FeatureMyFeature = "my_feature"
    // OR API features (default ON for new users)
    FeatureMyFeature = "my_feature"
 )
 ```
 Then add it to the appropriate slice:
 ```go
 // Default OFF — user must be explicitly granted access:
 var AgentFeatures = []string{..., FeatureMyFeature}
 // Default ON — user has access unless explicitly revoked:
 var APIFeatures = []string{..., FeatureMyFeature}
 ```
 ### 2. Add feature metadata
 In `core/http/auth/features.go`, add to the appropriate `FeatureMetas` function so the admin UI can display it:
 ```go
 func AgentFeatureMetas() []FeatureMeta {
    return []FeatureMeta{
        // ... existing ...
        {FeatureMyFeature, "My Feature", false},  // false = default OFF
    }
 }
 ```
 ### 3. Wire up the middleware
 In `core/http/app.go`:
 ```go
 myFeatureMw := auth.RequireFeature(application.AuthDB(), auth.FeatureMyFeature)
 ```
 Then pass it to the route registration function.
 ### 4. Register route-feature mappings (if applicable)
 If your feature gates standard API endpoints (like `/v1/...`), add entries to `RouteFeatureRegistry` in `features.go` instead of using per-route middleware.
 ## Accessing the authenticated user in handlers
 ```go
 import "github.com/mudler/LocalAI/core/http/auth"
 func MyHandler(c echo.Context) error {
    // Get the user (nil when auth is disabled or unauthenticated)
    user := auth.GetUser(c)
    if user == nil {
        // Handle unauthenticated — or let middleware handle it
    }
    // Check role
    if user.Role == auth.RoleAdmin {
        // admin-specific logic
    }
    // Check feature access programmatically (when you need conditional behavior, not full blocking)
    if auth.HasFeatureAccess(db, user, auth.FeatureMyFeature) {
        // feature-specific logic
    }
    // Check model access
    if !auth.IsModelAllowed(db, user, modelName) {
        return c.JSON(http.StatusForbidden, ...)
    }
 }
 ```
 ## Middleware composition patterns
 Middleware can be composed at different levels. Here are the patterns used in the codebase:
 ### Group-level middleware (agents pattern)
 ```go
 // All routes in the group share the middleware
 g := e.Group("/api/agents", poolReadyMw, agentsMw)
 g.GET("", listHandler)
 g.POST("", createHandler)
 ```
 ### Per-route middleware (localai pattern)
 ```go
 // Individual routes get middleware as extra arguments
 router.POST("/models/apply", applyHandler, adminMiddleware)
 router.GET("/metrics", metricsHandler, adminMiddleware)
 ```
 ### Middleware slice (openai pattern)
 ```go
 // Build a middleware chain for a handler
 chatMiddleware := []echo.MiddlewareFunc{
    usageMiddleware,
    traceMiddleware,
    modelFilterMiddleware,
 }
 app.POST("/v1/chat/completions", chatHandler, chatMiddleware...)
 ```
 ## Error response format
 Always use `schema.ErrorResponse` for auth/permission errors to stay consistent with the OpenAI-compatible API:
 ```go
 return c.JSON(http.StatusForbidden, schema.ErrorResponse{
    Error: &schema.APIError{
        Message: "feature not enabled for your account",
        Code:    http.StatusForbidden,
        Type:    "authorization_error",
    },
 })
 ```
 Use these HTTP status codes:
 - `401 Unauthorized` — no valid credentials provided
 - `403 Forbidden` — authenticated but lacking permission
 - `429 Too Many Requests` — rate limited (auth endpoints)
 ## Usage tracking
 If your endpoint should be tracked for usage (token counts, request counts), add the `usageMiddleware` to its middleware chain. See `core/http/middleware/usage.go` and how it's applied in `routes/openai.go`.
 ## Advertising surfaces — where to register a new capability
 Beyond routing and auth, LocalAI publishes its capability surface in **four independent places**. When you add an endpoint — especially one introducing a net-new capability like a new media type or a new auth-gated feature — you must update every relevant surface. These aren't optional: missing them means the endpoint works but is invisible to clients, admins, and the UI.
 ### 1. Swagger `@Tags` annotation (mandatory)
 Every handler needs a swagger block so the endpoint appears in `/swagger/index.html` and in the `/api/instructions` output. The `@Tags` value is what groups the endpoint into a capability area:
 ```go
 // MyEndpoint does X.
 // @Summary Do X.
 // @Tags my-capability
 // @Param request body schema.MyRequest true "payload"
 // @Success 200 {object} schema.MyResponse "Response"
 // @Router /v1/my-endpoint [post]
 func MyEndpoint(...) echo.HandlerFunc { ... }
 ```
 Use an existing tag when the endpoint extends an existing area (e.g. `audio`, `images`, `face-recognition`). Create a new tag only when the endpoint introduces a genuinely new capability surface — and in that case, also register it in step 2.
 After adding endpoints, regenerate the embedded spec so the runtime serves it:
 ```bash
 make protogen-go         # ensures gRPC codegen is fresh first
 make swagger             # regenerates swagger/swagger.json
 ```
 ### 2. `/api/instructions` registry (for new capability areas)
 `core/http/endpoints/localai/api_instructions.go` defines `instructionDefs` — a lightweight, machine-readable index of capability areas that groups swagger endpoints by tag. It's the primary discovery surface for agents and SDKs ("what can this server do?").
 **When to update:** only when adding a new capability area (a new swagger tag). Existing-tag additions automatically surface without any change here.
 Add an entry to `instructionDefs`:
 ```go
 {
    Name:        "my-capability",             // URL segment at /api/instructions/my-capability
    Description: "Short sentence describing the capability",
    Tags:        []string{"my-capability"},   // must match swagger @Tags
    Intro:       "Optional gotcha/context that isn't in the swagger descriptions (caveats, defaults, cross-references to other endpoints).",
 },
 ```
 Also bump the expected-length count in `api_instructions_test.go` and add the name to the `ContainElements` assertion.
 ### 3. `capabilities.js` symbol (for new model-config FLAG_* flags)
 If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), you need to update **all** of:
 - `Usecase<Name>` string constant in `core/config/backend_capabilities.go`
 - `UsecaseInfoMap` entry mapping the string to its flag + gRPC method
 - `FLAG_<NAME>` bitmask in `core/config/model_config.go`
 - `GetAllModelConfigUsecases()` map entry (otherwise the YAML loader silently ignores the string)
 - `ModalityGroups` membership if the flag should affect `IsMultimodal()` (e.g. realtime_audio is in both speech-input and audio-output groups so a lone flag still reads as multimodal)
 - `GuessUsecases()` branch listing the backends that own this capability
 - `usecaseFilters` in `core/http/routes/ui_api.go` (drives the gallery filter dropdown)
 - `Models.jsx` `FILTERS` array + matching `filters.<camelCase>` i18n key in `core/http/react-ui/public/locales/en/models.json`
 - `core/http/react-ui/src/utils/capabilities.js`:
 ```js
 export const CAP_MY_CAPABILITY = 'FLAG_MY_CAPABILITY'
 ```
 React pages that want to filter the ModelSelector by capability import this symbol. Declare it even if you're not building the UI page yet — the declaration keeps the Go/JS vocabularies in sync.
 ### 4. `docs/content/` (user-facing documentation)
 A new capability deserves its own page under `docs/content/features/`, plus cross-links from related features and an entry in `docs/content/whats-new.md`. See the pattern used by `face-recognition.md` / `object-detection.md`.
 ## Path protection rules
 The global auth middleware classifies paths as API paths or non-API paths:
 - **API paths** (always require auth when auth is enabled): `/api/`, `/v1/`, `/models/`, `/backends/`, `/backend/`, `/tts`, `/vad`, `/video`, `/stores/`, `/system`, `/ws/`, `/metrics`
 - **Exempt paths** (never require auth): `/api/auth/` prefix, anything in `appConfig.PathWithoutAuth`
 - **Non-API paths** (UI, static assets): pass through without auth — the React UI handles login redirects client-side
 If you add endpoints under a new top-level path prefix, add it to `isAPIPath()` in `middleware.go` to ensure it requires authentication.
 ## Checklist
 When adding a new endpoint:
 **Routing & auth**
 - [ ] Handler in `core/http/endpoints/`
 - [ ] Route registered in appropriate `core/http/routes/` file
 - [ ] Auth level chosen: public / standard / admin / feature-gated
 - [ ] Entry added to `RouteFeatureRegistry` in `core/http/auth/features.go` (one row per route/method — all /v1/* routes gate through this, not per-route middleware)
 - [ ] If new feature: constant in `permissions.go`, added to the right slice (`APIFeatures` default-ON / `AgentFeatures` default-OFF), metadata in `features.go` `*FeatureMetas()`
 - [ ] If feature uses group middleware: wired in `core/http/app.go` and passed to the route registration function
 - [ ] If new path prefix: added to `isAPIPath()` in `middleware.go`
 - [ ] If token-counting: `usageMiddleware` added to middleware chain
 **Advertising surfaces (easy to miss — see the [Advertising surfaces](#advertising-surfaces--where-to-register-a-new-capability) section)**
 - [ ] Swagger block on the handler: `@Summary`, `@Tags`, `@Param`, `@Success`, `@Router`
 - [ ] If new capability area (new swagger tag): entry in `instructionDefs` in `core/http/endpoints/localai/api_instructions.go` + test count bumped in `api_instructions_test.go`
 - [ ] If new `FLAG_*` usecase flag: matching `CAP_*` symbol exported from `core/http/react-ui/src/utils/capabilities.js`
 - [ ] `docs/content/features/<feature>.md` created; cross-links from related feature pages; entry in `docs/content/whats-new.md`
 **Quality**
 - [ ] Error responses use `schema.ErrorResponse` format (or `echo.NewHTTPError` with a mapped gRPC status — see the `mapBackendError` helper in `core/http/endpoints/localai/images.go`)
 - [ ] Tests cover both authenticated and unauthenticated access
 - [ ] Swagger regenerated (`make swagger`) if you changed any `@Router`/`@Tags`/`@Param` annotation
 ## Companion: MCP admin tool surface
 **Required for admin endpoints.** Every new admin endpoint MUST be considered for the MCP admin tool surface — the REST API and the MCP tool catalog can drift silently otherwise, and both the LocalAI Assistant chat modality and the standalone `local-ai mcp-server` rely on `pkg/mcp/localaitools/` to mirror REST.
 Two outcomes are acceptable; one is not:
 - **Tool added.** The new endpoint is something an admin would manage conversationally (install, list, edit, toggle, upgrade). Follow the full checklist in [.agents/localai-assistant-mcp.md](localai-assistant-mcp.md): add a `LocalAIClient` interface method, implement it in both `inproc` and `httpapi`, register the tool with a `Tool*` constant, update the skill prompts, **and add the route to `toolToHTTPRoute` in `pkg/mcp/localaitools/coverage_test.go`**.
 - **Tool deliberately skipped.** The endpoint is internal/diagnostic and adding a chat path would be misleading. Document the decision in the PR description; no code action.
 - **Forgot.** This breaks the contract. The `TestToolHTTPRouteMappingComplete` test in `pkg/mcp/localaitools` is a partial guard (it checks every `Tool*` has a route mapping), but it does NOT detect new REST endpoints without a tool — that's still a process check on the PR author.
 **Add to the bottom of the checklist below**:
 - [ ] If admin: decided whether MCP coverage is needed; if yes, tool registered + map updated; if no, skip-reason in PR description.
--- a/.agents/backend-signing.md
+++ b/.agents/backend-signing.md
@@ -1,126 +0,0 @@
 # Backend image signing & verification
 LocalAI verifies backend OCI images against a per-gallery keyless-cosign
 policy. This page documents the trust model, the producer side
 (`.github/workflows/backend_merge.yml` in this repo), and the consumer
 side (`pkg/oci/cosignverify` plus the gallery YAML).
 ## Trust model
 - **Producer:** `.github/workflows/backend_merge.yml` signs each pushed
  manifest list with `cosign sign --recursive` in keyless mode after
  `docker buildx imagetools create`. The signing cert is issued by
  Fulcio bound to the workflow's OIDC identity. There is no long-lived
  signing key. `--recursive` signs both the manifest list and every
  per-arch entry — needed because our consumer resolves a tag to a
  per-arch manifest before checking signatures.
 - **Storage:** Signatures are written as OCI 1.1 referrers
  (`--registry-referrers-mode=oci-1-1`) in the new Sigstore bundle format
  (current cosign releases do this by default; no `--new-bundle-format`
  flag). No `:sha256-<hex>.sig` tag clutter.
 - **Consumer:** `pkg/oci/cosignverify` discovers the bundle via the
  referrers API, hands it to `sigstore-go`, and verifies it against the
  policy declared in the gallery YAML (`Gallery.Verification`).
 - **Revocation:** Keyless cosign certs are ephemeral (10-minute Fulcio
  validity), so revocation is policy-side, not CA-side. The gallery's
  `verification.not_before` (RFC3339) is the kill-switch — advance it to
  invalidate every signature produced before a known compromise window.
 ## Producer setup
 `backend_merge.yml` is the workflow that joins per-arch digests into the
 multi-arch manifest list users actually pull, so it's also the right place
 to sign. The job needs:
 - `permissions: { id-token: write, contents: read }` at the job level so
  the runner can exchange its GitHub OIDC token for a Fulcio cert.
 - `sigstore/cosign-installer@v3` step (current cosign releases already
  default to the new bundle format).
 - After each `docker buildx imagetools create`, resolve the resulting
  list digest with `docker buildx imagetools inspect <tag> --format
  '{{.Manifest.Digest}}'` and sign:
 ```sh
 cosign sign --yes --recursive \
  --registry-referrers-mode=oci-1-1 \
  "${REGISTRY_REPO}@${DIGEST}"
 ```
 Sign by digest, never by tag — signing by tag binds the signature to
 whatever the tag points at *now*, and a subsequent tag push orphans it.
 `--registry-referrers-mode=oci-1-1` is still gated behind
 `COSIGN_EXPERIMENTAL=1` in cosign v2.4.x (set at the job env level in
 `backend_merge.yml`). Re-evaluate when bumping the pinned cosign release
 — newer versions are expected to graduate this flag and the env var can
 then be dropped.
 `backend_build_darwin.yml` builds and pushes single-arch darwin images
 that bypass the manifest-list merge. If/when those entries get a gallery
 `verification:` policy, the equivalent cosign step has to land there
 too.
 ## Consumer setup (in `mudler/LocalAI` gallery YAML)
 Once CI is signing, add a `verification:` block to the backend gallery
 entry (`backend/index.yaml`):
 ```yaml
 - name: localai
  url: github:mudler/LocalAI/backend/index.yaml@master
  verification:
    issuer: "https://token.actions.githubusercontent.com"
    identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@refs/heads/master$"
    # Optional revocation cutoff; advance during incident response.
    # not_before: "2026-06-01T00:00:00Z"
 ```
 Identity matching pins the OIDC subject Fulcio issued the signing cert
 to. Without this, any image signed by *anyone* with a Fulcio cert would
 pass — the regex is what makes a signature mean "produced by our CI".
 ## Strict mode
 Default behaviour: OCI backends without a `verification:` block install
 with a warning (logs include `installing OCI backend without signature
 verification`). Tarball/HTTP backends without a `sha256` field log a
 similar warning.
 For production, set `LOCALAI_REQUIRE_BACKEND_INTEGRITY=1` (or pass
 `--require-backend-integrity` to `local-ai run` / `local-ai backends
 install` / `local-ai models install`). The warning becomes a hard error
 and unverifiable backends refuse to install.
 ## Revocation playbook
 If `backend_merge.yml` (or any workflow with `id-token: write`) is
 compromised and we've shipped malicious signed images:
 1. **Identify the compromise window.** Find the earliest IntegratedTime
   from the bad signatures (Rekor search by `subject` filter).
 2. **Set `verification.not_before`** in `backend/index.yaml` to a
   timestamp just *after* that window's start.
 3. **Push the YAML.** Deployed LocalAI instances pick it up on next
   gallery refresh (1-hour cache in `core/gallery/gallery.go`).
 4. **Fix the underlying compromise** in the workflow and re-sign images
   with the new build, which will have IntegratedTime > `not_before`.
 5. **Optional:** for absolute decisiveness, also rotate to a new
   workflow path (`backend_merge_v2.yml`) and update `identity_regex`.
 ## Where the code lives
 - `pkg/oci/cosignverify/` — verifier, policy, OCI referrer fetch, NotBefore enforcement.
 - `pkg/downloader/uri.go` — `WithImageVerifier` option threaded through `DownloadFileWithContext`.
 - `core/gallery/backends.go` — `backendDownloadOptions` builds the verifier from the gallery's policy.
 - `core/config/gallery.go` — `Gallery.Verification` YAML schema.
 - `core/cli/run.go`, `core/cli/backends.go`, `core/cli/models.go` — `--require-backend-integrity` flag propagation.
 - `.github/workflows/backend_merge.yml` — producer-side `cosign sign --recursive` after each multi-arch manifest list push.
 ## Out of scope (follow-ups)
 - **Signing the gallery YAML itself.** The index is fetched over HTTPS
  from GitHub; we trust the host. A cosign blob signature on the YAML
  would close that gap but adds key-management overhead. Revisit this
  page if/when added.
 - **Tarball/HTTP backend signing.** Cosign can sign arbitrary blobs, but
  for now non-OCI backends keep using the `sha256:` field in YAML.
--- a/.agents/building-and-testing.md
+++ b/.agents/building-and-testing.md
@@ -1,17 +0,0 @@
 # Build and Testing
 Building and testing the project depends on the components involved and the platform where development is taking place. Due to the amount of context required it's usually best not to try building or testing the project unless the user requests it. If you must build the project then inspect the Makefile in the project root and the Makefiles of any backends that are effected by changes you are making. In addition the workflows in .github/workflows can be used as a reference when it is unclear how to build or test a component. The primary Makefile contains targets for building inside or outside Docker, if the user has not previously specified a preference then ask which they would like to use.
 ## Building a specified backend
 Let's say the user wants to build a particular backend for a given platform. For example let's say they want to build coqui for ROCM/hipblas
 - The Makefile has targets like `docker-build-coqui` created with `generate-docker-build-target` at the time of writing. Recently added backends may require a new target.
 - At a minimum we need to set the BUILD_TYPE, BASE_IMAGE build-args
  - Use `.github/backend-matrix.yml` as a reference — it's the data-only YAML that lists every backend variant's `build-type`, `base-image`, `platforms`, etc. (`backend.yml` and `backend_pr.yml` consume it via `scripts/changed-backends.js`).
  - l4t and cublas also require the CUDA major and minor version.
  - For llama-cpp / ik-llama-cpp / turboquant the matrix also sets `builder-base-image` pointing at a prebuilt `quay.io/go-skynet/ci-cache:base-grpc-*` tag. Local `make backends/<name>` defaults to `BUILDER_TARGET=builder-fromsource` and doesn't need it — the Dockerfile's from-source stage installs everything itself.
 - You can pretty print a command like `DOCKER_MAKEFLAGS=-j$(nproc --ignore=1) BUILD_TYPE=hipblas BASE_IMAGE=rocm/dev-ubuntu-24.04:7.2.1 make docker-build-coqui`
 - Unless the user specifies that they want you to run the command, then just print it because not all agent frontends handle long running jobs well and the output may overflow your context
 - The user may say they want to build AMD or ROCM instead of hipblas, or Intel instead of SYCL or NVIDIA insted of l4t or cublas. Ask for confirmation if there is ambiguity.
 - Sometimes the user may need extra parameters to be added to `docker build` (e.g. `--platform` for cross-platform builds or `--progress` to view the full logs), in which case you can generate the `docker build` command directly.
--- a/.agents/ci-caching.md
+++ b/.agents/ci-caching.md
@@ -1,250 +0,0 @@
 # CI Build Caching
 Container builds — both the root LocalAI image (`Dockerfile`) and the per-backend images (`backend/Dockerfile.*`) — share a registry-backed BuildKit cache plus a layered set of prebuilt base images. This file explains how the cache is laid out, what invalidates it, and how to bypass it.
 ## Workflow surfaces
 | Workflow | Purpose | Triggers |
 |---|---|---|
 | `.github/workflows/backend.yml` | Backend container images on master | `push` to master + tags, weekly Sunday cron, `workflow_dispatch` |
 | `.github/workflows/backend_pr.yml` | Backend container images on PRs | `pull_request` |
 | `.github/workflows/backend_build.yml` | Reusable: builds one backend (one arch) by digest | `workflow_call` from above |
 | `.github/workflows/backend_merge.yml` | Reusable: assembles per-arch digests into a multi-arch manifest list | `workflow_call` |
 | `.github/workflows/backend_build_darwin.yml` | Reusable: macOS-native backend builds | `workflow_call` |
 | `.github/workflows/image.yml` / `image-pr.yml` | Root LocalAI image (push / PR) | push / PR |
 | `.github/workflows/image_build.yml` / `image_merge.yml` | Reusable: per-arch root-image build + merge | `workflow_call` |
 | `.github/workflows/base-images.yml` | Builds the prebuilt `base-grpc-*` builder bases | Saturdays 05:00 UTC cron, `workflow_dispatch`, master push touching `Dockerfile.base-grpc-builder`, `.docker/install-base-deps.sh`, `.docker/apt-mirror.sh`, or this workflow |
 The matrix that drives `backend.yml` / `backend_pr.yml` lives in **`.github/backend-matrix.yml`** (data-only YAML, not embedded in the workflow). `scripts/changed-backends.js` parses it, applies path-filter logic against the PR diff (PR events) or the GitHub Compare API (push events), and emits the filtered matrix plus a `merge-matrix` for backends with multiple per-arch entries.
 ## Cache layout
 - **Cache registry**: `quay.io/go-skynet/ci-cache`
 - **One tag per matrix entry per arch**, derived from `tag-suffix` and `platform-tag`:
  - Backend builds (`backend_build.yml`): `cache<tag-suffix>-<platform-tag>`
    - e.g. `cache-cpu-faster-whisper-amd64`, `cache-cpu-faster-whisper-arm64`, `cache-gpu-nvidia-cuda-13-llama-cpp-amd64`
  - Root image builds (`image_build.yml`): `cache-localai<tag-suffix>-<platform-tag>` (with a `-core` placeholder when `tag-suffix` is empty, so `cache-localai-core-amd64` for the core image)
  - Pre-built base images (`base-images.yml`): `cache-base-grpc-<variant>` (one per `(BUILD_TYPE, arch)` permutation)
 - Each tag stores a multi-arch BuildKit cache manifest (`mode=max`), so every intermediate stage is re-usable, not just the final image.
 The per-arch suffix exists because amd64 and arm64 builds produce different intermediate content; sharing one cache key would thrash on every cross-arch rebuild.
 ## Read/write semantics
 | Trigger | `cache-from` | `cache-to` |
 |---|---|---|
 | `push` to `master` / tag / cron / dispatch | yes | yes (`mode=max,ignore-error=true`) |
 | `pull_request` | yes | **no** |
 PR builds read master's warm cache but never write — this prevents PRs from polluting the shared cache with their experimental state. After merge, the master build for that matrix entry refreshes the cache.
 `ignore-error=true` on the write side means a transient quay push failure does not fail the build; the next master push retries.
 ## Pre-built base images (`base-grpc-*`)
 The C++ backend Dockerfiles (`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}`) compile gRPC from source. On a cold build that's ~25–35 min before any LocalAI source compiles. To skip that on CI, `.github/workflows/base-images.yml` builds and pushes a set of pre-prepped builder bases:
 | Tag | Contents |
 |---|---|
 | `base-grpc-amd64` / `base-grpc-arm64` | Ubuntu 24.04 + apt build deps + protoc + cmake + gRPC at `/opt/grpc` |
 | `base-grpc-cuda-12-amd64` | the above + CUDA 12.8 toolkit |
 | `base-grpc-cuda-13-amd64` | the above + CUDA 13.0 toolkit (Ubuntu 22.04 base) |
 | `base-grpc-cuda-13-arm64` | the above + CUDA 13.0 sbsa toolkit (Ubuntu 24.04 base) |
 | `base-grpc-l4t-cuda-12-arm64` | JetPack r36.4.0 base (CUDA preinstalled, `SKIP_DRIVERS=true`) + gRPC |
 | `base-grpc-rocm-amd64` | rocm/dev-ubuntu-24.04:7.2.1 base + hipblas/hipblaslt/rocblas + gRPC |
 | `base-grpc-vulkan-amd64` / `base-grpc-vulkan-arm64` | Ubuntu 24.04 + Vulkan SDK 1.4.335 + gRPC |
 | `base-grpc-intel-amd64` | intel/oneapi-basekit:2025.3.2 base + gRPC |
 **Single source of truth**: the install logic for all 10 variants lives in `.docker/install-base-deps.sh`. Both `Dockerfile.base-grpc-builder` AND each variant Dockerfile's `builder-fromsource` stage bind-mount and execute the same script — so the prebuilt CI base and the local from-source path are bit-equivalent by construction.
 ### How variant Dockerfiles consume the base
 `Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` are multi-target. Three stages plus a final aliasing stage:
 - `builder-fromsource` — `FROM ${BASE_IMAGE}` then runs `install-base-deps.sh` and the per-backend compile script. Used when `BUILDER_TARGET=builder-fromsource` (the default; local `make backends/<name>`).
 - `builder-prebuilt` — `FROM ${BUILDER_BASE_IMAGE}` (one of the prebuilt `base-grpc-*` tags) and runs only the per-backend compile script. Used when `BUILDER_TARGET=builder-prebuilt` (CI when the matrix entry sets `builder-base-image`).
 - `FROM ${BUILDER_TARGET} AS builder` — alias resolves the ARG-selected stage to a fixed name (BuildKit doesn't allow ARG expansion in `COPY --from=`).
 - `FROM scratch` + `COPY --from=builder ...package/. ./` — emits the final scratch image with just the package contents.
 BuildKit prunes the unreferenced builder stage, so each build only runs the path it needs. `backend_build.yml` derives `BUILDER_TARGET=builder-prebuilt` automatically when the matrix entry has a non-empty `builder-base-image`; otherwise it defaults to `builder-fromsource`.
 The matrix `(build-type, platforms)` → `builder-base-image` mapping for llama-cpp / ik-llama-cpp / turboquant entries:
 | `build-type` | `platforms` | tag |
 |---|---|---|
 | `''` | `linux/amd64` | `base-grpc-amd64` |
 | `''` | `linux/arm64` | `base-grpc-arm64` |
 | `cublas` cuda 12 | `linux/amd64` | `base-grpc-cuda-12-amd64` |
 | `cublas` cuda 13 | `linux/amd64` | `base-grpc-cuda-13-amd64` |
 | `cublas` cuda 13 | `linux/arm64` | `base-grpc-cuda-13-arm64` |
 | `cublas` cuda 12 + JetPack base | `linux/arm64` | `base-grpc-l4t-cuda-12-arm64` |
 | `hipblas` | `linux/amd64` | `base-grpc-rocm-amd64` |
 | `vulkan` | `linux/amd64` | `base-grpc-vulkan-amd64` |
 | `vulkan` | `linux/arm64` | `base-grpc-vulkan-arm64` |
 | `sycl_*` | `linux/amd64` | `base-grpc-intel-amd64` |
 ### Bootstrap order when adding a new variant
 If you add a new entry to `base-images.yml`'s matrix, the new tag does not exist on quay until the workflow runs. To consume it from a variant entry safely, dispatch the base-images workflow on the branch first:
 ```bash
 gh workflow run base-images.yml --ref <feature-branch>
 ```
 Wait for the new variant to push, then merge the consumer change. Otherwise the consumer's CI fails with "image not found."
 ## Per-arch native builds + manifest merge
 Multi-arch backends (and the core LocalAI image) build natively per arch instead of running both arches under QEMU emulation on a single x86 runner. The pattern:
 - The matrix has TWO entries per multi-arch backend, sharing the same `tag-suffix` but distinct `platforms` + `platform-tag` + `runs-on`. Example: `-cpu-faster-whisper` has one amd64 entry on `ubuntu-latest` and one arm64 entry on `ubuntu-24.04-arm`.
 - Each per-arch build pushes by **canonical digest only** (no tags) via `outputs: type=image,push-by-digest=true,name-canonical=true,push=true`. The digest is uploaded as an artifact named `digests<tag-suffix>-<platform-tag>` (or `digests-localai<...>` for root-image builds).
 - `scripts/changed-backends.js` detects shared `tag-suffix` and emits a `merge-matrix` output. `backend.yml` / `backend_pr.yml` have a `backend-merge-jobs` job that consumes it and calls `backend_merge.yml`.
 - `backend_merge.yml` downloads all matching digest artifacts and runs `docker buildx imagetools create` to publish the final tagged manifest list pointing at both per-arch digests. Same `docker/metadata-action` config as the original monolithic build, so consumers see no tag-shape change.
 - `image_merge.yml` is the equivalent for the root LocalAI image (`-core` placeholder when `tag-suffix` is empty so the artifact-name glob doesn't over-match across `core` and `gpu-vulkan`).
 **`provenance: false` is required on multi-registry digest pushes**: with the default `mode=max` provenance attestation, BuildKit bundles a per-registry attestation manifest into each registry's manifest list, making the resulting list digest diverge across registries. `steps.build.outputs.digest` only matches one of them and the merge step's `imagetools create <reg>@sha256:<digest>` lookup fails on the other. Setting `provenance: false` keeps the digest content-only and identical across registries.
 ## Path filter on master push
 Both `backend.yml` (push) and `backend_pr.yml` (PR) generate their matrix dynamically through `scripts/changed-backends.js`:
 - **PR events**: paginated `pulls/{n}/files` API → filter the matrix to entries whose `dockerfile` path prefix matches the PR diff.
 - **Push events**: GitHub Compare API (`/repos/{owner}/{repo}/compare/{before}...{after}`) → same path-filter logic. Falls back to "run everything" on first-branch push (`event.before` zero), API truncation (≥300 changed files), missing API token, or any thrown error.
 - **Tag pushes**: `FORCE_ALL=true` is set from the workflow side (`startsWith(github.ref, 'refs/tags/')`) — releases rebuild every backend regardless of diff.
 - **Schedule / `workflow_dispatch`**: no `event.before`, falls through to "run everything" automatically.
 The Sunday 06:00 UTC cron on `backend.yml` exists specifically because path filtering can leave Python backends frozen on stale wheels. `DEPS_REFRESH` (below) only fires when the build actually runs, so an untouched Python backend would never re-resolve its unpinned deps. The weekly cron is the safety net.
 ## The `DEPS_REFRESH` cache-buster (Python backends)
 Every Python backend goes through the shared `backend/Dockerfile.python`, which ends with:
 ```dockerfile
 ARG DEPS_REFRESH=initial
 RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
 ```
 Most Python backends ship `requirements*.txt` files that **do not pin every transitive dep** (`torch`, `transformers`, `vllm`, `diffusers`, etc. are listed without a `==` pin, or with `>=` lower bounds only). With a warm BuildKit cache, the `make` layer hashes only on Dockerfile instructions + COPYed source — not on what `pip install` resolves at runtime. So a warm cache would ship the *first* version of `vllm` ever cached and never pick up upstream releases.
 `DEPS_REFRESH` defends against that:
 - `backend_build.yml` computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W19`) before each build and passes it as a build-arg.
 - The `RUN ... make` layer's BuildKit hash now includes that string, so the layer invalidates **at most once per week**, automatically picking up newer wheels.
 - Within a week, builds stay warm.
 This applies only to `Dockerfile.python` because:
 - Go (`Dockerfile.golang`) pins versions in `go.mod` / `go.sum`.
 - Rust (`Dockerfile.rust`) pins via `Cargo.lock`.
 - C++ backends pin gRPC (`v1.65.0`) and llama.cpp at a specific commit; their inputs don't drift between rebuilds.
 ### Adjusting the cadence
 Bump the format to daily (`+%Y-%m-%d`) or hourly (`+%Y-%m-%d-%H`) for faster refreshes. For one-shot rebuilds without changing the schedule, append a marker to the tag-suffix in the matrix or temporarily delete that backend's cache tag in quay.
 ## ccache for C++ backend builds
 `Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` declare a BuildKit cache mount on `/root/.ccache`:
 ```dockerfile
 RUN --mount=type=cache,target=/root/.ccache,id=<backend>-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
    bash /usr/local/sbin/compile.sh
 ```
 The compile script exports `CMAKE_C/CXX/CUDA_COMPILER_LAUNCHER=ccache` so CMake threads ccache through gcc/g++/nvcc. `cache-to: type=registry,mode=max` exports the cache mount data into the registry cache, so subsequent builds restore it.
 On a `LLAMA_VERSION` bump, most translation units are byte-identical to the previous version's preprocessed source — ccache returns the previous `.o` and skips the real compile. Same for LocalAI source changes that don't actually touch llama.cpp's CMake inputs. Cache scope is per `(TARGETARCH, BUILD_TYPE)` so e.g. cublas-12 doesn't share with cublas-13 (their CUDA headers differ; cross-pollination would just be cache misses anyway).
 ## Composite actions
 Two composite actions handle runner-side prep:
 - **`.github/actions/free-disk-space/action.yml`** — wraps `jlumbroso/free-disk-space@main` plus an explicit apt purge of dotnet/android/ghc/mono/etc. Reclaims ~6–10 GB on `ubuntu-latest`. No-op on self-hosted runners. Used by `backend_build.yml`, `image_build.yml`, `test.yml`, `tests-aio.yml`, etc.
 - **`.github/actions/setup-build-disk/action.yml`** — relocates Docker's data-root to `/mnt` on hosted X64 runners. GHA hosted `ubuntu-latest` ships ~75 GB of unused space at `/mnt`; combined with the free-disk-space cleanup this gives ~100 GB working space — enough for ROCm dev image + vLLM torch install + flash-attn intermediate layers. No-op on self-hosted and on non-X64 hosted runners. Used by `backend_build.yml`, `image_build.yml`, `base-images.yml`.
 Both actions run before any docker buildx step.
 ## Concurrency
 All `backend.yml` / `image.yml` / `test.yml` / etc. workflows use:
 ```yaml
 concurrency:
  group: ci-<workflow>-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
 ```
 - **PR events** group by PR number → newer pushes to the same PR cancel old runs (intended).
 - **Push events** group by `github.sha` → each master commit gets its own run; rapid-fire merges don't cancel each other (this was a real issue prior — two master pushes 11 seconds apart would cancel the first's CI).
 ## Self-warming, no separate populator
 There is no cron job that pre-warms the BuildKit cache for individual backends. The production builds *are* the populators. The first master build of a given matrix entry pays the cold cost; subsequent same-entry master builds reuse everything that hasn't changed (apt installs, gRPC compile in the variant `builder-fromsource` stage or skipped entirely when consuming `base-grpc-*`, Python wheel installs, etc.). The base-images workflow's weekly cron is the closest thing to a populator and only refreshes the prebuilt builder bases.
 ## Manually evicting cache
 To force a fully cold build for one backend or the whole image:
 ```bash
 # Delete a single tag (requires quay credentials with admin on the repo)
 curl -X DELETE \
  -H "Authorization: Bearer ${QUAY_TOKEN}" \
  https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/cache-gpu-nvidia-cuda-12-vllm-amd64
 # List all tags
 curl -s -H "Authorization: Bearer ${QUAY_TOKEN}" \
  "https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/?limit=100" | jq '.tags[].name'
 ```
 Eviction is rarely needed in normal operation — `DEPS_REFRESH` handles weekly drift, source changes invalidate naturally, and `mode=max` keeps the cache scoped per matrix entry per arch so a stale tag never bleeds into a different build.
 ## What the cache does **not** cover
 - The `free-disk-space` and `setup-build-disk` composite actions run on every job — these reclaim runner-state, not Docker layers, so BuildKit caches don't apply.
 - Intermediate artifacts of `Build (PR)` are not pushed anywhere — PRs only build for verification.
 - Darwin builds (see below) — macOS runners have no Docker daemon, so the registry-backed BuildKit cache cannot apply.
 ## Darwin native caches
 `backend_build_darwin.yml` runs natively on `macOS-14` GitHub-hosted runners — there is no Docker, no BuildKit, no cross-job registry cache. Instead, the reusable workflow uses `actions/cache@v4` for four native caches that mirror the spirit of the Linux cache (warm by default, weekly refresh for unpinned Python deps, PRs read-only).
 | Cache | Path(s) | Key | Scope |
 |---|---|---|---|
 | Go modules + build | `~/go/pkg/mod`, `~/Library/Caches/go-build` | `go.sum` (managed by `actions/setup-go@v5` `cache: true`) | All darwin jobs |
 | Homebrew | `~/Library/Caches/Homebrew/downloads`, selected `/opt/homebrew/Cellar/*` | hash of `backend_build_darwin.yml` | All darwin jobs |
 | ccache (llama.cpp CMake) | `~/Library/Caches/ccache` | pinned `LLAMA_VERSION` from `backend/cpp/llama-cpp/Makefile` | `inputs.backend == 'llama-cpp'` only |
 | Python wheels (uv + pip) | `~/Library/Caches/pip`, `~/Library/Caches/uv` | `inputs.backend` + ISO week (`+%Y-W%V`) + hash of that backend's `requirements*.txt` | `inputs.lang == 'python'` only |
 Read/write semantics match the BuildKit cache: `actions/cache/restore` runs every time, `actions/cache/save` is gated on `github.event_name != 'pull_request'`. PRs read master's warm cache but never write back.
 The Python wheel cache uses the same ISO-week cache-buster as the Linux `DEPS_REFRESH` build-arg — same problem (unpinned `torch`/`mlx`/`diffusers`/`transformers` resolve to fresh wheels weekly), same ~one-cold-rebuild-per-week solution.
 The brew Cellar cache requires `HOMEBREW_NO_AUTO_UPDATE=1` and `HOMEBREW_NO_INSTALL_CLEANUP=1` (set as job-level env). Without those, `brew install` would mutate the very directories that were just restored, defeating the cache.
 **Force-link after cache restore**: `actions/cache` restores `/opt/homebrew/Cellar/*` but NOT the `/opt/homebrew/bin/*` symlinks. After a cache hit, `brew install` sees the Cellar entries and decides "already installed" without re-running its link step, leaving the formulas off PATH. The Dependencies step explicitly runs `brew link --overwrite` for every cached formula afterwards to ensure the symlinks exist.
 For ccache, the workflow exports `CMAKE_ARGS=… -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache` via `$GITHUB_ENV` before running `make build-darwin-go-backend`. The Makefile in `backend/cpp/llama-cpp/` already forwards `CMAKE_ARGS` through to each variant build (`fallback`, `grpc`, `rpc-server`), so no script changes are needed. The three variants share most TUs, so ccache dedupes object files across them.
 `backend_build_darwin.yml` also has a llama-cpp-specific build-step branch that runs `make backends/llama-cpp-darwin` (the bespoke script that compiles three CMake variants and bundles dylibs via `otool`), distinct from the generic `make build-darwin-${lang}-backend` path. This was consolidated from a previously-bespoke top-level `llama-cpp-darwin` job in `backend.yml` so llama-cpp on Darwin honors the same path filter as the other 34 Darwin backends.
 ### Cache budget on Darwin
 GitHub Actions caches are limited to 10 GB per repo. Steady-state worst case: ~800 MB Go cache + ~2 GB brew Cellar + up to 2 GB ccache + ~1.5 GB × 5 python backends. If the cap is hit, prefer collapsing the per-backend Python keys into a shared `pyenv-darwin-shared-<week>` key (accepts more cross-backend churn for a smaller footprint) before reducing other caches.
 ## Self-hosted runners
 `.github/backend-matrix.yml` has zero references to `arc-runner-set` or `bigger-runner` — all backends run on GHA free-tier hosted runners (`ubuntu-latest` for amd64, `ubuntu-24.04-arm` for arm64 native, `macos-14` for Darwin). The migration off self-hosted relied on the per-arch native split (no QEMU emulation) plus `setup-build-disk`'s `/mnt` relocation (~100 GB working space, enough for ROCm dev image + vLLM/torch installs).
 One residual self-hosted reference remains in `test-extra.yml` (`tests-vibevoice-cpp-grpc-transcription` uses `bigger-runner` for the 30s JFK-decode timeout headroom). That's a separate concern.
 ## Touching the cache pipeline
 When changing `image_build.yml`, `backend_build.yml`, any of the `backend/Dockerfile.*` files, `Dockerfile.base-grpc-builder`, `.docker/install-base-deps.sh`, `.docker/<backend>-compile.sh`, or `scripts/changed-backends.js`:
 1. **Don't drop `DEPS_REFRESH=...` from the build-args** without a replacement strategy (lockfiles, pinned requirements). Otherwise master will silently freeze on whichever versions were cached at the time.
 2. **Keep `(tag-suffix, platform-tag)` unique per matrix entry** — together they're the cache namespace. Two matrix entries sharing a key would clobber each other's cache.
 3. **Keep `cache-to` gated on `github.event_name != 'pull_request'`** — PRs must not write.
 4. **Keep `ignore-error=true` on `cache-to`** — quay registry hiccups must not fail builds.
 5. **Keep `provenance: false` on push-by-digest steps** — multi-registry digest divergence is the Bug We Already Fixed; reintroducing provenance attestation re-breaks the merge.
 6. **`install-base-deps.sh` is the single source of truth for base contents.** Both `Dockerfile.base-grpc-builder` (CI) and the variant Dockerfiles' `builder-fromsource` (local) bind-mount and execute it. If you add a package to one path, add it to the script — don't fork the logic into a Dockerfile RUN.
 7. **After adding a `base-images.yml` matrix variant, run the workflow on your branch before merging consumer changes** that depend on the new tag — otherwise the consumer's CI fails "image not found."
--- a/.agents/coding-style.md
+++ b/.agents/coding-style.md
@@ -1,60 +0,0 @@
 # Coding Style
 The project has the following .editorconfig:
 ```
 root = true
 [*]
 indent_style = space
 indent_size = 2
 end_of_line = lf
 charset = utf-8
 trim_trailing_whitespace = true
 insert_final_newline = true
 [*.go]
 indent_style = tab
 [Makefile]
 indent_style = tab
 [*.proto]
 indent_size = 2
 [*.py]
 indent_size = 4
 [*.js]
 indent_size = 2
 [*.yaml]
 indent_size = 2
 [*.md]
 trim_trailing_whitespace = false
 ```
 - Use comments sparingly to explain why code does something, not what it does. Comments are there to add context that would be difficult to deduce from reading the code.
 - Prefer modern Go e.g. use `any` not `interface{}`
 ## Logging
 Use `github.com/mudler/xlog` for logging which has the same API as slog.
 ## Go tests
 All Go tests — including backend tests — must use [Ginkgo](https://onsi.github.io/ginkgo/) (v2) with Gomega matchers, not the stdlib `testing` package with `t.Run` / `t.Errorf`. A test file should register a suite with `RegisterFailHandler(Fail)` in a `TestXxx(t *testing.T)` bootstrap and use `Describe`/`Context`/`It` blocks for the actual cases. Look at any existing `*_test.go` under `core/` or `pkg/` for a template.
 Do not mix styles within a package. If you are extending tests in a package that already uses Ginkgo, keep using Ginkgo. If you find stdlib-style Go tests in the tree, treat them as tech debt to be migrated rather than as a pattern to follow.
 This is enforced by `golangci-lint` via the `forbidigo` linter (see `.golangci.yml`); calls like `t.Errorf` / `t.Fatalf` / `t.Run` / `t.Skip` / `t.Logf` are flagged. Run `make lint` locally before submitting; the same check runs in CI (`.github/workflows/lint.yml`).
 ## Documentation
 The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.
 - **Feature Documentation**: If you add a new feature (like a new backend or API endpoint), create a new markdown file in `docs/content/features/` explaining what it is, how to configure it, and how to use it.
 - **Configuration**: If you modify configuration options, update the relevant sections in `docs/content/`.
 - **Examples**: providing concrete examples (like YAML configuration blocks) is highly encouraged to help users get started quickly.
 - **Shortcodes**: Use `{{% notice note %}}`, `{{% notice tip %}}`, or `{{% notice warning %}}` for callout boxes. Do **not** use `{{% alert %}}` — that shortcode does not exist in this project's Hugo theme and will break the docs build.
--- a/.agents/debugging-backends.md
+++ b/.agents/debugging-backends.md
@@ -1,141 +0,0 @@
 # Debugging and Rebuilding Backends
 When a backend fails at runtime (e.g. a gRPC method error, a Python import error, or a dependency conflict), use this guide to diagnose, fix, and rebuild.
 ## Architecture Overview
 - **Source directory**: `backend/python/<name>/` (or `backend/go/<name>/`, `backend/cpp/<name>/`)
 - **Installed directory**: `backends/<name>/` — this is what LocalAI actually runs. It is populated by `make backends/<name>` which builds a Docker image, exports it, and installs it via `local-ai backends install`.
 - **Virtual environment**: `backends/<name>/venv/` — the installed Python venv (for Python backends). The Python binary is at `backends/<name>/venv/bin/python`.
 Editing files in `backend/python/<name>/` does **not** affect the running backend until you rebuild with `make backends/<name>`.
 ## Diagnosing Failures
 ### 1. Check the logs
 Backend gRPC processes log to LocalAI's stdout/stderr. Look for lines tagged with the backend's model ID:
 ```
 GRPC stderr id="trl-finetune-127.0.0.1:37335" line="..."
 ```
 Common error patterns:
 - **"Method not implemented"** — the backend is missing a gRPC method that the Go side calls. The model loader (`pkg/model/initializers.go`) always calls `LoadModel` after `Health`; fine-tuning backends must implement it even as a no-op stub.
 - **Python import errors / `AttributeError`** — usually a dependency version mismatch (e.g. `pyarrow` removing `PyExtensionType`).
 - **"failed to load backend"** — the gRPC process crashed or never started. Check stderr lines for the traceback.
 ### 2. Test the Python environment directly
 You can run the installed venv's Python to check imports without starting the full server:
 ```bash
 backends/<name>/venv/bin/python -c "import datasets; print(datasets.__version__)"
 ```
 If `pip` is missing from the venv, bootstrap it:
 ```bash
 backends/<name>/venv/bin/python -m ensurepip
 ```
 Then use `backends/<name>/venv/bin/python -m pip install ...` to test fixes in the installed venv before committing them to the source requirements.
 ### 3. Check upstream dependency constraints
 When you hit a dependency conflict, check what the main library expects. For example, TRL's upstream `requirements.txt`:
 ```
 https://github.com/huggingface/trl/blob/main/requirements.txt
 ```
 Pin minimum versions in the backend's requirements files to match upstream.
 ## Common Fixes
 ### Missing gRPC methods
 If the Go side calls a method the backend doesn't implement (e.g. `LoadModel`), add a no-op stub in `backend.py`:
 ```python
 def LoadModel(self, request, context):
    """No-op — actual loading happens elsewhere."""
    return backend_pb2.Result(success=True, message="OK")
 ```
 The gRPC contract requires `LoadModel` to succeed for the model loader to return a usable client, even if the backend doesn't need upfront model loading.
 ### Dependency version conflicts
 Python backends often break when a transitive dependency releases a breaking change (e.g. `pyarrow` removing `PyExtensionType`). Steps:
 1. Identify the broken import in the logs
 2. Test in the installed venv: `backends/<name>/venv/bin/python -c "import <module>"`
 3. Check upstream requirements for version constraints
 4. Update **all** requirements files in `backend/python/<name>/`:
   - `requirements.txt` — base deps (grpcio, protobuf)
   - `requirements-cpu.txt` — CPU-specific (includes PyTorch CPU index)
   - `requirements-cublas12.txt` — CUDA 12
   - `requirements-cublas13.txt` — CUDA 13
 5. Rebuild: `make backends/<name>`
 ### PyTorch index conflicts (uv resolver)
 The Docker build uses `uv` for pip installs. When `--extra-index-url` points to the PyTorch wheel index, `uv` may refuse to fetch packages like `requests` from PyPI if it finds a different version on the PyTorch index first. Fix this by adding `--index-strategy=unsafe-first-match` to `install.sh`:
 ```bash
 EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
 installRequirements
 ```
 Most Python backends already do this — check `backend/python/transformers/install.sh` or similar for reference.
 ## Rebuilding
 ### Rebuild a single backend
 ```bash
 make backends/<name>
 ```
 This runs the Docker build (`Dockerfile.python`), exports the image to `backend-images/<name>.tar`, and installs it into `backends/<name>/`. It also rebuilds the `local-ai` Go binary (without extra tags).
 **Important**: If you were previously running with `GO_TAGS=auth`, the `make backends/<name>` step will overwrite your binary without that tag. Rebuild the Go binary afterward:
 ```bash
 GO_TAGS=auth make build
 ```
 ### Rebuild and restart
 After rebuilding a backend, you must restart LocalAI for it to pick up the new backend files. The backend gRPC process is spawned on demand when the model is first loaded.
 ```bash
 # Kill existing process
 kill <pid>
 # Restart
 ./local-ai run --debug [your flags]
 ```
 ### Quick iteration (skip Docker rebuild)
 For fast iteration on a Python backend's `backend.py` without a full Docker rebuild, you can edit the installed copy directly:
 ```bash
 # Edit the installed copy
 vim backends/<name>/backend.py
 # Restart LocalAI to respawn the gRPC process
 ```
 This is useful for testing but **does not persist** — the next `make backends/<name>` will overwrite it. Always commit fixes to the source in `backend/python/<name>/`.
 ## Verification
 After fixing and rebuilding:
 1. Start LocalAI and confirm the backend registers: look for `Registering backend name="<name>"` in the logs
 2. Trigger the operation that failed (e.g. start a fine-tuning job)
 3. Watch the GRPC stderr/stdout lines for the backend's model ID
 4. Confirm no errors in the traceback
--- a/.agents/ds4-backend.md
+++ b/.agents/ds4-backend.md
@@ -1,84 +0,0 @@
 # Working on the ds4 Backend
 `antirez/ds4` is a single-model inference engine for DeepSeek V4 Flash.
 LocalAI wraps the engine's C API (`ds4/ds4.h`) with a fresh C++ gRPC server at
 `backend/cpp/ds4/` - NOT a fork of llama-cpp's grpc-server.cpp.
 ## Pin
 `backend/cpp/ds4/Makefile` pins `DS4_VERSION?=<sha>` at the top. The `ds4`
 target in the Makefile clones `antirez/ds4` at that commit (mirroring the
 llama-cpp / ik-llama-cpp / turboquant pattern). The bump-deps bot
 (`.github/workflows/bump_deps.yaml`) finds this pin via grep and opens a
 daily PR to update it. To bump manually: edit the `DS4_VERSION?=` line,
 then `make purge && make` (or rely on CI's clean build).
 ## Wire shape
 | RPC | Implementation |
 |---|---|
 | Health, Free, Status | Trivial; no engine dependency for Health |
 | LoadModel | `ds4_engine_open` + `ds4_session_create`; backend is compile-time (DS4_NO_GPU → CPU, __APPLE__ → Metal, otherwise CUDA) |
 | TokenizeString | `ds4_tokenize_text` |
 | Predict | `ds4_engine_generate_argmax` + `DsmlParser` → one ChatDelta with content / reasoning_content / tool_calls[] |
 | PredictStream | Same, per-token ChatDelta writes |
 ## DSML
 ds4 emits tool calls as literal text markers (`<｜DSML｜tool_calls>` etc.) -
 NOT special tokens. `dsml_parser.{h,cpp}` is our streaming state machine that
 classifies token bytes into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END
 events. `dsml_renderer.{h,cpp}` does the prompt direction: turns
 OpenAI tool_calls + role=tool messages back into DSML for the next turn.
 ## Thinking modes
 `PredictOptions.Metadata["enable_thinking"]` gates thinking on/off (default ON).
 `["reasoning_effort"] == "max" | "xhigh"` selects `DS4_THINK_MAX`; anything else
 maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_prefix`.
 ## Disk KV cache
 `kv_cache.{h,cpp}` implements an SHA1-keyed file cache using ds4's public
 `ds4_session_save_payload` / `ds4_session_load_payload` API. Enable per request
 via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
 NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).
 ## Build matrix
 | Build | Where | Notes |
 |---|---|---|
 | `cpu-ds4` (amd64 + arm64) | Linux GHA | ds4 considers CPU debug-only; useful only for wiring tests |
 | `cuda13-ds4` (amd64 + arm64) | Linux GHA + DGX Spark validation | Primary production path on Linux |
 | `ds4-darwin` (arm64) | macOS GHA runners | Metal; uses `scripts/build/ds4-darwin.sh` like llama-cpp-darwin |
 cuda12 is intentionally omitted. ROCm / Vulkan / SYCL are not applicable.
 ## Hardware-gated validation
 `tests/e2e-backends/backend_test.go` in `BACKEND_BINARY` mode:
 ```
 BACKEND_BINARY=$(pwd)/backend/cpp/ds4/package/run.sh \
 BACKEND_TEST_MODEL_FILE=/path/to/ds4flash.gguf \
 BACKEND_TEST_CAPS=health,load,predict,stream,tools \
 BACKEND_TEST_TOOL_PROMPT="What's the weather in Paris?" \
 go test -count=1 -timeout=30m -v ./tests/e2e-backends/...
 ```
 CI does not load the model; the suite is opt-in via env vars.
 ## Importer
 `core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by
 matching the `antirez/deepseek-v4-gguf` repo URI or the
 `DeepSeek-V4-Flash-*.gguf` filename pattern. **Registered BEFORE
 `LlamaCPPImporter`** in `defaultImporters` - both match `.gguf` but ds4 is more
 specific, and first-match-wins. The importer emits `backend: ds4`, uses
 `ds4flash.gguf` as the local filename (matches ds4's own CLI default), and
 disables the Go-side automatic tool-parsing fallback (the C++ backend emits
 ChatDelta.tool_calls natively via `DsmlParser`).
 ds4 is also listed in `core/http/endpoints/localai/backend.go`'s pref-only
 slice so the `/import-model` UI surfaces it as a manual choice for users who
 want to force the backend on a non-canonical URI.
--- a/.agents/llama-cpp-backend.md
+++ b/.agents/llama-cpp-backend.md
@@ -1,83 +0,0 @@
 # llama.cpp Backend
 The llama.cpp backend (`backend/cpp/llama-cpp/grpc-server.cpp`) is a gRPC adaptation of the upstream HTTP server (`llama.cpp/tools/server/server.cpp`). It uses the same underlying server infrastructure from `llama.cpp/tools/server/server-context.cpp`.
 ## Building and Testing
 - Test llama.cpp backend compilation: `make backends/llama-cpp`
 - The backend is built as part of the main build process
 - Check `backend/cpp/llama-cpp/Makefile` for build configuration
 ## Architecture
 - **grpc-server.cpp**: gRPC server implementation, adapts HTTP server patterns to gRPC
 - Uses shared server infrastructure: `server-context.cpp`, `server-task.cpp`, `server-queue.cpp`, `server-common.cpp`
 - The gRPC server mirrors the HTTP server's functionality but uses gRPC instead of HTTP
 ## Common Issues When Updating llama.cpp
 When fixing compilation errors after upstream changes:
 1. Check how `server.cpp` (HTTP server) handles the same change
 2. Look for new public APIs or getter methods
 3. Store copies of needed data instead of accessing private members
 4. Update function calls to match new signatures
 5. Test with `make backends/llama-cpp`
 ## Key Differences from HTTP Server
 - gRPC uses `BackendServiceImpl` class with gRPC service methods
 - HTTP server uses `server_routes` with HTTP handlers
 - Both use the same `server_context` and task queue infrastructure
 - gRPC methods: `LoadModel`, `Predict`, `PredictStream`, `Embedding`, `Rerank`, `TokenizeString`, `GetMetrics`, `Health`
 ## Tool Call Parsing Maintenance
 When working on JSON/XML tool call parsing functionality, always check llama.cpp for reference implementation and updates:
 ### Checking for XML Parsing Changes
 1. **Review XML Format Definitions**: Check `llama.cpp/common/chat-parser-xml-toolcall.h` for `xml_tool_call_format` struct changes
 2. **Review Parsing Logic**: Check `llama.cpp/common/chat-parser-xml-toolcall.cpp` for parsing algorithm updates
 3. **Review Format Presets**: Check `llama.cpp/common/chat-parser.cpp` for new XML format presets (search for `xml_tool_call_format form`)
 4. **Review Model Lists**: Check `llama.cpp/common/chat.h` for `COMMON_CHAT_FORMAT_*` enum values that use XML parsing:
   - `COMMON_CHAT_FORMAT_GLM_4_5`
   - `COMMON_CHAT_FORMAT_MINIMAX_M2`
   - `COMMON_CHAT_FORMAT_KIMI_K2`
   - `COMMON_CHAT_FORMAT_QWEN3_CODER_XML`
   - `COMMON_CHAT_FORMAT_APRIEL_1_5`
   - `COMMON_CHAT_FORMAT_XIAOMI_MIMO`
   - Any new formats added
 ### Model Configuration Options
 Always check `llama.cpp` for new model configuration options that should be supported in LocalAI:
 1. **Check Server Context**: Review `llama.cpp/tools/server/server-context.cpp` for new parameters
 2. **Check Chat Params**: Review `llama.cpp/common/chat.h` for `common_chat_params` struct changes
 3. **Check Server Options**: Review `llama.cpp/tools/server/server.cpp` for command-line argument changes
 4. **Examples of options to check**:
   - `ctx_shift` - Context shifting support
   - `parallel_tool_calls` - Parallel tool calling
   - `reasoning_format` - Reasoning format options
   - Any new flags or parameters
 ### Speculative Decoding Types
 The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
 `draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
 ### Implementation Guidelines
 1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation
 2. **Test Coverage**: Add tests for new features matching llama.cpp's behavior
 3. **Documentation**: Update relevant documentation when adding new formats or options
 4. **Backward Compatibility**: Ensure changes don't break existing functionality
 ### Files to Monitor
 - `llama.cpp/common/chat-parser-xml-toolcall.h` - Format definitions
 - `llama.cpp/common/chat-parser-xml-toolcall.cpp` - Parsing logic
 - `llama.cpp/common/chat-parser.cpp` - Format presets and model-specific handlers
 - `llama.cpp/common/chat.h` - Format enums and parameter structures
 - `llama.cpp/tools/server/server-context.cpp` - Server configuration options
--- a/.agents/localai-assistant-mcp.md
+++ b/.agents/localai-assistant-mcp.md
@@ -1,97 +0,0 @@
 # LocalAI Assistant — admin MCP server
 This document is the contract for **anyone** (human or AI agent) touching LocalAI's admin REST surface, the in-process MCP server that wraps it, or the embedded skill prompts that teach the assistant how to use it. Read this before adding/removing/renaming admin endpoints, MCP tools, or skill recipes.
 ## What this feature is
 `pkg/mcp/localaitools/` is a public Go package that exposes LocalAI's admin/management surface as an MCP server. It is used in two ways:
 1. **In-process**: when an admin opens a chat with `metadata.localai_assistant=true`, the chat handler injects the in-memory MCP server (paired `net.Pipe()` transport, no HTTP loopback) so the LLM can install models, manage backends and edit configs by chatting.
 2. **Standalone**: the `local-ai mcp-server --target=…` subcommand serves the same MCP server over stdio, talking HTTP to a remote LocalAI instance.
 The two modes share **all** tool definitions and skill prompts. They differ only in their `LocalAIClient` implementation (`inproc/` calls services directly; `httpapi/` calls REST).
 ## The three things you must keep in sync
 When you change LocalAI's admin surface, three layers must stay aligned:
 1. **REST endpoint** in `core/http/endpoints/localai/*.go`.
 2. **MCP tool registration** in `pkg/mcp/localaitools/tools_*.go`, plus a method on `LocalAIClient` (in `client.go`) and implementations in both `inproc/client.go` **and** `httpapi/client.go`.
 3. **Skill prompt** under `pkg/mcp/localaitools/prompts/skills/*.md` — the markdown that teaches the LLM how to use the new tool. If the new tool fits an existing recipe, update that recipe; otherwise add a new file.
 If you ship a REST endpoint without (2) and (3), conversational admins won't see the feature.
 ## Checklist for adding a new admin endpoint
 - [ ] REST endpoint exists in `core/http/endpoints/localai/*.go` and is gated by `auth.RequireAdmin()` in `core/http/routes/localai.go`.
 - [ ] `LocalAIClient` interface in `pkg/mcp/localaitools/client.go` has a method covering the new operation.
 - [ ] DTOs added/updated in `pkg/mcp/localaitools/dto.go` (JSON-tagged; never expose raw service types).
 - [ ] `inproc/client.go` implements the new method by calling the service directly (not via HTTP loopback).
 - [ ] `httpapi/client.go` implements the new method by calling the REST endpoint.
 - [ ] Tool registration added in the appropriate `pkg/mcp/localaitools/tools_*.go`. Mutating tools must reference safety rule 1 in the description.
 - [ ] If the tool is mutating, ensure `Options{DisableMutating: true}` skips it (mirror the pattern in `tools_models.go`).
 - [ ] Skill prompt added or updated under `pkg/mcp/localaitools/prompts/skills/`. The prompt must instruct the LLM when to call the tool, what to ask the user first, and what to do on error.
 - [ ] Tests:
   - `pkg/mcp/localaitools/server_test.go` adds the tool name to `expectedFullCatalog` and `expectedReadOnlyCatalog` (if read-only).
   - Tool dispatch is added to `TestEachToolDispatchesToClient`.
   - `pkg/mcp/localaitools/httpapi/client_test.go` covers the new HTTP path.
 ## Adding a new skill recipe (no new tool)
 Sometimes you want to teach the LLM a new pattern that uses existing tools. Drop a markdown file under `pkg/mcp/localaitools/prompts/skills/<verb>_<noun>.md`. The file is automatically embedded by `//go:embed` and assembled into the system prompt in lexicographic order. No Go changes needed.
 Conventions:
 - Filename: `<verb>_<noun>.md` (e.g. `install_chat_model.md`, `upgrade_backend.md`).
 - First line: `# Skill: <Title Case description>`.
 - Number the steps. Reference exact tool names in backticks.
 - If the skill mutates state, remind the LLM to confirm with the user.
 ## Code conventions
 These rules guard against the magic-literal drift that surfaced in the first audit. Do not re-introduce bare strings.
 - **Tool names** always come from the `Tool*` constants in `pkg/mcp/localaitools/tools.go`. Tool registrations, the test catalog (`server_test.go`'s `expectedFullCatalog` / `expectedReadOnlyCatalog`), and dispatch tables reference the constants. The embedded skill prompts under `prompts/` keep bare strings — that's the one allowed exception, and `TestPromptsContainSafetyAnchors` enforces alignment.
 - **Toggle/pin actions** use the `modeladmin.Action` type (`pkg/mcp/localaitools` and `core/services/modeladmin`). Use `ActionEnable`/`ActionDisable`/`ActionPin`/`ActionUnpin`; never bare `"enable"`/`"pin"` strings.
 - **Capability tags** for `list_installed_models` use the `localaitools.Capability` type (`capability.go`). The `LocalAIClient.ListInstalledModels` interface takes a typed `Capability`, and the `inproc` switch only accepts canonical values (`"embed"`/`"embedding"` are not aliases — only `CapabilityEmbeddings`).
 - **HTTP error checks** in `httpapi.Client` use `errors.Is(err, ErrHTTPNotFound)`, not substring matches on `err.Error()`. The typed `*HTTPError` carries `StatusCode` and `Body`; add new sentinel errors as needed rather than re-introducing string matching.
 - **Channel sends** to `GalleryService.ModelGalleryChannel` / `BackendGalleryChannel` from inproc clients MUST select on `ctx.Done()` so a cancelled chat completion releases the goroutine. See `inproc.sendModelOp` / `sendBackendOp`.
 - **Disk writes** of model config YAML go through `modeladmin.writeFileAtomic` (temp file + `os.Rename`). `os.WriteFile` truncates on crash and corrupts the model.
 - **MCP server lifecycle**: every initialised holder MUST register `Close()` with `signals.RegisterGracefulTerminationHandler`. The standalone `mcp-server` CLI uses `signal.NotifyContext` to honour SIGINT/SIGTERM.
 ## File map (where to look)
 ```
 pkg/mcp/localaitools/
  client.go              # LocalAIClient interface + DTO registry
  dto.go                 # JSON-tagged DTOs shared by both client impls
  server.go              # NewServer(client, opts) — registers tools
  tools.go               # Tool* name constants (single source of truth)
  capability.go          # Capability type + constants
  tools_models.go        # gallery_search, install_model, import_model_uri, ...
  tools_backends.go
  tools_config.go
  tools_system.go
  tools_state.go
  prompts.go             # //go:embed loader + SystemPrompt(opts)
  prompts/00_role.md
  prompts/10_safety.md   # SAFETY RULES — change with care
  prompts/20_tools.md    # curated tool catalog with one-liners
  prompts/skills/*.md
  inproc/client.go       # in-process LocalAIClient (services-direct)
  httpapi/client.go      # REST LocalAIClient (for standalone CLI / remote)
 core/http/endpoints/mcp/
  localai_assistant.go   # process-wide holder + LocalToolExecutor
 core/cli/mcp_server.go   # local-ai mcp-server subcommand
 ```
 ## Why two clients
 The in-process MCP server runs inside the same LocalAI binary that serves chat. Going over HTTP loopback would (a) require minting a synthetic admin API key for the server to authenticate against itself, (b) double-marshal every tool dispatch, and (c) lose access to in-process channels (e.g. `GalleryService.ModelGalleryChannel` for streaming install progress). So in-process uses `inproc.Client`. The standalone stdio CLI talks to a *remote* LocalAI; HTTP is the only option, so it uses `httpapi.Client`. Both implement the same `LocalAIClient` interface, and the parity test in `pkg/mcp/localaitools/parity_test.go` (when present) keeps their output equivalent.
 ## Why prompt-enforced confirmation, not code gates
 The user chose KISS. Every mutating tool has a safety rule (`prompts/10_safety.md` rule 1) that requires the LLM to summarise the action and wait for explicit user confirmation before calling it. There is no `plan_*`/`apply_*` two-step in code. If you add a mutating tool, do **not** add per-tool confirmation logic in Go — instead, list the new tool name in `prompts/10_safety.md` so the LLM knows it falls under the confirmation rule.
 ## Distributed mode
 The in-memory MCP server runs only on the head node (where the chat handler runs). `inproc.Client` wraps services that are already distributed-aware (`GalleryService` coordinates with workers; `ListNodes` reads the NATS-populated registry). No NATS routing of MCP tools — the admin surface lives on the head, period.
--- a/.agents/sglang-backend.md
+++ b/.agents/sglang-backend.md
@@ -1,62 +0,0 @@
 # Working on the SGLang Backend
 The SGLang backend lives at `backend/python/sglang/backend.py` (async gRPC). It wraps SGLang's `Engine` (`sglang.srt.entrypoints.engine.Engine`) and translates LocalAI's gRPC `PredictOptions` into SGLang sampling params + outputs into `Reply.chat_deltas`. Structurally it mirrors `backend/python/vllm/backend.py` — keep them shaped the same so changes in one have an obvious analog in the other.
 ## `engine_args` is the universal escape hatch
 A small fixed set of fields on `ModelOptions` is mapped to typed SGLang kwargs in `LoadModel` (model, quantization, load_format, gpu_memory_utilization → mem_fraction_static, trust_remote_code, enforce_eager → disable_cuda_graph, tensor_parallel_size → tp_size, max_model_len → context_length, dtype). **Everything else** flows through the `engine_args:` YAML map.
 Validation happens in `_apply_engine_args`. Keys are checked against `dataclasses.fields(ServerArgs)` (`sglang.srt.server_args.ServerArgs` is a flat `@dataclass` with ~380 fields). Unknown keys raise `ValueError` at LoadModel time with a `difflib.get_close_matches` suggestion — same shape as the vLLM backend.
 **Precedence:** typed `ModelOptions` fields populate `engine_kwargs` first, then `engine_args` overrides them. So a YAML that sets both `gpu_memory_utilization: 0.9` and `engine_args.mem_fraction_static: 0.5` ends up at `0.5`. Document this when answering "why didn't my YAML field stick?".
 **ServerArgs is flat.** Unlike vLLM, where speculative decoding is nested under `engine_args.speculative_config: {...}`, SGLang exposes flat top-level fields: `speculative_algorithm`, `speculative_draft_model_path`, `speculative_num_steps`, `speculative_eagle_topk`, `speculative_num_draft_tokens`, `speculative_dflash_block_size`, etc. There is no `speculative_config:` dict. Same goes for compilation, kv-transfer, attention — all flat.
 The canonical reference is `python/sglang/srt/server_args.py:ServerArgs` (line ~304). When SGLang adds new flags, no LocalAI code change is needed — they're automatically available via `engine_args:`. The validator picks them up because it introspects the live dataclass.
 ## Speculative decoding cheatsheet
 `--speculative-algorithm` accepts `EAGLE`, `EAGLE3`, `NEXTN`, `STANDALONE`, `NGRAM`, `DFLASH`. `NEXTN` is silently rewritten to `EAGLE` in `ServerArgs.__post_init__` (`server_args.py:3286-3287`). MTP (Multi-Token Prediction) is the same EAGLE path with `num_steps=1, eagle_topk=1, num_draft_tokens=2` against a target whose architecture has multi-token heads (e.g. MiMo-7B-RL, DeepSeek-V3-MTP).
 | Algorithm | Drafter requirement | Gallery demo target | Gallery demo drafter |
 |-----------|--------------------|---------------------|----------------------|
 | `NEXTN` / `EAGLE` (MTP) | Assistant drafter or built-in heads | google/gemma-4-E2B-it, google/gemma-4-E4B-it | google/gemma-4-E2B-it-assistant, google/gemma-4-E4B-it-assistant |
 | `EAGLE3` | EAGLE3 draft head | (no gallery entry yet) | e.g. jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B |
 | `DFLASH` | Block-diffusion drafter | (no gallery entry yet) | e.g. z-lab/Qwen3-4B-DFlash-b16 |
 | `STANDALONE` | Smaller LLM as drafter | (no gallery entry yet) | any smaller chat-tuned LLM in the same family |
 | `NGRAM` | None — uses prefix history | (no gallery entry yet) | n/a |
 The Gemma 4 demos use `mem_fraction_static: 0.85` (cookbook default) and the cookbook's `num_steps=5, num_draft_tokens=6, eagle_topk=1` parameters. Other algorithms are reachable from any user YAML via `engine_args:` but don't have shipped demos yet — that's a deliberate gallery scope choice, not a backend limitation.
 Gemma 4 support requires sglang built from a commit that includes [PR #21952](https://github.com/sgl-project/sglang/pull/21952). LocalAI's pinned release for cublas12 / cublas13 includes it. The `l4t13` (JetPack 7 / sbsa cu130) build floors at `sglang>=0.5.0` because the `pypi.jetson-ai-lab.io` mirror still ships only `0.5.1.post2` as of 2026-05-06 — Gemma 4 / MTP recipes are therefore not available on l4t13 until that mirror catches up. `backend.py` keeps backward compat with the 0.5.x → 0.5.11 `SamplingParams.seed` → `sampling_seed` rename via runtime detection.
 Compatibility caveats per the SGLang docs: DFLASH and NGRAM are incompatible with `enable_dp_attention`; DFLASH requires `pp_size == 1`; STANDALONE is incompatible with `enable_dp_attention`; NGRAM is CUDA-only and disables the overlap scheduler.
 ### `mem_fraction_static` + quantization + MTP on consumer GPUs
 When combining online weight quantization (`engine_args.quantization: fp8` / `awq` / etc.) with built-in-head MTP (`speculative_algorithm: EAGLE`/`NEXTN`) on a tight VRAM budget, sglang's default `mem_fraction_static: 0.85` will OOM during draft-worker init. The reason: sglang quantizes the **target** model's transformer blocks but loads the **MTP draft worker's vocab embedding** at the source dtype (typically bf16). For a 7 B-class model with a 150k-token vocab × 4096 hidden, that's another ~1.2 GiB allocated *after* the static pool is reserved. At 0.85 fraction on a 16 GB card there's no room left.
 Workaround: drop `mem_fraction_static` to ~0.7 so the post-static heap can absorb the MTP embedding alloc + CUDA graph private pools. Verified end-to-end on MiMo-7B-RL + fp8 + MTP on a 16 GB RTX 5070 Ti (`gallery/sglang-mimo-7b-mtp.yaml`) at ~88 tok/s. Models with larger vocabs or more MTP layers (e.g. DeepSeek-V3-MTP) need an even smaller fraction.
 This isn't documented anywhere upstream as of 2026-05-06 — the SGLang Gemma 4 cookbook uses 0.85 because their MTP path doesn't go through `eagle_worker_v2.py` for an embedding-bearing draft module. Don't blanket-apply 0.7 across all sglang YAMLs; only when MTP-with-built-in-heads + quantization combine.
 ## Tool-call and reasoning parsers stay on `Options[]`
 ServerArgs has `tool_call_parser` and `reasoning_parser` fields, and the backend does pass them through to `Engine` so SGLang's own HTTP/OAI surface keeps working. But for the **LocalAI** request path the backend constructs fresh per-request parser instances in `_make_parsers` (`backend.py:286`) because the parsers are stateful — the streaming and non-streaming paths each need their own.
 So the user-facing knob stays on `Options[]`:
 ```yaml
 options:
  - tool_parser:hermes
  - reasoning_parser:deepseek_r1
 ```
 Putting these in `engine_args:` will set them on `ServerArgs` but the LocalAI-level streaming `ChatDelta` will not pick them up. Don't recommend that path.
 ## What's missing today (out of scope, but worth tracking)
 - `core/config/hooks_sglang.go` — there is no SGLang equivalent of `hooks_vllm.go`. The vLLM hook auto-selects parsers for known model families from `parser_defaults.json` and seeds production engine_args defaults. A symmetric hook for SGLang could reuse the same `parser_defaults.json` (the SGLang parser names are different but the family detection is shared) and seed defaults like `enable_metrics: true` or attention-backend choices.
 - `core/gallery/importers/sglang.go` — vLLM has an importer that resolves model architecture → parser defaults at gallery-import time. A matching importer for SGLang would let `local-ai install` populate sensible parsers automatically.
 These should be a follow-up PR, not a blocker for the engine_args feature.
--- a/.agents/testing-mcp-apps.md
+++ b/.agents/testing-mcp-apps.md
@@ -1,120 +0,0 @@
 # Testing MCP Apps (Interactive Tool UIs)
 MCP Apps is an extension to MCP where tools declare interactive HTML UIs via `_meta.ui.resourceUri`. When the LLM calls such a tool, the UI renders the app in a sandboxed iframe inline in the chat. The app communicates bidirectionally with the host via `postMessage` (JSON-RPC) and can call server tools, send messages, and update model context.
 Spec: https://modelcontextprotocol.io/extensions/apps/overview
 ## Quick Start: Run a Test MCP App Server
 The `@modelcontextprotocol/server-basic-react` npm package is a ready-to-use test server that exposes a `get-time` tool with an interactive React clock UI. It requires Node >= 20, so run it in Docker:
 ```bash
 docker run -d --name mcp-app-test -p 3001:3001 node:22-slim \
  sh -c 'npx -y @modelcontextprotocol/server-basic-react'
 ```
 Wait ~10 seconds for it to start, then verify:
 ```bash
 # Check it's running
 docker logs mcp-app-test
 # Expected: "MCP server listening on http://localhost:3001/mcp"
 # Verify MCP protocol works
 curl -s -X POST http://localhost:3001/mcp \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json, text/event-stream' \
  -d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2025-03-26","capabilities":{},"clientInfo":{"name":"test","version":"1.0.0"}}}'
 # List tools — should show get-time with _meta.ui.resourceUri
 curl -s -X POST http://localhost:3001/mcp \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json, text/event-stream' \
  -d '{"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}}'
 ```
 The `tools/list` response should contain:
 ```json
 {
  "name": "get-time",
  "_meta": {
    "ui": { "resourceUri": "ui://get-time/mcp-app.html" }
  }
 }
 ```
 ## Testing in LocalAI's UI
 1. Make sure LocalAI is running (e.g. `http://localhost:8080`)
 2. Build the React UI: `cd core/http/react-ui && npm install && npm run build`
 3. Open the Chat page in your browser
 4. Click **"Client MCP"** in the chat header
 5. Add a new client MCP server:
   - **URL**: `http://localhost:3001/mcp`
   - **Use CORS proxy**: enabled (default) — required because the browser can't hit `localhost:3001` directly due to CORS; LocalAI's proxy at `/api/cors-proxy` handles it
 6. The server should connect and discover the `get-time` tool
 7. Select a model and send: **"What time is it?"**
 8. The LLM should call the `get-time` tool
 9. The tool result should render the interactive React clock app in an iframe as a standalone chat message (not inside the collapsed activity group)
 ## What to Verify
 - [ ] Tool appears in the connected tools list (not filtered — `get-time` is callable by the LLM)
 - [ ] The iframe renders as a standalone chat message with a puzzle-piece icon
 - [ ] The app loads and is interactive (clock UI, buttons work)
 - [ ] No "Reconnect to MCP server" overlay (connection is live)
 - [ ] Console logs show bidirectional communication:
  - `tools/call` messages from app to host (app calling server tools)
  - `ui/message` notifications (app sending messages)
 - [ ] After the app renders, the LLM continues and produces a text response with the time
 - [ ] Non-UI tools continue to work normally (text-only results)
 - [ ] Page reload shows the HTML statically with a reconnect overlay until you reconnect
 ## Console Log Patterns
 Healthy bidirectional communication looks like:
 ```
 Parsed message { jsonrpc: "2.0", id: N, result: {...} }     // Bridge init
 get-time result: { content: [...] }                          // Tool result received
 Calling get-time tool...                                     // App calls tool
 Sending message { method: "tools/call", ... }                // App -> host -> server
 Parsed message { jsonrpc: "2.0", id: N, result: {...} }     // Server response
 Sending message text to Host: ...                            // App sends message
 Sending message { method: "ui/message", ... }                // Message notification
 Message accepted                                             // Host acknowledged
 ```
 Benign warnings to ignore:
 - `Source map error: ... about:srcdoc` — browser devtools can't find source maps for srcdoc iframes
 - `Ignoring message from unknown source` — duplicate postMessage from iframe navigation
 - `notifications/cancelled` — app cleaning up previous requests
 ## Architecture Notes
 - **No server-side changes needed** — the MCP App protocol runs entirely in the browser
 - `PostMessageTransport` wraps `window.postMessage` between host and `srcdoc` iframe
 - `AppBridge` (from `@modelcontextprotocol/ext-apps`) auto-forwards `tools/call`, `resources/read`, `resources/list` from the app to the MCP server via the host's `Client`
 - The iframe uses `sandbox="allow-scripts allow-forms"` (no `allow-same-origin`) — opaque origin, no access to host cookies/DOM/localStorage
 - App-only tools (`_meta.ui.visibility: "app-only"`) are filtered from the LLM's tool list but remain callable by the app iframe
 ## Key Files
 - `core/http/react-ui/src/components/MCPAppFrame.jsx` — iframe + AppBridge component
 - `core/http/react-ui/src/hooks/useMCPClient.js` — MCP client hook with app UI helpers (`hasAppUI`, `getAppResource`, `getClientForTool`, `getToolDefinition`)
 - `core/http/react-ui/src/hooks/useChat.js` — agentic loop, attaches `appUI` to tool_result messages
 - `core/http/react-ui/src/pages/Chat.jsx` — renders MCPAppFrame as standalone chat messages
 ## Other Test Servers
 The `@modelcontextprotocol/ext-apps` repo has many example servers:
 - `@modelcontextprotocol/server-basic-react` — simple clock (React)
 - More examples at https://github.com/modelcontextprotocol/ext-apps/tree/main/examples
 All examples support both stdio and HTTP transport. Run without `--stdio` for HTTP mode on port 3001.
 ## Cleanup
 ```bash
 docker rm -f mcp-app-test
 ```
--- a/.agents/vllm-backend.md
+++ b/.agents/vllm-backend.md
@@ -1,115 +0,0 @@
 # Working on the vLLM Backend
 The vLLM backend lives at `backend/python/vllm/backend.py` (async gRPC) and the multimodal variant at `backend/python/vllm-omni/backend.py` (sync gRPC). Both wrap vLLM's `AsyncLLMEngine` / `Omni` and translate the LocalAI gRPC `PredictOptions` into vLLM `SamplingParams` + outputs into `Reply.chat_deltas`.
 This file captures the non-obvious bits — most of the bring-up was a single PR (`feat/vllm-parity`) and the things below are easy to get wrong.
 ## Tool calling and reasoning use vLLM's *native* parsers
 Do not write regex-based tool-call extractors for vLLM. vLLM ships:
 - `vllm.tool_parsers.ToolParserManager` — 50+ registered parsers (`hermes`, `llama3_json`, `llama4_pythonic`, `mistral`, `qwen3_xml`, `deepseek_v3`, `granite4`, `openai`, `kimi_k2`, `glm45`, …)
 - `vllm.reasoning.ReasoningParserManager` — 25+ registered parsers (`deepseek_r1`, `qwen3`, `mistral`, `gemma4`, …)
 Both can be used standalone: instantiate with a tokenizer, call `extract_tool_calls(text, request=None)` / `extract_reasoning(text, request=None)`. The backend stores the parser *classes* on `self.tool_parser_cls` / `self.reasoning_parser_cls` at LoadModel time and instantiates them per request.
 **Selection:** vLLM does *not* auto-detect parsers from model name — neither does the LocalAI backend. The user (or `core/config/hooks_vllm.go`) must pick one and pass it via `Options[]`:
 ```yaml
 options:
  - tool_parser:hermes
  - reasoning_parser:qwen3
 ```
 Auto-defaults for known model families live in `core/config/parser_defaults.json` and are applied:
 - at gallery import time by `core/gallery/importers/vllm.go`
 - at model load time by the `vllm` / `vllm-omni` backend hook in `core/config/hooks_vllm.go`
 User-supplied `tool_parser:`/`reasoning_parser:` in the config wins over defaults — the hook checks for existing entries before appending.
 **When to update `parser_defaults.json`:** any time vLLM ships a new tool or reasoning parser, or you onboard a new model family that LocalAI users will pull from HuggingFace. The file is keyed by *family pattern* matched against `normalizeModelID(cfg.Model)` (lowercase, org-prefix stripped, `_`→`-`). Patterns are checked **longest-first** — keep `qwen3.5` before `qwen3`, `llama-3.3` before `llama-3`, etc., or the wrong family wins. Add a covering test in `core/config/hooks_test.go`.
 **Sister file — `core/config/inference_defaults.json`:** same pattern but for sampling parameters (temperature, top_p, top_k, min_p, repeat_penalty, presence_penalty). Loaded by `core/config/inference_defaults.go` and applied by `ApplyInferenceDefaults()`. The schema is `map[string]float64` only — *strings don't fit*, which is why parser defaults needed their own JSON file. The inference file is **auto-generated from unsloth** via `go generate ./core/config/` (see `core/config/gen_inference_defaults/`) — don't hand-edit it; instead update the upstream source or regenerate. Both files share `normalizeModelID()` and the longest-first pattern ordering.
 **Constructor compatibility gotcha:** the abstract `ToolParser.__init__` accepts `tools=`, but several concrete parsers (Hermes2ProToolParser, etc.) override `__init__` and *only* accept `tokenizer`. Always:
 ```python
 try:
    tp = self.tool_parser_cls(self.tokenizer, tools=tools)
 except TypeError:
    tp = self.tool_parser_cls(self.tokenizer)
 ```
 ## ChatDelta is the streaming contract
 The Go side (`core/backend/llm.go`, `pkg/functions/chat_deltas.go`) consumes `Reply.chat_deltas` to assemble the OpenAI response. For tool calls to surface in `chat/completions`, the Python backend **must** populate `Reply.chat_deltas[].tool_calls` with `ToolCallDelta{index, id, name, arguments}`. Returning the raw `<tool_call>...</tool_call>` text in `Reply.message` is *not* enough — the Go regex fallback exists for llama.cpp, not for vllm.
 Same story for `reasoning_content` — emit it on `ChatDelta.reasoning_content`, not as part of `content`.
 ## Message conversion to chat templates
 `tokenizer.apply_chat_template()` expects a list of dicts, not proto Messages. The shared helper in `backend/python/common/vllm_utils.py` (`messages_to_dicts`) handles the mapping including:
 - `tool_call_id` and `name` for `role="tool"` messages
 - `tool_calls` JSON-string field → parsed Python list for `role="assistant"`
 - `reasoning_content` for thinking models
 Pass `tools=json.loads(request.Tools)` and (when `request.Metadata.get("enable_thinking") == "true"`) `enable_thinking=True` to `apply_chat_template`. Wrap in `try/except TypeError` because not every tokenizer template accepts those kwargs.
 ## CPU support and the SIMD/library minefield
 vLLM publishes prebuilt CPU wheels at `https://github.com/vllm-project/vllm/releases/...`. The pin lives in `backend/python/vllm/requirements-cpu-after.txt`.
 **Version compatibility — important:** newer vllm CPU wheels (≥ 0.15) declare `torch==2.10.0+cpu` as a hard dep, but `torch==2.10.0` only exists on the PyTorch test channel and pulls in an incompatible `torchvision`. Stay on **`vllm 0.14.1+cpu` + `torch 2.9.1+cpu`** until both upstream catch up. Bumping requires verifying torchvision/torchaudio match.
 `requirements-cpu.txt` uses `--extra-index-url https://download.pytorch.org/whl/cpu`. `install.sh` adds `--index-strategy=unsafe-best-match` for the `cpu` profile so uv resolves transformers/vllm from PyPI while pulling torch from the PyTorch index.
 **SIMD baseline:** the prebuilt CPU wheel is compiled with AVX-512 VNNI/BF16. On a CPU without those instructions, importing `vllm.model_executor.models.registry` SIGILLs at `_run_in_subprocess` time during model inspection. There is no runtime flag to disable it. Workarounds:
 1. **Run on a host with the right SIMD baseline** (default — fast)
 2. **Build from source** with `FROM_SOURCE=true` env var. Plumbing exists end-to-end:
   - `install.sh` hides `requirements-cpu-after.txt`, runs `installRequirements` for the base deps, then clones vllm and `VLLM_TARGET_DEVICE=cpu uv pip install --no-deps .`
   - `backend/Dockerfile.python` declares `ARG FROM_SOURCE` + `ENV FROM_SOURCE`
   - `Makefile` `docker-build-backend` macro forwards `--build-arg FROM_SOURCE=$(FROM_SOURCE)` when set
   - Source build takes 30–50 minutes — too slow for per-PR CI but fine for local.
 **Runtime shared libraries:** vLLM's `vllm._C` extension `dlopen`s `libnuma.so.1` at import time. If missing, the C extension silently fails and `torch.ops._C_utils.init_cpu_threads_env` is never registered → `EngineCore` crashes on `init_device` with:
 ```
 AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env'
 ```
 `backend/python/vllm/package.sh` bundles `libnuma.so.1` and `libgomp.so.1` into `${BACKEND}/lib/`, which `libbackend.sh` adds to `LD_LIBRARY_PATH` at run time. The builder stage in `backend/Dockerfile.python` installs `libnuma1`/`libgomp1` so package.sh has something to copy. Do *not* assume the production host has these — backend images are `FROM scratch`.
 ## Backend hook system (`core/config/backend_hooks.go`)
 Per-backend defaults that used to be hardcoded in `ModelConfig.Prepare()` now live in `core/config/hooks_*.go` files and self-register via `init()`:
 - `hooks_llamacpp.go` → GGUF metadata parsing, context size, GPU layers, jinja template
 - `hooks_vllm.go` → tool/reasoning parser auto-selection from `parser_defaults.json`
 Hook keys:
 - `"llama-cpp"`, `"vllm"`, `"vllm-omni"`, … — backend-specific
 - `""` — runs only when `cfg.Backend` is empty (auto-detect case)
 - `"*"` — global catch-all, runs for every backend before specific hooks
 Multiple hooks per key are supported and run in registration order. Adding a new backend default:
 ```go
 // core/config/hooks_<backend>.go
 func init() {
    RegisterBackendHook("<backend>", myDefaults)
 }
 func myDefaults(cfg *ModelConfig, modelPath string) {
    // only fill in fields the user didn't set
 }
 ```
 ## The `Messages.ToProto()` fields you need to set
 `core/schema/message.go:ToProto()` must serialize:
 - `ToolCallID` → `proto.Message.ToolCallId` (for `role="tool"` messages — links result back to the call)
 - `Reasoning` → `proto.Message.ReasoningContent`
 - `ToolCalls` → `proto.Message.ToolCalls` (JSON-encoded string)
 These were originally not serialized and tool-calling conversations broke silently — the C++ llama.cpp backend reads them but always got empty strings. Any new field added to `schema.Message` *and* `proto.Message` needs a matching line in `ToProto()`.
--- a/.devcontainer/docker-compose-devcontainer.yml
+++ b/.devcontainer/docker-compose-devcontainer.yml
@@ -10,8 +10,7 @@ services:
      - 8080:8080
    volumes:
      - localai_workspace:/workspace
-      - models:/host-models
+      - ../models:/host-models
      - backends:/host-backends
      - ./customization:/devcontainer-customization
    command: /bin/sh -c "while sleep 1000; do :; done"
    cap_add:
@@ -40,9 +39,6 @@ services:
      - GF_SECURITY_ADMIN_PASSWORD=grafana
    volumes:
      - ./grafana:/etc/grafana/provisioning/datasources
 volumes:
  prom_data:
-  localai_workspace:
+  localai_workspace:
  models:
  backends:
--- a/.docker/apt-mirror.sh
+++ b/.docker/apt-mirror.sh
@@ -1,39 +0,0 @@
 #!/bin/sh
 # Reconfigure Ubuntu apt sources to point at an alternate mirror.
 #
 # Used by Dockerfiles via `RUN --mount=type=bind,source=.docker/apt-mirror.sh,...`
 # and by CI workflows on the runner to mitigate outages of the default
 # archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com pool.
 #
 # Inputs (env):
 #   APT_MIRROR        Replacement for archive.ubuntu.com and security.ubuntu.com
 #                     (e.g. "http://azure.archive.ubuntu.com" or
 #                      "https://mirrors.edge.kernel.org").
 #                     Leave empty to keep upstream. The trailing "/ubuntu/..."
 #                     path is preserved by the rewrite.
 #   APT_PORTS_MIRROR  Replacement for ports.ubuntu.com (arm64/ppc64el/...).
 #                     Leave empty to keep upstream.
 #
 # Both default to empty, in which case the script is a no-op.
 set -e
 if [ -z "${APT_MIRROR}" ] && [ -z "${APT_PORTS_MIRROR}" ]; then
    exit 0
 fi
 # Ubuntu 24.04 (noble) ships DEB822 sources at /etc/apt/sources.list.d/ubuntu.sources;
 # older releases use /etc/apt/sources.list. We rewrite whichever exists.
 for f in /etc/apt/sources.list.d/ubuntu.sources /etc/apt/sources.list; do
    [ -f "$f" ] || continue
    if [ -n "${APT_MIRROR}" ]; then
        # Use a comma delimiter so the alternation pipe in the regex
        # is not interpreted as the s/// separator.
        sed -i -E "s,https?://(archive\.ubuntu\.com|security\.ubuntu\.com),${APT_MIRROR},g" "$f"
    fi
    if [ -n "${APT_PORTS_MIRROR}" ]; then
        sed -i -E "s,https?://ports\.ubuntu\.com,${APT_PORTS_MIRROR},g" "$f"
    fi
 done
 echo "apt-mirror: rewrote sources (APT_MIRROR='${APT_MIRROR}', APT_PORTS_MIRROR='${APT_PORTS_MIRROR}')"
--- a/.docker/ik-llama-cpp-compile.sh
+++ b/.docker/ik-llama-cpp-compile.sh
@@ -1,30 +0,0 @@
 #!/usr/bin/env bash
 # Shared compile logic for backend/Dockerfile.ik-llama-cpp.
 # Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
 set -euxo pipefail
 export CCACHE_DIR=/root/.ccache
 ccache --max-size=5G || true
 ccache -z || true
 export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
 if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
  export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
  rm -rf /LocalAI/backend/cpp/ik-llama-cpp-*-build
 fi
 cd /LocalAI/backend/cpp/ik-llama-cpp
 if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
  # ARM64 / ROCm: build without x86 SIMD
  make ik-llama-cpp-fallback
 else
  # ik_llama.cpp's IQK kernels require at least AVX2
  make ik-llama-cpp-avx2
 fi
 ccache -s || true
--- a/.docker/install-base-deps.sh
+++ b/.docker/install-base-deps.sh
@@ -1,244 +0,0 @@
 #!/usr/bin/env bash
 # Single source of truth for builder-base contents.
 #
 # Used by:
 #   - backend/Dockerfile.base-grpc-builder        (CI prebuilt-base source of truth)
 #   - backend/Dockerfile.llama-cpp                (builder-fromsource stage)
 #   - backend/Dockerfile.ik-llama-cpp             (builder-fromsource stage)
 #   - backend/Dockerfile.turboquant               (builder-fromsource stage)
 #
 # All four files invoke this script via
 #   RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
 #       --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
 #       bash /usr/local/sbin/install-base-deps
 #
 # so the prebuilt CI base image and the from-source local-dev path are
 # bit-equivalent by construction.
 #
 # Inputs (env, populated from Dockerfile ARG/ENV):
 #   BUILD_TYPE                ("cublas"|"l4t"|"hipblas"|"vulkan"|"sycl"|"clblas"|"")
 #   CUDA_MAJOR_VERSION        ("12" | "13" | "")
 #   CUDA_MINOR_VERSION        ("8" | "0" | "")
 #   TARGETARCH                ("amd64" | "arm64")
 #   UBUNTU_VERSION            ("2204" | "2404")
 #   SKIP_DRIVERS              ("false" | "true")
 #   CMAKE_FROM_SOURCE         ("false" | "true")
 #   CMAKE_VERSION             ("3.31.10")
 #   GRPC_VERSION              ("v1.65.0")
 #   GRPC_MAKEFLAGS            ("-j4 -Otarget")
 #   APT_MIRROR / APT_PORTS_MIRROR  (optional; consumed by /usr/local/sbin/apt-mirror)
 #   AMDGPU_TARGETS            (optional; only relevant for hipblas downstream)
 #
 # IMPORTANT: install logic is copied verbatim from the prior in-Dockerfile
 # RUN blocks. Do not paraphrase apt invocations / version pins / sed line
 # numbers / deb URLs — the bit-equivalence guarantee depends on it.
 set -eux
 # --- 0. apt mirror rewrite (no-op when APT_MIRROR / APT_PORTS_MIRROR unset) ---
 if [ -x /usr/local/sbin/apt-mirror ]; then
    APT_MIRROR="${APT_MIRROR:-}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR:-}" \
        sh /usr/local/sbin/apt-mirror
 fi
 export DEBIAN_FRONTEND=noninteractive
 export MAKEFLAGS="${GRPC_MAKEFLAGS:-}"
 # --- 1. Base apt build deps ---
 apt-get update
 apt-get install -y --no-install-recommends \
    build-essential \
    ccache git \
    ca-certificates \
    make \
    pkg-config libcurl4-openssl-dev \
    curl unzip \
    libssl-dev wget
 apt-get clean
 rm -rf /var/lib/apt/lists/*
 # --- 2. Vulkan SDK (BUILD_TYPE=vulkan) ---
 # NB: this block intentionally installs `cmake` via apt as part of the
 # Vulkan tooling — must run before the dedicated CMake step below.
 if [ "${BUILD_TYPE:-}" = "vulkan" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
    apt-get update
    apt-get install -y  --no-install-recommends \
        software-properties-common pciutils wget gpg-agent
    apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
        libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
        libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
        git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
        ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
        clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
    if [ "amd64" = "${TARGETARCH:-}" ]; then
        wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz"
        tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz
        rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz
        mkdir -p /opt/vulkan-sdk
        mv 1.4.335.0 /opt/vulkan-sdk/
        ( cd /opt/vulkan-sdk/1.4.335.0 && \
          ./vulkansdk --no-deps --maxjobs \
              vulkan-loader \
              vulkan-validationlayers \
              vulkan-extensionlayer \
              vulkan-tools \
              shaderc )
        cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/
        cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/
        cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/
        cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/
        rm -rf /opt/vulkan-sdk
    fi
    if [ "arm64" = "${TARGETARCH:-}" ]; then
        mkdir vulkan
        ( cd vulkan && \
          curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
          tar -xvf vulkan-sdk.tar.xz && \
          rm vulkan-sdk.tar.xz && \
          cd 1.4.335.0 && \
          cp -rfv aarch64/bin/* /usr/bin/ && \
          cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
          cp -rfv aarch64/include/* /usr/include/ && \
          cp -rfv aarch64/share/* /usr/share/ )
        rm -rf vulkan
    fi
    ldconfig
    apt-get clean
    rm -rf /var/lib/apt/lists/*
 fi
 # --- 3. CUDA toolkit (BUILD_TYPE=cublas|l4t) ---
 if { [ "${BUILD_TYPE:-}" = "cublas" ] || [ "${BUILD_TYPE:-}" = "l4t" ]; } && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
    apt-get update
    apt-get install -y  --no-install-recommends \
        software-properties-common pciutils
    if [ "amd64" = "${TARGETARCH:-}" ]; then
        curl -O "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb"
    fi
    if [ "arm64" = "${TARGETARCH:-}" ]; then
        if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
            curl -O "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb"
        else
            curl -O "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb"
        fi
    fi
    dpkg -i cuda-keyring_1.1-1_all.deb
    rm -f cuda-keyring_1.1-1_all.deb
    apt-get update
    apt-get install -y --no-install-recommends \
        "cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
        "libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
        "libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
        "libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
        "libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
        "libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}"
    if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "${TARGETARCH:-}" ]; then
        apt-get install -y --no-install-recommends \
            "libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
            "libcudnn9-cuda-${CUDA_MAJOR_VERSION}" \
            "cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
            "libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}"
    fi
    apt-get clean
    rm -rf /var/lib/apt/lists/*
 fi
 # --- 4. cuDSS / NVPL on arm64 + cublas (legacy JetPack / Tegra) ---
 # https://github.com/NVIDIA/Isaac-GR00T/issues/343
 if [ "${BUILD_TYPE:-}" = "cublas" ] && [ "${TARGETARCH:-}" = "arm64" ]; then
    wget "https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb"
    dpkg -i "cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb"
    cp /var/cudss-local-tegra-repo-ubuntu"${UBUNTU_VERSION}"-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/
    apt-get update
    apt-get -y install cudss "cudss-cuda-${CUDA_MAJOR_VERSION}"
    wget "https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb"
    dpkg -i "nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb"
    cp /var/nvpl-local-repo-ubuntu"${UBUNTU_VERSION}"-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/
    apt-get update
    apt-get install -y nvpl
 fi
 # --- 5. clBLAS (BUILD_TYPE=clblas) ---
 # Present in variant Dockerfiles' from-source path but not in master's
 # Dockerfile.base-grpc-builder. No CI matrix entry currently uses this,
 # but keep parity so a future BUILD_TYPE=clblas build doesn't drift.
 if [ "${BUILD_TYPE:-}" = "clblas" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
    apt-get update
    apt-get install -y --no-install-recommends \
        libclblast-dev
    apt-get clean
    rm -rf /var/lib/apt/lists/*
 fi
 # --- 6. ROCm / HIP build deps (BUILD_TYPE=hipblas) ---
 if [ "${BUILD_TYPE:-}" = "hipblas" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
    apt-get update
    apt-get install -y --no-install-recommends \
        hipblas-dev \
        hipblaslt-dev \
        rocblas-dev
    apt-get clean
    rm -rf /var/lib/apt/lists/*
    # I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install,
    # which results in local-ai and others not being able to locate the libraries.
    # We run ldconfig ourselves to work around this packaging deficiency.
    ldconfig
    # Log which GPU architectures have rocBLAS kernel support
    echo "rocBLAS library data architectures:"
    (ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
        echo "WARNING: No rocBLAS kernel data found"
 fi
 echo "TARGETARCH: ${TARGETARCH:-}"
 # --- 7. protoc (always) ---
 # The version in 22.04 is too old. We will create one as part of installing
 # the GRPC build below but that will also bring in a newer version of absl
 # which stablediffusion cannot compile with. This version of protoc is only
 # here so that we can generate the grpc code for the stablediffusion build.
 if [ "amd64" = "${TARGETARCH:-}" ]; then
    curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip
    unzip -j -d /usr/local/bin protoc.zip bin/protoc
    rm protoc.zip
 fi
 if [ "arm64" = "${TARGETARCH:-}" ]; then
    curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip
    unzip -j -d /usr/local/bin protoc.zip bin/protoc
    rm protoc.zip
 fi
 # --- 8. CMake (apt or compiled from source) ---
 # The version in 22.04 is too old. Vulkan path above already pulled cmake
 # via apt; the from-source branch here will install over it which is fine.
 if [ "${CMAKE_FROM_SOURCE:-false}" = "true" ]; then
    curl -L -s "https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz" -o cmake.tar.gz
    tar xvf cmake.tar.gz
    ( cd "cmake-${CMAKE_VERSION}" && ./configure && make && make install )
 else
    apt-get update
    apt-get install -y \
        cmake
    apt-get clean
    rm -rf /var/lib/apt/lists/*
 fi
 # --- 9. gRPC compile + install at /opt/grpc ---
 # We install GRPC to a different prefix here so that we can copy in only
 # the build artifacts later — saves several hundred MB on the final docker
 # image size vs copying in the entire GRPC source tree and running
 # `make install` in the target container.
 #
 # The TESTONLY abseil sed patch and /opt/grpc prefix are load-bearing —
 # downstream Dockerfiles `COPY` /opt/grpc to /usr/local (or rely on the
 # prebuilt base having it at /opt/grpc).
 mkdir -p /build
 cd /build
 git clone --recurse-submodules --jobs 4 -b "${GRPC_VERSION}" --depth 1 --shallow-submodules https://github.com/grpc/grpc
 mkdir -p /build/grpc/cmake/build
 cd /build/grpc/cmake/build
 sed -i "216i\\  TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt"
 cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../..
 make
 make install
 cd /
 rm -rf /build
--- a/.docker/llama-cpp-compile.sh
+++ b/.docker/llama-cpp-compile.sh
@@ -1,35 +0,0 @@
 #!/usr/bin/env bash
 # Shared compile logic for backend/Dockerfile.llama-cpp.
 # Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
 set -euxo pipefail
 export CCACHE_DIR=/root/.ccache
 ccache --max-size=5G || true
 ccache -z || true
 export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
 if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
  export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
  rm -rf /LocalAI/backend/cpp/llama-cpp-*-build
 fi
 if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
  cd /LocalAI/backend/cpp/llama-cpp
  make llama-cpp-fallback
  make llama-cpp-grpc
  make llama-cpp-rpc-server
 else
  cd /LocalAI/backend/cpp/llama-cpp
  make llama-cpp-avx
  make llama-cpp-avx2
  make llama-cpp-avx512
  make llama-cpp-fallback
  make llama-cpp-grpc
  make llama-cpp-rpc-server
 fi
 ccache -s || true
--- a/.docker/turboquant-compile.sh
+++ b/.docker/turboquant-compile.sh
@@ -1,35 +0,0 @@
 #!/usr/bin/env bash
 # Shared compile logic for backend/Dockerfile.turboquant.
 # Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
 set -euxo pipefail
 export CCACHE_DIR=/root/.ccache
 ccache --max-size=5G || true
 ccache -z || true
 export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
 if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
  export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
  rm -rf /LocalAI/backend/cpp/turboquant-*-build
 fi
 cd /LocalAI/backend/cpp/turboquant
 if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
  make turboquant-fallback
  make turboquant-grpc
  make turboquant-rpc-server
 else
  make turboquant-avx
  make turboquant-avx2
  make turboquant-avx512
  make turboquant-fallback
  make turboquant-grpc
  make turboquant-rpc-server
 fi
 ccache -s || true
--- a/.env
+++ b/.env
@@ -26,9 +26,6 @@
 ## Disables COMPEL (Diffusers)
 # COMPEL=0
 ## Disables SD_EMBED (Diffusers)
 # SD_EMBED=0
 ## Enable/Disable single backend (useful if only one GPU is available)
 # LOCALAI_SINGLE_ACTIVE_BACKEND=true
--- a/.github/actions/configure-apt-mirror/action.yml
+++ b/.github/actions/configure-apt-mirror/action.yml
@@ -1,100 +0,0 @@
 name: 'Configure apt mirror'
 description: |
  Reconfigure the GitHub Actions runner's Ubuntu apt sources to use an
  alternate mirror, and emit the effective URLs as outputs so callers can
  forward them as Docker build-args.
  Two mirror profiles depending on where the runner lives, because the
  best mirror differs by network:
    * github-hosted runners run on Azure, so they default to the
      Azure-hosted Ubuntu mirror (lowest latency, same VPC).
    * self-hosted runners (arc-runner-set, bigger-runner, ...) typically
      cannot route to azure.archive.ubuntu.com, so they default to the
      kernel.org mirror, which is publicly reachable from anywhere.
  Pass an empty string to either input to skip the rewrite for that
  profile and keep upstream archive.ubuntu.com / ports.ubuntu.com.
 inputs:
  github-hosted-mirror:
    description: 'archive/security mirror URL for github-hosted runners (empty = upstream)'
    required: false
    default: 'http://azure.archive.ubuntu.com'
  github-hosted-ports-mirror:
    description: 'ports.ubuntu.com mirror URL for github-hosted runners (empty = upstream)'
    required: false
    default: 'http://azure.ports.ubuntu.com'
  self-hosted-mirror:
    description: 'archive/security mirror URL for self-hosted runners (empty = upstream)'
    required: false
    # HTTP, not HTTPS: the bare ubuntu:24.04 builder image doesn't ship
    # ca-certificates, so the very first apt-get update over TLS would
    # fail with "No system certificates available" before it can install
    # anything. apt validates package integrity via GPG signatures, so
    # plain HTTP is safe for the archive itself.
    default: 'http://mirrors.edge.kernel.org'
  self-hosted-ports-mirror:
    description: 'ports.ubuntu.com mirror URL for self-hosted runners (empty = upstream)'
    required: false
    # mirrors.edge.kernel.org does NOT carry /ubuntu-ports/ — only the
    # main /ubuntu/ archive — so arm64 builds 404 there. Leave ports
    # upstream by default. The original DDoS was on archive.ubuntu.com
    # so ports.ubuntu.com remains the path of least surprise.
    default: ''
 outputs:
  effective-mirror:
    description: 'The mirror URL actually applied for this runner (or empty)'
    value: ${{ steps.pick.outputs.mirror }}
  effective-ports-mirror:
    description: 'The ports mirror URL actually applied for this runner (or empty)'
    value: ${{ steps.pick.outputs.ports-mirror }}
 runs:
  using: 'composite'
  steps:
    - name: Pick effective mirror for this runner
      id: pick
      shell: bash
      env:
        RUNNER_ENV: ${{ runner.environment }}
        GH_MIRROR: ${{ inputs.github-hosted-mirror }}
        GH_PORTS_MIRROR: ${{ inputs.github-hosted-ports-mirror }}
        SH_MIRROR: ${{ inputs.self-hosted-mirror }}
        SH_PORTS_MIRROR: ${{ inputs.self-hosted-ports-mirror }}
      run: |
        if [ "${RUNNER_ENV}" = "github-hosted" ]; then
          MIRROR="${GH_MIRROR}"
          PORTS_MIRROR="${GH_PORTS_MIRROR}"
        else
          MIRROR="${SH_MIRROR}"
          PORTS_MIRROR="${SH_PORTS_MIRROR}"
        fi
        echo "configure-apt-mirror: runner=${RUNNER_ENV} mirror='${MIRROR}' ports-mirror='${PORTS_MIRROR}'"
        echo "mirror=${MIRROR}" >> "$GITHUB_OUTPUT"
        echo "ports-mirror=${PORTS_MIRROR}" >> "$GITHUB_OUTPUT"
    - name: Rewrite apt sources
      if: steps.pick.outputs.mirror != '' || steps.pick.outputs.ports-mirror != ''
      shell: bash
      env:
        APT_MIRROR: ${{ steps.pick.outputs.mirror }}
        APT_PORTS_MIRROR: ${{ steps.pick.outputs.ports-mirror }}
      run: |
        set -e
        # Ubuntu 24.04 (noble) ships DEB822 sources at
        # /etc/apt/sources.list.d/ubuntu.sources; older releases use
        # /etc/apt/sources.list. Rewrite whichever exists.
        for f in /etc/apt/sources.list.d/ubuntu.sources /etc/apt/sources.list; do
          sudo test -f "$f" || continue
          if [ -n "${APT_MIRROR}" ]; then
            # Comma delimiter so the alternation pipe in the regex is not
            # interpreted as the s/// separator.
            sudo sed -i -E "s,https?://(archive\.ubuntu\.com|security\.ubuntu\.com),${APT_MIRROR},g" "$f"
          fi
          if [ -n "${APT_PORTS_MIRROR}" ]; then
            sudo sed -i -E "s,https?://ports\.ubuntu\.com,${APT_PORTS_MIRROR},g" "$f"
          fi
        done
        echo "Runner apt mirror configured (APT_MIRROR='${APT_MIRROR}', APT_PORTS_MIRROR='${APT_PORTS_MIRROR}')"
--- a/.github/actions/free-disk-space/action.yml
+++ b/.github/actions/free-disk-space/action.yml
@@ -1,65 +0,0 @@
 name: 'Free disk space on hosted runners'
 description: |
  Aggressively clean GitHub-hosted ubuntu-latest runners to reclaim ~6-10 GB
  of working space before docker buildx steps. Combines jlumbroso/free-disk-space
  with explicit apt purges of large packages we never use (dotnet, ghc, mono,
  android, jdk, ...).
  No-op on self-hosted runners; pass mode=skip to force-disable.
 inputs:
  mode:
    description: 'hosted (default — clean) or skip (no-op)'
    required: false
    default: 'hosted'
 runs:
  using: 'composite'
  steps:
    - name: Free Disk Space (Ubuntu)
      if: inputs.mode == 'hosted' && runner.environment == 'github-hosted'
      uses: jlumbroso/free-disk-space@main
      with:
        tool-cache: true
        android: true
        dotnet: true
        haskell: true
        large-packages: true
        docker-images: true
        swap-storage: true
    - name: Release space from worker
      if: inputs.mode == 'hosted' && runner.environment == 'github-hosted'
      shell: bash
      run: |
        echo "Listing top largest packages"
        pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
        head -n 30 <<< "${pkgs}"
        df -h
        sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
        sudo apt-get remove --auto-remove android-sdk-platform-tools snapd || true
        sudo apt-get purge --auto-remove android-sdk-platform-tools snapd || true
        sudo rm -rf /usr/local/lib/android
        sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
        sudo rm -rf /usr/share/dotnet
        sudo apt-get remove -y '^mono-.*' || true
        sudo apt-get remove -y '^ghc-.*' || true
        sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
        sudo apt-get remove -y 'php.*' || true
        sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
        sudo apt-get remove -y '^google-.*' || true
        sudo apt-get remove -y azure-cli || true
        sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
        sudo apt-get remove -y '^gfortran-.*' || true
        sudo apt-get remove -y microsoft-edge-stable || true
        sudo apt-get remove -y firefox || true
        sudo apt-get remove -y powershell || true
        sudo apt-get remove -y r-base-core || true
        sudo apt-get autoremove -y
        sudo apt-get clean
        sudo rm -rfv build || true
        sudo rm -rf /usr/share/dotnet || true
        sudo rm -rf /opt/ghc || true
        sudo rm -rf "/usr/local/share/boost" || true
        sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
        df -h
--- a/.github/actions/setup-build-disk/action.yml
+++ b/.github/actions/setup-build-disk/action.yml
@@ -1,59 +0,0 @@
 name: 'Set up build disk on hosted runners'
 description: |
  Relocate Docker's data-root to /mnt (which has ~75 GB free, vs ~20 GB
  on / after free-disk-space). Combined with the apt cleanup, gives
  ~100 GB working space for buildx — enough for ROCm dev image + vLLM
  torch install + flash-attn build.
  No-op on:
    - self-hosted runners (no /mnt expectation)
    - non-X64 runners (verify /mnt shape on ubuntu-24.04-arm separately
      before enabling there — see Task 3.2 in the migration plan)
    - mode=skip (force-disable from caller)
  Must run after free-disk-space (which removes large packages — would
  fail mid-uninstall if Docker were stopped) and before any Docker
  operation (setup-qemu, setup-buildx, login, build) so the relocated
  data-root catches all subsequent docker activity.
 inputs:
  mode:
    description: 'auto (default — relocate on hosted X64 only) or skip'
    required: false
    default: 'auto'
 runs:
  using: 'composite'
  steps:
    - name: Relocate Docker data-root to /mnt
      if: inputs.mode == 'auto' && runner.environment == 'github-hosted' && runner.arch == 'X64'
      shell: bash
      run: |
        set -euo pipefail
        echo "Before relocation:"
        df -h / /mnt || true
        sudo systemctl stop docker docker.socket
        sudo mkdir -p /mnt/docker-data /mnt/docker-tmp
        # buildx CLI runs as the unprivileged runner user and creates
        # config dirs under TMPDIR before binding them into the buildkit
        # container. /mnt is owned by root by default; mirror /tmp's
        # 1777 (world-writable + sticky) so non-root processes can write.
        sudo chmod 1777 /mnt/docker-tmp
        if [ -d /var/lib/docker ] && [ ! -L /var/lib/docker ]; then
          sudo rsync -a /var/lib/docker/ /mnt/docker-data/
          sudo rm -rf /var/lib/docker
          sudo ln -s /mnt/docker-data /var/lib/docker
        fi
        # daemon.json may not exist; merge data-root in or create minimal.
        if [ -f /etc/docker/daemon.json ]; then
          sudo jq '."data-root" = "/mnt/docker-data"' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.new >/dev/null
          sudo mv /etc/docker/daemon.json.new /etc/docker/daemon.json
        else
          echo '{"data-root":"/mnt/docker-data"}' | sudo tee /etc/docker/daemon.json
        fi
        sudo systemctl start docker
        # Make TMPDIR persist for subsequent steps in the same job.
        echo "TMPDIR=/mnt/docker-tmp" >> "$GITHUB_ENV"
        echo "After relocation:"
        df -h / /mnt
        docker info | grep -i 'docker root dir' || true
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
--- a/.github/bump_vllm_wheel.sh
+++ b/.github/bump_vllm_wheel.sh
@@ -1,45 +0,0 @@
 #!/bin/bash
 # Bump the cublas13 vLLM wheel pin in requirements-cublas13-after.txt.
 #
 # vLLM's PyPI wheel is built against CUDA 12 so the cublas13 build pulls a
 # cu130-flavoured wheel from vLLM's per-tag index at
 # https://wheels.vllm.ai/<TAG>/cu130/. That URL segment is itself version-locked
 # (no /latest/ alias upstream), so bumping vLLM means rewriting both the URL
 # segment and the version constraint atomically. bump_deps.sh handles git-sha
 # vars in Makefiles; this script handles the two-value rewrite specific to the
 # vLLM requirements file.
 set -xe
 REPO=$1   # vllm-project/vllm
 FILE=$2   # backend/python/vllm/requirements-cublas13-after.txt
 VAR=$3    # VLLM_VERSION (used for output file names so the workflow can read them)
 if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
    echo "usage: $0 <repo> <requirements-file> <var-name>" >&2
    exit 1
 fi
 # /releases/latest returns the most recent non-prerelease tag.
 LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
    "https://api.github.com/repos/$REPO/releases/latest" \
    | python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
 # Strip leading 'v' (vLLM tags are 'v0.20.0', the URL/version use '0.20.0').
 NEW_VERSION="${LATEST_TAG#v}"
 set +e
 CURRENT_VERSION=$(grep -oE '^vllm==[0-9]+\.[0-9]+\.[0-9]+' "$FILE" | head -1 | cut -d= -f3)
 set -e
 # sed both lines unconditionally — peter-evans/create-pull-request opens no PR
 # when the working tree is clean, so a no-op rewrite is safe.
 sed -i "$FILE" \
    -e "s|wheels\.vllm\.ai/[^/]*/cu130|wheels.vllm.ai/$NEW_VERSION/cu130|g" \
    -e "s|^vllm==.*|vllm==$NEW_VERSION|"
 if [ -z "$CURRENT_VERSION" ]; then
    echo "Could not find vllm==X.Y.Z in $FILE."
    exit 0
 fi
 echo "Changes: https://github.com/$REPO/compare/v${CURRENT_VERSION}...${LATEST_TAG}" >> "${VAR}_message.txt"
 echo "${NEW_VERSION}" >> "${VAR}_commit.txt"
--- a/.github/gallery-agent/agent.go
+++ b/.github/gallery-agent/agent.go
@@ -0,0 +1,445 @@
 package main
 import (
 	"context"
 	"encoding/json"
 	"fmt"
 	"io"
 	"net/http"
 	"os"
 	"regexp"
 	"slices"
 	"strings"
 	"github.com/ghodss/yaml"
 	hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
 	cogito "github.com/mudler/cogito"
 	"github.com/mudler/cogito/structures"
 	"github.com/sashabaranov/go-openai/jsonschema"
 )
 var (
 	openAIModel      = os.Getenv("OPENAI_MODEL")
 	openAIKey        = os.Getenv("OPENAI_KEY")
 	openAIBaseURL    = os.Getenv("OPENAI_BASE_URL")
 	galleryIndexPath = os.Getenv("GALLERY_INDEX_PATH")
 	//defaultclient
 	llm = cogito.NewOpenAILLM(openAIModel, openAIKey, openAIBaseURL)
 )
 // cleanTextContent removes trailing spaces, tabs, and normalizes line endings
 // to prevent YAML linting issues like trailing spaces and multiple empty lines
 func cleanTextContent(text string) string {
 	lines := strings.Split(text, "\n")
 	var cleanedLines []string
 	var prevEmpty bool
 	for _, line := range lines {
 		// Remove all trailing whitespace (spaces, tabs, etc.)
 		trimmed := strings.TrimRight(line, " \t\r")
 		// Avoid multiple consecutive empty lines
 		if trimmed == "" {
 			if !prevEmpty {
 				cleanedLines = append(cleanedLines, "")
 			}
 			prevEmpty = true
 		} else {
 			cleanedLines = append(cleanedLines, trimmed)
 			prevEmpty = false
 		}
 	}
 	// Remove trailing empty lines from the result
 	result := strings.Join(cleanedLines, "\n")
 	return stripThinkingTags(strings.TrimRight(result, "\n"))
 }
 type galleryModel struct {
 	Name string   `yaml:"name"`
 	Urls []string `yaml:"urls"`
 }
 // isModelExisting checks if a specific model ID exists in the gallery using text search
 func isModelExisting(modelID string) (bool, error) {
 	indexPath := getGalleryIndexPath()
 	content, err := os.ReadFile(indexPath)
 	if err != nil {
 		return false, fmt.Errorf("failed to read %s: %w", indexPath, err)
 	}
 	var galleryModels []galleryModel
 	err = yaml.Unmarshal(content, &galleryModels)
 	if err != nil {
 		return false, fmt.Errorf("failed to unmarshal %s: %w", indexPath, err)
 	}
 	for _, galleryModel := range galleryModels {
 		if slices.Contains(galleryModel.Urls, modelID) {
 			return true, nil
 		}
 	}
 	return false, nil
 }
 // filterExistingModels removes models that already exist in the gallery
 func filterExistingModels(models []ProcessedModel) ([]ProcessedModel, error) {
 	var filteredModels []ProcessedModel
 	for _, model := range models {
 		exists, err := isModelExisting(model.ModelID)
 		if err != nil {
 			fmt.Printf("Error checking if model %s exists: %v, skipping\n", model.ModelID, err)
 			continue
 		}
 		if !exists {
 			filteredModels = append(filteredModels, model)
 		} else {
 			fmt.Printf("Skipping existing model: %s\n", model.ModelID)
 		}
 	}
 	fmt.Printf("Filtered out %d existing models, %d new models remaining\n",
 		len(models)-len(filteredModels), len(filteredModels))
 	return filteredModels, nil
 }
 // getGalleryIndexPath returns the gallery index file path, with a default fallback
 func getGalleryIndexPath() string {
 	if galleryIndexPath != "" {
 		return galleryIndexPath
 	}
 	return "gallery/index.yaml"
 }
 func stripThinkingTags(content string) string {
 	// Remove content between <thinking> and </thinking> (including multi-line)
 	content = regexp.MustCompile(`(?s)<thinking>.*?</thinking>`).ReplaceAllString(content, "")
 	// Remove content between <think> and </think> (including multi-line)
 	content = regexp.MustCompile(`(?s)<think>.*?</think>`).ReplaceAllString(content, "")
 	// Clean up any extra whitespace
 	content = strings.TrimSpace(content)
 	return content
 }
 func getRealReadme(ctx context.Context, repository string) (string, error) {
 	// Create a conversation fragment
 	fragment := cogito.NewEmptyFragment().
 		AddMessage("user",
 			`Your task is to get a clear description of a large language model from huggingface by using the provided tool. I will share with you a repository that might be quantized, and as such probably not by the original model author. We need to get the real  description of the model, and not the one that might be quantized. You will have to call the tool to get the readme more than once by figuring out from the quantized readme which is the base model readme. This is the repository: `+repository)
 	// Execute with tools
 	result, err := cogito.ExecuteTools(llm, fragment,
 		cogito.WithIterations(3),
 		cogito.WithMaxAttempts(3),
 		cogito.WithTools(&HFReadmeTool{client: hfapi.NewClient()}))
 	if err != nil {
 		return "", err
 	}
 	result = result.AddMessage("user", "Describe the model in a clear and concise way that can be shared in a model gallery.")
 	// Get a response
 	newFragment, err := llm.Ask(ctx, result)
 	if err != nil {
 		return "", err
 	}
 	content := newFragment.LastMessage().Content
 	return cleanTextContent(content), nil
 }
 func selectMostInterestingModels(ctx context.Context, searchResult *SearchResult) ([]ProcessedModel, error) {
 	if len(searchResult.Models) == 1 {
 		return searchResult.Models, nil
 	}
 	// Create a conversation fragment
 	fragment := cogito.NewEmptyFragment().
 		AddMessage("user",
 			`Your task is to analyze a list of AI models and select the most interesting ones for a model gallery. You will be given detailed information about multiple models including their metadata, file information, and README content.
 Consider the following criteria when selecting models:
 1. Model popularity (download count)
 2. Model recency (last modified date)
 3. Model completeness (has preferred model file, README, etc.)
 4. Model uniqueness (not duplicates or very similar models)
 5. Model quality (based on README content and description)
 6. Model utility (practical applications)
 You should select models that would be most valuable for users browsing a model gallery. Prioritize models that are:
 - Well-documented with clear READMEs
 - Recently updated
 - Popular (high download count)
 - Have the preferred quantization format available
 - Offer unique capabilities or are from reputable authors
 Return your analysis and selection reasoning.`)
 	// Add the search results as context
 	modelsInfo := fmt.Sprintf("Found %d models matching '%s' with quantization preference '%s':\n\n",
 		searchResult.TotalModelsFound, searchResult.SearchTerm, searchResult.Quantization)
 	for i, model := range searchResult.Models {
 		modelsInfo += fmt.Sprintf("Model %d:\n", i+1)
 		modelsInfo += fmt.Sprintf("  ID: %s\n", model.ModelID)
 		modelsInfo += fmt.Sprintf("  Author: %s\n", model.Author)
 		modelsInfo += fmt.Sprintf("  Downloads: %d\n", model.Downloads)
 		modelsInfo += fmt.Sprintf("  Last Modified: %s\n", model.LastModified)
 		modelsInfo += fmt.Sprintf("  Files: %d files\n", len(model.Files))
 		if model.PreferredModelFile != nil {
 			modelsInfo += fmt.Sprintf("  Preferred Model File: %s (%d bytes)\n",
 				model.PreferredModelFile.Path, model.PreferredModelFile.Size)
 		} else {
 			modelsInfo += "  No preferred model file found\n"
 		}
 		if model.ReadmeContent != "" {
 			modelsInfo += fmt.Sprintf("  README: %s\n", model.ReadmeContent)
 		}
 		if model.ProcessingError != "" {
 			modelsInfo += fmt.Sprintf("  Processing Error: %s\n", model.ProcessingError)
 		}
 		modelsInfo += "\n"
 	}
 	fragment = fragment.AddMessage("user", modelsInfo)
 	fragment = fragment.AddMessage("user", "Based on your analysis, select the top 5 most interesting models and provide a brief explanation for each selection. Also, create a filtered SearchResult with only the selected models. Return just a list of repositories IDs, you will later be asked to output it as a JSON array with the json tool.")
 	// Get a response
 	newFragment, err := llm.Ask(ctx, fragment)
 	if err != nil {
 		return nil, err
 	}
 	fmt.Println(newFragment.LastMessage().Content)
 	repositories := struct {
 		Repositories []string `json:"repositories"`
 	}{}
 	s := structures.Structure{
 		Schema: jsonschema.Definition{
 			Type:                 jsonschema.Object,
 			AdditionalProperties: false,
 			Properties: map[string]jsonschema.Definition{
 				"repositories": {
 					Type:        jsonschema.Array,
 					Items:       &jsonschema.Definition{Type: jsonschema.String},
 					Description: "The trending repositories IDs",
 				},
 			},
 			Required: []string{"repositories"},
 		},
 		Object: &repositories,
 	}
 	err = newFragment.ExtractStructure(ctx, llm, s)
 	if err != nil {
 		return nil, err
 	}
 	filteredModels := []ProcessedModel{}
 	for _, m := range searchResult.Models {
 		if slices.Contains(repositories.Repositories, m.ModelID) {
 			filteredModels = append(filteredModels, m)
 		}
 	}
 	return filteredModels, nil
 }
 // ModelMetadata represents extracted metadata from a model
 type ModelMetadata struct {
 	Tags    []string `json:"tags"`
 	License string   `json:"license"`
 }
 // extractModelMetadata extracts tags and license from model README and documentation
 func extractModelMetadata(ctx context.Context, model ProcessedModel) ([]string, string, error) {
 	// Create a conversation fragment
 	fragment := cogito.NewEmptyFragment().
 		AddMessage("user",
 			`Your task is to extract metadata from an AI model's README and documentation. You will be provided with:
 1. Model information (ID, author, description)
 2. README content
 You need to extract:
 1. **Tags**: An array of relevant tags that describe the model. Use common tags from the gallery such as:
   - llm, gguf, gpu, cpu, multimodal, image-to-text, text-to-text, text-to-speech, tts
   - thinking, reasoning, chat, instruction-tuned, code, vision
   - Model family names (e.g., llama, qwen, mistral, gemma) if applicable
   - Any other relevant descriptive tags
   Select 3-8 most relevant tags.
 2. **License**: The license identifier (e.g., "apache-2.0", "mit", "llama2", "gpl-3.0", "bsd", "cc-by-4.0").
   If no license is found, return an empty string.
 Return the extracted metadata in a structured format.`)
 	// Add model information
 	modelInfo := "Model Information:\n"
 	modelInfo += fmt.Sprintf("  ID: %s\n", model.ModelID)
 	modelInfo += fmt.Sprintf("  Author: %s\n", model.Author)
 	modelInfo += fmt.Sprintf("  Downloads: %d\n", model.Downloads)
 	if model.ReadmeContent != "" {
 		modelInfo += fmt.Sprintf("  README Content:\n%s\n", model.ReadmeContent)
 	} else if model.ReadmeContentPreview != "" {
 		modelInfo += fmt.Sprintf("  README Preview: %s\n", model.ReadmeContentPreview)
 	}
 	fragment = fragment.AddMessage("user", modelInfo)
 	fragment = fragment.AddMessage("user", "Extract the tags and license from the model information. Return the metadata as a JSON object with 'tags' (array of strings) and 'license' (string).")
 	// Get a response
 	newFragment, err := llm.Ask(ctx, fragment)
 	if err != nil {
 		return nil, "", err
 	}
 	// Extract structured metadata
 	metadata := ModelMetadata{}
 	s := structures.Structure{
 		Schema: jsonschema.Definition{
 			Type:                 jsonschema.Object,
 			AdditionalProperties: false,
 			Properties: map[string]jsonschema.Definition{
 				"tags": {
 					Type:        jsonschema.Array,
 					Items:       &jsonschema.Definition{Type: jsonschema.String},
 					Description: "Array of relevant tags describing the model",
 				},
 				"license": {
 					Type:        jsonschema.String,
 					Description: "License identifier (e.g., apache-2.0, mit, llama2). Empty string if not found.",
 				},
 			},
 			Required: []string{"tags", "license"},
 		},
 		Object: &metadata,
 	}
 	err = newFragment.ExtractStructure(ctx, llm, s)
 	if err != nil {
 		return nil, "", err
 	}
 	return metadata.Tags, metadata.License, nil
 }
 // extractIconFromReadme scans the README content for image URLs and returns the first suitable icon URL found
 func extractIconFromReadme(readmeContent string) string {
 	if readmeContent == "" {
 		return ""
 	}
 	// Regular expressions to match image URLs in various formats (case-insensitive)
 	// Match markdown image syntax: ![alt](url) - case insensitive extensions
 	markdownImageRegex := regexp.MustCompile(`(?i)!\[[^\]]*\]\(([^)]+\.(png|jpg|jpeg|svg|webp|gif))\)`)
 	// Match HTML img tags: <img src="url">
 	htmlImageRegex := regexp.MustCompile(`(?i)<img[^>]+src=["']([^"']+\.(png|jpg|jpeg|svg|webp|gif))["']`)
 	// Match plain URLs ending with image extensions
 	plainImageRegex := regexp.MustCompile(`(?i)https?://[^\s<>"']+\.(png|jpg|jpeg|svg|webp|gif)`)
 	// Try markdown format first
 	matches := markdownImageRegex.FindStringSubmatch(readmeContent)
 	if len(matches) > 1 && matches[1] != "" {
 		url := strings.TrimSpace(matches[1])
 		// Prefer HuggingFace CDN URLs or absolute URLs
 		if strings.HasPrefix(strings.ToLower(url), "http") {
 			return url
 		}
 	}
 	// Try HTML img tags
 	matches = htmlImageRegex.FindStringSubmatch(readmeContent)
 	if len(matches) > 1 && matches[1] != "" {
 		url := strings.TrimSpace(matches[1])
 		if strings.HasPrefix(strings.ToLower(url), "http") {
 			return url
 		}
 	}
 	// Try plain URLs
 	matches = plainImageRegex.FindStringSubmatch(readmeContent)
 	if len(matches) > 0 {
 		url := strings.TrimSpace(matches[0])
 		if strings.HasPrefix(strings.ToLower(url), "http") {
 			return url
 		}
 	}
 	return ""
 }
 // getHuggingFaceAvatarURL attempts to get the HuggingFace avatar URL for a user
 func getHuggingFaceAvatarURL(author string) string {
 	if author == "" {
 		return ""
 	}
 	// Try to fetch user info from HuggingFace API
 	// HuggingFace API endpoint: https://huggingface.co/api/users/{username}
 	baseURL := "https://huggingface.co"
 	userURL := fmt.Sprintf("%s/api/users/%s", baseURL, author)
 	req, err := http.NewRequest("GET", userURL, nil)
 	if err != nil {
 		return ""
 	}
 	client := &http.Client{}
 	resp, err := client.Do(req)
 	if err != nil {
 		return ""
 	}
 	defer resp.Body.Close()
 	if resp.StatusCode != http.StatusOK {
 		return ""
 	}
 	// Parse the response to get avatar URL
 	var userInfo map[string]interface{}
 	body, err := io.ReadAll(resp.Body)
 	if err != nil {
 		return ""
 	}
 	if err := json.Unmarshal(body, &userInfo); err != nil {
 		return ""
 	}
 	// Try to extract avatar URL from response
 	if avatar, ok := userInfo["avatarUrl"].(string); ok && avatar != "" {
 		return avatar
 	}
 	if avatar, ok := userInfo["avatar"].(string); ok && avatar != "" {
 		return avatar
 	}
 	return ""
 }
 // extractModelIcon extracts icon URL from README or falls back to HuggingFace avatar
 func extractModelIcon(model ProcessedModel) string {
 	// First, try to extract icon from README
 	if icon := extractIconFromReadme(model.ReadmeContent); icon != "" {
 		return icon
 	}
 	// Fallback: Try to get HuggingFace user avatar
 	if model.Author != "" {
 		if avatar := getHuggingFaceAvatarURL(model.Author); avatar != "" {
 			return avatar
 		}
 	}
 	return ""
 }
--- a/.github/gallery-agent/gallery.go
+++ b/.github/gallery-agent/gallery.go
@@ -7,8 +7,8 @@ import (
 	"os"
 	"strings"
 	"github.com/ghodss/yaml"
 	"github.com/mudler/LocalAI/core/gallery/importers"
 	"sigs.k8s.io/yaml"
 )
 func formatTextContent(text string) string {
@@ -79,20 +79,7 @@ func generateYAMLEntry(model ProcessedModel, quantization string) string {
 	description = cleanTextContent(description)
 	formattedDescription := formatTextContent(description)
-	// Strip name and description from config file since they are
+	configFile := formatTextContent(modelConfig.ConfigFile)
 	// already present at the gallery entry level and should not
 	// appear under overrides.
 	configFileContent := modelConfig.ConfigFile
 	var cfgMap map[string]any
 	if err := yaml.Unmarshal([]byte(configFileContent), &cfgMap); err == nil {
 		delete(cfgMap, "name")
 		delete(cfgMap, "description")
 		if cleaned, err := yaml.Marshal(cfgMap); err == nil {
 			configFileContent = string(cleaned)
 		}
 	}
 	configFile := formatTextContent(configFileContent)
 	filesYAML, _ := yaml.Marshal(modelConfig.Files)
--- a/.github/gallery-agent/helpers.go
+++ b/.github/gallery-agent/helpers.go
@@ -1,301 +0,0 @@
 package main
 import (
 	"encoding/json"
 	"fmt"
 	"io"
 	"net/http"
 	"os"
 	"regexp"
 	"strings"
 	hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
 	"sigs.k8s.io/yaml"
 )
 var galleryIndexPath = os.Getenv("GALLERY_INDEX_PATH")
 // getGalleryIndexPath returns the gallery index file path, with a default fallback
 func getGalleryIndexPath() string {
 	if galleryIndexPath != "" {
 		return galleryIndexPath
 	}
 	return "gallery/index.yaml"
 }
 type galleryModel struct {
 	Name string   `yaml:"name"`
 	Urls []string `yaml:"urls"`
 }
 // loadGalleryURLSet parses gallery/index.yaml once and returns the set of
 // HuggingFace model URLs already present in the gallery.
 func loadGalleryURLSet() (map[string]struct{}, error) {
 	indexPath := getGalleryIndexPath()
 	content, err := os.ReadFile(indexPath)
 	if err != nil {
 		return nil, fmt.Errorf("failed to read %s: %w", indexPath, err)
 	}
 	var galleryModels []galleryModel
 	if err := yaml.Unmarshal(content, &galleryModels); err != nil {
 		return nil, fmt.Errorf("failed to unmarshal %s: %w", indexPath, err)
 	}
 	set := make(map[string]struct{}, len(galleryModels))
 	for _, gm := range galleryModels {
 		for _, u := range gm.Urls {
 			set[u] = struct{}{}
 		}
 	}
 	// Also skip URLs already proposed in open (unmerged) gallery-agent PRs.
 	// The workflow injects these via EXTRA_SKIP_URLS so we don't keep
 	// re-proposing the same model every run while a PR is waiting to merge.
 	for _, line := range strings.FieldsFunc(os.Getenv("EXTRA_SKIP_URLS"), func(r rune) bool {
 		return r == '\n' || r == ',' || r == ' '
 	}) {
 		u := strings.TrimSpace(line)
 		if u != "" {
 			set[u] = struct{}{}
 		}
 	}
 	return set, nil
 }
 // modelAlreadyInGallery checks whether a HuggingFace model repo is already
 // referenced in the gallery URL set.
 func modelAlreadyInGallery(set map[string]struct{}, modelID string) bool {
 	_, ok := set["https://huggingface.co/"+modelID]
 	return ok
 }
 // baseModelFromTags returns the first `base_model:<repo>` value found in the
 // tag list, or "" if none is present. HuggingFace surfaces the base model
 // declared in the model card's YAML frontmatter as such a tag.
 func baseModelFromTags(tags []string) string {
 	for _, t := range tags {
 		if strings.HasPrefix(t, "base_model:") {
 			return strings.TrimPrefix(t, "base_model:")
 		}
 	}
 	return ""
 }
 // licenseFromTags returns the `license:<id>` value from the tag list, or "".
 func licenseFromTags(tags []string) string {
 	for _, t := range tags {
 		if strings.HasPrefix(t, "license:") {
 			return strings.TrimPrefix(t, "license:")
 		}
 	}
 	return ""
 }
 // curatedTags produces the gallery tag list from HuggingFace's raw tag set.
 // Always includes llm + gguf, then adds whitelisted family / capability
 // markers when they appear in the HF tag list.
 func curatedTags(hfTags []string) []string {
 	whitelist := []string{
 		"gpu", "cpu",
 		"llama", "mistral", "mixtral", "qwen", "qwen2", "qwen3",
 		"gemma", "gemma2", "gemma3", "phi", "phi3", "phi4",
 		"deepseek", "yi", "falcon", "command-r",
 		"vision", "multimodal", "code", "chat",
 		"instruction-tuned", "reasoning", "thinking",
 	}
 	seen := map[string]struct{}{}
 	out := []string{"llm", "gguf"}
 	seen["llm"] = struct{}{}
 	seen["gguf"] = struct{}{}
 	hfSet := map[string]struct{}{}
 	for _, t := range hfTags {
 		hfSet[strings.ToLower(t)] = struct{}{}
 	}
 	for _, w := range whitelist {
 		if _, ok := hfSet[w]; ok {
 			if _, dup := seen[w]; !dup {
 				out = append(out, w)
 				seen[w] = struct{}{}
 			}
 		}
 	}
 	return out
 }
 // resolveReadme fetches a description-quality README for a (possibly
 // quantized) repo: if a `base_model:` tag is present, fetch the base repo's
 // README; otherwise fall back to the repo's own README.
 func resolveReadme(client *hfapi.Client, modelID string, hfTags []string) (string, error) {
 	if base := baseModelFromTags(hfTags); base != "" && base != modelID {
 		if content, err := client.GetReadmeContent(base, "README.md"); err == nil && strings.TrimSpace(content) != "" {
 			return cleanTextContent(content), nil
 		}
 	}
 	content, err := client.GetReadmeContent(modelID, "README.md")
 	if err != nil {
 		return "", err
 	}
 	return cleanTextContent(content), nil
 }
 // extractDescription turns a raw HuggingFace README into a concise plain-text
 // description suitable for embedding in gallery/index.yaml: strips YAML
 // frontmatter, HTML tags/comments, markdown images, link URLs (keeping the
 // link text), markdown tables, and then truncates at a paragraph boundary
 // around ~1200 characters. Raw README should still be used for icon
 // extraction — call this only for the `description:` field.
 func extractDescription(readme string) string {
 	s := readme
 	// Strip leading YAML frontmatter: `---\n...\n---\n` at start of file.
 	if strings.HasPrefix(strings.TrimLeft(s, " \t\n"), "---") {
 		trimmed := strings.TrimLeft(s, " \t\n")
 		rest := strings.TrimPrefix(trimmed, "---")
 		if idx := strings.Index(rest, "\n---"); idx >= 0 {
 			after := rest[idx+len("\n---"):]
 			after = strings.TrimPrefix(after, "\n")
 			s = after
 		}
 	}
 	// Strip HTML comments and tags.
 	s = regexp.MustCompile(`(?s)<!--.*?-->`).ReplaceAllString(s, "")
 	s = regexp.MustCompile(`(?is)<[^>]+>`).ReplaceAllString(s, "")
 	// Strip markdown images entirely.
 	s = regexp.MustCompile(`!\[[^\]]*\]\([^)]*\)`).ReplaceAllString(s, "")
 	// Replace markdown links `[text](url)` with just `text`.
 	s = regexp.MustCompile(`\[([^\]]+)\]\([^)]+\)`).ReplaceAllString(s, "$1")
 	// Drop table lines and horizontal rules, and flatten all leading
 	// whitespace: generateYAMLEntry embeds this under a `description: |`
 	// literal block whose indentation is set by the first non-empty line.
 	// If any line has extra leading whitespace (e.g. from an indented
 	// `<p align="center">` block in the original README), YAML will pick
 	// that up as the block's indent and every later line at a smaller
 	// indent blows the block scalar. Stripping leading whitespace here
 	// guarantees uniform 4-space indentation after formatTextContent runs.
 	var kept []string
 	for _, line := range strings.Split(s, "\n") {
 		t := strings.TrimLeft(line, " \t")
 		ts := strings.TrimSpace(t)
 		if strings.HasPrefix(ts, "|") {
 			continue
 		}
 		if strings.HasPrefix(ts, ":--") || strings.HasPrefix(ts, "---") || strings.HasPrefix(ts, "===") {
 			continue
 		}
 		kept = append(kept, t)
 	}
 	s = strings.Join(kept, "\n")
 	// Normalise whitespace and drop any leading blank lines so the literal
 	// block in YAML doesn't start with a blank first line (which would
 	// break the indentation detector the same way).
 	s = cleanTextContent(s)
 	s = strings.TrimLeft(s, " \t\n")
 	// Truncate at a paragraph boundary around maxLen chars.
 	const maxLen = 1200
 	if len(s) > maxLen {
 		cut := strings.LastIndex(s[:maxLen], "\n\n")
 		if cut < maxLen/3 {
 			cut = maxLen
 		}
 		s = strings.TrimRight(s[:cut], " \t\n") + "\n\n..."
 	}
 	return s
 }
 // cleanTextContent removes trailing spaces/tabs and collapses multiple empty
 // lines so README content embeds cleanly into YAML without lint noise.
 func cleanTextContent(text string) string {
 	lines := strings.Split(text, "\n")
 	var cleaned []string
 	var prevEmpty bool
 	for _, line := range lines {
 		trimmed := strings.TrimRight(line, " \t\r")
 		if trimmed == "" {
 			if !prevEmpty {
 				cleaned = append(cleaned, "")
 			}
 			prevEmpty = true
 		} else {
 			cleaned = append(cleaned, trimmed)
 			prevEmpty = false
 		}
 	}
 	return strings.TrimRight(strings.Join(cleaned, "\n"), "\n")
 }
 // extractIconFromReadme scans README content for an image URL usable as a
 // gallery entry icon.
 func extractIconFromReadme(readmeContent string) string {
 	if readmeContent == "" {
 		return ""
 	}
 	markdownImageRegex := regexp.MustCompile(`(?i)!\[[^\]]*\]\(([^)]+\.(png|jpg|jpeg|svg|webp|gif))\)`)
 	htmlImageRegex := regexp.MustCompile(`(?i)<img[^>]+src=["']([^"']+\.(png|jpg|jpeg|svg|webp|gif))["']`)
 	plainImageRegex := regexp.MustCompile(`(?i)https?://[^\s<>"']+\.(png|jpg|jpeg|svg|webp|gif)`)
 	if m := markdownImageRegex.FindStringSubmatch(readmeContent); len(m) > 1 && strings.HasPrefix(strings.ToLower(m[1]), "http") {
 		return strings.TrimSpace(m[1])
 	}
 	if m := htmlImageRegex.FindStringSubmatch(readmeContent); len(m) > 1 && strings.HasPrefix(strings.ToLower(m[1]), "http") {
 		return strings.TrimSpace(m[1])
 	}
 	if m := plainImageRegex.FindStringSubmatch(readmeContent); len(m) > 0 && strings.HasPrefix(strings.ToLower(m[0]), "http") {
 		return strings.TrimSpace(m[0])
 	}
 	return ""
 }
 // getHuggingFaceAvatarURL returns the HF avatar URL for a user, or "".
 func getHuggingFaceAvatarURL(author string) string {
 	if author == "" {
 		return ""
 	}
 	userURL := fmt.Sprintf("https://huggingface.co/api/users/%s/overview", author)
 	resp, err := http.Get(userURL)
 	if err != nil {
 		return ""
 	}
 	defer resp.Body.Close()
 	if resp.StatusCode != http.StatusOK {
 		return ""
 	}
 	body, err := io.ReadAll(resp.Body)
 	if err != nil {
 		return ""
 	}
 	var info map[string]any
 	if err := json.Unmarshal(body, &info); err != nil {
 		return ""
 	}
 	if v, ok := info["avatarUrl"].(string); ok && v != "" {
 		return v
 	}
 	if v, ok := info["avatar"].(string); ok && v != "" {
 		return v
 	}
 	return ""
 }
 // extractModelIcon extracts an icon URL from the README, falling back to the
 // HuggingFace user avatar.
 func extractModelIcon(model ProcessedModel) string {
 	if icon := extractIconFromReadme(model.ReadmeContent); icon != "" {
 		return icon
 	}
 	if model.Author != "" {
 		if avatar := getHuggingFaceAvatarURL(model.Author); avatar != "" {
 			return avatar
 		}
 	}
 	return ""
 }
--- a/.github/gallery-agent/main.go
+++ b/.github/gallery-agent/main.go
@@ -6,6 +6,7 @@ import (
 	"fmt"
 	"os"
 	"strconv"
 	"strings"
 	"time"
 	hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
@@ -38,6 +39,16 @@ type ProcessedModel struct {
 	Icon                    string               `json:"icon,omitempty"`
 }
 // SearchResult represents the complete result of searching and processing models
 type SearchResult struct {
 	SearchTerm       string           `json:"search_term"`
 	Limit            int              `json:"limit"`
 	Quantization     string           `json:"quantization"`
 	TotalModelsFound int              `json:"total_models_found"`
 	Models           []ProcessedModel `json:"models"`
 	FormattedOutput  string           `json:"formatted_output"`
 }
 // AddedModelSummary represents a summary of models added to the gallery
 type AddedModelSummary struct {
 	SearchTerm     string   `json:"search_term"`
@@ -52,16 +63,19 @@ type AddedModelSummary struct {
 func main() {
 	startTime := time.Now()
-	// Synthetic mode for local testing
+	// Check for synthetic mode
-	if sm := os.Getenv("SYNTHETIC_MODE"); sm == "true" || sm == "1" {
+	syntheticMode := os.Getenv("SYNTHETIC_MODE")
 	if syntheticMode == "true" || syntheticMode == "1" {
 		fmt.Println("Running in SYNTHETIC MODE - generating random test data")
-		if err := runSyntheticMode(); err != nil {
+		err := runSyntheticMode()
 		if err != nil {
 			fmt.Fprintf(os.Stderr, "Error in synthetic mode: %v\n", err)
 			os.Exit(1)
 		}
 		return
 	}
 	// Get configuration from environment variables
 	searchTerm := os.Getenv("SEARCH_TERM")
 	if searchTerm == "" {
 		searchTerm = "GGUF"
@@ -69,7 +83,7 @@ func main() {
 	limitStr := os.Getenv("LIMIT")
 	if limitStr == "" {
-		limitStr = "15"
+		limitStr = "5"
 	}
 	limit, err := strconv.Atoi(limitStr)
 	if err != nil {
@@ -78,197 +92,287 @@ func main() {
 	}
 	quantization := os.Getenv("QUANTIZATION")
 	if quantization == "" {
 		quantization = "Q4_K_M"
 	}
-	maxModelsStr := os.Getenv("MAX_MODELS")
+	maxModels := os.Getenv("MAX_MODELS")
-	if maxModelsStr == "" {
+	if maxModels == "" {
-		maxModelsStr = "1"
+		maxModels = "1"
 	}
-	maxModels, err := strconv.Atoi(maxModelsStr)
+	maxModelsInt, err := strconv.Atoi(maxModels)
 	if err != nil {
 		fmt.Fprintf(os.Stderr, "Error parsing MAX_MODELS: %v\n", err)
 		os.Exit(1)
 	}
 	// Print configuration
 	fmt.Printf("Gallery Agent Configuration:\n")
 	fmt.Printf("  Search Term: %s\n", searchTerm)
 	fmt.Printf("  Limit: %d\n", limit)
 	fmt.Printf("  Quantization: %s\n", quantization)
-	fmt.Printf("  Max Models to Add: %d\n", maxModels)
+	fmt.Printf("  Max Models to Add: %d\n", maxModelsInt)
-	fmt.Printf("  Gallery Index Path: %s\n", getGalleryIndexPath())
+	fmt.Printf("  Gallery Index Path: %s\n", os.Getenv("GALLERY_INDEX_PATH"))
 	fmt.Println()
-	// Phase 1: load current gallery and query HuggingFace.
+	result, err := searchAndProcessModels(searchTerm, limit, quantization)
 	gallerySet, err := loadGalleryURLSet()
 	if err != nil {
-		fmt.Fprintf(os.Stderr, "Error loading gallery index: %v\n", err)
+		fmt.Fprintf(os.Stderr, "Error: %v\n", err)
 		os.Exit(1)
 	}
 	fmt.Printf("Loaded %d existing gallery entries\n", len(gallerySet))
-	client := hfapi.NewClient()
+	fmt.Println(result.FormattedOutput)
 	var models []ProcessedModel
-	fmt.Println("Searching for trending models on HuggingFace...")
+	if len(result.Models) > 1 {
-	rawModels, err := client.GetTrending(searchTerm, limit)
+		fmt.Println("More than one model found (", len(result.Models), "), using AI agent to select the most interesting models")
-	if err != nil {
+		for _, model := range result.Models {
-		fmt.Fprintf(os.Stderr, "Error fetching models: %v\n", err)
+			fmt.Println("Model: ", model.ModelID)
 		os.Exit(1)
 	}
 	fmt.Printf("Found %d trending models matching %q\n", len(rawModels), searchTerm)
 	totalFound := len(rawModels)
 	// Phase 2: drop anything already in the gallery *before* any expensive
 	// per-model work (GetModelDetails, README fetches, icon lookups).
 	fresh := rawModels[:0]
 	for _, m := range rawModels {
 		if modelAlreadyInGallery(gallerySet, m.ModelID) {
 			fmt.Printf("Skipping existing model: %s\n", m.ModelID)
 			continue
 		}
-		fresh = append(fresh, m)
+		// Use AI agent to select the most interesting models
 		fmt.Println("Using AI agent to select the most interesting models...")
 		models, err = selectMostInterestingModels(context.Background(), result)
 		if err != nil {
 			fmt.Fprintf(os.Stderr, "Error in model selection: %v\n", err)
 			// Continue with original result if selection fails
 			models = result.Models
 		}
 	} else if len(result.Models) == 1 {
 		models = result.Models
 		fmt.Println("Only one model found, using it directly")
 	}
 	fmt.Printf("%d candidates after gallery dedup\n", len(fresh))
-	// Phase 3: HuggingFace already returned these in trendingScore order —
+	fmt.Print(models)
-	// just cap to MAX_MODELS.
+
-	if len(fresh) > maxModels {
+	// Filter out models that already exist in the gallery
-		fresh = fresh[:maxModels]
+	fmt.Println("Filtering out existing models...")
 	models, err = filterExistingModels(models)
 	if err != nil {
 		fmt.Fprintf(os.Stderr, "Error filtering existing models: %v\n", err)
 		os.Exit(1)
 	}
-	if len(fresh) == 0 {
+
 	// Limit to maxModelsInt after filtering
 	if len(models) > maxModelsInt {
 		models = models[:maxModelsInt]
 	}
 	// Track added models for summary
 	var addedModelIDs []string
 	var addedModelURLs []string
 	// Generate YAML entries and append to gallery/index.yaml
 	if len(models) > 0 {
 		for _, model := range models {
 			addedModelIDs = append(addedModelIDs, model.ModelID)
 			// Generate Hugging Face URL for the model
 			modelURL := fmt.Sprintf("https://huggingface.co/%s", model.ModelID)
 			addedModelURLs = append(addedModelURLs, modelURL)
 		}
 		fmt.Println("Generating YAML entries for selected models...")
 		err = generateYAMLForModels(context.Background(), models, quantization)
 		if err != nil {
 			fmt.Fprintf(os.Stderr, "Error generating YAML entries: %v\n", err)
 			os.Exit(1)
 		}
 	} else {
 		fmt.Println("No new models to add to the gallery.")
 		writeSummary(AddedModelSummary{
 			SearchTerm:     searchTerm,
 			TotalFound:     totalFound,
 			ModelsAdded:    0,
 			Quantization:   quantization,
 			ProcessingTime: time.Since(startTime).String(),
 		})
 		return
 	}
-	// Phase 4: fetch details and build ProcessedModel entries for survivors.
+	// Create and write summary
-	var processed []ProcessedModel
+	processingTime := time.Since(startTime).String()
-	quantPrefs := []string{quantization, "Q4_K_M", "Q4_K_S", "Q3_K_M", "Q2_K", "Q8_0"}
+	summary := AddedModelSummary{
 	for _, m := range fresh {
 		fmt.Printf("Processing model: %s (downloads=%d)\n", m.ModelID, m.Downloads)
 		pm := ProcessedModel{
 			ModelID:                 m.ModelID,
 			Author:                  m.Author,
 			Downloads:               m.Downloads,
 			LastModified:            m.LastModified,
 			QuantizationPreferences: quantPrefs,
 		}
 		details, err := client.GetModelDetails(m.ModelID)
 		if err != nil {
 			fmt.Printf("  Error getting model details: %v (skipping)\n", err)
 			continue
 		}
 		preferred := hfapi.FindPreferredModelFile(details.Files, quantPrefs)
 		if preferred == nil {
 			fmt.Printf("  No GGUF file matching %v — skipping\n", quantPrefs)
 			continue
 		}
 		pm.Files = make([]ProcessedModelFile, len(details.Files))
 		for j, f := range details.Files {
 			fileType := "other"
 			if f.IsReadme {
 				fileType = "readme"
 			} else if f.Path == preferred.Path {
 				fileType = "model"
 			}
 			pm.Files[j] = ProcessedModelFile{
 				Path:     f.Path,
 				Size:     f.Size,
 				SHA256:   f.SHA256,
 				IsReadme: f.IsReadme,
 				FileType: fileType,
 			}
 			if f.Path == preferred.Path {
 				copyFile := pm.Files[j]
 				pm.PreferredModelFile = &copyFile
 			}
 			if f.IsReadme {
 				copyFile := pm.Files[j]
 				pm.ReadmeFile = &copyFile
 			}
 		}
 		// Deterministic README resolution: follow base_model tag if set.
 		// Keep the raw (HTML-bearing) README around while we extract the
 		// icon, then strip it down to a plain-text description for the
 		// `description:` YAML field.
 		readme, err := resolveReadme(client, m.ModelID, m.Tags)
 		if err != nil {
 			fmt.Printf("  Warning: failed to fetch README: %v\n", err)
 		}
 		pm.ReadmeContent = readme
 		pm.License = licenseFromTags(m.Tags)
 		pm.Tags = curatedTags(m.Tags)
 		pm.Icon = extractModelIcon(pm)
 		if pm.ReadmeContent != "" {
 			pm.ReadmeContent = extractDescription(pm.ReadmeContent)
 			pm.ReadmeContentPreview = truncateString(pm.ReadmeContent, 200)
 		}
 		fmt.Printf("  License: %s, Tags: %v, Icon: %s\n", pm.License, pm.Tags, pm.Icon)
 		processed = append(processed, pm)
 	}
 	if len(processed) == 0 {
 		fmt.Println("No processable models after detail fetch.")
 		writeSummary(AddedModelSummary{
 			SearchTerm:     searchTerm,
 			TotalFound:     totalFound,
 			ModelsAdded:    0,
 			Quantization:   quantization,
 			ProcessingTime: time.Since(startTime).String(),
 		})
 		return
 	}
 	// Phase 5: write YAML entries.
 	var addedIDs, addedURLs []string
 	for _, pm := range processed {
 		addedIDs = append(addedIDs, pm.ModelID)
 		addedURLs = append(addedURLs, "https://huggingface.co/"+pm.ModelID)
 	}
 	fmt.Println("Generating YAML entries for selected models...")
 	if err := generateYAMLForModels(context.Background(), processed, quantization); err != nil {
 		fmt.Fprintf(os.Stderr, "Error generating YAML entries: %v\n", err)
 		os.Exit(1)
 	}
 	writeSummary(AddedModelSummary{
 		SearchTerm:     searchTerm,
-		TotalFound:     totalFound,
+		TotalFound:     result.TotalModelsFound,
-		ModelsAdded:    len(addedIDs),
+		ModelsAdded:    len(addedModelIDs),
-		AddedModelIDs:  addedIDs,
+		AddedModelIDs:  addedModelIDs,
-		AddedModelURLs: addedURLs,
+		AddedModelURLs: addedModelURLs,
 		Quantization:   quantization,
-		ProcessingTime: time.Since(startTime).String(),
+		ProcessingTime: processingTime,
-	})
+	}
 }
-func writeSummary(summary AddedModelSummary) {
+	// Write summary to file
-	data, err := json.MarshalIndent(summary, "", "  ")
+	summaryData, err := json.MarshalIndent(summary, "", "  ")
 	if err != nil {
 		fmt.Fprintf(os.Stderr, "Error marshaling summary: %v\n", err)
-		return
+	} else {
 		err = os.WriteFile("gallery-agent-summary.json", summaryData, 0644)
 		if err != nil {
 			fmt.Fprintf(os.Stderr, "Error writing summary file: %v\n", err)
 		} else {
 			fmt.Printf("Summary written to gallery-agent-summary.json\n")
 		}
 	}
-	if err := os.WriteFile("gallery-agent-summary.json", data, 0644); err != nil {
+}
-		fmt.Fprintf(os.Stderr, "Error writing summary file: %v\n", err)
+
-		return
+func searchAndProcessModels(searchTerm string, limit int, quantization string) (*SearchResult, error) {
 	client := hfapi.NewClient()
 	var outputBuilder strings.Builder
 	fmt.Println("Searching for models...")
 	// Initialize the result struct
 	result := &SearchResult{
 		SearchTerm:   searchTerm,
 		Limit:        limit,
 		Quantization: quantization,
 		Models:       []ProcessedModel{},
 	}
-	fmt.Println("Summary written to gallery-agent-summary.json")
+
 	models, err := client.GetLatest(searchTerm, limit)
 	if err != nil {
 		return nil, fmt.Errorf("failed to fetch models: %w", err)
 	}
 	fmt.Println("Models found:", len(models))
 	result.TotalModelsFound = len(models)
 	if len(models) == 0 {
 		outputBuilder.WriteString("No models found.\n")
 		result.FormattedOutput = outputBuilder.String()
 		return result, nil
 	}
 	outputBuilder.WriteString(fmt.Sprintf("Found %d models matching '%s':\n\n", len(models), searchTerm))
 	// Process each model
 	for i, model := range models {
 		outputBuilder.WriteString(fmt.Sprintf("%d. Processing Model: %s\n", i+1, model.ModelID))
 		outputBuilder.WriteString(fmt.Sprintf("   Author: %s\n", model.Author))
 		outputBuilder.WriteString(fmt.Sprintf("   Downloads: %d\n", model.Downloads))
 		outputBuilder.WriteString(fmt.Sprintf("   Last Modified: %s\n", model.LastModified))
 		// Initialize processed model struct
 		processedModel := ProcessedModel{
 			ModelID:                 model.ModelID,
 			Author:                  model.Author,
 			Downloads:               model.Downloads,
 			LastModified:            model.LastModified,
 			QuantizationPreferences: []string{quantization, "Q4_K_M", "Q4_K_S", "Q3_K_M", "Q2_K"},
 		}
 		// Get detailed model information
 		details, err := client.GetModelDetails(model.ModelID)
 		if err != nil {
 			errorMsg := fmt.Sprintf("   Error getting model details: %v\n", err)
 			outputBuilder.WriteString(errorMsg)
 			processedModel.ProcessingError = err.Error()
 			result.Models = append(result.Models, processedModel)
 			continue
 		}
 		// Define quantization preferences (in order of preference)
 		quantizationPreferences := []string{quantization, "Q4_K_M", "Q4_K_S", "Q3_K_M", "Q2_K"}
 		// Find preferred model file
 		preferredModelFile := hfapi.FindPreferredModelFile(details.Files, quantizationPreferences)
 		// Process files
 		processedFiles := make([]ProcessedModelFile, len(details.Files))
 		for j, file := range details.Files {
 			fileType := "other"
 			if file.IsReadme {
 				fileType = "readme"
 			} else if preferredModelFile != nil && file.Path == preferredModelFile.Path {
 				fileType = "model"
 			}
 			processedFiles[j] = ProcessedModelFile{
 				Path:     file.Path,
 				Size:     file.Size,
 				SHA256:   file.SHA256,
 				IsReadme: file.IsReadme,
 				FileType: fileType,
 			}
 		}
 		processedModel.Files = processedFiles
 		// Set preferred model file
 		if preferredModelFile != nil {
 			for _, file := range processedFiles {
 				if file.Path == preferredModelFile.Path {
 					processedModel.PreferredModelFile = &file
 					break
 				}
 			}
 		}
 		// Print file information
 		outputBuilder.WriteString(fmt.Sprintf("   Files found: %d\n", len(details.Files)))
 		if preferredModelFile != nil {
 			outputBuilder.WriteString(fmt.Sprintf("   Preferred Model File: %s (SHA256: %s)\n",
 				preferredModelFile.Path,
 				preferredModelFile.SHA256))
 		} else {
 			outputBuilder.WriteString(fmt.Sprintf("   No model file found with quantization preferences: %v\n", quantizationPreferences))
 		}
 		if details.ReadmeFile != nil {
 			outputBuilder.WriteString(fmt.Sprintf("   README File: %s\n", details.ReadmeFile.Path))
 			// Find and set readme file
 			for _, file := range processedFiles {
 				if file.IsReadme {
 					processedModel.ReadmeFile = &file
 					break
 				}
 			}
 			fmt.Println("Getting real readme for", model.ModelID, "waiting...")
 			// Use agent to get the real readme and prepare the model description
 			readmeContent, err := getRealReadme(context.Background(), model.ModelID)
 			if err == nil {
 				processedModel.ReadmeContent = readmeContent
 				processedModel.ReadmeContentPreview = truncateString(readmeContent, 200)
 				outputBuilder.WriteString(fmt.Sprintf("   README Content Preview: %s\n",
 					processedModel.ReadmeContentPreview))
 			} else {
 				fmt.Printf("   Warning: Failed to get real readme: %v\n", err)
 			}
 			fmt.Println("Real readme got", readmeContent)
 			// Extract metadata (tags, license) from README using LLM
 			fmt.Println("Extracting metadata for", model.ModelID, "waiting...")
 			tags, license, err := extractModelMetadata(context.Background(), processedModel)
 			if err == nil {
 				processedModel.Tags = tags
 				processedModel.License = license
 				outputBuilder.WriteString(fmt.Sprintf("   Tags: %v\n", tags))
 				outputBuilder.WriteString(fmt.Sprintf("   License: %s\n", license))
 			} else {
 				fmt.Printf("   Warning: Failed to extract metadata: %v\n", err)
 			}
 			// Extract icon from README or use HuggingFace avatar
 			icon := extractModelIcon(processedModel)
 			if icon != "" {
 				processedModel.Icon = icon
 				outputBuilder.WriteString(fmt.Sprintf("   Icon: %s\n", icon))
 			}
 			// Get README content
 			// readmeContent, err := client.GetReadmeContent(model.ModelID, details.ReadmeFile.Path)
 			// if err == nil {
 			// 	processedModel.ReadmeContent = readmeContent
 			// 	processedModel.ReadmeContentPreview = truncateString(readmeContent, 200)
 			// 	outputBuilder.WriteString(fmt.Sprintf("   README Content Preview: %s\n",
 			// 		processedModel.ReadmeContentPreview))
 			// }
 		}
 		// Print all files with their checksums
 		outputBuilder.WriteString("   All Files:\n")
 		for _, file := range processedFiles {
 			outputBuilder.WriteString(fmt.Sprintf("     - %s (%s, %d bytes", file.Path, file.FileType, file.Size))
 			if file.SHA256 != "" {
 				outputBuilder.WriteString(fmt.Sprintf(", SHA256: %s", file.SHA256))
 			}
 			outputBuilder.WriteString(")\n")
 		}
 		outputBuilder.WriteString("\n")
 		result.Models = append(result.Models, processedModel)
 	}
 	result.FormattedOutput = outputBuilder.String()
 	return result, nil
 }
 func truncateString(s string, maxLen int) string {
@@ -277,4 +381,3 @@ func truncateString(s string, maxLen int) string {
 	}
 	return s[:maxLen] + "..."
 }
--- a/.github/gallery-agent/testing.go
+++ b/.github/gallery-agent/testing.go
@@ -3,7 +3,7 @@ package main
 import (
 	"context"
 	"fmt"
-	"math/rand/v2"
+	"math/rand"
 	"strings"
 	"time"
 )
@@ -13,11 +13,11 @@ func runSyntheticMode() error {
 	generator := NewSyntheticDataGenerator()
 	// Generate a random number of synthetic models (1-3)
-	numModels := generator.rand.IntN(3) + 1
+	numModels := generator.rand.Intn(3) + 1
 	fmt.Printf("Generating %d synthetic models for testing...\n", numModels)
 	var models []ProcessedModel
-	for range numModels {
+	for i := 0; i < numModels; i++ {
 		model := generator.GenerateProcessedModel()
 		models = append(models, model)
 		fmt.Printf("Generated synthetic model: %s\n", model.ModelID)
@@ -42,14 +42,14 @@ type SyntheticDataGenerator struct {
 // NewSyntheticDataGenerator creates a new synthetic data generator
 func NewSyntheticDataGenerator() *SyntheticDataGenerator {
 	return &SyntheticDataGenerator{
-		rand: rand.New(rand.NewPCG(uint64(time.Now().UnixNano()), 0)),
+		rand: rand.New(rand.NewSource(time.Now().UnixNano())),
 	}
 }
 // GenerateProcessedModelFile creates a synthetic ProcessedModelFile
 func (g *SyntheticDataGenerator) GenerateProcessedModelFile() ProcessedModelFile {
 	fileTypes := []string{"model", "readme", "other"}
-	fileType := fileTypes[g.rand.IntN(len(fileTypes))]
+	fileType := fileTypes[g.rand.Intn(len(fileTypes))]
 	var path string
 	var isReadme bool
@@ -68,7 +68,7 @@ func (g *SyntheticDataGenerator) GenerateProcessedModelFile() ProcessedModelFile
 	return ProcessedModelFile{
 		Path:     path,
-		Size:     int64(g.rand.IntN(1000000000) + 1000000), // 1MB to 1GB
+		Size:     int64(g.rand.Intn(1000000000) + 1000000), // 1MB to 1GB
 		SHA256:   g.randomSHA256(),
 		IsReadme: isReadme,
 		FileType: fileType,
@@ -80,19 +80,19 @@ func (g *SyntheticDataGenerator) GenerateProcessedModel() ProcessedModel {
 	authors := []string{"microsoft", "meta", "google", "openai", "anthropic", "mistralai", "huggingface"}
 	modelNames := []string{"llama", "gpt", "claude", "mistral", "gemma", "phi", "qwen", "codellama"}
-	author := authors[g.rand.IntN(len(authors))]
+	author := authors[g.rand.Intn(len(authors))]
-	modelName := modelNames[g.rand.IntN(len(modelNames))]
+	modelName := modelNames[g.rand.Intn(len(modelNames))]
 	modelID := fmt.Sprintf("%s/%s-%s", author, modelName, g.randomString(6))
 	// Generate files
-	numFiles := g.rand.IntN(5) + 2 // 2-6 files
+	numFiles := g.rand.Intn(5) + 2 // 2-6 files
 	files := make([]ProcessedModelFile, numFiles)
 	// Ensure at least one model file and one readme
 	hasModelFile := false
 	hasReadme := false
-	for i := range numFiles {
+	for i := 0; i < numFiles; i++ {
 		files[i] = g.GenerateProcessedModelFile()
 		if files[i].FileType == "model" {
 			hasModelFile = true
@@ -140,27 +140,27 @@ func (g *SyntheticDataGenerator) GenerateProcessedModel() ProcessedModel {
 	// Generate sample metadata
 	licenses := []string{"apache-2.0", "mit", "llama2", "gpl-3.0", "bsd", ""}
-	license := licenses[g.rand.IntN(len(licenses))]
+	license := licenses[g.rand.Intn(len(licenses))]
 	sampleTags := []string{"llm", "gguf", "gpu", "cpu", "text-to-text", "chat", "instruction-tuned"}
-	numTags := g.rand.IntN(4) + 3 // 3-6 tags
+	numTags := g.rand.Intn(4) + 3 // 3-6 tags
 	tags := make([]string, numTags)
-	for i := range numTags {
+	for i := 0; i < numTags; i++ {
-		tags[i] = sampleTags[g.rand.IntN(len(sampleTags))]
+		tags[i] = sampleTags[g.rand.Intn(len(sampleTags))]
 	}
 	// Remove duplicates
 	tags = g.removeDuplicates(tags)
 	// Optionally include icon (50% chance)
 	icon := ""
-	if g.rand.IntN(2) == 0 {
+	if g.rand.Intn(2) == 0 {
 		icon = fmt.Sprintf("https://cdn-avatars.huggingface.co/v1/production/uploads/%s.png", g.randomString(24))
 	}
 	return ProcessedModel{
 		ModelID:                 modelID,
 		Author:                  author,
-		Downloads:               g.rand.IntN(1000000) + 1000,
+		Downloads:               g.rand.Intn(1000000) + 1000,
 		LastModified:            g.randomDate(),
 		Files:                   files,
 		PreferredModelFile:      preferredModelFile,
@@ -180,7 +180,7 @@ func (g *SyntheticDataGenerator) randomString(length int) string {
 	const charset = "abcdefghijklmnopqrstuvwxyz0123456789"
 	b := make([]byte, length)
 	for i := range b {
-		b[i] = charset[g.rand.IntN(len(charset))]
+		b[i] = charset[g.rand.Intn(len(charset))]
 	}
 	return string(b)
 }
@@ -189,14 +189,14 @@ func (g *SyntheticDataGenerator) randomSHA256() string {
 	const charset = "0123456789abcdef"
 	b := make([]byte, 64)
 	for i := range b {
-		b[i] = charset[g.rand.IntN(len(charset))]
+		b[i] = charset[g.rand.Intn(len(charset))]
 	}
 	return string(b)
 }
 func (g *SyntheticDataGenerator) randomDate() string {
 	now := time.Now()
-	daysAgo := g.rand.IntN(365) // Random date within last year
+	daysAgo := g.rand.Intn(365) // Random date within last year
 	pastDate := now.AddDate(0, 0, -daysAgo)
 	return pastDate.Format("2006-01-02T15:04:05.000Z")
 }
@@ -220,5 +220,5 @@ func (g *SyntheticDataGenerator) generateReadmeContent(modelName, author string)
 		fmt.Sprintf("# %s Language Model\n\nDeveloped by %s, this model represents state-of-the-art performance in natural language understanding and generation.\n\n## Key Features\n\n- Multilingual support\n- Context-aware responses\n- Efficient memory usage\n- Fast inference speed\n\n## Applications\n\n- Chatbots and virtual assistants\n- Content generation\n- Code completion\n- Educational tools", strings.Title(modelName), author),
 	}
-	return templates[g.rand.IntN(len(templates))]
+	return templates[g.rand.Intn(len(templates))]
 }
--- a/.github/gallery-agent/tools.go
+++ b/.github/gallery-agent/tools.go
@@ -0,0 +1,46 @@
 package main
 import (
 	"fmt"
 	hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
 	openai "github.com/sashabaranov/go-openai"
 	jsonschema "github.com/sashabaranov/go-openai/jsonschema"
 )
 // Get repository README from HF
 type HFReadmeTool struct {
 	client *hfapi.Client
 }
 func (s *HFReadmeTool) Execute(args map[string]any) (string, error) {
 	q, ok := args["repository"].(string)
 	if !ok {
 		return "", fmt.Errorf("no query")
 	}
 	readme, err := s.client.GetReadmeContent(q, "README.md")
 	if err != nil {
 		return "", err
 	}
 	return readme, nil
 }
 func (s *HFReadmeTool) Tool() openai.Tool {
 	return openai.Tool{
 		Type: openai.ToolTypeFunction,
 		Function: &openai.FunctionDefinition{
 			Name:        "hf_readme",
 			Description: "A tool to get the README content of a huggingface repository",
 			Parameters: jsonschema.Definition{
 				Type: jsonschema.Object,
 				Properties: map[string]jsonschema.Definition{
 					"repository": {
 						Type:        jsonschema.String,
 						Description: "The huggingface repository to get the README content of",
 					},
 				},
 				Required: []string{"repository"},
 			},
 		},
 	}
 }
--- a/.github/scripts/anchor-digest-in-cache.sh
+++ b/.github/scripts/anchor-digest-in-cache.sh
@@ -1,46 +0,0 @@
 #!/usr/bin/env bash
 # Anchor a backend per-arch digest in quay.io/go-skynet/ci-cache so quay's
 # garbage collector won't reap the manifest before backend_merge.yml runs.
 #
 # Context: backend_build.yml pushes by canonical digest only
 # (push-by-digest=true). Unreferenced manifests on quay can be reaped within
 # ~1-2h, but backend-merge-jobs runs only after the *entire* per-arch build
 # matrix drains (max-parallel: 8 × dozens of entries → ~2h+). Without an
 # anchoring tag, the earliest digests are gone by the time `imagetools create`
 # tries to read them, producing "manifest not found" merge failures.
 #
 # We tag the digest under our internal ci-cache image; quay does not GC tagged
 # manifests. The user-facing manifest list still references the original
 # digest in local-ai-backends. backend_merge.yml deletes the anchor tag after
 # the user-facing manifest is published — see cleanup-keepalive-tags.sh.
 #
 # Required env:
 #   GITHUB_RUN_ID  - current workflow run id (set automatically by GHA)
 #   TAG_SUFFIX     - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
 #   PLATFORM_TAG   - amd64 / arm64 / single (single = singleton matrix entry)
 #   DIGEST         - canonical content digest from build step (sha256:...)
 #
 # Optional env:
 #   ANCHOR_IMAGE   - target image (default: quay.io/go-skynet/ci-cache)
 #   SOURCE_IMAGE   - source image (default: quay.io/go-skynet/local-ai-backends)
 #   GITHUB_STEP_SUMMARY - if set, an anchored-by line is appended to it
 set -euo pipefail
 : "${GITHUB_RUN_ID:?}"
 : "${TAG_SUFFIX:?}"
 : "${PLATFORM_TAG:?}"
 : "${DIGEST:?}"
 anchor_image="${ANCHOR_IMAGE:-quay.io/go-skynet/ci-cache}"
 source_image="${SOURCE_IMAGE:-quay.io/go-skynet/local-ai-backends}"
 tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${PLATFORM_TAG}"
 docker buildx imagetools create \
  -t "${anchor_image}:${tag}" \
  "${source_image}@${DIGEST}"
 echo "anchored ${DIGEST} as ${anchor_image}:${tag}"
 if [[ -n "${GITHUB_STEP_SUMMARY:-}" ]]; then
  echo "anchored \`${DIGEST}\` as \`${anchor_image}:${tag}\`" >> "${GITHUB_STEP_SUMMARY}"
 fi
--- a/.github/scripts/cleanup-keepalive-tags.sh
+++ b/.github/scripts/cleanup-keepalive-tags.sh
@@ -1,49 +0,0 @@
 #!/usr/bin/env bash
 # Best-effort cleanup of the keepalive anchor tags written by
 # anchor-digest-in-cache.sh. Called from backend_merge.yml after the
 # user-facing manifest list has been published.
 #
 # Quay's docker registry v2 doesn't allow tag deletes — only digest deletes.
 # The proper delete is the quay REST API, which requires an OAuth-scoped
 # token. We try QUAY_TOKEN as a bearer token: if the secret is an OAuth app
 # token (typical for service accounts) the delete succeeds; otherwise this
 # is a soft no-op and the tag persists until manually pruned.
 #
 # Cleanup failure MUST NOT fail the merge — the merge has already produced
 # the user-facing manifest list at this point and the keepalive tags are
 # pure overhead. We always exit 0.
 #
 # Required env:
 #   GITHUB_RUN_ID  - current workflow run id (set automatically by GHA)
 #   TAG_SUFFIX     - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
 #   QUAY_TOKEN     - bearer token for quay's REST API
 #
 # Optional env:
 #   QUAY_REPO      - target repo (default: go-skynet/ci-cache)
 #   PLATFORM_TAGS  - space-separated list of platform-tag values to try
 #                    (default: "amd64 arm64 single")
 #                    We don't know which platform-tag(s) exist for this
 #                    tag-suffix without an extra API call, so we just try
 #                    all three and ignore 404s for the ones that don't.
 set -uo pipefail
 : "${GITHUB_RUN_ID:?}"
 : "${TAG_SUFFIX:?}"
 : "${QUAY_TOKEN:?}"
 quay_repo="${QUAY_REPO:-go-skynet/ci-cache}"
 platform_tags="${PLATFORM_TAGS:-amd64 arm64 single}"
 for plat in $platform_tags; do
  tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${plat}"
  url="https://quay.io/api/v1/repository/${quay_repo}/tag/${tag}"
  http=$(curl -sS -o /dev/null -w '%{http_code}' \
    -X DELETE -H "Authorization: Bearer ${QUAY_TOKEN}" "$url" || echo "000")
  case "$http" in
    204|200) echo "deleted $tag" ;;
    404)     echo "not present: $tag" ;;
    401|403) echo "auth not OAuth-scoped (http $http) for $tag - skipping; orphan tag will persist" ;;
    *)       echo "unexpected http $http deleting $tag - skipping" ;;
  esac
 done
 exit 0
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
--- a/.github/workflows/backend_build.yml
+++ b/.github/workflows/backend_build.yml
@@ -24,17 +24,6 @@ on:
        description: 'Platforms'
        default: ''
        type: string
      platform-tag:
        description: |
          Short tag identifying the platform leg, e.g. "amd64" or "arm64".
          Used to scope the per-arch registry cache and the digest artifact name.
          Required for split-and-merge multi-arch builds; pass "amd64" for
          single-arch amd64 builds too. Optional (default '') during the
          migration to per-arch matrix expansion; will be flipped to
          required: true in Phase 6 once all callers pass an explicit value.
        required: false
        default: ''
        type: string
      tag-latest:
        description: 'Tag latest'
        default: ''
@@ -69,20 +58,6 @@ on:
        required: false
        default: '2204'
        type: string
      amdgpu-targets:
        description: 'AMD GPU targets for ROCm/HIP builds'
        required: false
        default: ''
        type: string
      builder-base-image:
        description: |
          Pre-built builder base image (e.g. quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64).
          When set, the variant Dockerfile uses its `builder-prebuilt` stage which FROMs this
          image directly instead of running its own gRPC stage + apt installs. Empty for
          backends whose Dockerfile doesn't support a prebuilt base.
        required: false
        default: ''
        type: string
    secrets:
      dockerUsername:
        required: false
@@ -100,27 +75,81 @@ jobs:
        quay_username: ${{ secrets.quayUsername }}
    steps:
      - name: Free Disk Space (Ubuntu)
        if: inputs.runs-on == 'ubuntu-latest'
        uses: jlumbroso/free-disk-space@main
        with:
          # this might remove tools that are actually needed,
          # if set to "true" but frees about 6 GB
          tool-cache: true
          # all of these default to true, but feel free to set to
          # "false" if necessary for your workflow
          android: true
          dotnet: true
          haskell: true
          large-packages: true
          docker-images: true
          swap-storage: true
      - name: Force Install GIT latest
        run: |
          sudo apt-get update \
          && sudo apt-get install -y software-properties-common \
          && sudo apt-get update \
          && sudo add-apt-repository -y ppa:git-core/ppa \
          && sudo apt-get update \
          && sudo apt-get install -y git
      - name: Checkout
        uses: actions/checkout@v6
        with:
          submodules: true
-      - name: Configure apt mirror on runner
+      - name: Release space from worker
-        id: apt_mirror
+        if: inputs.runs-on == 'ubuntu-latest'
-        uses: ./.github/actions/configure-apt-mirror
+        run: |
-
+          echo "Listing top largest packages"
-      - name: Free disk space
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
-        uses: ./.github/actions/free-disk-space
+          head -n 30 <<< "${pkgs}"
-        with:
+          echo
-          mode: ${{ inputs.runs-on == 'ubuntu-latest' && 'hosted' || 'skip' }}
+          df -h
-
+          echo
-      - name: Set up build disk
+          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
-        uses: ./.github/actions/setup-build-disk
+          sudo apt-get remove --auto-remove android-sdk-platform-tools snapd || true
          sudo apt-get purge --auto-remove android-sdk-platform-tools snapd || true
          sudo rm -rf /usr/local/lib/android
          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
          sudo rm -rf /usr/share/dotnet
          sudo apt-get remove -y '^mono-.*' || true
          sudo apt-get remove -y '^ghc-.*' || true
          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
          sudo apt-get remove -y 'php.*' || true
          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
          sudo apt-get remove -y '^google-.*' || true
          sudo apt-get remove -y azure-cli || true
          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
          sudo apt-get remove -y '^gfortran-.*' || true
          sudo apt-get remove -y microsoft-edge-stable || true
          sudo apt-get remove -y firefox || true
          sudo apt-get remove -y powershell || true
          sudo apt-get remove -y r-base-core || true
          sudo apt-get autoremove -y
          sudo apt-get clean
          echo
          echo "Listing top largest packages"
          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
          head -n 30 <<< "${pkgs}"
          echo
          sudo rm -rfv build || true
          sudo rm -rf /usr/share/dotnet || true
          sudo rm -rf /opt/ghc || true
          sudo rm -rf "/usr/local/share/boost" || true
          sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
          df -h
      - name: Docker meta
        id: meta
        if: github.event_name != 'pull_request'
-        uses: docker/metadata-action@v6
+        uses: docker/metadata-action@v5
        with:
          images: |
            quay.io/go-skynet/local-ai-backends
@@ -136,7 +165,7 @@ jobs:
      - name: Docker meta for PR
        id: meta_pull_request
        if: github.event_name == 'pull_request'
-        uses: docker/metadata-action@v6
+        uses: docker/metadata-action@v5
        with:
          images: |
            quay.io/go-skynet/ci-tests
@@ -159,31 +188,21 @@ jobs:
      - name: Login to DockerHub
        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
+        uses: docker/login-action@v3
        with:
          username: ${{ secrets.dockerUsername }}
          password: ${{ secrets.dockerPassword }}
      - name: Login to Quay.io
        if: ${{ env.quay_username != '' }}
-        uses: docker/login-action@v4
+        uses: docker/login-action@v3
        with:
          registry: quay.io
          username: ${{ secrets.quayUsername }}
          password: ${{ secrets.quayPassword }}
-      # Weekly cache-buster for the per-backend `make` step. Most Python
+      - name: Build and push
-      # backends list unpinned deps (torch, transformers, vllm, ...), so a
+        uses: docker/build-push-action@v6
      # warm cache freezes upstream versions indefinitely. Rolling this
      # weekly forces a re-resolve of the install layer at most once per
      # week, picking up newer wheels without a full cold rebuild.
      - name: Compute deps refresh key
        id: deps_refresh
        run: echo "key=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
      - name: Build and push by digest
        id: build
        uses: docker/build-push-action@v7
        if: github.event_name != 'pull_request'
        with:
          builder: ${{ steps.buildx.outputs.name }}
@@ -195,67 +214,16 @@ jobs:
            BASE_IMAGE=${{ inputs.base-image }}
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
            APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
            APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
            DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
            BUILDER_BASE_IMAGE=${{ inputs.builder-base-image }}
            BUILDER_TARGET=${{ inputs.builder-base-image != '' && 'builder-prebuilt' || 'builder-fromsource' }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
+          cache-from: type=gha
          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}-${{ inputs.platform-tag }},mode=max,ignore-error=true
          platforms: ${{ inputs.platforms }}
-          outputs: |
+          push: ${{ github.event_name != 'pull_request' }}
-            type=image,name=quay.io/go-skynet/local-ai-backends,push-by-digest=true,name-canonical=true,push=true
+          tags: ${{ steps.meta.outputs.tags }}
            type=image,name=localai/localai-backends,push-by-digest=true,name-canonical=true,push=true
          # Disable provenance: with mode=max (the default for push:true)
          # buildx bundles a per-registry attestation manifest into each
          # registry's manifest list, which makes the resulting list digest
          # diverge across registries. steps.build.outputs.digest then
          # only matches one of them, and the merge job's
          # `imagetools create <reg>@sha256:<digest>` lookup fails on the
          # other. Disabling provenance keeps the digest content-only and
          # identical across both registries — required for digest-based
          # cross-registry merge.
          provenance: false
          labels: ${{ steps.meta.outputs.labels }}
-      - name: Export digest
+      - name: Build and push (PR)
-        if: github.event_name != 'pull_request'
+        uses: docker/build-push-action@v6
        run: |
          mkdir -p /tmp/digests
          digest="${{ steps.build.outputs.digest }}"
          touch "/tmp/digests/${digest#sha256:}"
      # See .github/scripts/anchor-digest-in-cache.sh for why this is needed
      # and how it interacts with backend_merge.yml's cleanup step.
      - name: Anchor digest in ci-cache so quay GC won't reap before merge
        if: github.event_name != 'pull_request'
        env:
          TAG_SUFFIX: ${{ inputs.tag-suffix }}
          PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
          DIGEST: ${{ steps.build.outputs.digest }}
        run: .github/scripts/anchor-digest-in-cache.sh
      # Artifact name uses a `--` separator between tag-suffix and platform-tag
      # to avoid prefix collisions during the merge job's pattern-based download.
      # Tag-suffixes are not prefix-disjoint (e.g. -gpu-nvidia-cuda-12-vllm is a
      # prefix of -gpu-nvidia-cuda-12-vllm-omni); a single `-` separator plus the
      # merge-side `digests<tag-suffix>-*` glob would let one merge over-match
      # the other backend's artifacts. The `-single` placeholder for empty
      # platform-tag (single-arch entries) keeps the artifact name non-trailing.
      - name: Upload digest artifact
        if: github.event_name != 'pull_request'
        uses: actions/upload-artifact@v7
        with:
          name: digests${{ inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
          path: /tmp/digests/*
          if-no-files-found: error
          retention-days: 1
      - name: Build (PR)
        uses: docker/build-push-action@v7
        if: github.event_name == 'pull_request'
        with:
          builder: ${{ steps.buildx.outputs.name }}
@@ -267,15 +235,9 @@ jobs:
            BASE_IMAGE=${{ inputs.base-image }}
            BACKEND=${{ inputs.backend }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
            APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
            APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
            DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
            BUILDER_BASE_IMAGE=${{ inputs.builder-base-image }}
            BUILDER_TARGET=${{ inputs.builder-base-image != '' && 'builder-prebuilt' || 'builder-fromsource' }}
          context: ${{ inputs.context }}
          file: ${{ inputs.dockerfile }}
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
+          cache-from: type=gha
          platforms: ${{ inputs.platforms }}
          push: ${{ env.quay_username != '' }}
          tags: ${{ steps.meta_pull_request.outputs.tags }}
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -48,13 +48,6 @@ jobs:
    strategy:
      matrix:
        go-version: ['${{ inputs.go-version }}']
    env:
      # Keep the brew Cellar stable across cache restores. Without these,
      # `brew install` would auto-update brew itself and re-link formulas,
      # mutating the very paths the cache just restored.
      HOMEBREW_NO_AUTO_UPDATE: '1'
      HOMEBREW_NO_INSTALL_CLEANUP: '1'
      HOMEBREW_NO_ANALYTICS: '1'
    steps:
      - name: Clone
        uses: actions/checkout@v6
@@ -65,192 +58,23 @@ jobs:
        uses: actions/setup-go@v5
        with:
          go-version: ${{ matrix.go-version }}
-          # Caches ~/go/pkg/mod and ~/Library/Caches/go-build keyed on go.sum.
+          cache: false
          # Shared across every darwin matrix entry — first job in a run warms
          # it, the rest hit warm.
          cache: true
      # You can test your matrix by printing the current Go version
      - name: Display Go version
        run: go version
      # ---- Homebrew cache ----
      # macOS runners have no Docker daemon, so the BuildKit registry cache used
      # for Linux backend images (see .agents/ci-caching.md) doesn't apply here.
      # We cache the brew downloads + Cellar entries for the formulas we install
      # below. Read on every run, write only on master/tag pushes — same policy
      # as the Linux registry cache.
      - name: Restore Homebrew cache
        id: brew-cache
        uses: actions/cache/restore@v4
        with:
          path: |
            ~/Library/Caches/Homebrew/downloads
            /opt/homebrew/Cellar/protobuf
            /opt/homebrew/Cellar/grpc
            /opt/homebrew/Cellar/protoc-gen-go
            /opt/homebrew/Cellar/protoc-gen-go-grpc
            /opt/homebrew/Cellar/libomp
            /opt/homebrew/Cellar/llvm
            /opt/homebrew/Cellar/ccache
            /opt/homebrew/Cellar/blake3
            /opt/homebrew/Cellar/fmt
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
      - name: Dependencies
        run: |
-          # ccache is always installed (used by the llama-cpp variant build) so
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm
          # the brew cache content stays stable across every backend in the
          # matrix — they all share one cache key.
          # blake3, fmt, hiredis, xxhash, zstd are ccache's runtime dylib deps.
          # Without explicitly installing them, a brew cache-hit run restores
          # ccache's Cellar dir but skips installing those transitive deps,
          # and ccache fails at runtime with `dyld: Library not loaded`.
          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd
          # Force-reinstall ccache so brew re-validates its full runtime-dep
          # closure on every run. This is the durable fix: when the upstream
          # ccache formula gains a new transitive dep (as it has multiple times
          # already), we don't have to chase missing dylibs one at a time.
          # The downloads cache makes the reinstall fast (~5s on a hit).
          brew reinstall ccache
          # Same pattern for grpc: its CMake config (used by the llama-cpp
          # `grpc-server` target) does find_package(absl). The cache restores
          # /opt/homebrew/Cellar/grpc so brew above no-ops the install, but
          # abseil isn't in our Cellar cache list and never gets installed
          # alongside, leaving grpc's CMake unable to resolve it. Reinstalling
          # grpc re-validates and pulls abseil in, mirroring the ccache fix.
          brew reinstall grpc
          # The brew cache restores the Cellar dirs but NOT the bin symlinks
          # at /opt/homebrew/bin/*. brew install above sees the Cellar present
          # and decides "already installed" without re-linking, so on a cache-
          # hit run the formulas aren't on PATH. Force-link them; --overwrite
          # tolerates pre-existing symlinks from earlier installs.
          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd 2>/dev/null || true
      - name: Save Homebrew cache
        if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
        uses: actions/cache/save@v4
        with:
          path: |
            ~/Library/Caches/Homebrew/downloads
            /opt/homebrew/Cellar/protobuf
            /opt/homebrew/Cellar/grpc
            /opt/homebrew/Cellar/protoc-gen-go
            /opt/homebrew/Cellar/protoc-gen-go-grpc
            /opt/homebrew/Cellar/libomp
            /opt/homebrew/Cellar/llvm
            /opt/homebrew/Cellar/ccache
            /opt/homebrew/Cellar/blake3
            /opt/homebrew/Cellar/fmt
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
      # ---- ccache for llama.cpp CMake builds ----
      # Three CMake variants (fallback, grpc, rpc-server) compile the same
      # llama.cpp source tree with overlapping flags — ccache dedupes object
      # files across them. Key on the pinned LLAMA_VERSION so a pin bump
      # invalidates cleanly; restore-keys fall back to the latest entry for the
      # same pin so unchanged TUs stay warm even when the cache is fresh.
      - name: Compute llama.cpp version
        if: inputs.backend == 'llama-cpp'
        id: llama-version
        run: |
          version=$(grep '^LLAMA_VERSION' backend/cpp/llama-cpp/Makefile | head -1 | cut -d= -f2 | cut -d'?' -f1 | tr -d ' ')
          echo "version=${version}" >> "$GITHUB_OUTPUT"
      - name: Restore ccache
        if: inputs.backend == 'llama-cpp'
        id: ccache-cache
        uses: actions/cache/restore@v4
        with:
          path: ~/Library/Caches/ccache
          key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
          restore-keys: |
            ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-
      - name: Configure ccache
        if: inputs.backend == 'llama-cpp'
        run: |
          mkdir -p "$HOME/Library/Caches/ccache"
          ccache -M 2G
          ccache -z
          # llama-cpp-darwin.sh reads CMAKE_ARGS / CCACHE_DIR from env.
          {
            echo "CMAKE_ARGS=${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache"
            echo "CCACHE_DIR=$HOME/Library/Caches/ccache"
          } >> "$GITHUB_ENV"
      # ---- Python wheel cache (uv + pip) ----
      # Mirrors the Linux DEPS_REFRESH cadence (see .agents/ci-caching.md): the
      # ISO-week segment of the cache key forces at most one cold rebuild per
      # backend per week, automatically picking up newer wheels for unpinned
      # deps (torch, mlx, diffusers, …). Restore-keys fall back to the most
      # recent build of the same backend so off-week PRs still hit warm.
      - name: Compute weekly cache bucket
        if: inputs.lang == 'python'
        id: weekly
        run: echo "bucket=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
      - name: Restore Python wheel cache
        if: inputs.lang == 'python'
        id: pyenv-cache
        uses: actions/cache/restore@v4
        with:
          path: |
            ~/Library/Caches/pip
            ~/Library/Caches/uv
          key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
          restore-keys: |
            pyenv-darwin-${{ inputs.backend }}-
      # llama-cpp on Darwin uses a bespoke build script (scripts/build/llama-cpp-darwin.sh)
      # that compiles three CMake variants from backend/cpp/llama-cpp and bundles dylibs
      # via otool — it doesn't fit the build-darwin-go-backend / build-darwin-python-backend
      # mold. Drive it via its dedicated `backends/llama-cpp-darwin` make target instead.
      - name: Build ${{ inputs.backend }}-darwin (llama-cpp)
        if: inputs.backend == 'llama-cpp'
        run: |
          make protogen-go
          make backends/llama-cpp-darwin
      - name: Build ds4 backend (Darwin Metal)
        if: inputs.backend == 'ds4'
        run: |
          make backends/ds4-darwin
      - name: Build ${{ inputs.backend }}-darwin
        if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4'
        run: |
          make protogen-go
          BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend
      - name: ccache stats
        if: inputs.backend == 'llama-cpp'
        run: ccache -s
      - name: Save ccache
        if: inputs.backend == 'llama-cpp' && github.event_name != 'pull_request'
        uses: actions/cache/save@v4
        with:
          path: ~/Library/Caches/ccache
          key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
      - name: Save Python wheel cache
        if: inputs.lang == 'python' && github.event_name != 'pull_request' && steps.pyenv-cache.outputs.cache-hit != 'true'
        uses: actions/cache/save@v4
        with:
          path: |
            ~/Library/Caches/pip
            ~/Library/Caches/uv
          key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
      - name: Upload ${{ inputs.backend }}.tar
-        uses: actions/upload-artifact@v7
+        uses: actions/upload-artifact@v6
        with:
          name: ${{ inputs.backend }}-tar
          path: backend-images/${{ inputs.backend }}.tar
@@ -261,7 +85,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Download ${{ inputs.backend }}.tar
-        uses: actions/download-artifact@v8
+        uses: actions/download-artifact@v7
        with:
          name: ${{ inputs.backend }}-tar
          path: .
@@ -281,7 +105,7 @@ jobs:
      - name: Docker meta
        id: meta
-        uses: docker/metadata-action@v6
+        uses: docker/metadata-action@v5
        with:
          images: |
            localai/localai-backends
@@ -295,7 +119,7 @@ jobs:
      - name: Docker meta
        id: quaymeta
-        uses: docker/metadata-action@v6
+        uses: docker/metadata-action@v5
        with:
          images: |
            quay.io/go-skynet/local-ai-backends
--- a/.github/workflows/backend_merge.yml
+++ b/.github/workflows/backend_merge.yml
@@ -1,217 +0,0 @@
 ---
 name: 'merge backend manifest list (reusable)'
 # Reusable workflow that joins per-arch digest artifacts (uploaded by
 # backend_build.yml when called with platform-tag) into a single tagged
 # multi-arch manifest list. Called once per backend by backend.yml after
 # both per-arch build jobs succeed.
 on:
  workflow_call:
    inputs:
      tag-latest:
        description: 'Whether the manifest list should also be tagged latest (auto/false/true)'
        required: false
        type: string
        default: ''
      tag-suffix:
        description: 'Backend tag suffix (e.g. -cpu-faster-whisper). Used to compute the artifact pattern and the final tag suffix.'
        required: true
        type: string
    secrets:
      dockerUsername:
        required: false
      dockerPassword:
        required: false
      quayUsername:
        required: true
      quayPassword:
        required: true
 jobs:
  merge:
    runs-on: ubuntu-latest
    # id-token: write is required for keyless cosign — the workflow
    # exchanges the GitHub OIDC token for a short-lived Fulcio cert that
    # signs each pushed manifest. Without this permission the runner
    # cannot mint the token, and `cosign sign` fails with "no token".
    permissions:
      contents: read
      id-token: write
    env:
      quay_username: ${{ secrets.quayUsername }}
      # cosign v2.4.x still gates --registry-referrers-mode=oci-1-1 behind
      # this flag. Without it, signing fails with:
      #   invalid argument "oci-1-1" for "--registry-referrers-mode" flag:
      #   in order to use mode "oci-1-1", you must set COSIGN_EXPERIMENTAL=1
      COSIGN_EXPERIMENTAL: '1'
    steps:
      # Sparse checkout: the merge job needs `.github/scripts/` (for the
      # keepalive cleanup script) but none of the source tree.
      - name: Checkout (.github/scripts only)
        uses: actions/checkout@v6
        with:
          sparse-checkout: |
            .github/scripts
          sparse-checkout-cone-mode: false
      # `--` separator anchors the glob so we don't over-match sibling
      # backends whose tag-suffix happens to be a prefix of ours
      # (e.g. -cpu-vllm vs -cpu-vllm-omni). Must stay in sync with the
      # upload-artifact name in backend_build.yml.
      - name: Download digests
        uses: actions/download-artifact@v8
        with:
          pattern: digests${{ inputs.tag-suffix }}--*
          merge-multiple: true
          path: /tmp/digests
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@master
      # cosign signs each pushed manifest list with --recursive so the
      # index and every per-arch entry get an attached Sigstore bundle.
      # Recent cosign releases always emit the new bundle format, so
      # there's no extra CLI flag to opt into it.
      - name: Install cosign
        if: github.event_name != 'pull_request'
        uses: sigstore/cosign-installer@v3
        with:
          cosign-release: 'v2.4.1'
      - name: Login to DockerHub
        if: github.event_name != 'pull_request'
        uses: docker/login-action@v4
        with:
          username: ${{ secrets.dockerUsername }}
          password: ${{ secrets.dockerPassword }}
      - name: Login to Quay.io
        if: ${{ env.quay_username != '' }}
        uses: docker/login-action@v4
        with:
          registry: quay.io
          username: ${{ secrets.quayUsername }}
          password: ${{ secrets.quayPassword }}
      - name: Docker meta
        id: meta
        if: github.event_name != 'pull_request'
        uses: docker/metadata-action@v6
        with:
          images: |
            quay.io/go-skynet/local-ai-backends
            localai/localai-backends
          tags: |
            type=ref,event=branch
            type=semver,pattern={{raw}}
            type=sha
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true
      # Source from ci-cache, not local-ai-backends.
      #
      # The build job pushes per-arch manifests to local-ai-backends with
      # push-by-digest=true (no tag), then anchors a tagged copy into
      # ci-cache so the manifest can be retrieved hours later when this
      # merge runs. Quay's manifest GC, however, is per-repository: the
      # anchor tag in ci-cache protects the manifest there, but the same
      # digest in local-ai-backends has no tag in *that* repo and gets
      # reaped independently. Sourcing local-ai-backends@<digest> here
      # then fails with "manifest not found" — exactly the regression
      # we hit on v4.2.2 (19/37 multiarch merges failed).
      #
      # ci-cache@<digest> resolves because we anchored it there. buildx
      # imagetools create copies the manifest into local-ai-backends
      # (cross-repo within the same registry, blobs already cross-mounted
      # from the original push so no transfer needed) and publishes the
      # manifest list with the user-facing tags. The resulting manifest
      # list is fully self-contained in local-ai-backends — child digests
      # only, no embedded references to ci-cache.
      - name: Create manifest list and push (quay)
        if: github.event_name != 'pull_request'
        working-directory: /tmp/digests
        run: |
          set -euo pipefail
          tags=$(jq -cr '
            .tags
            | map(select(startswith("quay.io/")))
            | map("-t " + .)
            | join(" ")
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          if [ -z "$tags" ]; then
            echo "No quay.io tags from docker/metadata-action; skipping quay merge"
            exit 0
          fi
          # shellcheck disable=SC2086
          docker buildx imagetools create $tags \
            $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
          # Resolve the manifest-list digest (any tag points at it) so
          # cosign can sign by digest. Signing by tag would leave the
          # signature orphaned the next time the tag moves.
          first_tag=$(jq -cr '
            .tags | map(select(startswith("quay.io/"))) | .[0]
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
          # --recursive walks the list and signs every per-arch entry
          # too — clients that resolve a tag to a platform-specific
          # manifest before checking signatures need the per-arch
          # signatures, not just the list-level one.
          cosign sign --yes --recursive \
            --registry-referrers-mode=oci-1-1 \
            "quay.io/go-skynet/local-ai-backends@${digest}"
      - name: Create manifest list and push (dockerhub)
        if: github.event_name != 'pull_request'
        working-directory: /tmp/digests
        run: |
          set -euo pipefail
          tags=$(jq -cr '
            .tags
            | map(select(startswith("localai/")))
            | map("-t " + .)
            | join(" ")
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          if [ -z "$tags" ]; then
            echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
            exit 0
          fi
          # shellcheck disable=SC2086
          docker buildx imagetools create $tags \
            $(printf 'localai/localai-backends@sha256:%s ' *)
          first_tag=$(jq -cr '
            .tags | map(select(startswith("localai/"))) | .[0]
          ' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
          cosign sign --yes --recursive \
            --registry-referrers-mode=oci-1-1 \
            "localai/localai-backends@${digest}"
      - name: Inspect manifest
        if: github.event_name != 'pull_request'
        run: |
          set -euo pipefail
          first_tag=$(jq -cr '.tags[0]' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          if [ -n "$first_tag" ] && [ "$first_tag" != "null" ]; then
            docker buildx imagetools inspect "$first_tag"
          fi
      # See .github/scripts/cleanup-keepalive-tags.sh for why this is
      # best-effort and what the failure modes are.
      - name: Cleanup keepalive tags in ci-cache
        if: github.event_name != 'pull_request' && success()
        env:
          TAG_SUFFIX: ${{ inputs.tag-suffix }}
          QUAY_TOKEN: ${{ secrets.quayPassword }}
        run: .github/scripts/cleanup-keepalive-tags.sh
      - name: Job summary
        if: github.event_name != 'pull_request'
        run: |
          set -euo pipefail
          echo "Merged manifest tags:" >> "$GITHUB_STEP_SUMMARY"
          jq -r '.tags[]' <<< "$DOCKER_METADATA_OUTPUT_JSON" | sed 's/^/- /' >> "$GITHUB_STEP_SUMMARY"
          echo >> "$GITHUB_STEP_SUMMARY"
          echo "Per-arch digests:" >> "$GITHUB_STEP_SUMMARY"
          ls -1 /tmp/digests | sed 's/^/- sha256:/' >> "$GITHUB_STEP_SUMMARY"
--- a/.github/workflows/backend_pr.yml
+++ b/.github/workflows/backend_pr.yml
@@ -4,23 +4,17 @@ on:
  pull_request:
 concurrency:
-  group: ci-backends-pr-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
+  group: ci-backends-pr-${{ github.head_ref || github.ref }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+  cancel-in-progress: true
 jobs:
  generate-matrix:
    runs-on: ubuntu-latest
    outputs:
-      matrix-singlearch: ${{ steps.set-matrix.outputs['matrix-singlearch'] }}
+      matrix: ${{ steps.set-matrix.outputs.matrix }}
-      matrix-multiarch: ${{ steps.set-matrix.outputs['matrix-multiarch'] }}
+      matrix-darwin: ${{ steps.set-matrix.outputs.matrix-darwin }}
-      matrix-darwin: ${{ steps.set-matrix.outputs['matrix-darwin'] }}
+      has-backends: ${{ steps.set-matrix.outputs.has-backends }}
-      merge-matrix-multiarch: ${{ steps.set-matrix.outputs['merge-matrix-multiarch'] }}
+      has-backends-darwin: ${{ steps.set-matrix.outputs.has-backends-darwin }}
      merge-matrix-singlearch: ${{ steps.set-matrix.outputs['merge-matrix-singlearch'] }}
      has-backends-singlearch: ${{ steps.set-matrix.outputs['has-backends-singlearch'] }}
      has-backends-multiarch: ${{ steps.set-matrix.outputs['has-backends-multiarch'] }}
      has-backends-darwin: ${{ steps.set-matrix.outputs['has-backends-darwin'] }}
      has-merges-multiarch: ${{ steps.set-matrix.outputs['has-merges-multiarch'] }}
      has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
@@ -33,9 +27,7 @@ jobs:
          bun add js-yaml
          bun add @octokit/core
-      # filters the matrix in backend.yml; splits into single-arch and
+      # filters the matrix in backend.yml
      # multi-arch groups so backend-merge-jobs can `needs:` only the latter
      # (matches backend.yml's structure).
      - name: Filter matrix for changed backends
        id: set-matrix
        env:
@@ -43,10 +35,10 @@ jobs:
          GITHUB_EVENT_PATH: ${{ github.event_path }}
        run: bun run scripts/changed-backends.js
-  backend-jobs-multiarch:
+  backend-jobs:
    needs: generate-matrix
    uses: ./.github/workflows/backend_build.yml
-    if: needs.generate-matrix.outputs['has-backends-multiarch'] == 'true'
+    if: needs.generate-matrix.outputs.has-backends == 'true'
    with:
      tag-latest: ${{ matrix.tag-latest }}
      tag-suffix: ${{ matrix.tag-suffix }}
@@ -54,83 +46,19 @@ jobs:
      cuda-major-version: ${{ matrix.cuda-major-version }}
      cuda-minor-version: ${{ matrix.cuda-minor-version }}
      platforms: ${{ matrix.platforms }}
      platform-tag: ${{ matrix.platform-tag || '' }}
      runs-on: ${{ matrix.runs-on }}
      builder-base-image: ${{ matrix.builder-base-image || '' }}
      base-image: ${{ matrix.base-image }}
      backend: ${{ matrix.backend }}
      dockerfile: ${{ matrix.dockerfile }}
      skip-drivers: ${{ matrix.skip-drivers }}
      context: ${{ matrix.context }}
      ubuntu-version: ${{ matrix.ubuntu-version }}
      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
    secrets:
      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    strategy:
      fail-fast: true
-      max-parallel: 8
+      matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix) }}
      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-multiarch']) }}
  backend-jobs-singlearch:
    needs: generate-matrix
    uses: ./.github/workflows/backend_build.yml
    if: needs.generate-matrix.outputs['has-backends-singlearch'] == 'true'
    with:
      tag-latest: ${{ matrix.tag-latest }}
      tag-suffix: ${{ matrix.tag-suffix }}
      build-type: ${{ matrix.build-type }}
      cuda-major-version: ${{ matrix.cuda-major-version }}
      cuda-minor-version: ${{ matrix.cuda-minor-version }}
      platforms: ${{ matrix.platforms }}
      platform-tag: ${{ matrix.platform-tag || '' }}
      runs-on: ${{ matrix.runs-on }}
      builder-base-image: ${{ matrix.builder-base-image || '' }}
      base-image: ${{ matrix.base-image }}
      backend: ${{ matrix.backend }}
      dockerfile: ${{ matrix.dockerfile }}
      skip-drivers: ${{ matrix.skip-drivers }}
      context: ${{ matrix.context }}
      ubuntu-version: ${{ matrix.ubuntu-version }}
      amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
    secrets:
      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    strategy:
      fail-fast: true
      max-parallel: 8
      matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch']) }}
  backend-merge-jobs-multiarch:
    needs: [generate-matrix, backend-jobs-multiarch]
    # backend_merge.yml's push-side steps are all gated on
    # github.event_name != 'pull_request', so on a PR the merge job would
    # do nothing. Skip it entirely to avoid spinning up an empty runner.
    # !cancelled() lets the merge run even when a few build legs fail —
    # see the matching note in backend.yml.
    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
    uses: ./.github/workflows/backend_merge.yml
    with:
      tag-latest: ${{ matrix.tag-latest }}
      tag-suffix: ${{ matrix.tag-suffix }}
    secrets:
      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    strategy:
      fail-fast: false
      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-multiarch']) }}
  backend-merge-jobs-singlearch:
    needs: [generate-matrix, backend-jobs-singlearch]
    if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
    uses: ./.github/workflows/backend_merge.yml
    with:
      tag-latest: ${{ matrix.tag-latest }}
      tag-suffix: ${{ matrix.tag-suffix }}
    secrets:
      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    strategy:
      fail-fast: false
      matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch']) }}
  backend-jobs-darwin:
    needs: generate-matrix
    uses: ./.github/workflows/backend_build_darwin.yml
@@ -138,7 +66,7 @@ jobs:
    with:
      backend: ${{ matrix.backend }}
      build-type: ${{ matrix.build-type }}
-      go-version: "1.25.x"
+      go-version: "1.24.x"
      tag-suffix: ${{ matrix.tag-suffix }}
      lang: ${{ matrix.lang || 'python' }}
      use-pip: ${{ matrix.backend == 'diffusers' }}
--- a/.github/workflows/base-images.yml
+++ b/.github/workflows/base-images.yml
@@ -1,161 +0,0 @@
 ---
 name: 'build base-grpc images'
 # Builds + pushes pre-compiled builder base images that downstream
 # llama-cpp / ik-llama-cpp / turboquant variant Dockerfiles will FROM
 # (PR 2). Each base contains apt deps + protoc + cmake + gRPC at
 # /opt/grpc + (conditionally) CUDA / ROCm / Vulkan toolchains.
 #
 # Triggers:
 #   - schedule (Saturdays 05:00 UTC) - picks up Ubuntu/CUDA/ROCm
 #     security updates and re-runs ahead of the backend.yml weekly
 #     cron (Sundays 06:00 UTC).
 #   - workflow_dispatch - manual one-off rebuild.
 #   - push to master that touches Dockerfile.base-grpc-builder or
 #     this workflow itself - keeps bases in sync with their inputs.
 #
 # Bootstrap (one-time after this PR merges):
 #   gh workflow run base-images.yml --ref master
 # Wait ~30 min for all 9 matrix variants to push to
 # quay.io/go-skynet/ci-cache:base-grpc-* before merging PR 2.
 on:
  schedule:
    - cron: '0 5 * * 6'
  workflow_dispatch:
  push:
    branches: [master]
    paths:
      - 'backend/Dockerfile.base-grpc-builder'
      - '.github/workflows/base-images.yml'
      # The install logic and apt-mirror helper are bind-mounted into
      # Dockerfile.base-grpc-builder at build time — changes to either
      # affect the produced base images and must trigger a rebuild.
      - '.docker/install-base-deps.sh'
      - '.docker/apt-mirror.sh'
 concurrency:
  group: ci-base-images-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
 jobs:
  build:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ${{ matrix.runs-on }}
    strategy:
      fail-fast: false
      matrix:
        include:
          - tag: 'base-grpc-amd64'
            runs-on: 'ubuntu-latest'
            base-image: 'ubuntu:24.04'
            build-type: ''
            cuda-major-version: ''
            cuda-minor-version: ''
            ubuntu-version: '2404'
          - tag: 'base-grpc-arm64'
            runs-on: 'ubuntu-24.04-arm'
            base-image: 'ubuntu:24.04'
            build-type: ''
            cuda-major-version: ''
            cuda-minor-version: ''
            ubuntu-version: '2404'
          - tag: 'base-grpc-cuda-12-amd64'
            runs-on: 'ubuntu-latest'
            base-image: 'ubuntu:24.04'
            build-type: 'cublas'
            cuda-major-version: '12'
            cuda-minor-version: '8'
            ubuntu-version: '2404'
          - tag: 'base-grpc-cuda-13-amd64'
            runs-on: 'ubuntu-latest'
            base-image: 'ubuntu:22.04'
            build-type: 'cublas'
            cuda-major-version: '13'
            cuda-minor-version: '0'
            ubuntu-version: '2204'
          - tag: 'base-grpc-cuda-13-arm64'
            runs-on: 'ubuntu-24.04-arm'
            base-image: 'ubuntu:24.04'
            build-type: 'cublas'
            cuda-major-version: '13'
            cuda-minor-version: '0'
            ubuntu-version: '2404'
          - tag: 'base-grpc-rocm-amd64'
            runs-on: 'ubuntu-latest'
            base-image: 'rocm/dev-ubuntu-24.04:7.2.1'
            build-type: 'hipblas'
            cuda-major-version: ''
            cuda-minor-version: ''
            ubuntu-version: '2404'
          - tag: 'base-grpc-vulkan-amd64'
            runs-on: 'ubuntu-latest'
            base-image: 'ubuntu:24.04'
            build-type: 'vulkan'
            cuda-major-version: ''
            cuda-minor-version: ''
            ubuntu-version: '2404'
          - tag: 'base-grpc-vulkan-arm64'
            runs-on: 'ubuntu-24.04-arm'
            base-image: 'ubuntu:24.04'
            build-type: 'vulkan'
            cuda-major-version: ''
            cuda-minor-version: ''
            ubuntu-version: '2404'
          - tag: 'base-grpc-intel-amd64'
            runs-on: 'ubuntu-latest'
            base-image: 'intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04'
            build-type: 'sycl'
            cuda-major-version: ''
            cuda-minor-version: ''
            ubuntu-version: '2404'
          # Legacy JetPack r36.4.0 base for older Jetson devices (CUDA 12).
          # Distinct from base-grpc-cuda-13-arm64 (Ubuntu 24.04 + CUDA 13 sbsa)
          # which targets newer Jetsons. Some matrix entries
          # (-nvidia-l4t-arm64-llama-cpp / -turboquant) still build against
          # the JetPack image, so we need a matching base.
          - tag: 'base-grpc-l4t-cuda-12-arm64'
            runs-on: 'ubuntu-24.04-arm'
            base-image: 'nvcr.io/nvidia/l4t-jetpack:r36.4.0'
            build-type: 'l4t'
            cuda-major-version: '12'
            cuda-minor-version: '0'
            ubuntu-version: '2204'
            # JetPack r36.4.0 already ships CUDA preinstalled at /usr/local/cuda;
            # apt-installing cuda-nvcc-12-0 from the public repos fails because
            # those packages aren't published for the JetPack apt feed. Match
            # the original l4t matrix entry which set skip-drivers: 'true'.
            skip-drivers: 'true'
    steps:
      - uses: actions/checkout@v6
        with:
          submodules: false
      - name: Free disk space
        uses: ./.github/actions/free-disk-space
      - name: Set up build disk
        uses: ./.github/actions/setup-build-disk
      - uses: docker/setup-qemu-action@master
        with:
          platforms: all
      - uses: docker/setup-buildx-action@master
      - uses: docker/login-action@v4
        with:
          registry: quay.io
          username: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
          password: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
      - uses: docker/build-push-action@v7
        with:
          context: .
          file: ./backend/Dockerfile.base-grpc-builder
          build-args: |
            BASE_IMAGE=${{ matrix.base-image }}
            BUILD_TYPE=${{ matrix.build-type }}
            CUDA_MAJOR_VERSION=${{ matrix.cuda-major-version }}
            CUDA_MINOR_VERSION=${{ matrix.cuda-minor-version }}
            UBUNTU_VERSION=${{ matrix.ubuntu-version }}
            SKIP_DRIVERS=${{ matrix.skip-drivers || 'false' }}
          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-${{ matrix.tag }}
          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache-${{ matrix.tag }},mode=max,ignore-error=true
          provenance: false
          tags: quay.io/go-skynet/ci-cache:${{ matrix.tag }}
          push: true
--- a/.github/workflows/build-test.yaml
+++ b/.github/workflows/build-test.yaml
@@ -37,7 +37,7 @@ jobs:
          make build-launcher-darwin
          ls -liah dist
      - name: Upload macOS launcher artifacts
-        uses: actions/upload-artifact@v7
+        uses: actions/upload-artifact@v6
        with:
          name: launcher-macos
          path: dist/
@@ -50,8 +50,6 @@ jobs:
        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - name: Set up Go
        uses: actions/setup-go@v5
        with:
@@ -62,7 +60,7 @@ jobs:
          sudo apt-get install golang gcc libgl1-mesa-dev xorg-dev libxkbcommon-dev
          make build-launcher-linux
      - name: Upload Linux launcher artifacts
-        uses: actions/upload-artifact@v7
+        uses: actions/upload-artifact@v6
        with:
          name: launcher-linux
          path: local-ai-launcher-linux.tar.xz
--- a/.github/workflows/bump-inference-defaults.yml
+++ b/.github/workflows/bump-inference-defaults.yml
@@ -1,48 +0,0 @@
 name: Bump inference defaults
 on:
  schedule:
    # Run daily at 06:00 UTC
    - cron: '0 6 * * *'
  workflow_dispatch: # Allow manual trigger
 permissions:
  contents: write
  pull-requests: write
 jobs:
  bump:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-go@v5
        with:
          go-version-file: go.mod
      - name: Re-fetch inference defaults
        run: make generate-force
      - name: Check for changes
        id: diff
        run: |
          if git diff --quiet core/config/inference_defaults.json; then
            echo "changed=false" >> "$GITHUB_OUTPUT"
          else
            echo "changed=true" >> "$GITHUB_OUTPUT"
          fi
      - name: Create Pull Request
        if: steps.diff.outputs.changed == 'true'
        uses: peter-evans/create-pull-request@v8
        with:
          commit-message: "chore: bump inference defaults from unsloth"
          title: "chore: bump inference defaults from unsloth"
          body: |
            Auto-generated update of `core/config/inference_defaults.json` from
            [unsloth's inference_defaults.json](https://github.com/unslothai/unsloth/blob/main/studio/backend/assets/configs/inference_defaults.json).
            This PR was created automatically by the `bump-inference-defaults` workflow.
          branch: chore/bump-inference-defaults
          delete-branch: true
          labels: automated
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -5,7 +5,6 @@ on:
  workflow_dispatch:
 jobs:
  bump-backends:
    if: github.repository == 'mudler/LocalAI'
    strategy:
      fail-fast: false
      matrix:
@@ -14,22 +13,14 @@ jobs:
            variable: "LLAMA_VERSION"
            branch: "master"
            file: "backend/cpp/llama-cpp/Makefile"
          - repository: "ikawrakow/ik_llama.cpp"
            variable: "IK_LLAMA_VERSION"
            branch: "main"
            file: "backend/cpp/ik-llama-cpp/Makefile"
          - repository: "TheTom/llama-cpp-turboquant"
            variable: "TURBOQUANT_VERSION"
            branch: "feature/turboquant-kv-cache"
            file: "backend/cpp/turboquant/Makefile"
          - repository: "antirez/ds4"
            variable: "DS4_VERSION"
            branch: "main"
            file: "backend/cpp/ds4/Makefile"
          - repository: "ggml-org/whisper.cpp"
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
            file: "backend/go/whisper/Makefile"
          - repository: "PABannier/bark.cpp"
            variable: "BARKCPP_VERSION"
            branch: "main"
            file: "Makefile"
          - repository: "leejet/stable-diffusion.cpp"
            variable: "STABLEDIFFUSION_GGML_VERSION"
            branch: "master"
@@ -38,26 +29,6 @@ jobs:
            variable: "PIPER_VERSION"
            branch: "master"
            file: "backend/go/piper/Makefile"
          - repository: "antirez/voxtral.c"
            variable: "VOXTRAL_VERSION"
            branch: "main"
            file: "backend/go/voxtral/Makefile"
          - repository: "ace-step/acestep.cpp"
            variable: "ACESTEP_CPP_VERSION"
            branch: "master"
            file: "backend/go/acestep-cpp/Makefile"
          - repository: "PABannier/sam3.cpp"
            variable: "SAM3_VERSION"
            branch: "main"
            file: "backend/go/sam3-cpp/Makefile"
          - repository: "predict-woo/qwen3-tts.cpp"
            variable: "QWEN3TTS_CPP_VERSION"
            branch: "main"
            file: "backend/go/qwen3-tts-cpp/Makefile"
          - repository: "localai-org/vibevoice.cpp"
            variable: "VIBEVOICE_CPP_VERSION"
            branch: "master"
            file: "backend/go/vibevoice-cpp/Makefile"
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
@@ -88,37 +59,5 @@ jobs:
          body: ${{ steps.bump.outputs.message }}
          signoff: true
-  bump-vllm-wheel:
+
-    # vLLM's cu130 wheel comes from a per-tag index URL (no /latest/ alias),
+
    # so the cublas13 requirements file pins both a URL segment and a version
    # constraint. bump_deps.sh handles git-sha-in-Makefile only — this job
    # rewrites both values atomically when a new vLLM stable tag ships.
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - name: Bump vLLM cu130 wheel pin 🔧
        id: bump
        run: |
          bash .github/bump_vllm_wheel.sh vllm-project/vllm backend/python/vllm/requirements-cublas13-after.txt VLLM_VERSION
          {
            echo 'message<<EOF'
            cat "VLLM_VERSION_message.txt"
            echo EOF
          } >> "$GITHUB_OUTPUT"
          {
            echo 'commit<<EOF'
            cat "VLLM_VERSION_commit.txt"
            echo EOF
          } >> "$GITHUB_OUTPUT"
          rm -rfv VLLM_VERSION_message.txt VLLM_VERSION_commit.txt
      - name: Create Pull Request
        uses: peter-evans/create-pull-request@v8
        with:
          token: ${{ secrets.UPDATE_BOT_TOKEN }}
          push-to-fork: ci-forks/LocalAI
          commit-message: ':arrow_up: Update vllm-project/vllm cu130 wheel'
          title: 'chore: :arrow_up: Update vllm-project/vllm cu130 wheel to `${{ steps.bump.outputs.commit }}`'
          branch: "update/VLLM_VERSION"
          body: ${{ steps.bump.outputs.message }}
          signoff: true
--- a/.github/workflows/bump_docs.yaml
+++ b/.github/workflows/bump_docs.yaml
@@ -5,7 +5,6 @@ on:
  workflow_dispatch:
 jobs:
  bump-docs:
    if: github.repository == 'mudler/LocalAI'
    strategy:
      fail-fast: false
      matrix:
--- a/.github/workflows/checksum_checker.yaml
+++ b/.github/workflows/checksum_checker.yaml
@@ -5,12 +5,17 @@ on:
  workflow_dispatch:
 jobs:
  checksum_check:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
      - name: Force Install GIT latest
        run: |
          sudo apt-get update \
          && sudo apt-get install -y software-properties-common \
          && sudo apt-get update \
          && sudo add-apt-repository -y ppa:git-core/ppa \
          && sudo apt-get update \
          && sudo apt-get install -y git
      - uses: actions/checkout@v6
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - name: Install dependencies
        run: |
          sudo apt-get update
--- a/.github/workflows/disabled/dependabot_auto.yml
+++ b/.github/workflows/disabled/dependabot_auto.yml
@@ -9,8 +9,8 @@ permissions:
 jobs:
  dependabot:
    if: github.repository == 'mudler/LocalAI' && github.actor == 'dependabot[bot]'
    runs-on: ubuntu-latest
    if: ${{ github.actor == 'dependabot[bot]' }}
    steps:
      - name: Dependabot metadata
        id: metadata
--- a/.github/workflows/deploy-explorer.yaml
+++ b/.github/workflows/deploy-explorer.yaml
@@ -12,7 +12,6 @@ concurrency:
 jobs:
  build-linux:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
@@ -34,7 +33,7 @@ jobs:
        run: |
          CGO_ENABLED=0 make build
      - name: rm
-        uses: appleboy/ssh-action@v1.2.5
+        uses: appleboy/ssh-action@v1.2.4
        with:
            host: ${{ secrets.EXPLORER_SSH_HOST }}
            username: ${{ secrets.EXPLORER_SSH_USERNAME }}
@@ -54,7 +53,7 @@ jobs:
            rm: true
            target: ./local-ai
      - name: restarting
-        uses: appleboy/ssh-action@v1.2.5
+        uses: appleboy/ssh-action@v1.2.4
        with:
            host: ${{ secrets.EXPLORER_SSH_HOST }}
            username: ${{ secrets.EXPLORER_SSH_USERNAME }}
--- a/.github/workflows/gallery-agent.yaml
+++ b/.github/workflows/gallery-agent.yaml
@@ -2,7 +2,7 @@ name: Gallery Agent
 on:
  schedule:
-    - cron: '0 */12 * * *'  # Run every 4 hours
+    - cron: '0 */3 * * *'  # Run every 4 hours
  workflow_dispatch:
    inputs:
      search_term:
@@ -27,7 +27,6 @@ on:
        type: string
 jobs:
  gallery-agent:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
@@ -48,88 +47,21 @@ jobs:
          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
          PATH="$PATH:$HOME/go/bin" make protogen-go
-      - name: Process gallery-agent PR commands
+      - uses: mudler/localai-github-action@v1.1
-        env:
+        with:
-          GH_TOKEN: ${{ secrets.UPDATE_BOT_TOKEN }}
+          model: 'https://huggingface.co/bartowski/Qwen_Qwen3-1.7B-GGUF'
          REPO: ${{ github.repository }}
          SEARCH: 'gallery agent in:title'
        run: |
          # Walk gallery-agent PRs and act on maintainer comments:
          #   /gallery-agent blacklist → label `gallery-agent/blacklisted` + close (never repropose)
          #   /gallery-agent recreate  → close without label (next run may repropose)
          # Only comments from OWNER / MEMBER / COLLABORATOR are honored so
          # random users can't drive the bot.
          #
          # We scan both open PRs AND recently-closed PRs that don't already
          # carry the blacklist label. This covers the common flow where a
          # maintainer writes /gallery-agent blacklist and immediately clicks
          # Close — without this, the next scheduled run wouldn't see the
          # command (PR is already closed) and would repropose the model.
          gh label create gallery-agent/blacklisted \
            --repo "$REPO" --color ededed \
            --description "gallery-agent must not repropose this model" 2>/dev/null || true
          prs_open=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" \
            --json number --jq '.[].number')
          # Closed PRs from the last 14 days that don't yet have the blacklist label.
          # Bounded window keeps the scan cheap while covering late-applied commands.
          since=$(date -u -d '14 days ago' +%Y-%m-%d)
          prs_closed=$(gh pr list --repo "$REPO" --state closed \
            --search "$SEARCH closed:>=$since -label:gallery-agent/blacklisted" \
            --json number --jq '.[].number')
          prs=$(printf '%s\n%s\n' "$prs_open" "$prs_closed" | sort -u | sed '/^$/d')
          for pr in $prs; do
            state=$(gh pr view "$pr" --repo "$REPO" --json state --jq '.state')
            cmds=$(gh pr view "$pr" --repo "$REPO" --json comments \
              --jq '.comments[] | select(.authorAssociation=="OWNER" or .authorAssociation=="MEMBER" or .authorAssociation=="COLLABORATOR") | .body')
            if echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+blacklist([[:space:]]|$)'; then
              echo "PR #$pr: blacklist command found (state=$state)"
              gh pr edit "$pr" --repo "$REPO" --add-label gallery-agent/blacklisted || true
              if [ "$state" = "OPEN" ]; then
                gh pr close "$pr" --repo "$REPO" --comment "Blacklisted via \`/gallery-agent blacklist\`. This model will not be reproposed." || true
              fi
            elif [ "$state" = "OPEN" ] && echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+recreate([[:space:]]|$)'; then
              echo "PR #$pr: recreate command found"
              gh pr close "$pr" --repo "$REPO" --comment "Closed via \`/gallery-agent recreate\`. The next scheduled run will propose this model again." || true
            fi
          done
      - name: Collect skip URLs for the gallery agent
        id: open_prs
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          REPO: ${{ github.repository }}
          SEARCH: 'gallery agent in:title'
        run: |
          # Skip set =
          #   URLs from any open gallery-agent PR (avoid duplicate PRs for the same model while one is pending)
          # + URLs from closed PRs carrying the `gallery-agent/blacklisted` label (hard blacklist)
          # Plain-closed PRs without the label are ignored — closing a PR is
          # not by itself a "never propose again" signal; maintainers must
          # opt in via the /gallery-agent blacklist comment command.
          urls_open=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" \
            --json body --jq '[.[].body] | join("\n")' \
            | grep -oE 'https://huggingface\.co/[^ )]+' || true)
          urls_blacklist=$(gh pr list --repo "$REPO" --state closed --search "$SEARCH" \
            --label gallery-agent/blacklisted \
            --json body --jq '[.[].body] | join("\n")' \
            | grep -oE 'https://huggingface\.co/[^ )]+' || true)
          urls=$(printf '%s\n%s\n' "$urls_open" "$urls_blacklist" | sort -u | sed '/^$/d')
          echo "Skip URLs:"
          echo "$urls"
          {
            echo "urls<<EOF"
            echo "$urls"
            echo "EOF"
          } >> "$GITHUB_OUTPUT"
      - name: Run gallery agent
        env:
          #OPENAI_MODEL: ${{ secrets.OPENAI_MODEL }}
          OPENAI_MODE: Qwen_Qwen3-1.7B-GGUF
          OPENAI_BASE_URL: "http://localhost:8080"
          OPENAI_KEY: ${{ secrets.OPENAI_KEY }}
          #OPENAI_BASE_URL: ${{ secrets.OPENAI_BASE_URL }}
          SEARCH_TERM: ${{ github.event.inputs.search_term || 'GGUF' }}
          LIMIT: ${{ github.event.inputs.limit || '15' }}
          QUANTIZATION: ${{ github.event.inputs.quantization || 'Q4_K_M' }}
          MAX_MODELS: ${{ github.event.inputs.max_models || '1' }}
          EXTRA_SKIP_URLS: ${{ steps.open_prs.outputs.urls }}
        run: |
          export GALLERY_INDEX_PATH=$PWD/gallery/index.yaml
          go run ./.github/gallery-agent
@@ -191,21 +123,7 @@ jobs:
            **Added Models:**
            ${{ steps.read_summary.outputs.added_models || '- No models added' }}
-
+            
            ### Bot commands
            Maintainers (owner / member / collaborator) can control this PR
            by leaving a comment with one of:
            - `/gallery-agent recreate` — close this PR; the next scheduled
              run will propose this model again (useful if the entry needs
              to be regenerated with fresh metadata).
            - `/gallery-agent blacklist` — close this PR and permanently
              prevent the gallery agent from ever reproposing this model.
            Plain "Close" (without a command) is treated as a no-op: the
            model may be reproposed by a future run.
            **Workflow Details:**
            - Triggered by: `${{ github.event_name }}`
            - Run ID: `${{ github.run_id }}`
--- a/.github/workflows/generate_grpc_cache.yaml
+++ b/.github/workflows/generate_grpc_cache.yaml
@@ -0,0 +1,95 @@
 name: 'generate and publish GRPC docker caches'
 on:
  workflow_dispatch:
  schedule:
    # daily at midnight
    - cron: '0 0 * * *'
 concurrency:
  group: grpc-cache-${{ github.head_ref || github.ref }}-${{ github.repository }}
  cancel-in-progress: true
 jobs:
  generate_caches:
    strategy:
      matrix:
        include:
          - grpc-base-image: ubuntu:24.04
            runs-on: 'ubuntu-latest'
            platforms: 'linux/amd64,linux/arm64'
    runs-on: ${{matrix.runs-on}}
    steps:
      - name: Release space from worker
        if: matrix.runs-on == 'ubuntu-latest'
        run: |
          echo "Listing top largest packages"
          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
          head -n 30 <<< "${pkgs}"
          echo
          df -h
          echo
          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
          sudo rm -rf /usr/local/lib/android
          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
          sudo rm -rf /usr/share/dotnet
          sudo apt-get remove -y '^mono-.*' || true
          sudo apt-get remove -y '^ghc-.*' || true
          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
          sudo apt-get remove -y 'php.*' || true
          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
          sudo apt-get remove -y '^google-.*' || true
          sudo apt-get remove -y azure-cli || true
          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
          sudo apt-get remove -y '^gfortran-.*' || true
          sudo apt-get remove -y microsoft-edge-stable || true
          sudo apt-get remove -y firefox || true
          sudo apt-get remove -y powershell || true
          sudo apt-get remove -y r-base-core || true
          sudo apt-get autoremove -y
          sudo apt-get clean
          echo
          echo "Listing top largest packages"
          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
          head -n 30 <<< "${pkgs}"
          echo
          sudo rm -rfv build || true
          sudo rm -rf /usr/share/dotnet || true
          sudo rm -rf /opt/ghc || true
          sudo rm -rf "/usr/local/share/boost" || true
          sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
          df -h
      - name: Set up QEMU
        uses: docker/setup-qemu-action@master
        with:
          platforms: all
      - name: Set up Docker Buildx
        id: buildx
        uses: docker/setup-buildx-action@master
      - name: Checkout
        uses: actions/checkout@v6
      - name: Cache GRPC
        uses: docker/build-push-action@v6
        with:
          builder: ${{ steps.buildx.outputs.name }}
          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
          # This means that even the MAKEFLAGS have to be an EXACT match.
          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
          build-args: |
            GRPC_BASE_IMAGE=${{ matrix.grpc-base-image }}
            GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
            GRPC_VERSION=v1.65.0
          context: .
          file: ./Dockerfile
          cache-to: type=gha,ignore-error=true
          cache-from: type=gha
          target: grpc
          platforms: ${{ matrix.platforms }}
          push: false
--- a/.github/workflows/generate_intel_image.yaml
+++ b/.github/workflows/generate_intel_image.yaml
@@ -7,16 +7,15 @@ on:
      - master
 concurrency:
-  group: intel-cache-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
+  group: intel-cache-${{ github.head_ref || github.ref }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+  cancel-in-progress: true
 jobs:
  generate_caches:
    if: github.repository == 'mudler/LocalAI'
    strategy:
      matrix:
        include:
-          - base-image: intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04
+          - base-image: intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04
            runs-on: 'arc-runner-set'
            platforms: 'linux/amd64'
    runs-on: ${{matrix.runs-on}}
@@ -27,14 +26,14 @@ jobs:
          platforms: all
      - name: Login to DockerHub
        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
+        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_PASSWORD }}
      - name: Login to quay
        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
+        uses: docker/login-action@v3
        with:
          registry: quay.io
          username: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
@@ -47,7 +46,7 @@ jobs:
        uses: actions/checkout@v6
      - name: Cache Intel images
-        uses: docker/build-push-action@v7
+        uses: docker/build-push-action@v6
        with:
          builder: ${{ steps.buildx.outputs.name }}
          build-args: |
--- a/.github/workflows/gh-pages.yml
+++ b/.github/workflows/gh-pages.yml
@@ -1,75 +0,0 @@
 name: Deploy docs to GitHub Pages
 on:
  push:
    branches:
      - master
    paths:
      - 'docs/**'
      - 'gallery/**'
      - 'images/**'
      - '.github/ci/modelslist.go'
      - '.github/workflows/gh-pages.yml'
  workflow_dispatch:
 permissions:
  contents: read
  pages: write
  id-token: write
 concurrency:
  group: pages
  cancel-in-progress: false
 jobs:
  build:
    runs-on: ubuntu-latest
    env:
      HUGO_VERSION: "0.146.3"
    steps:
      - name: Checkout
        uses: actions/checkout@v6
        with:
          fetch-depth: 0  # needed for enableGitInfo
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.22'
          cache: false
      - name: Setup Hugo
        uses: peaceiris/actions-hugo@v3
        with:
          hugo-version: ${{ env.HUGO_VERSION }}
          extended: true
      - name: Setup Pages
        id: pages
        uses: actions/configure-pages@v6
      - name: Generate gallery
        run: go run ./.github/ci/modelslist.go ./gallery/index.yaml > docs/static/gallery.html
      - name: Build site
        working-directory: docs
        run: |
          mkdir -p layouts/_default
          hugo --minify --baseURL "${{ steps.pages.outputs.base_url }}/"
      - name: Upload artifact
        uses: actions/upload-pages-artifact@v5
        with:
          path: docs/public
  deploy:
    environment:
      name: github-pages
      url: ${{ steps.deployment.outputs.page_url }}
    runs-on: ubuntu-latest
    needs: build
    steps:
      - name: Deploy to GitHub Pages
        id: deployment
        uses: actions/deploy-pages@v5
--- a/.github/workflows/image-pr.yml
+++ b/.github/workflows/image-pr.yml
@@ -5,8 +5,8 @@
    pull_request:
  concurrency:
-    group: ci-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
+    group: ci-${{ github.head_ref || github.ref }}-${{ github.repository }}
-    cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+    cancel-in-progress: true
  jobs:
    image-build:
@@ -18,9 +18,9 @@
        cuda-major-version: ${{ matrix.cuda-major-version }}
        cuda-minor-version: ${{ matrix.cuda-minor-version }}
        platforms: ${{ matrix.platforms }}
        platform-tag: ${{ matrix.platform-tag || '' }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
      secrets:
@@ -59,36 +59,28 @@
              platforms: 'linux/amd64'
              tag-latest: 'false'
              tag-suffix: '-hipblas'
-              base-image: "rocm/dev-ubuntu-24.04:7.2.1"
+              base-image: "rocm/dev-ubuntu-24.04:6.4.4"
              grpc-base-image: "ubuntu:24.04"
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
              ubuntu-version: '2404'
            - build-type: 'sycl'
              platforms: 'linux/amd64'
              tag-latest: 'false'
-              base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
+              base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
              grpc-base-image: "ubuntu:24.04"
              tag-suffix: 'sycl'
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
              ubuntu-version: '2404'
            - build-type: 'vulkan'
-              platforms: 'linux/amd64'
+              platforms: 'linux/amd64,linux/arm64'
              platform-tag: 'amd64'
              tag-latest: 'false'
              tag-suffix: '-vulkan-core'
              runs-on: 'ubuntu-latest'
              base-image: "ubuntu:24.04"
              makeflags: "--jobs=4 --output-sync=target"
              ubuntu-version: '2404'
            - build-type: 'vulkan'
              platforms: 'linux/arm64'
              platform-tag: 'arm64'
              tag-latest: 'false'
              tag-suffix: '-vulkan-core'
              runs-on: 'ubuntu-24.04-arm'
              base-image: "ubuntu:24.04"
              makeflags: "--jobs=4 --output-sync=target"
              ubuntu-version: '2404'
            - build-type: 'cublas'
              cuda-major-version: "13"
              cuda-minor-version: "0"
--- a/.github/workflows/image.yml
+++ b/.github/workflows/image.yml
@@ -9,12 +9,11 @@
        - '*'
  concurrency:
-    group: ci-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
+    group: ci-${{ github.head_ref || github.ref }}-${{ github.repository }}
-    cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+    cancel-in-progress: true
  jobs:
    hipblas-jobs:
      if: github.repository == 'mudler/LocalAI'
      uses: ./.github/workflows/image_build.yml
      with:
        tag-latest: ${{ matrix.tag-latest }}
@@ -25,6 +24,8 @@
        platforms: ${{ matrix.platforms }}
        runs-on: ${{ matrix.runs-on }}
        base-image: ${{ matrix.base-image }}
        grpc-base-image: ${{ matrix.grpc-base-image }}
        aio: ${{ matrix.aio }}
        makeflags: ${{ matrix.makeflags }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
        ubuntu-codename: ${{ matrix.ubuntu-codename }}
@@ -40,14 +41,15 @@
              platforms: 'linux/amd64'
              tag-latest: 'auto'
              tag-suffix: '-gpu-hipblas'
-              base-image: "rocm/dev-ubuntu-24.04:7.2.1"
+              base-image: "rocm/dev-ubuntu-24.04:6.4.4"
              grpc-base-image: "ubuntu:24.04"
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
              aio: "-aio-gpu-hipblas"
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
-
+  
    core-image-build:
      if: github.repository == 'mudler/LocalAI'
      uses: ./.github/workflows/image_build.yml
      with:
        tag-latest: ${{ matrix.tag-latest }}
@@ -56,9 +58,10 @@
        cuda-major-version: ${{ matrix.cuda-major-version }}
        cuda-minor-version: ${{ matrix.cuda-minor-version }}
        platforms: ${{ matrix.platforms }}
        platform-tag: ${{ matrix.platform-tag || '' }}
        runs-on: ${{ matrix.runs-on }}
        aio: ${{ matrix.aio }}
        base-image: ${{ matrix.base-image }}
        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        skip-drivers: ${{ matrix.skip-drivers }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
@@ -73,23 +76,12 @@
        matrix:
          include:
            - build-type: ''
-              platforms: 'linux/amd64'
+              platforms: 'linux/amd64,linux/arm64'
              platform-tag: 'amd64'
              tag-latest: 'auto'
              tag-suffix: ''
              base-image: "ubuntu:24.04"
              runs-on: 'ubuntu-latest'
-              makeflags: "--jobs=4 --output-sync=target"
+              aio: "-aio-cpu"
              skip-drivers: 'false'
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
            - build-type: ''
              platforms: 'linux/arm64'
              platform-tag: 'arm64'
              tag-latest: 'auto'
              tag-suffix: ''
              base-image: "ubuntu:24.04"
              runs-on: 'ubuntu-24.04-arm'
              makeflags: "--jobs=4 --output-sync=target"
              skip-drivers: 'false'
              ubuntu-version: '2404'
@@ -104,6 +96,7 @@
              base-image: "ubuntu:24.04"
              skip-drivers: 'false'
              makeflags: "--jobs=4 --output-sync=target"
              aio: "-aio-gpu-nvidia-cuda-12"
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
            - build-type: 'cublas'
@@ -116,156 +109,33 @@
              base-image: "ubuntu:22.04"
              skip-drivers: 'false'
              makeflags: "--jobs=4 --output-sync=target"
              aio: "-aio-gpu-nvidia-cuda-13"
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
            - build-type: 'vulkan'
-              platforms: 'linux/amd64'
+              platforms: 'linux/amd64,linux/arm64'
              platform-tag: 'amd64'
              tag-latest: 'auto'
              tag-suffix: '-gpu-vulkan'
              runs-on: 'ubuntu-latest'
              base-image: "ubuntu:24.04"
              skip-drivers: 'false'
              makeflags: "--jobs=4 --output-sync=target"
-              ubuntu-version: '2404'
+              aio: "-aio-gpu-vulkan"
              ubuntu-codename: 'noble'
            - build-type: 'vulkan'
              platforms: 'linux/arm64'
              platform-tag: 'arm64'
              tag-latest: 'auto'
              tag-suffix: '-gpu-vulkan'
              runs-on: 'ubuntu-24.04-arm'
              base-image: "ubuntu:24.04"
              skip-drivers: 'false'
              makeflags: "--jobs=4 --output-sync=target"
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
            - build-type: 'intel'
              platforms: 'linux/amd64'
              tag-latest: 'auto'
-              base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
+              base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
              grpc-base-image: "ubuntu:24.04"
              tag-suffix: '-gpu-intel'
              runs-on: 'ubuntu-latest'
              makeflags: "--jobs=3 --output-sync=target"
              aio: "-aio-gpu-intel"
              ubuntu-version: '2404'
              ubuntu-codename: 'noble'
-
+  
    core-image-merge:
      # !cancelled(): without it, GHA's default `needs:` cascade skips the
      # merge whenever any matrix cell of the parent build fails or is
      # cancelled. Same fix as backend.yml's merge jobs — we still want to
      # publish the manifest list for tag-suffixes whose legs all succeeded.
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: core-image-build
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: ''
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    gpu-vulkan-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: core-image-build
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-gpu-vulkan'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    # Single-arch server-image merges. Same conceptual fix as the backend
    # singletons in PR #9781: image_build.yml pushes by canonical digest
    # only, so without a downstream merge step there's no tag for consumers
    # (no :latest-gpu-nvidia-cuda-12, no :v<X>-gpu-nvidia-cuda-12, etc.).
    # Each merge job needs only its parent build matrix and is filtered by
    # tag-suffix in image_merge.yml's artifact-download pattern.
    gpu-nvidia-cuda-12-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: core-image-build
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-gpu-nvidia-cuda-12'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    gpu-nvidia-cuda-13-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: core-image-build
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-gpu-nvidia-cuda-13'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    gpu-intel-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: core-image-build
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-gpu-intel'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    gpu-hipblas-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: hipblas-jobs
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-gpu-hipblas'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    nvidia-l4t-arm64-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: gh-runner
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-nvidia-l4t-arm64'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    nvidia-l4t-arm64-cuda-13-image-merge:
      if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
      needs: gh-runner
      uses: ./.github/workflows/image_merge.yml
      with:
        tag-latest: 'auto'
        tag-suffix: '-nvidia-l4t-arm64-cuda-13'
      secrets:
        dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
        dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
        quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
        quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    gh-runner:
      if: github.repository == 'mudler/LocalAI'
      uses: ./.github/workflows/image_build.yml
      with:
        tag-latest: ${{ matrix.tag-latest }}
@@ -275,7 +145,9 @@
        cuda-minor-version: ${{ matrix.cuda-minor-version }}
        platforms: ${{ matrix.platforms }}
        runs-on: ${{ matrix.runs-on }}
        aio: ${{ matrix.aio }}
        base-image: ${{ matrix.base-image }}
        grpc-base-image: ${{ matrix.grpc-base-image }}
        makeflags: ${{ matrix.makeflags }}
        skip-drivers: ${{ matrix.skip-drivers }}
        ubuntu-version: ${{ matrix.ubuntu-version }}
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -8,6 +8,11 @@ on:
        description: 'Base image'
        required: true
        type: string
      grpc-base-image:
        description: 'GRPC Base image, must be a compatible image with base-image'
        required: false
        default: ''
        type: string
      build-type:
        description: 'Build type'
        default: ''
@@ -24,15 +29,6 @@ on:
        description: 'Platforms'
        default: ''
        type: string
      platform-tag:
        description: |
          Short tag identifying the platform leg, e.g. "amd64" or "arm64".
          Used to scope the per-arch registry cache and the digest artifact name.
          Optional during the migration; will be flipped to required: true once
          every caller passes an explicit value.
        required: false
        default: ''
        type: string
      tag-latest:
        description: 'Tag latest'
        default: ''
@@ -55,6 +51,11 @@ on:
        required: false
        default: '--jobs=4 --output-sync=target'
        type: string
      aio:
        description: 'AIO Image Name'
        required: false
        default: ''
        type: string
      ubuntu-version:
        description: 'Ubuntu version'
        required: false
@@ -79,25 +80,78 @@ jobs:
    runs-on: ${{ inputs.runs-on }}
    steps:
      - name: Free Disk Space (Ubuntu)
        if: inputs.runs-on == 'ubuntu-latest'
        uses: jlumbroso/free-disk-space@main
        with:
          # this might remove tools that are actually needed,
          # if set to "true" but frees about 6 GB
          tool-cache: true
          # all of these default to true, but feel free to set to
          # "false" if necessary for your workflow
          android: true
          dotnet: true
          haskell: true
          large-packages: true
          docker-images: true
          swap-storage: true
      - name: Force Install GIT latest
        run: |
          sudo apt-get update \
          && sudo apt-get install -y software-properties-common \
          && sudo apt-get update \
          && sudo add-apt-repository -y ppa:git-core/ppa \
          && sudo apt-get update \
          && sudo apt-get install -y git
      - name: Checkout
        uses: actions/checkout@v6
-      - name: Configure apt mirror on runner
+      - name: Release space from worker
-        id: apt_mirror
+        if: inputs.runs-on == 'ubuntu-latest'
-        uses: ./.github/actions/configure-apt-mirror
+        run: |
-
+          echo "Listing top largest packages"
-      - name: Free disk space
+          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
-        uses: ./.github/actions/free-disk-space
+          head -n 30 <<< "${pkgs}"
-        with:
+          echo
-          mode: ${{ inputs.runs-on == 'ubuntu-latest' && 'hosted' || 'skip' }}
+          df -h
-
+          echo
-      - name: Set up build disk
+          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
-        uses: ./.github/actions/setup-build-disk
+          sudo apt-get remove --auto-remove android-sdk-platform-tools snapd || true
          sudo apt-get purge --auto-remove android-sdk-platform-tools snapd || true
          sudo rm -rf /usr/local/lib/android
          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
          sudo rm -rf /usr/share/dotnet
          sudo apt-get remove -y '^mono-.*' || true
          sudo apt-get remove -y '^ghc-.*' || true
          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
          sudo apt-get remove -y 'php.*' || true
          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
          sudo apt-get remove -y '^google-.*' || true
          sudo apt-get remove -y azure-cli || true
          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
          sudo apt-get remove -y '^gfortran-.*' || true
          sudo apt-get remove -y microsoft-edge-stable || true
          sudo apt-get remove -y firefox || true
          sudo apt-get remove -y powershell || true
          sudo apt-get remove -y r-base-core || true
          sudo apt-get autoremove -y
          sudo apt-get clean
          echo
          echo "Listing top largest packages"
          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
          head -n 30 <<< "${pkgs}"
          echo
          sudo rm -rfv build || true
          sudo rm -rf /usr/share/dotnet || true
          sudo rm -rf /opt/ghc || true
          sudo rm -rf "/usr/local/share/boost" || true
          sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
          df -h
      - name: Docker meta
        id: meta
        if: github.event_name != 'pull_request'
-        uses: docker/metadata-action@v6
+        uses: docker/metadata-action@v5
        with:
          images: |
            quay.io/go-skynet/local-ai
@@ -106,14 +160,13 @@ jobs:
            type=ref,event=branch
            type=semver,pattern={{raw}}
            type=sha
            type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true
      - name: Docker meta for PR
        id: meta_pull_request
        if: github.event_name == 'pull_request'
-        uses: docker/metadata-action@v6
+        uses: docker/metadata-action@v5
        with:
          images: |
            quay.io/go-skynet/ci-tests
@@ -124,6 +177,34 @@ jobs:
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }}
      - name: Docker meta AIO (quay.io)
        if: inputs.aio != ''
        id: meta_aio
        uses: docker/metadata-action@v5
        with:
          images: |
            quay.io/go-skynet/local-ai
          tags: |
            type=ref,event=branch
            type=semver,pattern={{raw}}
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.aio }},onlatest=true
      - name: Docker meta AIO (dockerhub)
        if: inputs.aio != ''
        id: meta_aio_dockerhub
        uses: docker/metadata-action@v5
        with:
          images: |
            localai/localai
          tags: |
            type=ref,event=branch
            type=semver,pattern={{raw}}
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.aio }},onlatest=true
      - name: Set up QEMU
        uses: docker/setup-qemu-action@master
        with:
@@ -135,107 +216,112 @@ jobs:
      - name: Login to DockerHub
        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
+        uses: docker/login-action@v3
        with:
          username: ${{ secrets.dockerUsername }}
          password: ${{ secrets.dockerPassword }}
      - name: Login to DockerHub
        if: github.event_name != 'pull_request'
-        uses: docker/login-action@v4
+        uses: docker/login-action@v3
        with:
          registry: quay.io
          username: ${{ secrets.quayUsername }}
          password: ${{ secrets.quayPassword }}
-      - name: Build and push by digest
+      - name: Build and push
-        id: build
+        uses: docker/build-push-action@v6
        uses: docker/build-push-action@v7
        if: github.event_name != 'pull_request'
        with:
          builder: ${{ steps.buildx.outputs.name }}
          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
          # This means that even the MAKEFLAGS have to be an EXACT match.
          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
          # This is why some build args like GRPC_VERSION and MAKEFLAGS are hardcoded
          build-args: |
            BUILD_TYPE=${{ inputs.build-type }}
            CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
            CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
            BASE_IMAGE=${{ inputs.base-image }}
            GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
            GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
            GRPC_VERSION=v1.65.0
            MAKEFLAGS=${{ inputs.makeflags }}
            SKIP_DRIVERS=${{ inputs.skip-drivers }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
            APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
            APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
          context: .
          file: ./Dockerfile
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
+          cache-from: type=gha
          cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}-${{ inputs.platform-tag }},mode=max,ignore-error=true
          platforms: ${{ inputs.platforms }}
-          outputs: |
+          push: ${{ github.event_name != 'pull_request' }}
-            type=image,name=quay.io/go-skynet/local-ai,push-by-digest=true,name-canonical=true,push=true
+          tags: ${{ steps.meta.outputs.tags }}
            type=image,name=localai/localai,push-by-digest=true,name-canonical=true,push=true
          # See backend_build.yml for the rationale — provenance=mode=max
          # diverges the manifest-list digest per registry, breaking the
          # downstream imagetools create lookup.
          provenance: false
          labels: ${{ steps.meta.outputs.labels }}
      - name: Export digest
        if: github.event_name != 'pull_request'
        run: |
          mkdir -p /tmp/digests
          digest="${{ steps.build.outputs.digest }}"
          touch "/tmp/digests/${digest#sha256:}"
      # See .github/scripts/anchor-digest-in-cache.sh for why this is needed
      # and how it interacts with image_merge.yml's cleanup step. Mirrors the
      # same anchor in backend_build.yml — quay's per-repo manifest GC reaps
      # untagged manifests in local-ai before the merge runs.
      - name: Anchor digest in ci-cache so quay GC won't reap before merge
        if: github.event_name != 'pull_request'
        env:
          TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
          PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
          DIGEST: ${{ steps.build.outputs.digest }}
          SOURCE_IMAGE: quay.io/go-skynet/local-ai
        run: .github/scripts/anchor-digest-in-cache.sh
      - name: Upload digest artifact
        if: github.event_name != 'pull_request'
        uses: actions/upload-artifact@v7
        with:
          # `--` separator + 'single' placeholder for empty platform-tag —
          # same pattern as backend_build.yml. Prevents prefix collisions
          # in the merge-side glob (e.g. -nvidia-l4t-arm64 is a prefix of
          # -nvidia-l4t-arm64-cuda-13).
          name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
          path: /tmp/digests/*
          if-no-files-found: error
          retention-days: 1
 ### Start testing image
      - name: Build and push
-        uses: docker/build-push-action@v7
+        uses: docker/build-push-action@v6
        if: github.event_name == 'pull_request'
        with:
          builder: ${{ steps.buildx.outputs.name }}
          # The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
          # This means that even the MAKEFLAGS have to be an EXACT match.
          # If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
          # This is why some build args like GRPC_VERSION and MAKEFLAGS are hardcoded
          build-args: |
            BUILD_TYPE=${{ inputs.build-type }}
            CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
            CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
            BASE_IMAGE=${{ inputs.base-image }}
            GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
            GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
            GRPC_VERSION=v1.65.0
            MAKEFLAGS=${{ inputs.makeflags }}
            SKIP_DRIVERS=${{ inputs.skip-drivers }}
            UBUNTU_VERSION=${{ inputs.ubuntu-version }}
            UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
            APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
            APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
          context: .
          file: ./Dockerfile
-          cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
+          cache-from: type=gha
          platforms: ${{ inputs.platforms }}
          #push: true
          tags: ${{ steps.meta_pull_request.outputs.tags }}
          labels: ${{ steps.meta_pull_request.outputs.labels }}
 ## End testing image
      - name: Build and push AIO image
        if: inputs.aio != ''
        uses: docker/build-push-action@v6
        with:
          builder: ${{ steps.buildx.outputs.name }}
          build-args: |
            BASE_IMAGE=quay.io/go-skynet/local-ai:${{ steps.meta.outputs.version }}
            MAKEFLAGS=${{ inputs.makeflags }}
          context: .
          file: ./Dockerfile.aio
          platforms: ${{ inputs.platforms }}
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.meta_aio.outputs.tags }}
          labels: ${{ steps.meta_aio.outputs.labels }}
      - name: Build and push AIO image (dockerhub)
        if: inputs.aio != ''
        uses: docker/build-push-action@v6
        with:
          builder: ${{ steps.buildx.outputs.name }}
          build-args: |
            BASE_IMAGE=localai/localai:${{ steps.meta.outputs.version }}
            MAKEFLAGS=${{ inputs.makeflags }}
          context: .
          file: ./Dockerfile.aio
          platforms: ${{ inputs.platforms }}
          push: ${{ github.event_name != 'pull_request' }}
          tags: ${{ steps.meta_aio_dockerhub.outputs.tags }}
          labels: ${{ steps.meta_aio_dockerhub.outputs.labels }}
      - name: job summary
        run: |
          echo "Built image: ${{ steps.meta.outputs.labels }}" >> $GITHUB_STEP_SUMMARY
      - name: job summary(AIO)
        if: inputs.aio != ''
        run: |
          echo "Built image: ${{ steps.meta_aio.outputs.labels }}" >> $GITHUB_STEP_SUMMARY
--- a/.github/workflows/image_merge.yml
+++ b/.github/workflows/image_merge.yml
@@ -1,146 +0,0 @@
 ---
 name: 'merge LocalAI image manifest list (reusable)'
 # Reusable workflow that joins per-arch digest artifacts (uploaded by
 # image_build.yml when called with platform-tag) into a single tagged
 # multi-arch manifest list.
 on:
  workflow_call:
    inputs:
      tag-latest:
        description: 'Whether the manifest list should also be tagged latest (auto/false/true)'
        required: false
        type: string
        default: ''
      tag-suffix:
        description: 'Image tag suffix (empty for core image). Used in artifact pattern with a -core placeholder for empty.'
        required: true
        type: string
    secrets:
      dockerUsername:
        required: false
      dockerPassword:
        required: false
      quayUsername:
        required: true
      quayPassword:
        required: true
 jobs:
  merge:
    runs-on: ubuntu-latest
    env:
      quay_username: ${{ secrets.quayUsername }}
    steps:
      # Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
      # script). Skips the rest of the source tree.
      - name: Checkout (.github/scripts only)
        uses: actions/checkout@v6
        with:
          sparse-checkout: |
            .github/scripts
          sparse-checkout-cone-mode: false
      - name: Download digests
        uses: actions/download-artifact@v8
        with:
          # `--` separator anchors the glob so we don't over-match sibling
          # tag-suffixes (e.g. -nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13).
          # Must stay in sync with image_build.yml's upload-artifact name.
          pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--*
          merge-multiple: true
          path: /tmp/digests
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@master
      - name: Login to DockerHub
        if: github.event_name != 'pull_request'
        uses: docker/login-action@v4
        with:
          username: ${{ secrets.dockerUsername }}
          password: ${{ secrets.dockerPassword }}
      - name: Login to Quay.io
        uses: docker/login-action@v4
        with:
          registry: quay.io
          username: ${{ secrets.quayUsername }}
          password: ${{ secrets.quayPassword }}
      - name: Docker meta
        id: meta
        uses: docker/metadata-action@v6
        with:
          images: |
            quay.io/go-skynet/local-ai
            localai/localai
          tags: |
            type=ref,event=branch
            type=semver,pattern={{raw}}
            type=sha
            type=raw,value={{branch}}-{{date 'X'}}-{{sha}},enable={{is_default_branch}}
          flavor: |
            latest=${{ inputs.tag-latest }}
            suffix=${{ inputs.tag-suffix }},onlatest=true
      # Source from ci-cache, not local-ai. See backend_merge.yml for the
      # detailed rationale — quay's manifest GC is per-repository, so the
      # untagged digest in local-ai gets reaped while the same content lives
      # tagged under ci-cache (anchored by image_build.yml). buildx imagetools
      # create copies the manifest into local-ai (blobs already cross-mounted)
      # and publishes the manifest list with user-facing tags. End state in
      # local-ai is self-contained; no embedded reference to ci-cache.
      - name: Create manifest list and push (quay)
        working-directory: /tmp/digests
        run: |
          set -euo pipefail
          tags=$(jq -cr '.tags | map(select(startswith("quay.io/"))) | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          if [ -z "$tags" ]; then
            echo "No quay.io tags from docker/metadata-action; skipping quay merge"
          else
            # shellcheck disable=SC2086
            docker buildx imagetools create $tags \
              $(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
          fi
      - name: Create manifest list and push (dockerhub)
        if: github.event_name != 'pull_request'
        working-directory: /tmp/digests
        run: |
          set -euo pipefail
          tags=$(jq -cr '.tags | map(select(startswith("localai/"))) | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          if [ -z "$tags" ]; then
            echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
          else
            # shellcheck disable=SC2086
            docker buildx imagetools create $tags \
              $(printf 'localai/localai@sha256:%s ' *)
          fi
      - name: Inspect manifest
        run: |
          set -euo pipefail
          first_tag=$(jq -cr '.tags[0]' <<< "$DOCKER_METADATA_OUTPUT_JSON")
          if [ -n "$first_tag" ] && [ "$first_tag" != "null" ]; then
            docker buildx imagetools inspect "$first_tag"
          fi
      # See .github/scripts/cleanup-keepalive-tags.sh for the best-effort
      # semantics — fails soft when the registry credential isn't OAuth-scoped.
      - name: Cleanup keepalive tags in ci-cache
        if: github.event_name != 'pull_request' && success()
        env:
          TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
          QUAY_TOKEN: ${{ secrets.quayPassword }}
        run: .github/scripts/cleanup-keepalive-tags.sh
      - name: Job summary
        run: |
          set -euo pipefail
          echo "Merged manifest tags:" >> "$GITHUB_STEP_SUMMARY"
          jq -r '.tags[]' <<< "$DOCKER_METADATA_OUTPUT_JSON" | sed 's/^/- /' >> "$GITHUB_STEP_SUMMARY"
          echo >> "$GITHUB_STEP_SUMMARY"
          echo "Per-arch digests:" >> "$GITHUB_STEP_SUMMARY"
          ls -1 /tmp/digests | sed 's/^/- sha256:/' >> "$GITHUB_STEP_SUMMARY"
--- a/.github/workflows/disabled/labeler.yml
+++ b/.github/workflows/disabled/labeler.yml
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -1,48 +0,0 @@
 ---
 name: 'lint'
 on:
  pull_request:
    paths-ignore:
      - 'docs/**'
      - 'examples/**'
      - 'README.md'
      - '**/*.md'
  push:
    branches:
      - master
 concurrency:
  group: ci-lint-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
 jobs:
  golangci-lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
        with:
          # Full history so golangci-lint's new-from-merge-base can reach
          # origin/master and compute the diff against it.
          fetch-depth: 0
      - uses: actions/setup-go@v5
        with:
          go-version: '1.26.x'
          cache: false
      - name: install golangci-lint
        run: |
          curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh \
            | sh -s -- -b "$(go env GOPATH)/bin" v2.11.4
      - name: generate grpc proto sources
        # pkg/grpc/proto/*.go is generated, not checked in. Several packages
        # import it, so without this step typecheck fails project-wide.
        run: make protogen-go
      - name: stub react-ui dist for go:embed
        # core/http/app.go has //go:embed react-ui/dist/*; the glob needs at
        # least one non-hidden entry to satisfy typecheck. We don't run
        # `make react-ui` here because lint doesn't need the real bundle.
        run: |
          mkdir -p core/http/react-ui/dist
          touch core/http/react-ui/dist/index.html
      - name: lint
        run: make lint
--- a/.github/workflows/disabled/localaibot_automerge.yml
+++ b/.github/workflows/disabled/localaibot_automerge.yml
@@ -10,8 +10,8 @@ permissions:
  actions: write # to dispatch publish workflow
 jobs:
  dependabot:
    if: github.repository == 'mudler/LocalAI' && github.actor == 'localai-bot' && contains(github.event.pull_request.title, 'chore:')
    runs-on: ubuntu-latest
    if: ${{ github.actor == 'localai-bot' && !contains(github.event.pull_request.title, 'chore(model gallery):') }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
--- a/.github/workflows/disabled/notify-models.yaml
+++ b/.github/workflows/disabled/notify-models.yaml
@@ -10,7 +10,7 @@ permissions:
 jobs:
  notify-discord:
-    if: github.repository == 'mudler/LocalAI' && (github.event.pull_request.merged == true) && (contains(github.event.pull_request.labels.*.name, 'area/ai-model'))
+    if: ${{ (github.event.pull_request.merged == true) && (contains(github.event.pull_request.labels.*.name, 'area/ai-model')) }}
    env:
        MODEL_NAME: gemma-3-12b-it-qat
    runs-on: ubuntu-latest
@@ -90,7 +90,7 @@ jobs:
        connect-timeout-seconds: 180
        limit-access-to-actor: true
  notify-twitter:
-    if: github.repository == 'mudler/LocalAI' && (github.event.pull_request.merged == true) && (contains(github.event.pull_request.labels.*.name, 'area/ai-model'))
+    if: ${{ (github.event.pull_request.merged == true) && (contains(github.event.pull_request.labels.*.name, 'area/ai-model')) }}
    env:
        MODEL_NAME: gemma-3-12b-it-qat
    runs-on: ubuntu-latest
--- a/.github/workflows/notify-releases.yaml
+++ b/.github/workflows/notify-releases.yaml
@@ -6,7 +6,6 @@ on:
 jobs:
  notify-discord:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    env:
        RELEASE_BODY: ${{ github.event.release.body }}
--- a/.github/workflows/disabled/prlint.yaml
+++ b/.github/workflows/disabled/prlint.yaml
--- a/.github/workflows/release.yaml
+++ b/.github/workflows/release.yaml
@@ -18,7 +18,7 @@ jobs:
        with:
          go-version: 1.23
      - name: Run GoReleaser
-        uses: goreleaser/goreleaser-action@v7
+        uses: goreleaser/goreleaser-action@v6
        with:
          version: v2.11.0
          args: release --clean
@@ -39,7 +39,7 @@ jobs:
        run: |
          make build-launcher-darwin
      - name: Upload DMG to Release
-        uses: softprops/action-gh-release@v3
+        uses: softprops/action-gh-release@v2
        with:
          files: ./dist/LocalAI.dmg
  launcher-build-linux:
@@ -49,8 +49,6 @@ jobs:
        uses: actions/checkout@v6
        with:
          fetch-depth: 0
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - name: Set up Go
        uses: actions/setup-go@v5
        with:
@@ -61,6 +59,6 @@ jobs:
          sudo apt-get install golang gcc libgl1-mesa-dev xorg-dev libxkbcommon-dev
          make build-launcher-linux
      - name: Upload Linux launcher artifacts
-        uses: softprops/action-gh-release@v3
+        uses: softprops/action-gh-release@v2
        with:
          files: ./local-ai-launcher-linux.tar.xz
--- a/.github/workflows/stalebot.yml
+++ b/.github/workflows/stalebot.yml
@@ -8,10 +8,9 @@ on:
 jobs:
  stale:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/stale@b5d41d4e1d5dceea10e7104786b73624c18a190f # v9
+      - uses: actions/stale@997185467fa4f803885201cee163a9f38240193d # v9
        with:
          stale-issue-message: 'This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.'
          stale-pr-message: 'This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 10 days.'
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -10,55 +10,10 @@ on:
      - '*'
 concurrency:
-  group: ci-tests-extra-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
+  group: ci-tests-extra-${{ github.head_ref || github.ref }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+  cancel-in-progress: true
 jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      run-all: ${{ steps.detect.outputs.run-all }}
      transformers: ${{ steps.detect.outputs.transformers }}
      rerankers: ${{ steps.detect.outputs.rerankers }}
      diffusers: ${{ steps.detect.outputs.diffusers }}
      coqui: ${{ steps.detect.outputs.coqui }}
      moonshine: ${{ steps.detect.outputs.moonshine }}
      pocket-tts: ${{ steps.detect.outputs.pocket-tts }}
      qwen-tts: ${{ steps.detect.outputs.qwen-tts }}
      qwen-asr: ${{ steps.detect.outputs.qwen-asr }}
      nemo: ${{ steps.detect.outputs.nemo }}
      voxcpm: ${{ steps.detect.outputs.voxcpm }}
      liquid-audio: ${{ steps.detect.outputs.liquid-audio }}
      llama-cpp-quantization: ${{ steps.detect.outputs.llama-cpp-quantization }}
      llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
      ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
      turboquant: ${{ steps.detect.outputs.turboquant }}
      vllm: ${{ steps.detect.outputs.vllm }}
      sglang: ${{ steps.detect.outputs.sglang }}
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
      qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
      vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
      localvqe: ${{ steps.detect.outputs.localvqe }}
      voxtral: ${{ steps.detect.outputs.voxtral }}
      kokoros: ${{ steps.detect.outputs.kokoros }}
      insightface: ${{ steps.detect.outputs.insightface }}
      speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
      sherpa-onnx: ${{ steps.detect.outputs.sherpa-onnx }}
      whisper: ${{ steps.detect.outputs.whisper }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
      - name: Setup Bun
        uses: oven-sh/setup-bun@v2
      - name: Install dependencies
        run: bun add js-yaml @octokit/core
      - name: Detect changed backends
        id: detect
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GITHUB_EVENT_PATH: ${{ github.event_path }}
        run: bun run scripts/changed-backends.js
  # Requires CUDA
  # tests-chatterbox-tts:
  #   runs-on: ubuntu-latest
@@ -82,8 +37,6 @@ jobs:
  #          make --jobs=5 --output-sync=target -C backend/python/chatterbox
  #          make --jobs=5 --output-sync=target -C backend/python/chatterbox test
  tests-transformers:
    needs: detect-changes
    if: needs.detect-changes.outputs.transformers == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
@@ -105,8 +58,6 @@ jobs:
           make --jobs=5 --output-sync=target -C backend/python/transformers
           make --jobs=5 --output-sync=target -C backend/python/transformers test
  tests-rerankers:
    needs: detect-changes
    if: needs.detect-changes.outputs.rerankers == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
@@ -129,8 +80,6 @@ jobs:
           make --jobs=5 --output-sync=target -C backend/python/rerankers test
  tests-diffusers:
    needs: detect-changes
    if: needs.detect-changes.outputs.diffusers == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
@@ -280,8 +229,6 @@ jobs:
  #          make --jobs=5 --output-sync=target -C backend/python/vllm test
  tests-coqui:
    needs: detect-changes
    if: needs.detect-changes.outputs.coqui == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
@@ -301,8 +248,6 @@ jobs:
          make --jobs=5 --output-sync=target -C backend/python/coqui
          make --jobs=5 --output-sync=target -C backend/python/coqui test
  tests-moonshine:
    needs: detect-changes
    if: needs.detect-changes.outputs.moonshine == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
@@ -322,8 +267,6 @@ jobs:
          make --jobs=5 --output-sync=target -C backend/python/moonshine
          make --jobs=5 --output-sync=target -C backend/python/moonshine test
  tests-pocket-tts:
    needs: detect-changes
    if: needs.detect-changes.outputs.pocket-tts == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
@@ -343,8 +286,6 @@ jobs:
          make --jobs=5 --output-sync=target -C backend/python/pocket-tts
          make --jobs=5 --output-sync=target -C backend/python/pocket-tts test
  tests-qwen-tts:
    needs: detect-changes
    if: needs.detect-changes.outputs.qwen-tts == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
@@ -363,31 +304,7 @@ jobs:
        run: |
          make --jobs=5 --output-sync=target -C backend/python/qwen-tts
          make --jobs=5 --output-sync=target -C backend/python/qwen-tts test
  # TODO: s2-pro model is too large to load on CPU-only CI runners — re-enable
  # when we have GPU runners or a smaller test model.
  # tests-fish-speech:
  #   runs-on: ubuntu-latest
  #   timeout-minutes: 45
  #   steps:
  #     - name: Clone
  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
  #       run: |
  #         sudo apt-get update
  #         sudo apt-get install -y build-essential ffmpeg portaudio19-dev
  #         sudo apt-get install -y ca-certificates cmake curl patch python3-pip
  #         # Install UV
  #         curl -LsSf https://astral.sh/uv/install.sh | sh
  #         pip install --user --no-cache-dir grpcio-tools==1.64.1
  #     - name: Test fish-speech
  #       run: |
  #         make --jobs=5 --output-sync=target -C backend/python/fish-speech
  #         make --jobs=5 --output-sync=target -C backend/python/fish-speech test
  tests-qwen-asr:
    needs: detect-changes
    if: needs.detect-changes.outputs.qwen-asr == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
@@ -406,30 +323,7 @@ jobs:
        run: |
          make --jobs=5 --output-sync=target -C backend/python/qwen-asr
          make --jobs=5 --output-sync=target -C backend/python/qwen-asr test
  tests-nemo:
    needs: detect-changes
    if: needs.detect-changes.outputs.nemo == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential ffmpeg sox
          sudo apt-get install -y ca-certificates cmake curl patch python3-pip
          # Install UV
          curl -LsSf https://astral.sh/uv/install.sh | sh
          pip install --user --no-cache-dir grpcio-tools==1.64.1
      - name: Test nemo
        run: |
          make --jobs=5 --output-sync=target -C backend/python/nemo
          make --jobs=5 --output-sync=target -C backend/python/nemo test
  tests-voxcpm:
    needs: detect-changes
    if: needs.detect-changes.outputs.voxcpm == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
@@ -448,616 +342,3 @@ jobs:
        run: |
          make --jobs=5 --output-sync=target -C backend/python/voxcpm
          make --jobs=5 --output-sync=target -C backend/python/voxcpm test
  # liquid-audio: LFM2.5-Audio any-to-any backend. The CI smoke test
  # exercises Health() and LoadModel(mode:finetune) — fine-tune mode
  # short-circuits before pulling weights (backend.py:192), so no
  # HuggingFace download or GPU is needed. The full-inference path is
  # gated on LIQUID_AUDIO_MODEL_ID, which we don't set here.
  tests-liquid-audio:
    needs: detect-changes
    if: needs.detect-changes.outputs.liquid-audio == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential ffmpeg
          sudo apt-get install -y ca-certificates cmake curl patch python3-pip
          # Install UV
          curl -LsSf https://astral.sh/uv/install.sh | sh
          pip install --user --no-cache-dir grpcio-tools==1.64.1
      - name: Test liquid-audio
        run: |
          make --jobs=5 --output-sync=target -C backend/python/liquid-audio
          make --jobs=5 --output-sync=target -C backend/python/liquid-audio test
  tests-llama-cpp-quantization:
    needs: detect-changes
    if: needs.detect-changes.outputs.llama-cpp-quantization == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential cmake curl git python3-pip
          # Install UV
          curl -LsSf https://astral.sh/uv/install.sh | sh
          pip install --user --no-cache-dir grpcio-tools==1.64.1
      - name: Build llama-quantize from llama.cpp
        run: |
          git clone --depth 1 https://github.com/ggml-org/llama.cpp.git /tmp/llama.cpp
          cmake -B /tmp/llama.cpp/build -S /tmp/llama.cpp -DGGML_NATIVE=OFF
          cmake --build /tmp/llama.cpp/build --target llama-quantize -j$(nproc)
          sudo cp /tmp/llama.cpp/build/bin/llama-quantize /usr/local/bin/
      - name: Install backend
        run: |
          make --jobs=5 --output-sync=target -C backend/python/llama-cpp-quantization
      - name: Test llama-cpp-quantization
        run: |
          make --jobs=5 --output-sync=target -C backend/python/llama-cpp-quantization test
  tests-llama-cpp-grpc:
    needs: detect-changes
    if: needs.detect-changes.outputs.llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      - name: Build llama-cpp backend image and run gRPC e2e tests
        run: |
          make test-extra-backend-llama-cpp
  tests-llama-cpp-grpc-transcription:
    needs: detect-changes
    if: needs.detect-changes.outputs.llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      - name: Build llama-cpp backend image and run audio transcription gRPC e2e tests
        run: |
          make test-extra-backend-llama-cpp-transcription
  # PR-acceptance smoke gate: always runs on every PR (no detect-changes gate, no
  # paths filter). Pulls the pre-built master CPU llama-cpp image from quay
  # instead of building from source, so the cost is a docker pull (~30s) plus the
  # short Qwen3-0.6B model download. Exercises the full gRPC surface — health,
  # load, predict, stream — plus the logprobs/logit_bias specs that moved out of
  # core/http/app_test.go. Anything heavier or per-backend is gated to the
  # detect-changes path-filter above.
  tests-llama-cpp-smoke:
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      - name: Pull pre-built llama-cpp backend image
        run: docker pull quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
      - name: Run e2e-backends smoke
        env:
          BACKEND_IMAGE: quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
          BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias
        run: |
          make test-extra-backend
  # Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
  # Builds the sherpa-onnx Docker image, extracts the rootfs so the e2e suite
  # can discover the backend binary + shared libs, downloads the three model
  # bundles (silero-vad, omnilingual-asr, vits-ljs) and drives the realtime
  # websocket spec end-to-end.
  tests-sherpa-onnx-realtime:
    needs: detect-changes
    if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      - name: Setup Node.js
        uses: actions/setup-node@v6
        with:
          node-version: '22'
      - name: Build sherpa-onnx backend image and run realtime e2e tests
        run: |
          make test-extra-e2e-realtime-sherpa
  # Streaming ASR via the sherpa-onnx online recognizer (zipformer
  # transducer). Exercises both AudioTranscription (buffered) and
  # AudioTranscriptionStream (real-time deltas) on the e2e-backends
  # harness.
  tests-sherpa-onnx-grpc-transcription:
    needs: detect-changes
    if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      - name: Build sherpa-onnx backend image and run streaming ASR gRPC e2e tests
        run: |
          make test-extra-backend-sherpa-onnx-transcription
  # End-to-end transcription via the e2e-backends gRPC harness against
  # the whisper.cpp backend. Drives AudioTranscription (offline) and
  # AudioTranscriptionStream (real, segment-callback-driven deltas) on
  # ggml-base.en + the JFK 11s clip.
  tests-whisper-grpc-transcription:
    needs: detect-changes
    if: needs.detect-changes.outputs.whisper == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      - name: Build whisper backend image and run transcription gRPC e2e tests
        run: |
          make test-extra-backend-whisper-transcription
  # VITS TTS via the sherpa-onnx backend. Drives both TTS (file write) and
  # TTSStream (PCM chunks) on the e2e-backends harness.
  tests-sherpa-onnx-grpc-tts:
    needs: detect-changes
    if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      - name: Build sherpa-onnx backend image and run TTS gRPC e2e tests
        run: |
          make test-extra-backend-sherpa-onnx-tts
  tests-ik-llama-cpp-grpc:
    needs: detect-changes
    if: needs.detect-changes.outputs.ik-llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      - name: Build ik-llama-cpp backend image and run gRPC e2e tests
        run: |
          make test-extra-backend-ik-llama-cpp
  tests-turboquant-grpc:
    needs: detect-changes
    if: needs.detect-changes.outputs.turboquant == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      # Exercises the turboquant (llama.cpp fork) backend with KV-cache
      # quantization enabled. The convenience target sets
      # BACKEND_TEST_CACHE_TYPE_K / _V=q8_0, which are plumbed into the
      # ModelOptions.CacheTypeKey/Value gRPC fields. LoadModel-success +
      # backend stdout/stderr (captured by the Ginkgo suite) prove the
      # cache-type config path reaches the fork's KV-cache init.
      - name: Build turboquant backend image and run gRPC e2e tests
        run: |
          make test-extra-backend-turboquant
  # tests-vllm-grpc is currently disabled in CI.
  #
  # The prebuilt vllm CPU wheel is compiled with AVX-512 VNNI/BF16
  # instructions, and neither ubuntu-latest nor the bigger-runner pool
  # offers a stable CPU baseline that supports them — runners come
  # back with different hardware between runs and SIGILL on import of
  # vllm.model_executor.models.registry. Compiling vllm from source
  # via FROM_SOURCE=true works on any CPU but takes 30-50 minutes per
  # run, which is too slow for a smoke test.
  #
  # The test itself (tests/e2e-backends + make test-extra-backend-vllm)
  # is fully working and validated locally on a host with the right
  # SIMD baseline. Run it manually with:
  #
  #   make test-extra-backend-vllm
  #
  # Re-enable this job once we have a self-hosted runner label with
  # guaranteed AVX-512 VNNI/BF16 support, or once the vllm project
  # publishes a CPU wheel with a wider baseline.
  #
  # tests-vllm-grpc:
  #   needs: detect-changes
  #   if: needs.detect-changes.outputs.vllm == 'true' || needs.detect-changes.outputs.run-all == 'true'
  #   runs-on: bigger-runner
  #   timeout-minutes: 90
  #   steps:
  #     - name: Clone
  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
  #       run: |
  #         sudo apt-get update
  #         sudo apt-get install -y --no-install-recommends \
  #             make build-essential curl unzip ca-certificates git tar
  #     - name: Setup Go
  #       uses: actions/setup-go@v5
  #       with:
  #         go-version: '1.25.4'
  #     - name: Free disk space
  #       run: |
  #         sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
  #         df -h
  #     - name: Build vllm (cpu) backend image and run gRPC e2e tests
  #       run: |
  #         make test-extra-backend-vllm
  # tests-sglang-grpc is currently disabled in CI for the same reason as
  # tests-vllm-grpc: sglang's CPU kernel (sgl-kernel) uses __m512 AVX-512
  # intrinsics unconditionally in shm.cpp, so the from-source build
  # requires `-march=sapphirerapids` (already set in install.sh) and the
  # resulting binary SIGILLs at import on CPUs without AVX-512 VNNI/BF16.
  # The ubuntu-latest runner pool does not guarantee that ISA baseline.
  #
  # The test itself (tests/e2e-backends + make test-extra-backend-sglang)
  # is fully working and validated locally on a host with the right
  # SIMD baseline. Run it manually with:
  #
  #   make test-extra-backend-sglang
  #
  # Re-enable this job once we have a self-hosted runner label with
  # guaranteed AVX-512 VNNI/BF16 support.
  #
  # tests-sglang-grpc:
  #   needs: detect-changes
  #   if: needs.detect-changes.outputs.sglang == 'true' || needs.detect-changes.outputs.run-all == 'true'
  #   runs-on: bigger-runner
  #   timeout-minutes: 90
  #   steps:
  #     - name: Clone
  #       uses: actions/checkout@v6
  #       with:
  #         submodules: true
  #     - name: Dependencies
  #       run: |
  #         sudo apt-get update
  #         sudo apt-get install -y --no-install-recommends \
  #             make build-essential curl unzip ca-certificates git tar
  #     - name: Setup Go
  #       uses: actions/setup-go@v5
  #       with:
  #         go-version: '1.25.4'
  #     - name: Free disk space
  #       run: |
  #         sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
  #         df -h
  #     - name: Build sglang (cpu) backend image and run gRPC e2e tests
  #       run: |
  #         make test-extra-backend-sglang
  tests-acestep-cpp:
    needs: detect-changes
    if: needs.detect-changes.outputs.acestep-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential cmake curl libopenblas-dev ffmpeg
      - name: Setup Go
        uses: actions/setup-go@v5
      - name: Display Go version
        run: go version
      - name: Proto Dependencies
        run: |
          # Install protoc
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
          rm protoc.zip
          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
          PATH="$PATH:$HOME/go/bin" make protogen-go
      - name: Build acestep-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/acestep-cpp
      - name: Test acestep-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/acestep-cpp test
  tests-qwen3-tts-cpp:
    needs: detect-changes
    if: needs.detect-changes.outputs.qwen3-tts-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential cmake curl libopenblas-dev ffmpeg
      - name: Setup Go
        uses: actions/setup-go@v5
      - name: Display Go version
        run: go version
      - name: Proto Dependencies
        run: |
          # Install protoc
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
          rm protoc.zip
          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
          PATH="$PATH:$HOME/go/bin" make protogen-go
      - name: Build qwen3-tts-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp
      - name: Test qwen3-tts-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp test
  # Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
  # runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
  # the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K
  # + tokenizer + voice) and runs the closed-loop TTS → ASR Go test.
  tests-vibevoice-cpp:
    needs: detect-changes
    if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential cmake curl libopenblas-dev ffmpeg
      - name: Setup Go
        uses: actions/setup-go@v5
      - name: Display Go version
        run: go version
      - name: Proto Dependencies
        run: |
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
          rm protoc.zip
          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
          PATH="$PATH:$HOME/go/bin" make protogen-go
      - name: Build vibevoice-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/vibevoice-cpp
      - name: Test vibevoice-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/vibevoice-cpp test
  # End-to-end TTS via the e2e-backends gRPC harness. Builds the
  # vibevoice-cpp Docker image and drives Backend/TTS against it with a
  # real LocalAI gRPC client.
  tests-vibevoice-cpp-grpc-tts:
    needs: detect-changes
    if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      - name: Build vibevoice-cpp backend image and run TTS gRPC e2e tests
        run: |
          make test-extra-backend-vibevoice-cpp-tts
  # End-to-end transcription via the e2e-backends gRPC harness. The
  # vibevoice ASR is a 7B-param model (Q4_K weights ~10 GB on disk)
  # and the JFK 30 s decode is too heavy for a free 4-core
  # ubuntu-latest pool runner - two CI attempts got SIGTERM'd during
  # LoadModel, before the test could even progress. Use the
  # self-hosted 'bigger-runner' label (same one the GPU image builds
  # in backend.yml use) and the documented dotnet/ghc/android cache
  # purge to clear ~10-20 GB of headroom for the model + Docker
  # image + working dir.
  tests-vibevoice-cpp-grpc-transcription:
    needs: detect-changes
    if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: bigger-runner
    timeout-minutes: 150
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y --no-install-recommends \
              make build-essential curl unzip ca-certificates git tar
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      - name: Free disk space
        run: |
          sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
          df -h
      - name: Build vibevoice-cpp backend image and run ASR gRPC e2e tests
        run: |
          make test-extra-backend-vibevoice-cpp-transcription
  # End-to-end audio transform via the e2e-backends gRPC harness. The
  # LocalVQE GGUF is small (~5 MB) and the model is real-time on CPU, so
  # the default ubuntu-latest pool is plenty.
  tests-localvqe-grpc-transform:
    needs: detect-changes
    if: needs.detect-changes.outputs.localvqe == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 60
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      - name: Build localvqe backend image and run audio_transform gRPC e2e tests
        run: |
          make test-extra-backend-localvqe-transform
  tests-voxtral:
    needs: detect-changes
    if: needs.detect-changes.outputs.voxtral == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential cmake curl libopenblas-dev ffmpeg
      - name: Setup Go
        uses: actions/setup-go@v5
      # You can test your matrix by printing the current Go version
      - name: Display Go version
        run: go version
      - name: Proto Dependencies
        run: |
          # Install protoc
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
          rm protoc.zip
          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
          PATH="$PATH:$HOME/go/bin" make protogen-go
      - name: Build voxtral
        run: |
          make --jobs=5 --output-sync=target -C backend/go/voxtral
      - name: Test voxtral
        run: |
          make --jobs=5 --output-sync=target -C backend/go/voxtral test
  tests-kokoros:
    needs: detect-changes
    if: needs.detect-changes.outputs.kokoros == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential cmake pkg-config protobuf-compiler clang libclang-dev
          sudo apt-get install -y espeak-ng libespeak-ng-dev libsonic-dev libpcaudio-dev libopus-dev libssl-dev
          curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
          echo "$HOME/.cargo/bin" >> $GITHUB_PATH
      - name: Build kokoros
        run: |
          make -C backend/rust/kokoros kokoros-grpc
      - name: Test kokoros
        run: |
          make -C backend/rust/kokoros test
  tests-insightface-grpc:
    needs: detect-changes
    if: needs.detect-changes.outputs.insightface == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y --no-install-recommends \
              make build-essential curl unzip ca-certificates git tar
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.26.0'
      - name: Free disk space
        run: |
          sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
          df -h
      - name: Build insightface backend image and run both model configurations
        run: |
          make test-extra-backend-insightface-all
  tests-speaker-recognition-grpc:
    needs: detect-changes
    if: needs.detect-changes.outputs.speaker-recognition == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y --no-install-recommends \
              make build-essential curl ca-certificates git tar
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.26.0'
      - name: Free disk space
        run: |
          sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
          df -h
      - name: Build speaker-recognition backend image and run the ECAPA-TDNN configuration
        run: |
          make test-extra-backend-speaker-recognition-all
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -9,23 +9,70 @@ on:
    tags:
      - '*'
 env:
  GRPC_VERSION: v1.65.0
 concurrency:
-  group: ci-tests-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
+  group: ci-tests-${{ github.head_ref || github.ref }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+  cancel-in-progress: true
 jobs:
  tests-linux:
    runs-on: ubuntu-latest
    strategy:
      matrix:
-        go-version: ['1.26.x']
+        go-version: ['1.25.x']
    steps:
      - name: Free Disk Space (Ubuntu)
        uses: jlumbroso/free-disk-space@main
        with:
          # this might remove tools that are actually needed,
          # if set to "true" but frees about 6 GB
          tool-cache: true
          # all of these default to true, but feel free to set to
          # "false" if necessary for your workflow
          android: true
          dotnet: true
          haskell: true
          large-packages: true
          docker-images: true
          swap-storage: true
      - name: Release space from worker
        run: |
          echo "Listing top largest packages"
          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
          head -n 30 <<< "${pkgs}"
          echo
          df -h
          echo
          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
          sudo rm -rf /usr/local/lib/android
          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
          sudo rm -rf /usr/share/dotnet
          sudo apt-get remove -y '^mono-.*' || true
          sudo apt-get remove -y '^ghc-.*' || true
          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
          sudo apt-get remove -y 'php.*' || true
          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
          sudo apt-get remove -y '^google-.*' || true
          sudo apt-get remove -y azure-cli || true
          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
          sudo apt-get remove -y '^gfortran-.*' || true
          sudo apt-get autoremove -y
          sudo apt-get clean
          echo
          echo "Listing top largest packages"
          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
          head -n 30 <<< "${pkgs}"
          echo
          sudo rm -rfv build || true
          df -h
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Free disk space
        uses: ./.github/actions/free-disk-space
      - name: Setup Go ${{ matrix.go-version }}
        uses: actions/setup-go@v5
        with:
@@ -46,16 +93,89 @@ jobs:
      - name: Dependencies
        run: |
          sudo apt-get update
-          sudo apt-get install curl ffmpeg libopus-dev
+          sudo apt-get install build-essential ccache upx-ucl curl ffmpeg
-      - name: Setup Node.js
+          sudo apt-get install -y libgmock-dev clang
-        uses: actions/setup-node@v6
+          # Install UV
-        with:
+          curl -LsSf https://astral.sh/uv/install.sh | sh
-          node-version: '22'
+          sudo apt-get install -y ca-certificates cmake patch python3-pip unzip
-      - name: Build React UI
+          sudo apt-get install -y libopencv-dev
-        run: make react-ui
+
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
          rm protoc.zip
          curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
          sudo dpkg -i cuda-keyring_1.1-1_all.deb
          sudo apt-get update
          sudo apt-get install -y cuda-nvcc-${CUDA_VERSION} libcublas-dev-${CUDA_VERSION}
          export CUDACXX=/usr/local/cuda/bin/nvcc
          make -C backend/python/transformers
          make backends/huggingface backends/llama-cpp backends/local-store backends/silero-vad backends/piper backends/whisper backends/stablediffusion-ggml
        env:
          CUDA_VERSION: 12-4
      - name: Test
        run: |
-          PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test
+          PATH="$PATH:/root/go/bin" GO_TAGS="tts" make --jobs 5 --output-sync=target test
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
        with:
          detached: true
          connect-timeout-seconds: 180
          limit-access-to-actor: true
  tests-aio-container:
    runs-on: ubuntu-latest
    steps:
      - name: Release space from worker
        run: |
          echo "Listing top largest packages"
          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
          head -n 30 <<< "${pkgs}"
          echo
          df -h
          echo
          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
          sudo rm -rf /usr/local/lib/android
          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
          sudo rm -rf /usr/share/dotnet
          sudo apt-get remove -y '^mono-.*' || true
          sudo apt-get remove -y '^ghc-.*' || true
          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
          sudo apt-get remove -y 'php.*' || true
          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
          sudo apt-get remove -y '^google-.*' || true
          sudo apt-get remove -y azure-cli || true
          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
          sudo apt-get remove -y '^gfortran-.*' || true
          sudo apt-get autoremove -y
          sudo apt-get clean
          echo
          echo "Listing top largest packages"
          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
          head -n 30 <<< "${pkgs}"
          echo
          sudo rm -rfv build || true
          df -h
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          # Install protoc
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
          rm protoc.zip
          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
          PATH="$PATH:$HOME/go/bin" make protogen-go
      - name: Test
        run: |
            PATH="$PATH:$HOME/go/bin" make backends/local-store backends/silero-vad backends/llama-cpp backends/whisper backends/piper backends/stablediffusion-ggml docker-build-aio e2e-aio
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
@@ -68,7 +188,7 @@ jobs:
    runs-on: macos-latest
    strategy:
      matrix:
-        go-version: ['1.26.x']
+        go-version: ['1.25.x']
    steps:
      - name: Clone
        uses: actions/checkout@v6
@@ -84,14 +204,12 @@ jobs:
        run: go version
      - name: Dependencies
        run: |
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus ffmpeg
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm
          pip install --user --no-cache-dir grpcio-tools grpcio
-      - name: Setup Node.js
+      - name: Build llama-cpp-darwin
-        uses: actions/setup-node@v6
+        run: |
-        with:
+          make protogen-go
-          node-version: '22'
+          make backends/llama-cpp-darwin
      - name: Build React UI
        run: make react-ui
      - name: Test
        run: |
          export C_INCLUDE_PATH=/usr/local/include
--- a/.github/workflows/tests-aio.yml
+++ b/.github/workflows/tests-aio.yml
@@ -1,86 +0,0 @@
 ---
 name: 'tests-aio'
 # Runs the all-in-one (AIO) Docker image with real backends + real models.
 # Heavy: builds llama-cpp/whisper/piper/silero-vad/stablediffusion-ggml/local-store
 # and exercises end-to-end inference inside the container. Moved out of test.yml
 # (which used to run on every PR) so PR CI no longer pays this cost.
 #
 # Triggers:
 #   - schedule (nightly @ 04:00 UTC) — catches packaging/image regressions within 24h
 #   - workflow_dispatch — manual run on-demand
 #   - push to master/tags — sanity check after merge / before release
 on:
  schedule:
    - cron: '0 4 * * *'
  workflow_dispatch:
  push:
    branches:
      - master
    tags:
      - '*'
 concurrency:
  group: ci-tests-aio-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
 jobs:
  tests-aio:
    runs-on: ubuntu-latest
    steps:
      - name: Release space from worker
        run: |
          echo "Listing top largest packages"
          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
          head -n 30 <<< "${pkgs}"
          echo
          df -h
          echo
          sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
          sudo apt-get remove --auto-remove android-sdk-platform-tools || true
          sudo apt-get purge --auto-remove android-sdk-platform-tools || true
          sudo rm -rf /usr/local/lib/android
          sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
          sudo rm -rf /usr/share/dotnet
          sudo apt-get remove -y '^mono-.*' || true
          sudo apt-get remove -y '^ghc-.*' || true
          sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
          sudo apt-get remove -y 'php.*' || true
          sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
          sudo apt-get remove -y '^google-.*' || true
          sudo apt-get remove -y azure-cli || true
          sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
          sudo apt-get remove -y '^gfortran-.*' || true
          sudo apt-get autoremove -y
          sudo apt-get clean
          echo
          echo "Listing top largest packages"
          pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
          head -n 30 <<< "${pkgs}"
          echo
          sudo rm -rfv build || true
          df -h
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Dependencies
        run: |
          # Install protoc
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
          rm protoc.zip
          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
          PATH="$PATH:$HOME/go/bin" make protogen-go
      - name: Test
        run: |
            PATH="$PATH:$HOME/go/bin" make backends/local-store backends/silero-vad backends/llama-cpp backends/whisper backends/piper backends/stablediffusion-ggml docker-build-e2e e2e-aio
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
        with:
          detached: true
          connect-timeout-seconds: 180
          limit-access-to-actor: true
--- a/.github/workflows/tests-e2e.yml
+++ b/.github/workflows/tests-e2e.yml
@@ -10,8 +10,8 @@ on:
      - '*'
 concurrency:
-  group: ci-tests-e2e-backend-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
+  group: ci-tests-e2e-backend-${{ github.head_ref || github.ref }}-${{ github.repository }}
-  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
+  cancel-in-progress: true
 jobs:
  tests-e2e-backend:
@@ -24,8 +24,6 @@ jobs:
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - name: Setup Go ${{ matrix.go-version }}
        uses: actions/setup-go@v5
        with:
@@ -45,13 +43,7 @@ jobs:
      - name: Dependencies
        run: |
          sudo apt-get update
-          sudo apt-get install -y build-essential libopus-dev
+          sudo apt-get install -y build-essential
      - name: Setup Node.js
        uses: actions/setup-node@v6
        with:
          node-version: '22'
      - name: Build React UI
        run: make react-ui
      - name: Test Backend E2E
        run: |
          PATH="$PATH:$HOME/go/bin" make build-mock-backend test-e2e
--- a/.github/workflows/tests-ui-e2e.yml
+++ b/.github/workflows/tests-ui-e2e.yml
@@ -1,74 +0,0 @@
 ---
 name: 'UI E2E Tests'
 on:
  pull_request:
    paths:
      - 'core/http/**'
      - 'tests/e2e-ui/**'
      - 'tests/e2e/mock-backend/**'
  push:
    branches:
      - master
 concurrency:
  group: ci-tests-ui-e2e-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}
 jobs:
  tests-ui-e2e:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        go-version: ['1.26.x']
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - name: Setup Go ${{ matrix.go-version }}
        uses: actions/setup-go@v5
        with:
          go-version: ${{ matrix.go-version }}
          cache: false
      - name: Setup Node.js
        uses: actions/setup-node@v6
        with:
          node-version: '22'
      - name: Proto Dependencies
        run: |
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
          rm protoc.zip
          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
      - name: System Dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential libopus-dev
      - name: Build UI test server
        run: PATH="$PATH:$HOME/go/bin" make build-ui-test-server
      - name: Install Playwright
        working-directory: core/http/react-ui
        run: |
          npm install
          npx playwright install --with-deps chromium
      - name: Run Playwright tests
        working-directory: core/http/react-ui
        run: npx playwright test
      - name: Upload Playwright report
        if: ${{ failure() }}
        uses: actions/upload-artifact@v7
        with:
          name: playwright-report
          path: core/http/react-ui/playwright-report/
          retention-days: 7
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
        with:
          detached: true
          connect-timeout-seconds: 180
          limit-access-to-actor: true
--- a/.github/workflows/update_swagger.yaml
+++ b/.github/workflows/update_swagger.yaml
@@ -5,14 +5,11 @@ on:
  workflow_dispatch:
 jobs:
  swagger:
    if: github.repository == 'mudler/LocalAI'
    strategy:
      fail-fast: false
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - name: Configure apt mirror on runner
        uses: ./.github/actions/configure-apt-mirror
      - uses: actions/setup-go@v5
        with:
          go-version: 'stable'
--- a/.gitignore
+++ b/.gitignore
@@ -37,7 +37,7 @@ models/*
 test-models/
 test-dir/
 tests/e2e-aio/backends
-mock-backend
+tests/e2e-aio/models
 release/
@@ -65,18 +65,3 @@ docs/static/gallery.html
 # per-developer customization files for the development container
 .devcontainer/customization/*
 # React UI build artifacts (keep placeholder dist/index.html)
 core/http/react-ui/node_modules/
 core/http/react-ui/dist
 # Extracted backend binaries for container-based testing
 local-backends/
 # UI E2E test artifacts
 tests/e2e-ui/ui-test-server
 core/http/react-ui/playwright-report/
 core/http/react-ui/test-results/
 # Local worktrees
 .worktrees/
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,6 +1,3 @@
 [submodule "docs/themes/hugo-theme-relearn"]
 	path = docs/themes/hugo-theme-relearn
 	url = https://github.com/McShelby/hugo-theme-relearn.git
 [submodule "backend/rust/kokoros/sources/Kokoros"]
 	path = backend/rust/kokoros/sources/Kokoros
 	url = https://github.com/lucasjinreal/Kokoros
--- a/.golangci.yml
+++ b/.golangci.yml
@@ -1,97 +0,0 @@
 version: "2"
 # Only issues introduced relative to master are reported. Pre-existing issues
 # in the codebase do not fail the lint job; they're treated as a baseline that
 # can be cleaned up incrementally. New code (added lines on a branch) is held
 # to the full linter set. Locally, `make lint-all` overrides this and reports
 # every issue.
 issues:
  # origin/master because in shallow CI checkouts only the remote-tracking
  # branch exists; a bare 'master' ref isn't reachable locally.
  new-from-merge-base: origin/master
 linters:
  default: standard
  # staticcheck is noisy on this codebase (mostly QF style suggestions like
  # "could use tagged switch" or "unnecessary fmt.Sprintf"). Re-enable
  # selectively if a high-signal subset is identified.
  disable:
    - staticcheck
  enable:
    - forbidigo
  settings:
    forbidigo:
      forbid:
        - pattern: '^t\.Errorf$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(...) instead of t.Errorf. See .agents/coding-style.md.'
        - pattern: '^t\.Error$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(...) instead of t.Error. See .agents/coding-style.md.'
        - pattern: '^t\.Fatalf$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(Succeed()) / Fail(...) instead of t.Fatalf. See .agents/coding-style.md.'
        - pattern: '^t\.Fatal$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(Succeed()) / Fail(...) instead of t.Fatal. See .agents/coding-style.md.'
        - pattern: '^t\.Run$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Describe/Context/It instead of t.Run. See .agents/coding-style.md.'
        - pattern: '^t\.Skip$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.Skip. See .agents/coding-style.md.'
        - pattern: '^t\.Skipf$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.Skipf. See .agents/coding-style.md.'
        - pattern: '^t\.SkipNow$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.SkipNow. See .agents/coding-style.md.'
        - pattern: '^t\.Logf$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use GinkgoWriter / fmt.Fprintf(GinkgoWriter, ...) instead of t.Logf. See .agents/coding-style.md.'
        - pattern: '^t\.Log$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use GinkgoWriter / fmt.Fprintln(GinkgoWriter, ...) instead of t.Log. See .agents/coding-style.md.'
        - pattern: '^t\.Fail$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.'
        - pattern: '^t\.FailNow$'
          msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.'
        # In-process config should flow through ApplicationConfig / kong-bound
        # CLI flags, not via os.Getenv. The CLI layer is the legitimate
        # env→struct boundary (kong's `env:"..."` tag); anything deeper that
        # reads env directly leaks process state into business logic and
        # makes flags impossible to test or override per-request. Backend
        # subprocesses, the system/capabilities probe, and a few places that
        # read non-LocalAI env vars (HOME, PATH, AUTH_TOKEN passed by parent)
        # are exempt — see linters.exclusions.rules below.
        - pattern: '^os\.(Getenv|LookupEnv|Environ)$'
          msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
  exclusions:
    paths:
      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
      - 'backend/go/whisper/sources'
      - 'docs/'
    rules:
      # CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
      # boundary, and a handful of subcommands legitimately propagate values
      # to spawned subprocesses (LLAMACPP_GRPC_SERVERS, MLX hostfile, ...).
      - path: ^core/cli/
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
      # Backend subprocesses are independent binaries with their own env
      # surface; they're not "in-process config" of the LocalAI server.
      - path: ^backend/
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
      # System capability probe reads HOME, PATH-style vars to discover
      # GPUs, default paths, etc. — not LocalAI config.
      - path: ^pkg/system/
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
      # gRPC server reads AUTH_TOKEN passed in by the parent process at spawn
      # time; model.Loader sets/inherits env to communicate with subprocesses.
      - path: ^pkg/grpc/
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
      - path: ^pkg/model/
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
      # Top-level main binaries (local-ai, launcher) are entry points.
      - path: ^cmd/
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
      # Tests legitimately read $HOME, $TMPDIR, and gating env vars
      # (LOCALAI_COSIGN_LIVE, etc.) to skip live-network specs.
      - path: _test\.go$
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
--- a/.goreleaser.yaml
+++ b/.goreleaser.yaml
@@ -2,7 +2,6 @@ version: 2
 before:
  hooks:
    - make protogen-go
    - make react-ui
    - go mod tidy
 dist: release
 source:
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,45 +1,290 @@
-# LocalAI Agent Instructions
+# Build and testing
-This file is the entry point for AI coding assistants (Claude Code, Cursor, Copilot, Codex, Aider, etc.) working on LocalAI. It is an index to detailed topic guides in the `.agents/` directory. Read the relevant file(s) for the task at hand — you don't need to load all of them.
+Building and testing the project depends on the components involved and the platform where development is taking place. Due to the amount of context required it's usually best not to try building or testing the project unless the user requests it. If you must build the project then inspect the Makefile in the project root and the Makefiles of any backends that are effected by changes you are making. In addition the workflows in .github/workflows can be used as a reference when it is unclear how to build or test a component. The primary Makefile contains targets for building inside or outside Docker, if the user has not previously specified a preference then ask which they would like to use.
-Human contributors: see [CONTRIBUTING.md](CONTRIBUTING.md) for the development workflow.
+## Building a specified backend
-## Policy for AI-Assisted Contributions
+Let's say the user wants to build a particular backend for a given platform. For example let's say they want to build coqui for ROCM/hipblas
-LocalAI follows the Linux kernel project's [guidelines for AI coding assistants](https://docs.kernel.org/process/coding-assistants.html). Before submitting AI-assisted code, read [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md). Key rules:
+- The Makefile has targets like `docker-build-coqui` created with `generate-docker-build-target` at the time of writing. Recently added backends may require a new target.
 - At a minimum we need to set the BUILD_TYPE, BASE_IMAGE build-args
  - Use .github/workflows/backend.yml as a reference it lists the needed args in the `include` job strategy matrix
  - l4t and cublas also requires the CUDA major and minor version
 - You can pretty print a command like `DOCKER_MAKEFLAGS=-j$(nproc --ignore=1) BUILD_TYPE=hipblas BASE_IMAGE=rocm/dev-ubuntu-24.04:6.4.4 make docker-build-coqui`
 - Unless the user specifies that they want you to run the command, then just print it because not all agent frontends handle long running jobs well and the output may overflow your context
 - The user may say they want to build AMD or ROCM instead of hipblas, or Intel instead of SYCL or NVIDIA insted of l4t or cublas. Ask for confirmation if there is ambiguity.
 - Sometimes the user may need extra parameters to be added to `docker build` (e.g. `--platform` for cross-platform builds or `--progress` to view the full logs), in which case you can generate the `docker build` command directly.
- **No `Signed-off-by` from AI.** Only the human submitter may sign off on the Developer Certificate of Origin.
+## Adding a New Backend
 - **No `Co-Authored-By: <AI>` trailers.** The human contributor owns the change.
 - **Use an `Assisted-by:` trailer** to attribute AI involvement. Format: `Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]`.
 - **The human submitter is responsible** for reviewing, testing, and understanding every line of generated code.
-## Topics
+When adding a new backend to LocalAI, you need to update several files to ensure the backend is properly built, tested, and registered. Here's a step-by-step guide based on the pattern used for adding backends like `moonshine`:
-| File | When to read |
+### 1. Create Backend Directory Structure
 |------|-------------|
 | [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md) | Policy for AI-assisted contributions — licensing, DCO, attribution |
 | [.agents/building-and-testing.md](.agents/building-and-testing.md) | Building the project, running tests, Docker builds for specific platforms |
 | [.agents/ci-caching.md](.agents/ci-caching.md) | CI build cache layout (registry-backed BuildKit cache on quay.io/go-skynet/ci-cache, per-arch keys), `DEPS_REFRESH` weekly cache-buster for unpinned Python deps, prebuilt `base-grpc-*` images for llama.cpp variants, per-arch native + manifest-merge pattern, `setup-build-disk` `/mnt` relocation, path filter on master push, manual eviction |
 | [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) |
 | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
 | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
 | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
 | [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
 | [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
 | [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
 | [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
 | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
 | [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery |
 | [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) | LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync |
 | [.agents/backend-signing.md](.agents/backend-signing.md) | Backend OCI image signing (keyless cosign + sigstore-go) — producer-side CI setup, consumer-side gallery `verification:` block, strict mode (`LOCALAI_REQUIRE_BACKEND_INTEGRITY`), revocation via `not_before` |
-## Quick Reference
+Create the backend directory under the appropriate location:
 - **Python backends**: `backend/python/<backend-name>/`
 - **Go backends**: `backend/go/<backend-name>/`
 - **C++ backends**: `backend/cpp/<backend-name>/`
- **Logging**: Use `github.com/mudler/xlog` (same API as slog)
+For Python backends, you'll typically need:
- **Go style**: Prefer `any` over `interface{}`
+- `backend.py` - Main gRPC server implementation
- **Comments**: Explain *why*, not *what*
+- `Makefile` - Build configuration
- **Docs**: Update `docs/content/` when adding features or changing config
+- `install.sh` - Installation script for dependencies
- **New API endpoints**: LocalAI advertises its capability surface in several independent places — swagger `@Tags`, `/api/instructions` registry, auth `RouteFeatureRegistry`, React UI `capabilities.js`, docs. Read [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) and follow its checklist — missing any surface means clients, admins, and the UI won't know the endpoint exists.
+- `protogen.sh` - Protocol buffer generation script
- **Admin endpoints → MCP tool**: every admin endpoint that an admin would manage conversationally (install/list/edit/toggle/upgrade) MUST also be exposed as an MCP tool in `pkg/mcp/localaitools/`. The LocalAI Assistant chat modality and the standalone `local-ai mcp-server` consume that package; drift between REST and MCP is a real risk. Read [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) — the `TestToolHTTPRouteMappingComplete` test fails until you wire the new tool and update the route map.
+- `requirements.txt` - Python dependencies
- **Build**: Inspect `Makefile` and `.github/workflows/` — ask the user before running long builds
+- `run.sh` - Runtime script
- **UI**: The active UI is the React app in `core/http/react-ui/`. The older Alpine.js/HTML UI in `core/http/static/` is pending deprecation — all new UI work goes in the React UI
+- `test.py` / `test.sh` - Test files
 ### 2. Add Build Configurations to `.github/workflows/backend.yml`
 Add build matrix entries for each platform/GPU type you want to support. Look at similar backends (e.g., `chatterbox`, `faster-whisper`) for reference.
 **Placement in file:**
 - CPU builds: Add after other CPU builds (e.g., after `cpu-chatterbox`)
 - CUDA 12 builds: Add after other CUDA 12 builds (e.g., after `gpu-nvidia-cuda-12-chatterbox`)
 - CUDA 13 builds: Add after other CUDA 13 builds (e.g., after `gpu-nvidia-cuda-13-chatterbox`)
 **Additional build types you may need:**
 - ROCm/HIP: Use `build-type: 'hipblas'` with `base-image: "rocm/dev-ubuntu-24.04:6.4.4"`
 - Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"`
 - L4T (ARM): Use `build-type: 'l4t'` with `platforms: 'linux/arm64'` and `runs-on: 'ubuntu-24.04-arm'`
 ### 3. Add Backend Metadata to `backend/index.yaml`
 **Step 3a: Add Meta Definition**
 Add a YAML anchor definition in the `## metas` section (around line 2-300). Look for similar backends to use as a template such as `diffusers` or `chatterbox`
 **Step 3b: Add Image Entries**
 Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags.
 ### 4. Update the Makefile
 The Makefile needs to be updated in several places to support building and testing the new backend:
 **Step 4a: Add to `.NOTPARALLEL`**
 Add `backends/<backend-name>` to the `.NOTPARALLEL` line (around line 2) to prevent parallel execution conflicts:
 ```makefile
 .NOTPARALLEL: ... backends/<backend-name>
 ```
 **Step 4b: Add to `prepare-test-extra`**
 Add the backend to the `prepare-test-extra` target (around line 312) to prepare it for testing:
 ```makefile
 prepare-test-extra: protogen-python
 	...
 	$(MAKE) -C backend/python/<backend-name>
 ```
 **Step 4c: Add to `test-extra`**
 Add the backend to the `test-extra` target (around line 319) to run its tests:
 ```makefile
 test-extra: prepare-test-extra
 	...
 	$(MAKE) -C backend/python/<backend-name> test
 ```
 **Step 4d: Add Backend Definition**
 Add a backend definition variable in the backend definitions section (around line 428-457). The format depends on the backend type:
 **For Python backends with root context** (like `faster-whisper`, `coqui`):
 ```makefile
 BACKEND_<BACKEND_NAME> = <backend-name>|python|.|false|true
 ```
 **For Python backends with `./backend` context** (like `chatterbox`, `moonshine`):
 ```makefile
 BACKEND_<BACKEND_NAME> = <backend-name>|python|./backend|false|true
 ```
 **For Go backends**:
 ```makefile
 BACKEND_<BACKEND_NAME> = <backend-name>|golang|.|false|true
 ```
 **Step 4e: Generate Docker Build Target**
 Add an eval call to generate the docker-build target (around line 480-501):
 ```makefile
 $(eval $(call generate-docker-build-target,$(BACKEND_<BACKEND_NAME>)))
 ```
 **Step 4f: Add to `docker-build-backends`**
 Add `docker-build-<backend-name>` to the `docker-build-backends` target (around line 507):
 ```makefile
 docker-build-backends: ... docker-build-<backend-name>
 ```
 **Determining the Context:**
 - If the backend is in `backend/python/<backend-name>/` and uses `./backend` as context in the workflow file, use `./backend` context
 - If the backend is in `backend/python/<backend-name>/` but uses `.` as context in the workflow file, use `.` context
 - Check similar backends to determine the correct context
 ### 5. Verification Checklist
 After adding a new backend, verify:
 - [ ] Backend directory structure is complete with all necessary files
 - [ ] Build configurations added to `.github/workflows/backend.yml` for all desired platforms
 - [ ] Meta definition added to `backend/index.yaml` in the `## metas` section
 - [ ] Image entries added to `backend/index.yaml` for all build variants (latest + development)
 - [ ] Tag suffixes match between workflow file and index.yaml
 - [ ] Makefile updated with all 6 required changes (`.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, backend definition, docker-build target eval, `docker-build-backends`)
 - [ ] No YAML syntax errors (check with linter)
 - [ ] No Makefile syntax errors (check with linter)
 - [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
 ### 6. Example: Adding a Python Backend
 For reference, when `moonshine` was added:
 - **Files created**: `backend/python/moonshine/{backend.py, Makefile, install.sh, protogen.sh, requirements.txt, run.sh, test.py, test.sh}`
 - **Workflow entries**: 3 build configurations (CPU, CUDA 12, CUDA 13)
 - **Index entries**: 1 meta definition + 6 image entries (cpu, cuda12, cuda13 × latest/development)
 - **Makefile updates**: 
  - Added to `.NOTPARALLEL` line
  - Added to `prepare-test-extra` and `test-extra` targets
  - Added `BACKEND_MOONSHINE = moonshine|python|./backend|false|true`
  - Added eval for docker-build target generation
  - Added `docker-build-moonshine` to `docker-build-backends`
 # Coding style
 - The project has the following .editorconfig
 ```
 root = true
 [*]
 indent_style = space
 indent_size = 2
 end_of_line = lf
 charset = utf-8
 trim_trailing_whitespace = true
 insert_final_newline = true
 [*.go]
 indent_style = tab
 [Makefile]
 indent_style = tab
 [*.proto]
 indent_size = 2
 [*.py]
 indent_size = 4
 [*.js]
 indent_size = 2
 [*.yaml]
 indent_size = 2
 [*.md]
 trim_trailing_whitespace = false
 ```
 - Use comments sparingly to explain why code does something, not what it does. Comments are there to add context that would be difficult to deduce from reading the code.
 - Prefer modern Go e.g. use `any` not `interface{}`
 # Logging
 Use `github.com/mudler/xlog` for logging which has the same API as slog.
 # llama.cpp Backend
 The llama.cpp backend (`backend/cpp/llama-cpp/grpc-server.cpp`) is a gRPC adaptation of the upstream HTTP server (`llama.cpp/tools/server/server.cpp`). It uses the same underlying server infrastructure from `llama.cpp/tools/server/server-context.cpp`.
 ## Building and Testing
 - Test llama.cpp backend compilation: `make backends/llama-cpp`
 - The backend is built as part of the main build process
 - Check `backend/cpp/llama-cpp/Makefile` for build configuration
 ## Architecture
 - **grpc-server.cpp**: gRPC server implementation, adapts HTTP server patterns to gRPC
 - Uses shared server infrastructure: `server-context.cpp`, `server-task.cpp`, `server-queue.cpp`, `server-common.cpp`
 - The gRPC server mirrors the HTTP server's functionality but uses gRPC instead of HTTP
 ## Common Issues When Updating llama.cpp
 When fixing compilation errors after upstream changes:
 1. Check how `server.cpp` (HTTP server) handles the same change
 2. Look for new public APIs or getter methods
 3. Store copies of needed data instead of accessing private members
 4. Update function calls to match new signatures
 5. Test with `make backends/llama-cpp`
 ## Key Differences from HTTP Server
 - gRPC uses `BackendServiceImpl` class with gRPC service methods
 - HTTP server uses `server_routes` with HTTP handlers
 - Both use the same `server_context` and task queue infrastructure
 - gRPC methods: `LoadModel`, `Predict`, `PredictStream`, `Embedding`, `Rerank`, `TokenizeString`, `GetMetrics`, `Health`
 ## Tool Call Parsing Maintenance
 When working on JSON/XML tool call parsing functionality, always check llama.cpp for reference implementation and updates:
 ### Checking for XML Parsing Changes
 1. **Review XML Format Definitions**: Check `llama.cpp/common/chat-parser-xml-toolcall.h` for `xml_tool_call_format` struct changes
 2. **Review Parsing Logic**: Check `llama.cpp/common/chat-parser-xml-toolcall.cpp` for parsing algorithm updates
 3. **Review Format Presets**: Check `llama.cpp/common/chat-parser.cpp` for new XML format presets (search for `xml_tool_call_format form`)
 4. **Review Model Lists**: Check `llama.cpp/common/chat.h` for `COMMON_CHAT_FORMAT_*` enum values that use XML parsing:
   - `COMMON_CHAT_FORMAT_GLM_4_5`
   - `COMMON_CHAT_FORMAT_MINIMAX_M2`
   - `COMMON_CHAT_FORMAT_KIMI_K2`
   - `COMMON_CHAT_FORMAT_QWEN3_CODER_XML`
   - `COMMON_CHAT_FORMAT_APRIEL_1_5`
   - `COMMON_CHAT_FORMAT_XIAOMI_MIMO`
   - Any new formats added
 ### Model Configuration Options
 Always check `llama.cpp` for new model configuration options that should be supported in LocalAI:
 1. **Check Server Context**: Review `llama.cpp/tools/server/server-context.cpp` for new parameters
 2. **Check Chat Params**: Review `llama.cpp/common/chat.h` for `common_chat_params` struct changes
 3. **Check Server Options**: Review `llama.cpp/tools/server/server.cpp` for command-line argument changes
 4. **Examples of options to check**:
   - `ctx_shift` - Context shifting support
   - `parallel_tool_calls` - Parallel tool calling
   - `reasoning_format` - Reasoning format options
   - Any new flags or parameters
 ### Implementation Guidelines
 1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation
 2. **Test Coverage**: Add tests for new features matching llama.cpp's behavior
 3. **Documentation**: Update relevant documentation when adding new formats or options
 4. **Backward Compatibility**: Ensure changes don't break existing functionality
 ### Files to Monitor
 - `llama.cpp/common/chat-parser-xml-toolcall.h` - Format definitions
 - `llama.cpp/common/chat-parser-xml-toolcall.cpp` - Parsing logic
 - `llama.cpp/common/chat-parser.cpp` - Format presets and model-specific handlers
 - `llama.cpp/common/chat.h` - Format enums and parameter structures
 - `llama.cpp/tools/server/server-context.cpp` - Server configuration options
 # Documentation
 The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.
 - **Feature Documentation**: If you add a new feature (like a new backend or API endpoint), create a new markdown file in `docs/content/features/` explaining what it is, how to configure it, and how to use it.
 - **Configuration**: If you modify configuration options, update the relevant sections in `docs/content/`.
 - **Examples**: providing concrete examples (like YAML configuration blocks) is highly encouraged to help users get started quickly.
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1 +0,0 @@
 AGENTS.md
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -7,13 +7,10 @@ Thank you for your interest in contributing to LocalAI! We appreciate your time
 - [Getting Started](#getting-started)
  - [Prerequisites](#prerequisites)
  - [Setting up the Development Environment](#setting-up-the-development-environment)
  - [Environment Variables](#environment-variables)
 - [Contributing](#contributing)
  - [Submitting an Issue](#submitting-an-issue)
  - [Development Workflow](#development-workflow)
  - [Creating a Pull Request (PR)](#creating-a-pull-request-pr)
 - [Coding Guidelines](#coding-guidelines)
 - [AI Coding Assistants](#ai-coding-assistants)
 - [Testing](#testing)
 - [Documentation](#documentation)
 - [Community and Communication](#community-and-communication)
@@ -22,122 +19,18 @@ Thank you for your interest in contributing to LocalAI! We appreciate your time
 ### Prerequisites
- **Go 1.21+** (the project currently uses Go 1.26 in `go.mod`, but 1.21 is the minimum supported version)
+- Golang [1.21]
-  - [Download Go](https://go.dev/dl/) or install via your package manager
+- Git
-  - macOS: `brew install go`
+- macOS/Linux
  - Ubuntu/Debian: follow the [official instructions](https://go.dev/doc/install) (the `apt` version is often outdated)
  - Verify: `go version`
 - **Git**
 - **GNU Make**
 - **GCC / C/C++ toolchain** (required for CGo and native backends)
 - **Protocol Buffers compiler** (`protoc`) — needed for gRPC code generation
-#### System dependencies by platform
+### Setting up the Development Environment and running localAI in the local environment
-<details>
+1. Clone the repository: `git clone https://github.com/go-skynet/LocalAI.git`
-<summary><strong>Ubuntu / Debian</strong></summary>
+2. Navigate to the project directory: `cd LocalAI`
-
+3. Install the required dependencies ( see https://localai.io/basics/build/#build-localai-locally )
-```bash
+4. Build LocalAI: `make build`
-sudo apt-get update
+5. Run LocalAI: `./local-ai`
-sudo apt-get install -y build-essential gcc g++ cmake git wget \
+6. To Build and live reload: `make build-dev`
  protobuf-compiler libprotobuf-dev pkg-config \
  libopencv-dev libgrpc-dev
 ```
 </details>
 <details>
 <summary><strong>CentOS / RHEL / Fedora</strong></summary>
 ```bash
 sudo dnf groupinstall -y "Development Tools"
 sudo dnf install -y cmake git wget protobuf-compiler protobuf-devel \
  opencv-devel grpc-devel
 ```
 </details>
 <details>
 <summary><strong>macOS</strong></summary>
 ```bash
 xcode-select --install
 brew install cmake git protobuf grpc opencv wget
 ```
 </details>
 <details>
 <summary><strong>Windows</strong></summary>
 Use [WSL 2](https://learn.microsoft.com/en-us/windows/wsl/install) with an Ubuntu distribution, then follow the Ubuntu instructions above.
 </details>
 ### Setting up the Development Environment
 1. **Clone the repository:**
   ```bash
   git clone https://github.com/mudler/LocalAI.git
   cd LocalAI
   ```
 2. **Build LocalAI:**
   ```bash
   make build
   ```
   This runs protobuf generation, installs Go tools, builds the React UI, and compiles the `local-ai` binary. Key build variables you can set:
   | Variable | Description | Example |
   |---|---|---|
   | `BUILD_TYPE` | GPU/accelerator type (`cublas`, `hipblas`, `intel`, ``) | `BUILD_TYPE=cublas make build` |
   | `GO_TAGS` | Additional Go build tags | `GO_TAGS=debug make build` |
   | `CUDA_MAJOR_VERSION` | CUDA major version (default: `13`) | `CUDA_MAJOR_VERSION=12` |
 3. **Run LocalAI:**
   ```bash
   ./local-ai
   ```
 4. **Development mode with live reload:**
   ```bash
   make build-dev
   ```
   This installs [`air`](https://github.com/air-verse/air) automatically and watches for file changes, rebuilding and restarting the server on each save.
 5. **Containerized build** (no local toolchain needed):
   ```bash
   make docker
   ```
   For GPU-specific Docker builds, see the `docker-build-*` targets in the Makefile and refer to [CLAUDE.md](CLAUDE.md) for detailed backend build instructions.
 ### Environment Variables
 LocalAI is configured primarily through environment variables (or equivalent CLI flags). The most useful ones for development are:
 | Variable | Description | Default |
 |---|---|---|
 | `LOCALAI_DEBUG` | Enable debug mode | `false` |
 | `LOCALAI_LOG_LEVEL` | Log verbosity (`error`, `warn`, `info`, `debug`, `trace`) | — |
 | `LOCALAI_LOG_FORMAT` | Log format (`default`, `text`, `json`) | `default` |
 | `LOCALAI_MODELS_PATH` | Path to model files | `./models` |
 | `LOCALAI_BACKENDS_PATH` | Path to backend binaries | `./backends` |
 | `LOCALAI_CONFIG_DIR` | Directory for dynamic config files (API keys, external backends) | `./configuration` |
 | `LOCALAI_THREADS` | Number of threads for inference | — |
 | `LOCALAI_ADDRESS` | Bind address for the API server | `:8080` |
 | `LOCALAI_API_KEY` | API key(s) for authentication | — |
 | `LOCALAI_CORS` | Enable CORS | `false` |
 | `LOCALAI_DISABLE_WEBUI` | Disable the web UI | `false` |
 See `core/cli/run.go` for the full list of supported environment variables.
 ## Contributing
@@ -147,148 +40,43 @@ We welcome contributions from everyone! To get started, follow these steps:
 If you find a bug, have a feature request, or encounter any issues, please check the [issue tracker](https://github.com/go-skynet/LocalAI/issues) to see if a similar issue has already been reported. If not, feel free to [create a new issue](https://github.com/go-skynet/LocalAI/issues/new) and provide as much detail as possible.
-### Development Workflow
+### Creating a Pull Request (PR)
 #### Branch naming conventions
 Use a descriptive branch name that indicates the type and scope of the change:
 - `feature/<short-description>` — new functionality
 - `fix/<short-description>` — bug fixes
 - `docs/<short-description>` — documentation changes
 - `refactor/<short-description>` — code refactoring
 #### Commit messages
 - Use a short, imperative subject line (e.g., "feat: add whisper backend support", not "Added whisper backend support")
 - Keep the subject under 72 characters
 - Use the body to explain **why** the change was made when the subject alone is not sufficient
 - Use [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/)
 #### Creating a Pull Request (PR)
 Before jumping into a PR for a massive feature or big change, it is preferred to discuss it first via an issue.
 1. Fork the repository.
-2. Create a new branch: `git checkout -b feature/my-change`
+2. Create a new branch with a descriptive name: `git checkout -b [branch name]`
-3. Make your changes, keeping commits focused and atomic.
+3. Make your changes and commit them.
-4. Run tests locally before pushing (see [Testing](#testing) below).
+4. Push the changes to your fork: `git push origin [branch name]`
-5. Push to your fork: `git push origin feature/my-change`
+5. Create a new pull request from your branch to the main project's `main` or `master` branch.
-6. Open a pull request against the `master` branch.
+6. Provide a clear description of your changes in the pull request.
-7. Fill in the PR description with:
+7. Make any requested changes during the review process.
-   - What the change does and why
+8. Once your PR is approved, it will be merged into the main project.
   - How it was tested
   - Any breaking changes or migration steps
 8. Respond to review feedback promptly. Push follow-up commits rather than force-pushing amended commits so reviewers can see incremental changes.
 9. Once approved, a maintainer will merge your PR.
 ## Coding Guidelines
-This project uses an [`.editorconfig`](.editorconfig) file to define formatting standards (indentation, line endings, charset, etc.). Please configure your editor to respect it.
+- No specific coding guidelines at the moment. Please make sure the code can be tested. The most popular lint tools like [`golangci-lint`](https://golangci-lint.run) can help you here.
 For AI-assisted development, see [`AGENTS.md`](AGENTS.md) (or the equivalent [`CLAUDE.md`](CLAUDE.md) symlink) for agent-specific guidelines including build instructions and backend architecture details. Contributions produced with AI assistance must follow the rules in the [AI Coding Assistants](#ai-coding-assistants) section below.
 ### General Principles
 - Write code that can be tested. All new features and bug fixes should include test coverage.
 - Use comments sparingly to explain **why** code does something, not **what** it does. Comments should add context that would be difficult to deduce from reading the code alone.
 - Keep changes focused. Avoid unrelated refactors, formatting changes, or feature additions in the same PR.
 ### Go Code
 - Prefer modern Go idioms — for example, use `any` instead of `interface{}`.
 - Use [`golangci-lint`](https://golangci-lint.run) to catch common issues before submitting a PR.
 - Use [`github.com/mudler/xlog`](https://github.com/mudler/xlog) for logging (same API as `slog`). Do not use `fmt.Println` or the standard `log` package for operational logging.
 - Use tab indentation for Go files (as defined in `.editorconfig`).
 ### Python Code
 - Use 4-space indentation (as defined in `.editorconfig`).
 - Include a `requirements.txt` for any new dependencies.
 ### Code Review
 - All contributions go through code review via pull requests.
 - Reviewers will check for correctness, test coverage, adherence to these guidelines, and clarity of intent.
 - Be responsive to review feedback and keep discussions constructive.
 ## AI Coding Assistants
 LocalAI follows the **same guidelines as the Linux kernel project** for AI-assisted contributions: <https://docs.kernel.org/process/coding-assistants.html>.
 The full policy for this repository lives in [`.agents/ai-coding-assistants.md`](.agents/ai-coding-assistants.md). Summary:
 - **AI agents MUST NOT add `Signed-off-by` tags.** Only humans can certify the Developer Certificate of Origin.
 - **AI agents MUST NOT add `Co-Authored-By` trailers** attributing themselves as co-authors.
 - **Attribute AI involvement with an `Assisted-by` trailer** in the commit message:
  ```
  Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
  ```
  Example: `Assisted-by: Claude:claude-opus-4-7 golangci-lint`
  Basic development tools (git, go, make, editors) should not be listed.
 - **The human submitter is responsible** for reviewing, testing, and fully understanding every line of AI-generated code — including verifying that any referenced APIs, flags, or file paths actually exist in the tree.
 - Contributions must remain compatible with LocalAI's **MIT License**.
 ## Testing
-All new features and bug fixes should include test coverage. The project uses [Ginkgo](https://onsi.github.io/ginkgo/) as its test framework.
+`make test` cannot handle all the model now. Please be sure to add a test case for the new features or the part was changed.
-### Running unit tests
+### Running AIO tests
-```bash
+All-In-One images has a set of tests that automatically verifies that most of the endpoints works correctly, a flow can be :
 make test
 ```
 This downloads test model fixtures, runs protobuf generation, and executes the full test suite including llama-gguf, TTS, and stable-diffusion tests. Note: some tests require model files to be downloaded, so the first run may take longer.
 To run tests for a specific package:
 ```bash
 go test ./core/config/...
 go test ./pkg/model/...
 ```
 To run a specific test by name using Ginkgo's `--focus` flag:
 ```bash
 go run github.com/onsi/ginkgo/v2/ginkgo --focus="should load a model" -v -r ./core/
 ```
 ### Running end-to-end tests
 The e2e tests run LocalAI in a Docker container and exercise the API:
 ```bash
 make test-e2e
 ```
 ### Running E2E container tests
 These tests build a standard LocalAI Docker image and run it with pre-configured model configs to verify that most endpoints work correctly:
 ```bash
 # Build the LocalAI docker image
-make docker-build-e2e
+make DOCKER_IMAGE=local-ai docker
-# Run the e2e tests (uses model configs from tests/e2e-aio/models/)
+# Build the corresponding AIO image
-make e2e-aio
+BASE_IMAGE=local-ai DOCKER_AIO_IMAGE=local-ai-aio:test make docker-aio
 ```
-### Testing backends
+# Run the AIO e2e tests
-
+LOCALAI_IMAGE_TAG=test LOCALAI_IMAGE=local-ai-aio make run-e2e-aio
 To prepare and test extra (Python) backends:
 ```bash
 make prepare-test-extra   # build Python backends for testing
 make test-extra           # run backend-specific tests
 ```
 ## Documentation
-We welcome contributions to the documentation. Please open a new PR or create a new issue. The documentation is available under `docs/` https://github.com/mudler/LocalAI/tree/master/docs
+We are welcome the contribution of the documents, please open new PR or create a new issue. The documentation is available under `docs/` https://github.com/mudler/LocalAI/tree/master/docs
 ### Gallery YAML Schema
--- a/47
+++ b/47
@@ -1,23 +1,16 @@
 ARG BASE_IMAGE=ubuntu:24.04
 ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
 ARG INTEL_BASE_IMAGE=${BASE_IMAGE}
 ARG UBUNTU_CODENAME=noble
 # Optional alternate Ubuntu apt mirror(s). Empty = use upstream.
 # See .docker/apt-mirror.sh for accepted values.
 ARG APT_MIRROR=""
 ARG APT_PORTS_MIRROR=""
 FROM ${BASE_IMAGE} AS requirements
 ARG APT_MIRROR
 ARG APT_PORTS_MIRROR
 ENV DEBIAN_FRONTEND=noninteractive
-RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
+RUN apt-get update && \
    APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
    apt-get update && \
    apt-get install -y --no-install-recommends \
        ca-certificates curl wget espeak-ng libgomp1 \
-        ffmpeg libopenblas0 libopenblas-dev libopus0 sox && \
+        ffmpeg libopenblas0 libopenblas-dev sox && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
@@ -156,7 +149,6 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
            hipblaslt-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
@@ -184,7 +176,7 @@ ENV PATH=/opt/rocm/bin:${PATH}
 # The requirements-core target is common to all images.  It should not be placed in requirements-core unless every single build will use it.
 FROM requirements-drivers AS build-requirements
-ARG GO_VERSION=1.26.0
+ARG GO_VERSION=1.25.4
 ARG CMAKE_VERSION=3.31.10
 ARG CMAKE_FROM_SOURCE=false
 ARG TARGETARCH
@@ -198,7 +190,6 @@ RUN apt-get update && \
        curl libssl-dev \
        git \
        git-lfs \
        libopus-dev pkg-config \
        unzip upx-ucl python3 python-is-python3 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
@@ -248,14 +239,10 @@ WORKDIR /build
 # This is a temporary workaround until Intel fixes their repository
 FROM ${INTEL_BASE_IMAGE} AS intel
 ARG UBUNTU_CODENAME=noble
 ARG APT_MIRROR
 ARG APT_PORTS_MIRROR
 RUN wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
 gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg
 RUN echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu ${UBUNTU_CODENAME}/lts/2350 unified" > /etc/apt/sources.list.d/intel-graphics.list
-RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
+RUN apt-get update && \
    APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
    apt-get update && \
    apt-get install -y --no-install-recommends \
        intel-oneapi-runtime-libs && \
    apt-get clean && \
@@ -268,7 +255,7 @@ RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mi
 FROM build-requirements AS builder-base
-ARG GO_TAGS="auth"
+ARG GO_TAGS=""
 ARG GRPC_BACKENDS
 ARG MAKEFLAGS
 ARG LD_FLAGS="-s -w"
@@ -304,17 +291,6 @@ EOT
 ###################################
 ###################################
 # Build React UI
 FROM node:26-slim AS react-ui-builder
 WORKDIR /app
 COPY core/http/react-ui/package*.json ./
 RUN npm install
 COPY core/http/react-ui/ ./
 RUN npm run build
 ###################################
 ###################################
 # Compile backends first in a separate stage
 FROM builder-base AS builder-backends
 ARG TARGETARCH
@@ -331,6 +307,7 @@ COPY ./.git ./.git
 # Some of the Go backends use libs from the main src, we could further optimize the caching by building the CPP backends before here
 COPY ./pkg/grpc ./pkg/grpc
 COPY ./pkg/utils ./pkg/utils
 COPY ./pkg/langchain ./pkg/langchain
 RUN ls -l ./
 RUN make protogen-go
@@ -343,9 +320,6 @@ WORKDIR /build
 COPY . .
 # Copy pre-built React UI
 COPY --from=react-ui-builder /app/dist ./core/http/react-ui/dist
 ## Build the binary
 ## If we're on arm64 AND using cublas/hipblas, skip some of the llama-compat backends to save space
 ## Otherwise just run the normal build
@@ -390,17 +364,14 @@ COPY ./entrypoint.sh .
 # Copy the binary
 COPY --from=builder /build/local-ai ./
 # Copy the opus shim if it was built
 RUN --mount=from=builder,src=/build/,dst=/mnt/build \
    if [ -f /mnt/build/libopusshim.so ]; then cp /mnt/build/libopusshim.so ./; fi
 # Make sure the models directory exists
-RUN mkdir -p /models /backends /data
+RUN mkdir -p /models /backends
 # Define the health check command
 HEALTHCHECK --interval=1m --timeout=10m --retries=10 \
  CMD curl -f ${HEALTHCHECK_ENDPOINT} || exit 1
-VOLUME /models /backends /configuration /data
+VOLUME /models /backends /configuration
 EXPOSE 8080
 ENTRYPOINT [ "/entrypoint.sh" ]
--- a/Dockerfile.aio
+++ b/Dockerfile.aio
@@ -0,0 +1,8 @@
 ARG BASE_IMAGE=ubuntu:24.04
 FROM ${BASE_IMAGE} 
 RUN apt-get update && apt-get install -y pciutils && apt-get clean
 COPY aio/ /aio
 ENTRYPOINT [ "/aio/entrypoint.sh" ]
--- a/812
+++ b/812
--- a/README.md
+++ b/README.md
@@ -5,14 +5,26 @@
 </h1>
 <p align="center">
 <a href="https://github.com/go-skynet/LocalAI/fork" target="blank">
 <img src="https://img.shields.io/github/forks/go-skynet/LocalAI?style=for-the-badge" alt="LocalAI forks"/>
 </a>
 <a href="https://github.com/go-skynet/LocalAI/stargazers" target="blank">
 <img src="https://img.shields.io/github/stars/go-skynet/LocalAI?style=for-the-badge" alt="LocalAI stars"/>
 </a>
 <a href="https://github.com/go-skynet/LocalAI/pulls" target="blank">
 <img src="https://img.shields.io/github/issues-pr/go-skynet/LocalAI?style=for-the-badge" alt="LocalAI pull-requests"/>
 </a>
 <a href='https://github.com/go-skynet/LocalAI/releases'>
 <img src='https://img.shields.io/github/release/go-skynet/LocalAI?&label=Latest&style=for-the-badge'>
 </a>
-<a href="LICENSE" target="blank">
+</p>
-<img src="https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge" alt="LocalAI License"/>
+
 <p align="center">
 <a href="https://hub.docker.com/r/localai/localai" target="blank">
 <img src="https://img.shields.io/badge/dockerhub-images-important.svg?logo=Docker" alt="LocalAI Docker hub"/>
 </a>
 <a href="https://quay.io/repository/go-skynet/local-ai?tab=tags&tag=latest" target="blank">
 <img src="https://img.shields.io/badge/quay.io-images-important.svg?" alt="LocalAI Quay.io"/>
 </a>
 </p>
@@ -29,186 +41,335 @@
 <a href="https://trendshift.io/repositories/5539" target="_blank"><img src="https://trendshift.io/api/badge/repositories/5539" alt="mudler%2FLocalAI | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 </p>
-**LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
+> :bulb: Get help - [❓FAQ](https://localai.io/faq/) [💭Discussions](https://github.com/go-skynet/LocalAI/discussions) [:speech_balloon: Discord](https://discord.gg/uJAeKSAGDy) [:book: Documentation website](https://localai.io/)
 >
 > [💻 Quickstart](https://localai.io/basics/getting_started/) [🖼️ Models](https://models.localai.io/) [🚀 Roadmap](https://github.com/mudler/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3Aroadmap) [🛫 Examples](https://github.com/mudler/LocalAI-examples) Try on 
 [![Telegram](https://img.shields.io/badge/Telegram-2CA5E0?style=for-the-badge&logo=telegram&logoColor=white)](https://t.me/localaiofficial_bot)
- **Drop-in API compatibility** — OpenAI, Anthropic, ElevenLabs APIs
+[![tests](https://github.com/go-skynet/LocalAI/actions/workflows/test.yml/badge.svg)](https://github.com/go-skynet/LocalAI/actions/workflows/test.yml)[![Build and Release](https://github.com/go-skynet/LocalAI/actions/workflows/release.yaml/badge.svg)](https://github.com/go-skynet/LocalAI/actions/workflows/release.yaml)[![build container images](https://github.com/go-skynet/LocalAI/actions/workflows/image.yml/badge.svg)](https://github.com/go-skynet/LocalAI/actions/workflows/image.yml)[![Bump dependencies](https://github.com/go-skynet/LocalAI/actions/workflows/bump_deps.yaml/badge.svg)](https://github.com/go-skynet/LocalAI/actions/workflows/bump_deps.yaml)[![Artifact Hub](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/localai)](https://artifacthub.io/packages/search?repo=localai)
 - **36+ backends** — llama.cpp, vLLM, transformers, whisper, diffusers, MLX...
 - **Any hardware** — NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
 - **Multi-user ready** — API key auth, user quotas, role-based access
 - **Built-in AI agents** — autonomous agents with tool use, RAG, MCP, and skills
 - **Privacy-first** — your data never leaves your infrastructure
-Created by [Ettore Di Giacinto](https://github.com/mudler) and maintained by the [LocalAI team](#team).
+**LocalAI** is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that's compatible with OpenAI (Elevenlabs, Anthropic... ) API specifications for local AI inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families. Does not require GPU. It is created and maintained by [Ettore Di Giacinto](https://github.com/mudler).
 > [:book: Documentation](https://localai.io/) | [:speech_balloon: Discord](https://discord.gg/uJAeKSAGDy) | [💻 Quickstart](https://localai.io/basics/getting_started/) | [🖼️ Models](https://models.localai.io/) | [❓FAQ](https://localai.io/faq/)
-## Guided tour
+## Local Stack Family
-https://github.com/user-attachments/assets/08cbb692-57da-48f7-963d-2e7b43883c18
+Liking LocalAI? LocalAI is part of an integrated suite of AI infrastructure tools, you might also like:
-<details>
+- **[LocalAGI](https://github.com/mudler/LocalAGI)** - AI agent orchestration platform with OpenAI Responses API compatibility and advanced agentic capabilities
 - **[LocalRecall](https://github.com/mudler/LocalRecall)** - MCP/REST API knowledge base system providing persistent memory and storage for AI agents
 - 🆕 **[Cogito](https://github.com/mudler/cogito)** - Go library for building intelligent, co-operative agentic software and LLM-powered workflows, focusing on improving results for small, open source language models that scales to any LLM. Powers LocalAGI and LocalAI MCP/Agentic capabilities
 - 🆕 **[Wiz](https://github.com/mudler/wiz)** - Terminal-based AI agent accessible via Ctrl+Space keybinding. Portable, local-LLM friendly shell assistant with TUI/CLI modes, tool execution with approval, MCP protocol support, and multi-shell compatibility (zsh, bash, fish)
 - 🆕 **[SkillServer](https://github.com/mudler/skillserver)** - Simple, centralized skills database for AI agents via MCP. Manages skills as Markdown files with MCP server integration, web UI for editing, Git synchronization, and full-text search capabilities
 <summary>
 Click to see more!
 </summary>
-#### User and auth
+## Screenshots / Video
-https://github.com/user-attachments/assets/228fa9ad-81a3-4d43-bfb9-31557e14a36c
+### Youtube video
-#### Agents
+<h1 align="center">
  <br>
  <a href="https://www.youtube.com/watch?v=PDqYhB9nNHA" target="_blank"> <img width="300" src="https://img.youtube.com/vi/PDqYhB9nNHA/0.jpg"> </a><br>
 <br>
 </h1>
 https://github.com/user-attachments/assets/6270b331-e21d-4087-a540-6290006b381a
-#### Usage metrics per user
+### Screenshots
-https://github.com/user-attachments/assets/cbb03379-23b4-4e3d-bd26-d152f057007f
+| Talk Interface | Generate Audio |
 | --- | --- |
 | ![Screenshot 2025-03-31 at 12-01-36 LocalAI - Talk](./docs/assets/images/screenshots/screenshot_tts.png) | ![Screenshot 2025-03-31 at 12-01-29 LocalAI - Generate audio with voice-en-us-ryan-low](./docs/assets/images/screenshots/screenshot_tts.png) |
-#### Fine-tuning and Quantization
+| Models Overview | Generate Images |
 | --- | --- |
 | ![Screenshot 2025-03-31 at 12-01-20 LocalAI - Models](./docs/assets/images/screenshots/screenshot_gallery.png) | ![Screenshot 2025-03-31 at 12-31-41 LocalAI - Generate images with flux 1-dev](./docs/assets/images/screenshots/screenshot_image.png) |
-https://github.com/user-attachments/assets/5ba4ace9-d3df-4795-b7d4-b0b404ea71ee
+| Chat Interface | Home |
 | --- | --- |
 | ![Screenshot 2025-03-31 at 11-57-44 LocalAI - Chat with localai-functioncall-qwen2 5-7b-v0 5](./docs/assets/images/screenshots/screenshot_chat.png) | ![Screenshot 2025-03-31 at 11-57-23 LocalAI API - c2a39e3 (c2a39e3639227cfd94ffffe9f5691239acc275a8)](./docs/assets/images/screenshots/screenshot_home.png) |
-#### WebRTC
+| Login | Swarm |
 | --- | --- |
 |![Screenshot 2025-03-31 at 12-09-59 ](./docs/assets/images/screenshots/screenshot_login.png) | ![Screenshot 2025-03-31 at 12-10-39 LocalAI - P2P dashboard](./docs/assets/images/screenshots/screenshot_p2p.png) |
-https://github.com/user-attachments/assets/ed88e34c-fed3-4b83-8a67-4716a9feeb7b
+## 💻 Quickstart
-</details>
+> ⚠️ **Note:** The `install.sh` script is currently experiencing issues due to the heavy changes currently undergoing in LocalAI and may produce broken or misconfigured installations. Please use Docker installation (see below) or manual binary installation until [issue #8032](https://github.com/mudler/LocalAI/issues/8032) is resolved.
-## Quickstart
+Run the installer script:
-### macOS
+```bash
 # Basic installation
 curl https://localai.io/install.sh | sh
 ```
 For more installation options, see [Installer Options](https://localai.io/installation/).
 ### macOS Download:
 <a href="https://github.com/mudler/LocalAI/releases/latest/download/LocalAI.dmg">
  <img src="https://img.shields.io/badge/Download-macOS-blue?style=for-the-badge&logo=apple&logoColor=white" alt="Download LocalAI for macOS"/>
 </a>
-> **Note:** The DMG is not signed by Apple. After installing, run: `sudo xattr -d com.apple.quarantine /Applications/LocalAI.app`. See [#6268](https://github.com/mudler/LocalAI/issues/6268) for details.
+> Note: the DMGs are not signed by Apple as quarantined. See https://github.com/mudler/LocalAI/issues/6268 for a workaround, fix is tracked here: https://github.com/mudler/LocalAI/issues/6244
 ### Containers (Docker, podman, ...)
-> Already ran LocalAI before? Use `docker start -i local-ai` to restart an existing container.
+> **💡 Docker Run vs Docker Start**
 > 
 > - `docker run` creates and starts a new container. If a container with the same name already exists, this command will fail.
 > - `docker start` starts an existing container that was previously created with `docker run`.
 > 
 > If you've already run LocalAI before and want to start it again, use: `docker start -i local-ai`
-#### CPU only:
+#### CPU only image:
 ```bash
 docker run -ti --name local-ai -p 8080:8080 localai/localai:latest
 ```
-#### NVIDIA GPU:
+#### NVIDIA GPU Images:
 ```bash
-# CUDA 13
+# CUDA 13.0
 docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-13
-# CUDA 12
+# CUDA 12.0
 docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12
-# NVIDIA Jetson ARM64 (CUDA 12, for AGX Orin and similar)
+# NVIDIA Jetson (L4T) ARM64
 # CUDA 12 (for Nvidia AGX Orin and similar platforms)
 docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64
-# NVIDIA Jetson ARM64 (CUDA 13, for DGX Spark)
+# CUDA 13 (for Nvidia DGX Spark)
 docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64-cuda-13
 ```
-#### AMD GPU (ROCm):
+#### AMD GPU Images (ROCm):
 ```bash
 docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-gpu-hipblas
 ```
-#### Intel GPU (oneAPI):
+#### Intel GPU Images (oneAPI):
 ```bash
 docker run -ti --name local-ai -p 8080:8080 --device=/dev/dri/card1 --device=/dev/dri/renderD128 localai/localai:latest-gpu-intel
 ```
-#### Vulkan GPU:
+#### Vulkan GPU Images:
 ```bash
 docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-vulkan
 ```
-### Loading models
+#### AIO Images (pre-downloaded models):
 ```bash
-# From the model gallery (see available models with `local-ai models list` or at https://models.localai.io)
+# CPU version
 docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-cpu
 # NVIDIA CUDA 13 version
 docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-13
 # NVIDIA CUDA 12 version
 docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-aio-gpu-nvidia-cuda-12
 # Intel GPU version
 docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-aio-gpu-intel
 # AMD GPU version
 docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-aio-gpu-hipblas
 ```
 For more information about the AIO images and pre-downloaded models, see [Container Documentation](https://localai.io/basics/container/).
 To load models:
 ```bash
 # From the model gallery (see available models with `local-ai models list`, in the WebUI from the model tab, or visiting https://models.localai.io)
 local-ai run llama-3.2-1b-instruct:q4_k_m
-# From Huggingface
+# Start LocalAI with the phi-2 model directly from huggingface
 local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
-# From the Ollama OCI registry
+# Install and run a model from the Ollama OCI registry
 local-ai run ollama://gemma:2b
-# From a YAML config
+# Run a model from a configuration file
 local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
-# From a standard OCI registry (e.g., Docker Hub)
+# Install and run a model from a standard OCI registry (e.g., Docker Hub)
 local-ai run oci://localai/phi-2:latest
 ```
-> **Automatic Backend Detection**: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/).
+> ⚡ **Automatic Backend Detection**: When you install models from the gallery or YAML files, LocalAI automatically detects your system's GPU capabilities (NVIDIA, AMD, Intel) and downloads the appropriate backend. For advanced configuration options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/#automatic-backend-detection).
-For more details, see the [Getting Started guide](https://localai.io/basics/getting_started/).
+For more information, see [💻 Getting started](https://localai.io/basics/getting_started/index.html), if you are interested in our roadmap items and future enhancements, you can see the [Issues labeled as Roadmap here](https://github.com/mudler/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3Aroadmap)
-## Latest News
+## 📰 Latest project news
- **April 2026**: [Voice recognition](https://github.com/mudler/LocalAI/pull/9500), [Face recognition, identification & liveness detection](https://github.com/mudler/LocalAI/pull/9480), [Ollama API compatibility](https://github.com/mudler/LocalAI/pull/9284), [Video generation in stable-diffusion.ggml](https://github.com/mudler/LocalAI/pull/9420), [Backend versioning with auto-upgrade](https://github.com/mudler/LocalAI/pull/9315), [Pin models & load-on-demand toggle](https://github.com/mudler/LocalAI/pull/9309), [Universal model importer](https://github.com/mudler/LocalAI/pull/9466), new backends: [sglang](https://github.com/mudler/LocalAI/pull/9359), [ik-llama-cpp](https://github.com/mudler/LocalAI/pull/9326), [TurboQuant](https://github.com/mudler/LocalAI/pull/9355), [sam.cpp](https://github.com/mudler/LocalAI/pull/9288), [Kokoros](https://github.com/mudler/LocalAI/pull/9212), [qwen3tts.cpp](https://github.com/mudler/LocalAI/pull/9316), [tinygrad multimodal](https://github.com/mudler/LocalAI/pull/9364)
+- December 2025: [Dynamic Memory Resource reclaimer](https://github.com/mudler/LocalAI/pull/7583), [Automatic fitting of models to multiple GPUS(llama.cpp)](https://github.com/mudler/LocalAI/pull/7584), [Added Vibevoice backend](https://github.com/mudler/LocalAI/pull/7494)
- **March 2026**: [Agent management](https://github.com/mudler/LocalAI/pull/8820), [New React UI](https://github.com/mudler/LocalAI/pull/8772), [WebRTC](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed via P2P and RDMA](https://github.com/mudler/LocalAI/pull/8801), [MCP Apps, MCP Client-side](https://github.com/mudler/LocalAI/pull/8947)
+- November 2025: Major improvements to the UX. Among these: [Import models via URL](https://github.com/mudler/LocalAI/pull/7245) and [Multiple chats and history](https://github.com/mudler/LocalAI/pull/7325)
- **February 2026**: [Realtime API for audio-to-audio with tool calling](https://github.com/mudler/LocalAI/pull/6245), [ACE-Step 1.5 support](https://github.com/mudler/LocalAI/pull/8396)
+- October 2025: 🔌 [Model Context Protocol (MCP)](https://localai.io/docs/features/mcp/) support added for agentic capabilities with external tools
- **January 2026**: **LocalAI 3.10.0** — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v3.10.0)
+- September 2025: New Launcher application for MacOS and Linux, extended support to many backends for Mac and Nvidia L4T devices. Models: Added MLX-Audio, WAN 2.2. WebUI improvements and Python-based backends now ships portable python environments.
- **December 2025**: [Dynamic Memory Resource reclaimer](https://github.com/mudler/LocalAI/pull/7583), [Automatic multi-GPU model fitting (llama.cpp)](https://github.com/mudler/LocalAI/pull/7584), [Vibevoice backend](https://github.com/mudler/LocalAI/pull/7494)
+- August 2025: MLX, MLX-VLM, Diffusers and llama.cpp are now supported on Mac M1/M2/M3+ chips ( with `development` suffix in the gallery ): https://github.com/mudler/LocalAI/pull/6049 https://github.com/mudler/LocalAI/pull/6119 https://github.com/mudler/LocalAI/pull/6121 https://github.com/mudler/LocalAI/pull/6060
- **November 2025**: [Import models via URL](https://github.com/mudler/LocalAI/pull/7245), [Multiple chats and history](https://github.com/mudler/LocalAI/pull/7325)
+- July/August 2025: 🔍 [Object Detection](https://localai.io/features/object-detection/) added to the API featuring [rf-detr](https://github.com/roboflow/rf-detr)
- **October 2025**: [Model Context Protocol (MCP)](https://localai.io/docs/features/mcp/) support for agentic capabilities
+- July 2025: All backends migrated outside of the main binary. LocalAI is now more lightweight, small, and automatically downloads the required backend to run the model. [Read the release notes](https://github.com/mudler/LocalAI/releases/tag/v3.2.0)
- **September 2025**: New Launcher for macOS and Linux, extended backend support for Mac and Nvidia L4T, MLX-Audio, WAN 2.2
+- June 2025: [Backend management](https://github.com/mudler/LocalAI/pull/5607) has been added. Attention: extras images are going to be deprecated from the next release! Read [the backend management PR](https://github.com/mudler/LocalAI/pull/5607).
- **August 2025**: MLX, MLX-VLM, Diffusers, llama.cpp now supported on Apple Silicon
+- May 2025: [Audio input](https://github.com/mudler/LocalAI/pull/5466) and [Reranking](https://github.com/mudler/LocalAI/pull/5396) in llama.cpp backend, [Realtime API](https://github.com/mudler/LocalAI/pull/5392),  Support to Gemma, SmollVLM, and more multimodal models (available in the gallery).
- **July 2025**: All backends migrated outside the main binary — [lightweight, modular architecture](https://github.com/mudler/LocalAI/releases/tag/v3.2.0)
+- May 2025: Important: image name changes [See release](https://github.com/mudler/LocalAI/releases/tag/v2.29.0)
 - Apr 2025: Rebrand, WebUI enhancements
 - Apr 2025: [LocalAGI](https://github.com/mudler/LocalAGI) and [LocalRecall](https://github.com/mudler/LocalRecall) join the LocalAI family stack.
 - Apr 2025: WebUI overhaul, AIO images updates
 - Feb 2025: Backend cleanup, Breaking changes, new backends (kokoro, OutelTTS, faster-whisper), Nvidia L4T images
 - Jan 2025: LocalAI model release: https://huggingface.co/mudler/LocalAI-functioncall-phi-4-v0.3, SANA support in diffusers: https://github.com/mudler/LocalAI/pull/4603
 - Dec 2024: stablediffusion.cpp backend (ggml) added ( https://github.com/mudler/LocalAI/pull/4289 )
 - Nov 2024: Bark.cpp backend added ( https://github.com/mudler/LocalAI/pull/4287 )
 - Nov 2024: Voice activity detection models (**VAD**) added to the API: https://github.com/mudler/LocalAI/pull/4204
 - Oct 2024: examples moved to [LocalAI-examples](https://github.com/mudler/LocalAI-examples)
 - Aug 2024:  🆕 FLUX-1, [P2P Explorer](https://explorer.localai.io)
 - July 2024: 🔥🔥 🆕 P2P Dashboard, LocalAI Federated mode and AI Swarms: https://github.com/mudler/LocalAI/pull/2723. P2P Global community pools: https://github.com/mudler/LocalAI/issues/3113
 - May 2024: 🔥🔥 Decentralized P2P llama.cpp:  https://github.com/mudler/LocalAI/pull/2343 (peer2peer llama.cpp!) 👉 Docs  https://localai.io/features/distribute/
 - May 2024: 🔥🔥 Distributed inferencing: https://github.com/mudler/LocalAI/pull/2324
 - April 2024: Reranker API: https://github.com/mudler/LocalAI/pull/2121
-For older news and full release notes, see [GitHub Releases](https://github.com/mudler/LocalAI/releases) and the [News page](https://localai.io/basics/news/).
+Roadmap items: [List of issues](https://github.com/mudler/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3Aroadmap)
-## Features
+## 🚀 [Features](https://localai.io/features/)
- [Text generation](https://localai.io/features/text-generation/) (`llama.cpp`, `transformers`, `vllm` ... [and more](https://localai.io/model-compatibility/))
+- 🧩 [Backend Gallery](https://localai.io/backends/): Install/remove backends on the fly, powered by OCI images — fully customizable and API-driven.
- [Text to Audio](https://localai.io/features/text-to-audio/)
+- 📖 [Text generation with GPTs](https://localai.io/features/text-generation/) (`llama.cpp`, `transformers`, `vllm` ... [:book: and more](https://localai.io/model-compatibility/index.html#model-compatibility-table))
- [Audio to Text](https://localai.io/features/audio-to-text/)
+- 🗣 [Text to Audio](https://localai.io/features/text-to-audio/)
- [Image generation](https://localai.io/features/image-generation)
+- 🔈 [Audio to Text](https://localai.io/features/audio-to-text/) (Audio transcription with `whisper.cpp`)
- [OpenAI-compatible tools API](https://localai.io/features/openai-functions/)
+- 🎨 [Image generation](https://localai.io/features/image-generation)
- [Realtime API](https://localai.io/features/openai-realtime/) (Speech-to-speech)
+- 🔥 [OpenAI-alike tools API](https://localai.io/features/openai-functions/) 
- [Embeddings generation](https://localai.io/features/embeddings/)
+- ⚡ [Realtime API](https://localai.io/features/openai-realtime/) (Speech-to-speech) 
- [Constrained grammars](https://localai.io/features/constrained_grammars/)
+- 🧠 [Embeddings generation for vector databases](https://localai.io/features/embeddings/)
- [Download models from Huggingface](https://localai.io/models/)
+- ✍️ [Constrained grammars](https://localai.io/features/constrained_grammars/)
- [Vision API](https://localai.io/features/gpt-vision/)
+- 🖼️ [Download Models directly from Huggingface ](https://localai.io/models/)
- [Object Detection](https://localai.io/features/object-detection/)
+- 🥽 [Vision API](https://localai.io/features/gpt-vision/)
- [Reranker API](https://localai.io/features/reranker/)
+- 🔍 [Object Detection](https://localai.io/features/object-detection/)
- [P2P Inferencing](https://localai.io/features/distribute/)
+- 📈 [Reranker API](https://localai.io/features/reranker/)
- [Distributed Mode](https://localai.io/features/distributed-mode/) — Horizontal scaling with PostgreSQL + NATS
+- 🆕🖧 [P2P Inferencing](https://localai.io/features/distribute/)
- [Model Context Protocol (MCP)](https://localai.io/docs/features/mcp/)
+- 🆕🔌 [Model Context Protocol (MCP)](https://localai.io/docs/features/mcp/) - Agentic capabilities with external tools and [LocalAGI's Agentic capabilities](https://github.com/mudler/LocalAGI)
- [Built-in Agents](https://localai.io/features/agents/) — Autonomous AI agents with tool use, RAG, skills, SSE streaming, and [Agent Hub](https://agenthub.localai.io)
+- 🔊 Voice activity detection (Silero-VAD support)
- [Backend Gallery](https://localai.io/backends/) — Install/remove backends on the fly via OCI images
+- 🌍 Integrated WebUI!
 - Voice Activity Detection (Silero-VAD)
 - Integrated WebUI
-## Supported Backends & Acceleration
+## 🧩 Supported Backends & Acceleration
-LocalAI supports **36+ backends** including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
+LocalAI supports a comprehensive range of AI backends with multiple acceleration options:
-See the full [Backend & Model Compatibility Table](https://localai.io/model-compatibility/) and [GPU Acceleration guide](https://localai.io/features/gpu-acceleration/).
+### Text Generation & Language Models
 | Backend | Description | Acceleration Support |
 |---------|-------------|---------------------|
 | **llama.cpp** | LLM inference in C/C++ | CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, CPU |
 | **vLLM** | Fast LLM inference with PagedAttention | CUDA 12/13, ROCm, Intel |
 | **transformers** | HuggingFace transformers framework | CUDA 12/13, ROCm, Intel, CPU |
 | **MLX** | Apple Silicon LLM inference | Metal (M1/M2/M3+) |
 | **MLX-VLM** | Apple Silicon Vision-Language Models | Metal (M1/M2/M3+) |
-## Resources
+### Audio & Speech Processing
 | Backend | Description | Acceleration Support |
 |---------|-------------|---------------------|
 | **whisper.cpp** | OpenAI Whisper in C/C++ | CUDA 12/13, ROCm, Intel SYCL, Vulkan, CPU |
 | **faster-whisper** | Fast Whisper with CTranslate2 | CUDA 12/13, ROCm, Intel, CPU |
 | **coqui** | Advanced TTS with 1100+ languages | CUDA 12/13, ROCm, Intel, CPU |
 | **kokoro** | Lightweight TTS model | CUDA 12/13, ROCm, Intel, CPU |
 | **chatterbox** | Production-grade TTS | CUDA 12/13, CPU |
 | **piper** | Fast neural TTS system | CPU |
 | **kitten-tts** | Kitten TTS models | CPU |
 | **silero-vad** | Voice Activity Detection | CPU |
 | **neutts** | Text-to-speech with voice cloning | CUDA 12/13, ROCm, CPU |
 | **vibevoice** | Real-time TTS with voice cloning | CUDA 12/13, ROCm, Intel, CPU |
 | **pocket-tts** | Lightweight CPU-based TTS | CUDA 12/13, ROCm, Intel, CPU |
 | **qwen-tts** | High-quality TTS with custom voice, voice design, and voice cloning | CUDA 12/13, ROCm, Intel, CPU |
- [Documentation](https://localai.io/)
+### Image & Video Generation
- [LLM fine-tuning guide](https://localai.io/docs/advanced/fine-tuning/)
+| Backend | Description | Acceleration Support |
- [Build from source](https://localai.io/basics/build/)
+|---------|-------------|---------------------|
- [Kubernetes installation](https://localai.io/basics/getting_started/#run-localai-in-kubernetes)
+| **stablediffusion.cpp** | Stable Diffusion in C/C++ | CUDA 12/13, Intel SYCL, Vulkan, CPU |
- [Integrations & community projects](https://localai.io/docs/integrations/)
+| **diffusers** | HuggingFace diffusion models | CUDA 12/13, ROCm, Intel, Metal, CPU |
 - [Installation video walkthrough](https://www.youtube.com/watch?v=cMVNnlqwfw4)
 - [Media & blog posts](https://localai.io/basics/news/#media-blogs-social)
 - [Examples](https://github.com/mudler/LocalAI-examples)
-## Team
+### Specialized AI Tasks
 | Backend | Description | Acceleration Support |
 |---------|-------------|---------------------|
 | **rfdetr** | Real-time object detection | CUDA 12/13, Intel, CPU |
 | **rerankers** | Document reranking API | CUDA 12/13, ROCm, Intel, CPU |
 | **local-store** | Vector database | CPU |
 | **huggingface** | HuggingFace API integration | API-based |
-LocalAI is maintained by a small team of humans, together with the wider community of contributors.
+### Hardware Acceleration Matrix
- **[Ettore Di Giacinto](https://github.com/mudler)** — original author and project lead
+| Acceleration Type | Supported Backends | Hardware Support |
- **[Richard Palethorpe](https://github.com/richiejp)** — maintainer
+|-------------------|-------------------|------------------|
 | **NVIDIA CUDA 12** | All CUDA-compatible backends | Nvidia hardware |
 | **NVIDIA CUDA 13** | All CUDA-compatible backends | Nvidia hardware |
 | **AMD ROCm** | llama.cpp, whisper, vllm, transformers, diffusers, rerankers, coqui, kokoro, neutts, vibevoice, pocket-tts, qwen-tts | AMD Graphics |
 | **Intel oneAPI** | llama.cpp, whisper, stablediffusion, vllm, transformers, diffusers, rfdetr, rerankers, coqui, kokoro, vibevoice, pocket-tts, qwen-tts | Intel Arc, Intel iGPUs |
 | **Apple Metal** | llama.cpp, whisper, diffusers, MLX, MLX-VLM | Apple M1/M2/M3+ |
 | **Vulkan** | llama.cpp, whisper, stablediffusion | Cross-platform GPUs |
 | **NVIDIA Jetson (CUDA 12)** | llama.cpp, whisper, stablediffusion, diffusers, rfdetr | ARM64 embedded AI (AGX Orin, etc.) |
 | **NVIDIA Jetson (CUDA 13)** | llama.cpp, whisper, stablediffusion, diffusers, rfdetr | ARM64 embedded AI (DGX Spark) |
 | **CPU Optimized** | All backends | AVX/AVX2/AVX512, quantization support |
-A huge thank you to everyone who contributes code, reviews PRs, files issues, and helps users in [Discord](https://discord.gg/uJAeKSAGDy) — LocalAI is a community-driven project and wouldn't exist without you. See the full [contributors list](https://github.com/mudler/LocalAI/graphs/contributors).
+### 🔗 Community and integrations
 Build and deploy custom containers:
 - https://github.com/sozercan/aikit
 WebUIs:
 - https://github.com/Jirubizu/localai-admin
 - https://github.com/go-skynet/LocalAI-frontend
 - QA-Pilot(An interactive chat project that leverages LocalAI LLMs for rapid understanding and navigation of GitHub code repository) https://github.com/reid41/QA-Pilot
 Agentic Libraries:
 - https://github.com/mudler/cogito
 MCPs:
 - https://github.com/mudler/MCPs
 OS Assistant:
 - https://github.com/mudler/Keygeist - Keygeist is an AI-powered keyboard operator that listens for key combinations and responds with AI-generated text typed directly into your Linux box.
 Model galleries
 - https://github.com/go-skynet/model-gallery
 Voice:
 - https://github.com/richiejp/VoxInput
 Other:
 - Helm chart https://github.com/go-skynet/helm-charts
 - VSCode extension https://github.com/badgooooor/localai-vscode-plugin
 - Langchain: https://python.langchain.com/docs/integrations/providers/localai/
 - Terminal utility https://github.com/djcopley/ShellOracle
 - Local Smart assistant https://github.com/mudler/LocalAGI
 - Home Assistant https://github.com/sammcj/homeassistant-localai / https://github.com/drndos/hass-openai-custom-conversation / https://github.com/valentinfrlch/ha-gpt4vision
 - Discord bot https://github.com/mudler/LocalAGI/tree/main/examples/discord
 - Slack bot https://github.com/mudler/LocalAGI/tree/main/examples/slack
 - Shell-Pilot(Interact with LLM using LocalAI models via pure shell scripts on your Linux or MacOS system) https://github.com/reid41/shell-pilot
 - Telegram bot https://github.com/mudler/LocalAI/tree/master/examples/telegram-bot
 - Another Telegram Bot https://github.com/JackBekket/Hellper
 - Auto-documentation https://github.com/JackBekket/Reflexia
 - Github bot which answer on issues, with code and documentation as context https://github.com/JackBekket/GitHelper
 - Github Actions: https://github.com/marketplace/actions/start-localai
 - Examples: https://github.com/mudler/LocalAI/tree/master/examples/
 ### 🔗 Resources
 - [LLM finetuning guide](https://localai.io/docs/advanced/fine-tuning/)
 - [How to build locally](https://localai.io/basics/build/index.html)
 - [How to install in Kubernetes](https://localai.io/basics/getting_started/index.html#run-localai-in-kubernetes)
 - [Projects integrating LocalAI](https://localai.io/docs/integrations/)
 - [How tos section](https://io.midori-ai.xyz/howtos/) (curated by our community)
 ## :book: 🎥 [Media, Blogs, Social](https://localai.io/basics/news/#media-blogs-social)
 - [Run Visual studio code with LocalAI (SUSE)](https://www.suse.com/c/running-ai-locally/)
 - 🆕 [Run LocalAI on Jetson Nano Devkit](https://mudler.pm/posts/local-ai-jetson-nano-devkit/)
 - [Run LocalAI on AWS EKS with Pulumi](https://www.pulumi.com/blog/low-code-llm-apps-with-local-ai-flowise-and-pulumi/)
 - [Run LocalAI on AWS](https://staleks.hashnode.dev/installing-localai-on-aws-ec2-instance)
 - [Create a slackbot for teams and OSS projects that answer to documentation](https://mudler.pm/posts/smart-slackbot-for-teams/)
 - [LocalAI meets k8sgpt](https://www.youtube.com/watch?v=PKrDNuJ_dfE)
 - [Question Answering on Documents locally with LangChain, LocalAI, Chroma, and GPT4All](https://mudler.pm/posts/localai-question-answering/)
 - [Tutorial to use k8sgpt with LocalAI](https://medium.com/@tyler_97636/k8sgpt-localai-unlock-kubernetes-superpowers-for-free-584790de9b65)
 ## Citation
@@ -224,7 +385,7 @@ If you utilize this repository, data in a downstream project, please consider ci
  howpublished = {\url{https://github.com/go-skynet/LocalAI}},
 ```
-## Sponsors
+## ❤️ Sponsors
 > Do you find LocalAI useful?
@@ -243,19 +404,19 @@ A huge thank you to our generous sponsors who support this project covering CI e
 ### Individual sponsors
-A special thanks to individual sponsors, a full list is on [GitHub](https://github.com/sponsors/mudler) and [buymeacoffee](https://buymeacoffee.com/mudler). Special shout out to [drikster80](https://github.com/drikster80) for being generous. Thank you everyone!
+A special thanks to individual sponsors that contributed to the project, a full list is in [Github](https://github.com/sponsors/mudler) and [buymeacoffee](https://buymeacoffee.com/mudler), a special shout out goes to [drikster80](https://github.com/drikster80) for being generous. Thank you everyone!
-## Star history
+## 🌟 Star history
 [![LocalAI Star history Chart](https://api.star-history.com/svg?repos=go-skynet/LocalAI&type=Date)](https://star-history.com/#go-skynet/LocalAI&Date)
-## License
+## 📖 License
-LocalAI is a community-driven project created by [Ettore Di Giacinto](https://github.com/mudler/) and maintained by the [LocalAI team](#team).
+LocalAI is a community-driven project created by [Ettore Di Giacinto](https://github.com/mudler/).
 MIT - Author Ettore Di Giacinto <mudler@localai.io>
-## Acknowledgements
+## 🙇 Acknowledgements
 LocalAI couldn't have been built without the help of great software already available from the community. Thank you!
@@ -266,11 +427,10 @@ LocalAI couldn't have been built without the help of great software already avai
 - https://github.com/EdVince/Stable-Diffusion-NCNN
 - https://github.com/ggerganov/whisper.cpp
 - https://github.com/rhasspy/piper
 - [exo](https://github.com/exo-explore/exo) for the MLX distributed auto-parallel sharding implementation
-## Contributors
+## 🤗 Contributors
-This is a community project, a special thanks to our contributors!
+This is a community project, a special thanks to our contributors! 🤗
 <a href="https://github.com/go-skynet/LocalAI/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=go-skynet/LocalAI" />
 </a>
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -8,24 +8,10 @@ At LocalAI, we take the security of our software seriously. We understand the im
 We provide support and updates for certain versions of our software. The following table outlines which versions are currently supported with security updates:
-| Version Series | Support Level | Details |
+| Version | Supported          |
-| -------------- | ------------- | ------- |
+| ------- | ------------------ |
-| 3.x | :white_check_mark: Actively supported | Full security updates and bug fixes for the latest minor versions. |
+| > 2.0   | :white_check_mark: |
-| 2.x | :warning: Security fixes only | Critical security patches only, until **December 31, 2025**. |
+| < 2.0   | :x:                |
 | 1.x | :x: End-of-life (EOL) | No longer supported as of **January 1, 2024**. No security fixes will be provided. |
 ### What each support level means
 - **Actively supported (3.x):** Receives all security updates, bug fixes, and new features. Users should stay on the latest 3.x minor release for the best protection.
 - **Security fixes only (2.x):** Receives only critical security patches (e.g., remote code execution, authentication bypass, data exposure). No bug fixes or new features. Support ends December 31, 2025.
 - **End-of-life (1.x):** No updates of any kind. Users on 1.x are strongly encouraged to upgrade immediately, as known vulnerabilities will not be patched.
 ### Migrating from older versions
 If you are running an unsupported or soon-to-be-unsupported version, we recommend upgrading as soon as possible:
 - **From 1.x to 3.x:** Version 1.x reached end-of-life on January 1, 2024. Review the [release notes](https://github.com/mudler/LocalAI/releases) for breaking changes across major versions, and upgrade directly to the latest 3.x release.
 - **From 2.x to 3.x:** While 2.x still receives critical security patches until December 31, 2025, we recommend planning your migration to 3.x to benefit from ongoing improvements and full support.
 Please ensure that you are using a supported version to receive the latest security updates.
--- a/aio/cpu/README.md
+++ b/aio/cpu/README.md
@@ -0,0 +1,5 @@
 ## AIO CPU size
 Use this image with CPU-only.
 Please keep using only C++ backends so the base image is as small as possible (without CUDA, cuDNN, python, etc).
--- a/aio/cpu/embeddings.yaml
+++ b/aio/cpu/embeddings.yaml
@@ -0,0 +1,13 @@
 embeddings: true
 name: text-embedding-ada-002
 backend: llama-cpp
 parameters:
  model: huggingface://bartowski/granite-embedding-107m-multilingual-GGUF/granite-embedding-107m-multilingual-f16.gguf
 usage: |
    You can test this model with curl like this:
    curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
      "input": "Your text string goes here",
      "model": "text-embedding-ada-002"
    }'
--- a/tests/e2e-aio/models/image-gen.yaml
+++ b/tests/e2e-aio/models/image-gen.yaml
@@ -12,3 +12,12 @@ download_files:
 - filename: "stable-diffusion-v1-5-pruned-emaonly-Q4_0.gguf"
  sha256: "b8944e9fe0b69b36ae1b5bb0185b3a7b8ef14347fe0fa9af6c64c4829022261f"
  uri: "huggingface://second-state/stable-diffusion-v1-5-GGUF/stable-diffusion-v1-5-pruned-emaonly-Q4_0.gguf"
 usage: |
        curl http://localhost:8080/v1/images/generations \
          -H "Content-Type: application/json" \
          -d '{
            "prompt": "<positive prompt>|<negative prompt>",
            "step": 25,
            "size": "512x512"
          }'
--- a/aio/cpu/rerank.yaml
+++ b/aio/cpu/rerank.yaml
@@ -0,0 +1,33 @@
 name: jina-reranker-v1-base-en
 reranking: true
 f16: true
 parameters:
  model: jina-reranker-v1-tiny-en.f16.gguf
 backend: llama-cpp
 download_files:
  - filename: jina-reranker-v1-tiny-en.f16.gguf
    sha256: 5f696cf0d0f3d347c4a279eee8270e5918554cdac0ed1f632f2619e4e8341407
    uri: huggingface://mradermacher/jina-reranker-v1-tiny-en-GGUF/jina-reranker-v1-tiny-en.f16.gguf 
 usage: |
    You can test this model with curl like this:
    curl http://localhost:8080/v1/rerank \
      -H "Content-Type: application/json" \
      -d '{
      "model": "jina-reranker-v1-base-en",
      "query": "Organic skincare products for sensitive skin",
      "documents": [
        "Eco-friendly kitchenware for modern homes",
        "Biodegradable cleaning supplies for eco-conscious consumers",
        "Organic cotton baby clothes for sensitive skin",
        "Natural organic skincare range for sensitive skin",
        "Tech gadgets for smart homes: 2024 edition",
        "Sustainable gardening tools and compost solutions",
        "Sensitive skin-friendly facial cleansers and toners",
        "Organic food wraps and storage solutions",
        "All-natural pet food for dogs with allergies",
        "Yoga mats made from recycled materials"
      ],
      "top_n": 3
    }'
--- a/aio/cpu/speech-to-text.yaml
+++ b/aio/cpu/speech-to-text.yaml
@@ -0,0 +1,18 @@
 name: whisper-1
 backend: whisper
 parameters:
  model: ggml-whisper-base.bin
 usage: |
    ## example audio file
    wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
    ## Send the example audio file to the transcriptions endpoint
    curl http://localhost:8080/v1/audio/transcriptions \
         -H "Content-Type: multipart/form-data" \
         -F file="@$PWD/gb1.ogg" -F model="whisper-1"
 download_files:
 - filename: "ggml-whisper-base.bin"
  sha256: "60ed5bc3dd14eea856493d334349b405782ddcaf0028d4b5df4088345fba2efe"
  uri: "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin"
--- a/aio/cpu/text-to-speech.yaml
+++ b/aio/cpu/text-to-speech.yaml
@@ -0,0 +1,15 @@
 name: tts-1
 download_files:
  - filename: voice-en-us-amy-low.tar.gz
    uri: https://github.com/rhasspy/piper/releases/download/v0.0.2/voice-en-us-amy-low.tar.gz
 backend: piper
 parameters:
  model: en-us-amy-low.onnx
 usage: |
    To test if this model works as expected, you can use the following curl command:
    curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
      "model":"voice-en-us-amy-low",
      "input": "Hi, this is a test."
    }'
--- a/tests/e2e-aio/models/text-to-text.yaml
+++ b/tests/e2e-aio/models/text-to-text.yaml
@@ -55,4 +55,4 @@ template:
 download_files:
 - filename: Hermes-3-Llama-3.2-3B-Q4_K_M.gguf
  sha256: 2e220a14ba4328fee38cf36c2c068261560f999fadb5725ce5c6d977cb5126b5
-  uri: huggingface://bartowski/Hermes-3-Llama-3.2-3B-GGUF/Hermes-3-Llama-3.2-3B-Q4_K_M.gguf
+  uri: huggingface://bartowski/Hermes-3-Llama-3.2-3B-GGUF/Hermes-3-Llama-3.2-3B-Q4_K_M.gguf
--- a/tests/e2e-aio/models/vad.yaml
+++ b/tests/e2e-aio/models/vad.yaml
@@ -1,8 +1,8 @@
-backend: silero-vad
+backend: silero-vad
-name: silero-vad
+name: silero-vad
-parameters:
+parameters:
-  model: silero-vad.onnx
+  model: silero-vad.onnx
-download_files:
+download_files:
- filename: silero-vad.onnx
+- filename: silero-vad.onnx
-  uri: https://huggingface.co/onnx-community/silero-vad/resolve/main/onnx/model.onnx
+  uri: https://huggingface.co/onnx-community/silero-vad/resolve/main/onnx/model.onnx
-  sha256: a4a068cd6cf1ea8355b84327595838ca748ec29a25bc91fc82e6c299ccdc5808
+  sha256: a4a068cd6cf1ea8355b84327595838ca748ec29a25bc91fc82e6c299ccdc5808
--- a/tests/e2e-aio/models/vision.yaml
+++ b/tests/e2e-aio/models/vision.yaml
@@ -47,4 +47,4 @@ download_files:
  uri: huggingface://openbmb/MiniCPM-V-4_5-gguf/ggml-model-Q4_K_M.gguf
 - filename: minicpm-v-4_5-mmproj-f16.gguf
  uri: huggingface://openbmb/MiniCPM-V-4_5-gguf/mmproj-model-f16.gguf
-  sha256: 7a7225a32e8d453aaa3d22d8c579b5bf833c253f784cdb05c99c9a76fd616df8
+  sha256: 7a7225a32e8d453aaa3d22d8c579b5bf833c253f784cdb05c99c9a76fd616df8
--- a/aio/entrypoint.sh
+++ b/aio/entrypoint.sh
@@ -0,0 +1,138 @@
 #!/bin/bash
 echo "===> LocalAI All-in-One (AIO) container starting..."
 GPU_ACCELERATION=false
 GPU_VENDOR=""
 function check_intel() {
    if lspci | grep -E 'VGA|3D' | grep -iq intel; then
        echo "Intel GPU detected"
        if [ -d /opt/intel ]; then
            GPU_ACCELERATION=true
            GPU_VENDOR=intel
        else
            echo "Intel GPU detected, but Intel GPU drivers are not installed. GPU acceleration will not be available."
        fi
    fi
 }
 function check_nvidia_wsl() {
    if lspci | grep -E 'VGA|3D' | grep -iq "Microsoft Corporation Device 008e"; then
        # We make the assumption this WSL2 cars is NVIDIA, then check for nvidia-smi
        # Make sure the container was run with `--gpus all` as the only required parameter
        echo "NVIDIA GPU detected via WSL2"
        # nvidia-smi should be installed in the container
        if nvidia-smi; then
            GPU_ACCELERATION=true
            GPU_VENDOR=nvidia
        else
            echo "NVIDIA GPU detected via WSL2, but nvidia-smi is not installed. GPU acceleration will not be available."
        fi
    fi
 }
 function check_amd() {
    if lspci | grep -E 'VGA|3D' | grep -iq amd; then
        echo "AMD GPU detected"
        # Check if ROCm is installed
        if [ -d /opt/rocm ]; then
            GPU_ACCELERATION=true
            GPU_VENDOR=amd
        else
            echo "AMD GPU detected, but ROCm is not installed. GPU acceleration will not be available."
        fi
    fi
 }
 function check_nvidia() {
    if lspci | grep -E 'VGA|3D' | grep -iq nvidia; then
        echo "NVIDIA GPU detected"
        # nvidia-smi should be installed in the container
        if nvidia-smi; then
            GPU_ACCELERATION=true
            GPU_VENDOR=nvidia
        else
            echo "NVIDIA GPU detected, but nvidia-smi is not installed. GPU acceleration will not be available."
        fi
    fi
 }
 function check_metal() {
    if system_profiler SPDisplaysDataType | grep -iq 'Metal'; then
        echo "Apple Metal supported GPU detected"
        GPU_ACCELERATION=true
        GPU_VENDOR=apple
    fi
 }
 function detect_gpu() {
    case "$(uname -s)" in
        Linux)
            check_nvidia
            check_amd
            check_intel
            check_nvidia_wsl
            ;;
        Darwin)
            check_metal
            ;;
    esac
 }
 function detect_gpu_size() {
    # Attempting to find GPU memory size for NVIDIA GPUs
    if [ "$GPU_ACCELERATION" = true ] && [ "$GPU_VENDOR" = "nvidia" ]; then
        echo "NVIDIA GPU detected. Attempting to find memory size..."
        # Using head -n 1 to get the total memory of the 1st NVIDIA GPU detected.
        # If handling multiple GPUs is required in the future, this is the place to do it
        nvidia_sm=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits | head -n 1)
        if [ ! -z "$nvidia_sm" ]; then
            echo "Total GPU Memory: $nvidia_sm MiB"
            # if bigger than 8GB, use 16GB
            #if [ "$nvidia_sm" -gt 8192 ]; then
            #    GPU_SIZE=gpu-16g
            #else
            GPU_SIZE=gpu-8g
            #fi
        else
            echo "Unable to determine NVIDIA GPU memory size. Falling back to CPU."
            GPU_SIZE=gpu-8g
        fi
    elif [ "$GPU_ACCELERATION" = true ] && [ "$GPU_VENDOR" = "intel" ]; then
        GPU_SIZE=intel
    # Default to a generic GPU size until we implement GPU size detection for non NVIDIA GPUs
    elif [ "$GPU_ACCELERATION" = true ]; then
        echo "Non-NVIDIA GPU detected. Specific GPU memory size detection is not implemented."
        GPU_SIZE=gpu-8g
    # default to cpu if GPU_SIZE is not set
    else
        echo "GPU acceleration is not enabled or supported. Defaulting to CPU."
        GPU_SIZE=cpu
    fi
 }
 function check_vars() {
    if [ -z "$MODELS" ]; then
        echo "MODELS environment variable is not set. Please set it to a comma-separated list of model YAML files to load."
        exit 1
    fi
    if [ -z "$PROFILE" ]; then
        echo "PROFILE environment variable is not set. Please set it to one of the following: cpu, gpu-8g, gpu-16g, apple"
        exit 1
    fi
 }
 detect_gpu
 detect_gpu_size
 PROFILE="${PROFILE:-$GPU_SIZE}" # default to cpu
 export MODELS="${MODELS:-/aio/${PROFILE}/embeddings.yaml,/aio/${PROFILE}/rerank.yaml,/aio/${PROFILE}/text-to-speech.yaml,/aio/${PROFILE}/image-gen.yaml,/aio/${PROFILE}/text-to-text.yaml,/aio/${PROFILE}/speech-to-text.yaml,/aio/${PROFILE}/vad.yaml,/aio/${PROFILE}/vision.yaml}"
 check_vars
 echo "===> Starting LocalAI[$PROFILE] with the following models: $MODELS"
 exec /entrypoint.sh "$@"
--- a/aio/gpu-8g/embeddings.yaml
+++ b/aio/gpu-8g/embeddings.yaml
@@ -0,0 +1,13 @@
 embeddings: true
 name: text-embedding-ada-002
 backend: llama-cpp
 parameters:
  model: huggingface://bartowski/granite-embedding-107m-multilingual-GGUF/granite-embedding-107m-multilingual-f16.gguf
 usage: |
    You can test this model with curl like this:
    curl http://localhost:8080/embeddings -X POST -H "Content-Type: application/json" -d '{
      "input": "Your text string goes here",
      "model": "text-embedding-ada-002"
    }'
--- a/aio/gpu-8g/image-gen.yaml
+++ b/aio/gpu-8g/image-gen.yaml
@@ -0,0 +1,25 @@
 name: stablediffusion
 parameters:
  model: DreamShaper_8_pruned.safetensors
 backend: diffusers
 step: 25
 f16: true
 diffusers:
  pipeline_type: StableDiffusionPipeline
  cuda: true
  enable_parameters: "negative_prompt,num_inference_steps"
  scheduler_type: "k_dpmpp_2m"
 download_files:
 - filename: DreamShaper_8_pruned.safetensors
  uri: huggingface://Lykon/DreamShaper/DreamShaper_8_pruned.safetensors
 usage: |
        curl http://localhost:8080/v1/images/generations \
          -H "Content-Type: application/json" \
          -d '{
            "prompt": "<positive prompt>|<negative prompt>",
            "step": 25,
            "size": "512x512"
          }'
--- a/aio/gpu-8g/rerank.yaml
+++ b/aio/gpu-8g/rerank.yaml
@@ -0,0 +1,33 @@
 name: jina-reranker-v1-base-en
 reranking: true
 f16: true
 parameters:
  model: jina-reranker-v1-tiny-en.f16.gguf
 backend: llama-cpp
 download_files:
  - filename: jina-reranker-v1-tiny-en.f16.gguf
    sha256: 5f696cf0d0f3d347c4a279eee8270e5918554cdac0ed1f632f2619e4e8341407
    uri: huggingface://mradermacher/jina-reranker-v1-tiny-en-GGUF/jina-reranker-v1-tiny-en.f16.gguf 
 usage: |
    You can test this model with curl like this:
    curl http://localhost:8080/v1/rerank \
      -H "Content-Type: application/json" \
      -d '{
      "model": "jina-reranker-v1-base-en",
      "query": "Organic skincare products for sensitive skin",
      "documents": [
        "Eco-friendly kitchenware for modern homes",
        "Biodegradable cleaning supplies for eco-conscious consumers",
        "Organic cotton baby clothes for sensitive skin",
        "Natural organic skincare range for sensitive skin",
        "Tech gadgets for smart homes: 2024 edition",
        "Sustainable gardening tools and compost solutions",
        "Sensitive skin-friendly facial cleansers and toners",
        "Organic food wraps and storage solutions",
        "All-natural pet food for dogs with allergies",
        "Yoga mats made from recycled materials"
      ],
      "top_n": 3
    }'
--- a/aio/gpu-8g/speech-to-text.yaml
+++ b/aio/gpu-8g/speech-to-text.yaml
@@ -0,0 +1,18 @@
 name: whisper-1
 backend: whisper
 parameters:
  model: ggml-whisper-base.bin
 usage: |
    ## example audio file
    wget --quiet --show-progress -O gb1.ogg https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
    ## Send the example audio file to the transcriptions endpoint
    curl http://localhost:8080/v1/audio/transcriptions \
         -H "Content-Type: multipart/form-data" \
         -F file="@$PWD/gb1.ogg" -F model="whisper-1"
 download_files:
 - filename: "ggml-whisper-base.bin"
  sha256: "60ed5bc3dd14eea856493d334349b405782ddcaf0028d4b5df4088345fba2efe"
  uri: "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin"
--- a/aio/gpu-8g/text-to-speech.yaml
+++ b/aio/gpu-8g/text-to-speech.yaml
@@ -0,0 +1,15 @@
 name: tts-1
 download_files:
  - filename: voice-en-us-amy-low.tar.gz
    uri: https://github.com/rhasspy/piper/releases/download/v0.0.2/voice-en-us-amy-low.tar.gz
 backend: piper
 parameters:
  model: en-us-amy-low.onnx
 usage: |
    To test if this model works as expected, you can use the following curl command:
    curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
      "model":"tts-1",
      "input": "Hi, this is a test."
    }'
--- a/aio/gpu-8g/text-to-text.yaml
+++ b/aio/gpu-8g/text-to-text.yaml
@@ -0,0 +1,54 @@
 context_size: 4096
 f16: true
 backend: llama-cpp
 function:
  capture_llm_results:
  - (?s)<Thought>(.*?)</Thought>
  grammar:
    properties_order: name,arguments
  json_regex_match:
  - (?s)<Output>(.*?)</Output>
  replace_llm_results:
  - key: (?s)<Thought>(.*?)</Thought>
    value: ""
 mmap: true
 name: gpt-4
 parameters:
  model: localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf
 stopwords:
 - <|im_end|>
 - <dummy32000>
 - </s>
 template:
  chat: |
    {{.Input -}}
    <|im_start|>assistant
  chat_message: |
    <|im_start|>{{ .RoleName }}
    {{ if .FunctionCall -}}
    Function call:
    {{ else if eq .RoleName "tool" -}}
    Function response:
    {{ end -}}
    {{ if .Content -}}
    {{.Content }}
    {{ end -}}
    {{ if .FunctionCall -}}
    {{toJson .FunctionCall}}
    {{ end -}}<|im_end|>
  completion: |
    {{.Input}}
  function: |
    <|im_start|>system
    You are an AI assistant that executes function calls, and these are the tools at your disposal:
    {{range .Functions}}
    {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
    {{end}}
    <|im_end|>
    {{.Input -}}
    <|im_start|>assistant
 download_files:
 - filename: localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf
  sha256: 4e7b7fe1d54b881f1ef90799219dc6cc285d29db24f559c8998d1addb35713d4
  uri: huggingface://mudler/LocalAI-functioncall-qwen2.5-7b-v0.5-Q4_K_M-GGUF/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf
--- a/aio/gpu-8g/vad.yaml
+++ b/aio/gpu-8g/vad.yaml
@@ -0,0 +1,8 @@
 backend: silero-vad
 name: silero-vad
 parameters:
  model: silero-vad.onnx
 download_files:
 - filename: silero-vad.onnx
  uri: https://huggingface.co/onnx-community/silero-vad/resolve/main/onnx/model.onnx
  sha256: a4a068cd6cf1ea8355b84327595838ca748ec29a25bc91fc82e6c299ccdc5808
--- a/aio/gpu-8g/vision.yaml
+++ b/aio/gpu-8g/vision.yaml
@@ -0,0 +1,50 @@
 context_size: 4096
 backend: llama-cpp
 f16: true
 mmap: true
 mmproj: minicpm-v-4_5-mmproj-f16.gguf
 name: gpt-4o
 parameters:
  model: minicpm-v-4_5-Q4_K_M.gguf
 stopwords:
 - <|im_end|>
 - <dummy32000>
 - </s>
 - <|endoftext|>
 template:
  chat: |
    {{.Input -}}
    <|im_start|>assistant
  chat_message: |
    <|im_start|>{{ .RoleName }}
    {{ if .FunctionCall -}}
    Function call:
    {{ else if eq .RoleName "tool" -}}
    Function response:
    {{ end -}}
    {{ if .Content -}}
    {{.Content }}
    {{ end -}}
    {{ if .FunctionCall -}}
    {{toJson .FunctionCall}}
    {{ end -}}<|im_end|>
  completion: |
    {{.Input}}
  function: |
    <|im_start|>system
    You are a function calling AI model. You are provided with functions to execute. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
    {{range .Functions}}
    {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
    {{end}}
    For each function call return a json object with function name and arguments
    <|im_end|>
    {{.Input -}}
    <|im_start|>assistant
 download_files:
 - filename: minicpm-v-4_5-Q4_K_M.gguf
  sha256: c1c3c33100b15b4caf7319acce4e23c0eb0ce1cbd12f70e8d24f05aa67b7512f
  uri: huggingface://openbmb/MiniCPM-V-4_5-gguf/ggml-model-Q4_K_M.gguf
 - filename: minicpm-v-4_5-mmproj-f16.gguf
  uri: huggingface://openbmb/MiniCPM-V-4_5-gguf/mmproj-model-f16.gguf
  sha256: 7a7225a32e8d453aaa3d22d8c579b5bf833c253f784cdb05c99c9a76fd616df8
--- a/Show More
+++ b/Show More