Commit Graph

67 Commits

Author SHA1 Message Date
Ettore Di Giacinto
20baec77ab feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480)
* feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis

Adds face recognition as a new first-class capability in LocalAI via the
`insightface` Python backend, with a pluggable two-engine design so
non-commercial (insightface model packs) and commercial-safe
(OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface.

New gRPC RPCs (backend/backend.proto):
  * FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse
  * FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse

Existing Embedding and Detect RPCs are reused (face image in
PredictOptions.Images / DetectOptions.src) for face embedding and
face detection respectively.

New HTTP endpoints under /v1/face/:
  * verify     — 1:1 image pair same-person decision
  * analyze    — per-face age + gender (emotion/race reserved)
  * register   — 1:N enrollment; stores embedding in vector store
  * identify   — 1:N recognition; detect → embed → StoresFind
  * forget     — remove a registered face by opaque ID

Service layer (core/services/facerecognition/) introduces a
`Registry` interface with one in-memory `storeRegistry` impl backed
by LocalAI's existing local-store gRPC vector backend. HTTP handlers
depend on the interface, not on StoresSet/StoresFind directly, so a
persistent PostgreSQL/pgvector implementation can be slotted in via a
single constructor change in core/application (TODO marker in the
package doc).

New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired
into FLAG_DETECTION so /v1/detection works for face bounding boxes.

Gallery (backend/index.yaml) ships three entries:
  * insightface-buffalo-l   — SCRFD-10GF + ArcFace R50 + genderage
                              (~326MB pre-baked; non-commercial research use only)
  * insightface-opencv      — YuNet + SFace (~40MB pre-baked; Apache 2.0)
  * insightface-buffalo-s   — SCRFD-500MF + MBF (runtime download; non-commercial)

Python backend (backend/python/insightface/):
  * engines.py — FaceEngine protocol with InsightFaceEngine and
    OnnxDirectEngine; resolves model paths relative to the backend
    directory so the same gallery config works in docker-scratch and
    in the e2e-backends rootfs-extraction harness.
  * backend.py — gRPC servicer implementing Health, LoadModel, Status,
    Embedding, Detect, FaceVerify, FaceAnalyze.
  * install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the
    backend directory so first-run is offline-clean (the final scratch
    image only preserves files under /<backend>/).
  * test.py — parametrized unit tests over both engines.

Tests:
  * Registry unit tests (go test -race ./core/services/facerecognition/...)
    — in-memory fake grpc.Backend, table-driven, covers register/
    identify/forget/error paths + concurrent access.
  * tests/e2e-backends/backend_test.go extended with face caps
    (face_detect, face_embed, face_verify, face_analyze); relative
    ordering + configurable verifyCeiling per engine.
  * Makefile targets: test-extra-backend-insightface-buffalo-l,
    -opencv, and the -all aggregate.
  * CI: .github/workflows/test-extra.yml gains tests-insightface-grpc,
    auto-triggered by changes under backend/python/insightface/.

Docs:
  * docs/content/features/face-recognition.md — feature page with
    license table, quickstart (defaults to the commercial-safe model),
    models matrix, API reference, 1:N workflow, storage caveats.
  * Cross-refs in object-detection.md, stores.md, embeddings.md, and
    whats-new.md.
  * Contributor README at backend/python/insightface/README.md.

Verified end-to-end:
  * buffalo_l: 6/6 specs (health, load, face_detect, face_embed,
    face_verify, face_analyze).
  * opencv: 5/5 specs (same minus face_analyze — SFace has no
    demographic head; correctly skipped via BACKEND_TEST_CAPS).

Assisted-by: Claude:claude-opus-4-7

* fix(face-recognition): move engine selection to model gallery, collapse backend entries

The previous commit put engine/model_pack options on backend gallery
entries (`backend/index.yaml`). That was wrong — `GalleryBackend`
(core/gallery/backend_types.go:32) has no `options` field, so the
YAML decoder silently dropped those keys and all three "different
insightface-*" backend entries resolved to the same container image
with no distinguishing configuration.

Correct split:

  * `backend/index.yaml` now has ONE `insightface` backend entry
    shipping the CPU + CUDA 12 container images. The Python backend
    bundles both the non-commercial insightface model packs
    (buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo
    weights (YuNet + SFace); the active engine is selected at
    LoadModel time via `options: ["engine:..."]`.

  * `gallery/index.yaml` gains three model entries —
    `insightface-buffalo-l`, `insightface-opencv`,
    `insightface-buffalo-s` — each setting the appropriate
    `overrides.backend` + `overrides.options` so installing one
    actually gives the user the intended engine. This matches how
    `rfdetr-base` lives in the model gallery against the `rfdetr`
    backend.

The earlier e2e tests passed despite this bug because the Makefile
targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC,
bypassing any gallery resolution entirely. No code changes needed.

Assisted-by: Claude:claude-opus-4-7

* feat(face-recognition): cover all supported models in the gallery + drop weight baking

Follows up on the model-gallery split: adds entries for every model
configuration either engine actually supports, and switches weight
delivery from image-baked to LocalAI's standard gallery mechanism.

Gallery now has seven `insightface-*` model entries (gallery/index.yaml):

  insightface (family)  — non-commercial research use
    • buffalo-l   (326MB)  — SCRFD-10GF + ResNet50 + genderage, default
    • buffalo-m   (313MB)  — SCRFD-2.5GF + ResNet50 + genderage
    • buffalo-s   (159MB)  — SCRFD-500MF + MBF + genderage
    • buffalo-sc  (16MB)   — SCRFD-500MF + MBF, recognition only
                             (no landmarks, no demographics — analyze
                             returns empty attributes)
    • antelopev2  (407MB)  — SCRFD-10GF + ResNet100@Glint360K + genderage

  OpenCV Zoo family — Apache 2.0 commercial-safe
    • opencv       — YuNet + SFace fp32 (~40MB)
    • opencv-int8  — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU)

Model weights are no longer baked into the backend image. The image
now ships only the Python runtime + libraries (~275MB content size,
~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through
LocalAI's gallery mechanism:

  * OpenCV variants list `files:` with ONNX URIs + SHA-256, so
    `local-ai models install insightface-opencv` pulls them into the
    models directory exactly like any other gallery-managed model.

  * insightface packs (upstream distributes .zip archives only, not
    individual ONNX files) auto-download on first LoadModel via
    FaceAnalysis' built-in machinery, rooted at the LocalAI models
    directory so they live alongside everything else — same pattern
    `rfdetr` uses with `inference.get_model()`.

Backend changes (backend/python/insightface/):

  * backend.py — LoadModel propagates `ModelOptions.ModelPath` (the
    LocalAI models directory) to engines via a `_model_dir` hint.
    This replaces the earlier ModelFile-dirname approach; ModelPath
    is the canonical "models directory" variable set by the Go loader
    (pkg/model/initializers.go:144) and is always populated.

  * engines.py::_resolve_model_path — picks up `model_dir` and searches
    it (plus basename-in-model-dir) before falling back to the dev
    script-dir. This is how OnnxDirectEngine finds gallery-downloaded
    YuNet/SFace files by filename only.

  * engines.py::_flatten_insightface_pack — new helper that works
    around an upstream packaging inconsistency: buffalo_l/s/sc zips
    expand flat, but buffalo_m and antelopev2 zips wrap their ONNX
    files in a redundant `<name>/` directory. insightface's own
    loader looks one level too shallow and fails. We call
    `ensure_available()` explicitly, flatten if nested, then hand to
    FaceAnalysis.

  * engines.py::InsightFaceEngine.prepare — root-resolution order now
    includes the `_model_dir` hint so packs download into the LocalAI
    models directory by default.

  * install.sh — no longer pre-downloads any weights. Everything is
    gallery-managed now.

  * smoke.py (new) — parametrized smoke test that iterates over every
    gallery configuration, simulating the LocalAI install flow
    (creates a models dir, fetches OpenCV files with checksum
    verification, lets insightface auto-download its packs), then
    runs detect + embed + verify (+ analyze where supported) through
    the in-process BackendServicer.

  * test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/`
    paths; downloads ONNX files to a temp dir at setUpClass time and
    passes ModelPath accordingly.

Registry change (core/services/facerecognition/store_registry.go):

  * `dim=0` in NewStoreRegistry now means "accept whatever dimension
    arrives" — needed because the backend supports 512-d ArcFace/MBF
    and 128-d SFace via the same Registry. A non-zero dim still fails
    fast with ErrDimensionMismatch.

  * core/application plumbs `faceEmbeddingDim = 0`, explaining the
    rationale in the comment.

Backend gallery description updated to reflect that the image carries
no weights — it's just Python + engines.

Smoke-tested all 7 configurations against the rebuilt image (with the
flatten fix applied), exit 0:

    PASS: insightface-buffalo-l    faces=6 dim=512 same-dist=0.000
    PASS: insightface-buffalo-sc   faces=6 dim=512 same-dist=0.000
    PASS: insightface-buffalo-s    faces=6 dim=512 same-dist=0.000
    PASS: insightface-buffalo-m    faces=6 dim=512 same-dist=0.000
    PASS: insightface-antelopev2   faces=6 dim=512 same-dist=0.000
    PASS: insightface-opencv       faces=6 dim=128 same-dist=0.000
    PASS: insightface-opencv-int8  faces=6 dim=128 same-dist=0.000
    7/7 passed

Assisted-by: Claude:claude-opus-4-7

* fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim

CI regression from the previous commit: I moved OpenCV Zoo weight
delivery to LocalAI's gallery `files:` mechanism, but the
test-extra-backend-insightface-opencv target was still passing
relative paths `detector_onnx:models/opencv/yunet.onnx` in
BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over
gRPC without going through the gallery, so those relative paths
resolved to nothing and OpenCV's ONNXImporter failed:

    LoadModel failed: Failed to load face engine:
    OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx

Fix: add an `insightface-opencv-models` prerequisite target that
fetches the two ONNX files (YuNet + SFace) to a deterministic host
cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256,
and skips the download on re-runs. The opencv test target depends on
it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend
finds the files via its normal absolute-path resolution branch.

Also refresh the buffalo_l comment: it no longer says "pre-baked"
(nothing is — the pack auto-downloads from upstream's GitHub release
on first LoadModel, same as in CI).

Locally verified: `make test-extra-backend-insightface-opencv` passes
5/5 specs (health, load, face_detect, face_embed, face_verify).

Assisted-by: Claude:claude-opus-4-7

* feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs

The docs promised that /v1/embeddings returns face vectors when you
send an image data-URI. That was never true: /v1/embeddings is
OpenAI-compatible and text-only by contract — its handler goes
through `core/backend/embeddings.go::ModelEmbedding`, which sets
`predictOptions.Embeddings = s` (a string of TEXT to embed) and never
populates `predictOptions.Images[]`. The Python backend's Embedding
gRPC method does handle Images[] (that's how /v1/face/register reaches
it internally via `backend.FaceEmbed`), but the HTTP embeddings
endpoint wasn't wired to populate it.

Rather than overload /v1/embeddings with image-vs-text detection —
messy, and the endpoint is OpenAI-compatible by design — add a
dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed`
(already used internally by /v1/face/register and /v1/face/identify).

Matches LocalAI's convention of a dedicated path per non-standard flow
(/v1/rerank, /v1/detection, /v1/face/verify etc.).

Response:

    {
      "embedding": [<dim> floats, L2-normed],
      "dim": int,           // 512 for ArcFace R50 / MBF, 128 for SFace
      "model": "<name>"
    }

Live-tested on the opencv engine: returns a 128-d L2-normalized vector
(sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings
is text-only and point image users at /v1/face/embed instead.

Assisted-by: Claude:claude-opus-4-7

* fix(http): map malformed image input + gRPC status codes to proper 4xx

Image-input failures on LocalAI's single-image endpoints (/v1/detection,
/v1/face/{verify,analyze,embed,register,identify}) have historically
returned 500 — even when the client was the one who sent garbage.
Classic example: you POST an "image" that isn't a URL, isn't a
data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim
that's its fault.

Two helpers land in core/http/endpoints/localai/images.go and every
single-image handler is switched over:

  * decodeImageInput(s)
      Wraps utils.GetContentURIAsBase64 and turns any failure
      (invalid URL, not a data-URI, download error, etc.) into
      echo.NewHTTPError(400, "invalid image input: ...").

  * mapBackendError(err)
      Inspects the gRPC status on a backend call error and maps:
        INVALID_ARGUMENT     → 400 Bad Request
        NOT_FOUND            → 404 Not Found
        FAILED_PRECONDITION  → 412 Precondition Failed
        Unimplemented        → 501 Not Implemented
      All other codes fall through unchanged (still 500).

Before, my 1×1 PNG error-path test returned:
    HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images"
After:
    HTTP 400 "failed to decode one or both images"

Scope-limited to the LocalAI single-image endpoints. The multi-modal
paths (middleware/request.go, openresponses/responses.go,
openai/realtime.go) intentionally log-and-skip individual media parts
when decoding fails — different design intent (graceful degradation
of a multi-part message), not a 400-worthy failure. Left untouched.

Live-verified: every error case in /tmp/face_errors.py now returns
4xx with a meaningful message; the "image with no face (1x1 PNG)"
case specifically went from 500 → 400.

Assisted-by: Claude:claude-opus-4-7

* refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis

Follows up on the discovery that LocalAI's gallery `files:` mechanism
handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy
piper voices use exactly this pattern. Insightface packs are zip
archives, so we can now deliver them the same way every other
gallery-managed model gets delivered: declaratively, checksum-verified,
through LocalAI's standard download+extract pipeline.

Two changes:

1. Gallery (gallery/index.yaml) — every insightface-* entry gains a
   `files:` list with the pack zip's URI + SHA-256. `local-ai models
   install insightface-buffalo-l` now fetches the zip, verifies the
   hash, and extracts it into the models directory. No more reliance
   on insightface's library-internal `ensure_available()` auto-download
   or its hardcoded `BASE_REPO_URL`.

2. InsightFaceEngine (backend/python/insightface/engines.py) — drops
   the FaceAnalysis wrapper and drives insightface's `model_zoo`
   directly. The ~50 lines FaceAnalysis provides — glob ONNX files,
   route each through `model_zoo.get_model()`, build a
   `{taskname: model}` dict, loop per-face at inference — are
   reimplemented in `InsightFaceEngine`. The actual inference classes
   (RetinaFace, ArcFaceONNX, Attribute, Landmark) are still
   insightface's — we only replicate the glue, so drift risk against
   upstream is minimal.

   Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/*.onnx`
   layout that doesn't match what LocalAI's zip extraction produces.
   LocalAI unpacks archives flat into `<models_dir>`. Upstream packs
   are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands
   at `<models_dir>/*.onnx`), buffalo_m/antelopev2 wrap in a redundant
   `<name>/` dir (lands at `<models_dir>/<name>/*.onnx`). The new
   `_locate_insightface_pack` helper searches both locations plus
   legacy paths and returns whichever has ONNX files. Replaces the
   earlier `_flatten_insightface_pack` helper (which tried to fight
   FaceAnalysis's layout expectations; now we just find the files
   wherever they are).

Net effect for users: install once via LocalAI's managed flow,
weights live alongside every other model, progress shows in the
jobs endpoint, no first-load network call. Same API surface,
cleaner plumbing.

Assisted-by: Claude:claude-opus-4-7

* fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched

The e2e suite drives LoadModel over gRPC without going through LocalAI's
gallery flow, so the engine's `_model_dir` option (normally populated
from ModelPath) is empty. Previously the insightface target relied on
FaceAnalysis auto-download to paper over this, but we dropped
FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l
target started failing at LoadModel with "no insightface pack found".

Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip
(same SHA as the gallery entry), extract it on the host, and pass
`root:<dir>` so the engine locates the pack without needing
ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI
fast; it covers the same insightface engine code path as buffalo_l.

Face analyze cap dropped since buffalo_sc has no age/gender head.

Assisted-by: Claude:claude-opus-4-7[1m]

* feat(face-recognition): surface face-recognition in advertised feature maps

The six /v1/face/* endpoints were missing from every place LocalAI
advertises its feature surface to clients:

  * api_instructions — the machine-readable capability index at
    GET /api/instructions. Added `face-recognition` as a dedicated
    instruction area with an intro that calls out the in-memory
    registry caveat and the /v1/face/embed vs /v1/embeddings split.
  * auth/permissions — added FeatureFaceRecognition constant, routed
    all six face endpoints through it so admins can gate them per-user
    like any other API feature. Default ON (matches the other API
    features).
  * React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to
    FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a
    follow-up (noted in the plan).

Instruction count bumped 9 → 10; test updated.

Assisted-by: Claude:claude-opus-4-7[1m]

* docs(agents): capture advertising-surface steps in the endpoint guide

Before this change, adding a new /v1/* endpoint reliably missed one or
more of: the swagger @Tags annotation, the /api/instructions registry,
the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The
endpoint would work but be invisible to API consumers, admins, and the
UI — and nothing in the existing docs said to look in those places.

Extend .agents/api-endpoints-and-auth.md with a new "Advertising
surfaces" section covering all four surfaces (swagger tags, /api/
instructions, capabilities.js, docs/), and expand the closing checklist
so it's impossible to ship a feature without visiting each one. Hoist a
one-liner reminder into AGENTS.md's Quick Reference so agents skim it
before diving in.

Assisted-by: Claude:claude-opus-4-7[1m]
2026-04-22 21:55:41 +02:00
Adira
607efe5a4c fix(backend-monitor): accept model as a query parameter (#9411)
The /backend/monitor endpoint is routed as GET but its handler bound the
model name from a request body, which is invalid per REST and breaks
Swagger UI and OpenAPI codegen tools that refuse to send bodies with GET.

Switch to reading ?model=<name> as a query parameter and update the
Swagger annotation, regenerated spec files, and documentation. The
handler still falls back to body binding when the query parameter is
absent, so existing clients sending {"model": "..."} continue to work.

Fixes #9207

Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
2026-04-21 22:06:35 +02:00
Ettore Di Giacinto
95efb8a562 feat(backend): add turboquant llama.cpp-fork backend (#9355)
* feat(backend): add turboquant llama.cpp-fork backend

turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch
feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme.
It ships as a first-class backend reusing backend/cpp/llama-cpp sources
via a thin wrapper Makefile: each variant target copies ../llama-cpp
into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server
with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No
duplication of grpc-server.cpp — upstream fixes flow through automatically.

Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL
f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml,
adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0
to exercise the KV-cache config path (backend_test.go gains dedicated env
vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement
usable by any llama.cpp-family backend), and registers a nightly auto-bump
PR in bump_deps.yaml tracking feature/turboquant-kv-cache.

scripts/changed-backends.js gets a special-case so edits to
backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since
the wrapper reuses those sources.

* feat(turboquant): carry upstream patches against fork API drift

turboquant branched from llama.cpp before upstream commit 66060008
("server: respect the ignore eos flag", #21203) which added the
`logit_bias_eog` field to `server_context_meta` and a matching
parameter to `server_task::params_from_json_cmpl`. The shared
backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so
building it against the fork unmodified fails.

Cherry-pick that commit as a patch file under
backend/cpp/turboquant/patches/ and apply it to the cloned fork
sources via a new apply-patches.sh hook called from the wrapper
Makefile. Simplifies the build flow too: instead of hopping through
llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now
drives the copied Makefile directly (clone -> patch -> build).

Drop the corresponding patch whenever the fork catches up with
upstream — the build fails fast if a patch stops applying, which
is the signal to retire it.

* docs: add turboquant backend section + clarify cache_type_k/v

Document the new turboquant (llama.cpp fork with TurboQuant KV-cache)
backend alongside the existing llama-cpp / ik-llama-cpp sections in
features/text-generation.md: when to pick it, how to install it from
the gallery, and a YAML example showing backend: turboquant together
with cache_type_k / cache_type_v.

Also expand the cache_type_k / cache_type_v table rows in
advanced/model-configuration.md to spell out the accepted llama.cpp
quantization values and note that these fields apply to all
llama.cpp-family backends, not just vLLM.

* feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion

The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but
ggml/include/ggml-rpc.h static-asserts it equals 96, breaking
the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server).
Carry a one-line patch that updates the expected count so the
assertion holds. Drop this patch whenever the fork fixes it upstream.

* feat(turboquant): allow turbo* KV-cache types and exercise them in e2e

The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own
allow-list of accepted KV-cache types (kv_cache_types[]) and rejects
anything outside it before the value reaches llama.cpp's parser. That
list only contains the standard llama.cpp types — turbo2/turbo3/turbo4
would throw "Unsupported cache type" at LoadModel time, meaning
nothing the LocalAI gRPC layer accepted was actually fork-specific.

Add a build-time augmentation step (patch-grpc-server.sh, called from
the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0
into the allow-list of the *copied* grpc-server.cpp under
turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/
is never touched, so the stock llama-cpp build keeps compiling against
vanilla upstream which has no notion of those enum values.

Switch test-extra-backend-turboquant to set
BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite
actually runs the fork's TurboQuant KV-cache code paths (turbo3 also
auto-enables flash_attention in the fork). Picking q8_0 here would
only re-test the standard llama.cpp path that the upstream llama-cpp
backend already covers.

Refresh the docs (text-generation.md + model-configuration.md) to
list turbo2/turbo3/turbo4 explicitly and call out that you only get
the TurboQuant code path with this backend + a turbo* cache type.

* fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3

The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant)
does not install python3, so the python-based augmentation step
errored with `python3: command not found` at make time. Switch to
awk, which ships in coreutils and is already available everywhere
the rest of the wrapper Makefile runs.

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-15 01:25:04 +02:00
Ettore Di Giacinto
833b7e8557 chore(docs): update transcription endpoint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-14 14:14:54 +00:00
Ettore Di Giacinto
0e7c0adee4 docs: document tool calling on vLLM and MLX backends
openai-functions.md used to claim LocalAI tool calling worked only on
llama.cpp-compatible models. That was true when it was written; it's
not true now — vLLM (since PR #9328) and MLX/MLX-VLM both extract
structured tool calls from model output.

- openai-functions.md: new 'Supported backends' matrix covering
  llama.cpp, vllm, vllm-omni, mlx, mlx-vlm, with the key distinction
  that vllm needs an explicit tool_parser: option while mlx auto-
  detects from the chat template. Reasoning content (think tags) is
  extracted on the same set of backends. Added setup snippets for
  both the vllm and mlx paths, and noted the gallery importer
  pre-fills tool_parser:/reasoning_parser: for known families.
- compatibility-table.md: fix the stale 'Streaming: no' for vllm,
  vllm-omni, mlx, mlx-vlm (all four support streaming now). Add
  'Functions' to their capabilities. Also widen the MLX Acceleration
  column to reflect the CPU/CUDA/Jetson L4T backends that already
  exist in backend/index.yaml — 'Metal' on its own was misleading.
2026-04-13 16:58:55 +00:00
Ettore Di Giacinto
9ca03cf9cc feat(backends): add ik-llama-cpp (#9326)
* feat(backends): add ik-llama-cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: add grpc e2e suite, hook to CI, update README

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-12 13:51:28 +02:00
Ettore Di Giacinto
151ad271f2 feat(rocm): bump to 7.x (#9323)
feat(rocm): bump to 7.2.1

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-12 08:51:30 +02:00
Ettore Di Giacinto
706cf5d43c feat(sam.cpp): add sam.cpp detection backend (#9288)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-09 21:49:11 +02:00
Ettore Di Giacinto
b0d9ce4905 Remove header from OpenAI Realtime API documentation
Removed the header from the Realtime API documentation.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-09 09:00:28 +02:00
Richard Palethorpe
557d0f0f04 feat(api): Allow coding agents to interactively discover how to control and configure LocalAI (#9084)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-04 15:14:35 +02:00
Ettore Di Giacinto
11dc54bda9 fix(docs): commit distribution.md
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-03 10:14:13 +02:00
Ettore Di Giacinto
7e0b73deaa fix(docs): fix broken references to distributed mode
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-03 09:46:06 +02:00
Ettore Di Giacinto
8862e3ce60 feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186)
* always enable parallel requests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: move tests to ginkgo

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(smart router): order by available vram

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 08:28:56 +02:00
Ettore Di Giacinto
59108fbe32 feat: add distributed mode (#9124)
* feat: add distributed mode (experimental)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix data races, mutexes, transactions

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix events and tool stream in agent chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* use ginkgo

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(cron): compute correctly time boundaries avoiding re-triggering

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* enhancements, refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* do not flood of healthy checks

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* do not list obvious backends as text backends

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* tests fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop redundant healthcheck

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* enhancements, refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-30 00:47:27 +02:00
Richard Palethorpe
26384c5c70 fix(docs): Use notice instead of alert (#9134)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-25 13:55:48 +01:00
Ettore Di Giacinto
f7e8d9e791 feat(quantization): add quantization backend (#9096)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-22 00:56:34 +01:00
Ettore Di Giacinto
d9c1db2b87 feat: add (experimental) fine-tuning support with TRL (#9088)
* feat: add fine-tuning endpoint

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(experimental): add fine-tuning endpoint and TRL support

This changeset defines new GRPC signatues for Fine tuning backends, and
add TRL backend as initial fine-tuning engine. This implementation also
supports exporting to GGUF and automatically importing it to LocalAI
after fine-tuning.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* commit TRL backend, stop by killing process

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* move fine-tune to generic features

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* add evals, reorder menu

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fix tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-21 02:08:02 +01:00
Ettore Di Giacinto
aea21951a2 feat: add users and authentication support (#9061)
* feat(ui): add users and authentication support

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: allow the admin user to impersonificate users

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: ui improvements, disable 'Users' button in navbar when no auth is configured

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: add OIDC support

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: gate models

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: cache requests to optimize speed

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* small UI enhancements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(ui): style improvements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: cover other paths by auth

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: separate local auth, refactor

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* security hardening, approval mode

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: fix tests and expectations

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: update localagi/localrecall

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-19 21:40:51 +01:00
LocalAI [bot]
bbe9067227 docs: Add troubleshooting guide for embedding models (fixes #9064) (#9065)
docs: Add troubleshooting guide for embedding models (#9064)

- Add section on using gallery models for embeddings
- Document common issues with embedding model configuration
- Add troubleshooting guide for Qwen3 embedding models
- Include correct configuration examples for Qwen3-Embedding-4B
- Document context size limits and dimension parameters
- Add table of Qwen3 embedding model specifications

Fixes #9064

Signed-off-by: localai-bot <localai-bot@localai.io>
Co-authored-by: localai-bot <localai-bot@localai.io>
2026-03-19 19:41:12 +01:00
Ettore Di Giacinto
5affb747a9 chore: drop AIO images (#9004)
AIO images are behind, and takes effort to maintain these. Wizard and
installation of models have been semplified massively, so AIO images
lost their purpose.

This allows us to be more laser focused on main images and reliefes
stress from CI.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-14 17:49:36 +01:00
Richard Palethorpe
f9a850c02a feat(realtime): WebRTC support (#8790)
* feat(realtime): WebRTC support

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(tracing): Show full LLM opts and deltas

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-13 21:37:15 +01:00
Ettore Di Giacinto
8818452d85 feat(ui): MCP Apps, mcp streaming and client-side support (#8947)
* Revert "fix: Add timeout-based wait for model deletion completion (#8756)"

This reverts commit 9e1b0d0c82.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: add mcp prompts and resources

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add client-side MCP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): allow to authenticate MCP servers

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add MCP Apps

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: update AGENTS

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: allow to collapse navbar, save state in storage

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add MCP button also to home page

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(chat): populate string content

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-11 07:30:49 +01:00
Ettore Di Giacinto
85f3558d22 feat(ui): add canvas mode, support history in agent chat (#8927)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-09 23:42:47 +01:00
Ettore Di Giacinto
a026277ab9 feat(mlx-distributed): add new MLX-distributed backend (#8801)
* feat(mlx-distributed): add new MLX-distributed backend

Add new MLX distributed backend with support for both TCP and RDMA for
model sharding.

This implementation ties in the discovery implementation already in
place, and re-uses the same P2P mechanism for the TCP MLX-distributed
inferencing.

The Auto-parallel implementation is inspired by Exo's
ones (who have been added to acknowledgement for the great work!)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* expose a CLI to facilitate backend starting

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: make manual rank0 configurable via model configs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add missing features from mlx backend

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-03-09 17:29:32 +01:00
LocalAI [bot]
d200401e86 feat: Add --data-path CLI flag for persistent data separation (#8888)
feat: add --data-path CLI flag for persistent data separation

- Add LOCALAI_DATA_PATH environment variable and --data-path CLI flag
- Default data path: /data (separate from configuration directory)
- Automatic migration on startup: moves agent_tasks.json, agent_jobs.json, collections/, and assets/ from old config dir to new data path
- Backward compatible: preserves old behavior if LOCALAI_DATA_PATH is not set
- Agent state and job directories now use DataPath with proper fallback chain
- Update documentation with new flag and docker-compose example

This separates mutable persistent data (collectiondb, agents, assets, skills) from configuration files, enabling better volume mounting and data persistence in containerized deployments.

Signed-off-by: localai-bot <localai-bot@noreply.github.com>
Co-authored-by: localai-bot <localai-bot@noreply.github.com>
2026-03-09 14:11:15 +01:00
LocalAI [bot]
9297074caa docs: expand GPU acceleration guide with L4T, multi-GPU, monitoring, and troubleshooting (#8858)
- Expand multi-GPU section to cover llama.cpp (CUDA_VISIBLE_DEVICES,
  HIP_VISIBLE_DEVICES) in addition to diffusers
- Add NVIDIA L4T/Jetson section with quick start commands and cross-reference
  to the dedicated ARM64 page
- Add GPU monitoring section with vendor-specific tools (nvidia-smi, rocm-smi,
  intel_gpu_top)
- Add troubleshooting section covering common issues: GPU not detected, CPU
  fallback, OOM errors, unsupported ROCm targets, SYCL mmap hang
- Replace "under construction" warning with useful cross-references to related
  docs (container images, VRAM management)

Signed-off-by: localai-bot <localai-bot@users.noreply.github.com>
Co-authored-by: localai-bot <localai-bot@noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 21:59:57 +01:00
LocalAI [bot]
9090bca920 feat: Add documentation for undocumented API endpoints (#8852)
* feat: add documentation for undocumented API endpoints

Creates comprehensive documentation for 8 previously undocumented endpoints:
- Voice Activity Detection (/v1/vad)
- Video Generation (/video)
- Sound Generation (/v1/sound-generation)
- Backend Monitor (/backend/monitor, /backend/shutdown)
- Token Metrics (/tokenMetrics)
- P2P endpoints (/api/p2p/* - 5 sub-endpoints)
- System Info (/system, /version)

Each documentation file includes HTTP method, request/response schemas,
curl examples, sample JSON responses, and error codes.

* docs: remove token-metrics endpoint documentation per review feedback

The token-metrics endpoint is not wired into the HTTP router and
should not be documented per reviewer request.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: move system-info documentation to reference section

Per review feedback, system-info endpoint docs are better suited
for the reference section rather than features.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: localai-bot <localai-bot@noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 17:59:33 +01:00
Ettore Di Giacinto
d21369ad7b Update shell completion documentation URL
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-03-08 09:33:36 +01:00
LocalAI [bot]
efd402207c feat: Add shell completion support for bash, zsh, and fish (#8851)
feat: add shell completion support for bash, zsh, and fish

- Add core/cli/completion.go with dynamic completion script generation
- Add core/cli/completion_test.go with unit tests
- Modify cmd/local-ai/main.go to support completion command
- Modify core/cli/cli.go to add Completion subcommand
- Add docs/content/features/shell-completion.md with installation instructions

The completion scripts are generated dynamically from the Kong CLI model,
so they automatically include all commands, subcommands, and flags.

Co-authored-by: localai-bot <localai-bot@noreply.github.com>
2026-03-08 09:32:39 +01:00
Ettore Di Giacinto
ac48867b7d feat: add agentic management (#8820)
* feat: add standalone and agentic functionalities

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* expose agents via responses api

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-07 00:03:08 +01:00
LocalAI [bot]
ab315f2725 feat: Add LOCALAI_DISABLE_MCP environment variable to disable MCP support (#8816)
* feat: Add LOCALAI_DISABLE_MCP environment variable to disable MCP support

- Added DisableMCP field to RunCMD struct in core/cli/run.go
- Added LOCALAI_DISABLE_MCP environment variable support
- Added DisableMCP field to ApplicationConfig struct
- Added DisableMCP AppOption function
- Updated MCP endpoint routing to check appConfig.DisableMCP
- When LOCALAI_DISABLE_MCP is set to true/1/yes, MCP endpoints are not registered

When set, all MCP functionality is disabled and appropriate error messages
are returned to users.

Use Cases:
- Security-conscious deployments where MCP is not needed
- Reducing attack surface
- Compliance requirements that prohibit certain protocol support

Environment variable: LOCALAI_DISABLE_MCP=true

Signed-off-by: localai-bot <localai-bot@users.noreply.github.com>

* docs: Add documentation for LOCALAI_DISABLE_MCP environment variable

- Add section explaining how to disable MCP support using environment variable
- Document use cases for disabling MCP
- Provide examples for CLI and Docker usage

Signed-off-by: localai-bot <localai-bot@users.noreply.github.com>

---------

Signed-off-by: localai-bot <localai-bot@users.noreply.github.com>
Co-authored-by: localai-bot <localai-bot@users.noreply.github.com>
2026-03-06 20:44:03 +01:00
Ettore Di Giacinto
580517f9db feat: pass-by metadata to predict options (#8795)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-05 22:50:10 +01:00
Andres
454d8adc76 feat(qwen-tts): Support using multiple voices (#8757)
* Add support for multiple voice clones in Qwen TTS

Signed-off-by: Andres Smith <andressmithdev@pm.me>

* Add voice prompt caching and generation logs to see generation time

---------

Signed-off-by: Andres Smith <andressmithdev@pm.me>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-03-04 09:47:21 +01:00
Ettore Di Giacinto
983db7bedc feat(ui): add model size estimation (#8684)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-02-28 23:03:47 +01:00
Lukas Schaefer
ed0bfb8732 fix: rename json_verbose to verbose_json (#8627)
Signed-off-by: Lukas Schaefer <lukas@lschaefer.xyz>
2026-02-23 17:57:06 +00:00
LocalAI [bot]
559ab99890 docs: update diffusers multi-GPU documentation to mention tensor_parallel_size configuration (#8621)
* docs: update diffusers multi-GPU documentation to mention tensor_parallel_size configuration

* chore: revert backend/python/diffusers/README.md to original content

---------

Co-authored-by: Your Name <you@example.com>
2026-02-22 18:17:23 +01:00
Ettore Di Giacinto
b471619ad9 chore(deps): bump cogito and add new options to the agent config (#8601)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-02-18 22:10:26 +01:00
Ettore Di Giacinto
bf5a1dd840 feat(voxtral): add voxtral backend (#8451)
* feat(voxtral): add voxtral backend

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* simplify

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-02-09 09:12:05 +01:00
Varun Chawla
aa9ca401fa docs: update model gallery documentation to reference main repository (#8452)
Fixes #8212 - Updated the note about reporting broken models to
reference the main LocalAI repository instead of the outdated
separate gallery repository reference.
2026-02-08 22:14:23 +01:00
Richard Palethorpe
c1d0b10b14 chore(docs): Document using a local model gallery (#8426)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-02-06 10:28:41 +01:00
Ettore Di Giacinto
53276d28e7 feat(musicgen): add ace-step and UI interface (#8396)
* feat(musicgen): add ace-step and UI interface

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Correctly handle model dir

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop auto-download

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add to models, fixup UIs icons

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* l4t13 is incompatbile

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* avoid pinning version for cuda12

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop l4t12

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-02-05 12:04:53 +01:00
Andres
b6459ddd57 feat(api): Add transcribe response format request parameter & adjust STT backends (#8318)
* WIP response format implementation for audio transcriptions

(cherry picked from commit e271dd764bbc13846accf3beb8b6522153aa276f)
Signed-off-by: Andres Smith <andressmithdev@pm.me>

* Rework transcript response_format and add more formats

(cherry picked from commit 6a93a8f63e2ee5726bca2980b0c9cf4ef8b7aeb8)
Signed-off-by: Andres Smith <andressmithdev@pm.me>

* Add test and replace go-openai package with official openai go client

(cherry picked from commit f25d1a04e46526429c89db4c739e1e65942ca893)
Signed-off-by: Andres Smith <andressmithdev@pm.me>

* Fix faster-whisper backend and refactor transcription formatting to also work on CLI

Signed-off-by: Andres Smith <andressmithdev@pm.me>
(cherry picked from commit 69a93977d5e113eb7172bd85a0f918592d3d2168)
Signed-off-by: Andres Smith <andressmithdev@pm.me>

---------

Signed-off-by: Andres Smith <andressmithdev@pm.me>
Co-authored-by: nanoandrew4 <nanoandrew4@gmail.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-02-01 17:33:17 +01:00
Ettore Di Giacinto
68dd9765a0 feat(tts): add support for streaming mode (#8291)
* feat(tts): add support for streaming mode

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Send first audio, make sure it's 16

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-30 11:58:01 +01:00
Richard Palethorpe
dd8e74a486 feat(realtime): Add audio conversations (#6245)
* feat(realtime): Add audio conversations

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(realtime): Vendor the updated API and modify for server side

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(realtime): Update to the GA realtime API

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore: Document realtime API and add docs to AGENTS.md

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat: Filter reasoning from spoken output

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(realtime): Send delta and done events for tool calls and audio transcripts

Ensure that content is sent in both deltas and done events for function call arguments and audio transcripts. This fixes compatibility with clients that rely on delta events for parsing.

💘 Generated with Crush

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(realtime): Improve tool call handling and error reporting

- Refactor Model interface to accept []types.ToolUnion and *types.ToolChoiceUnion
  instead of JSON strings, eliminating unnecessary marshal/unmarshal cycles
- Fix Parameters field handling: support both map[string]any and JSON string formats
- Add PredictConfig() method to Model interface for accessing model configuration
- Add comprehensive debug logging for tool call parsing and function config
- Add missing return statement after prediction error (critical bug fix)
- Add warning logs for NoAction function argument parsing failures
- Improve error visibility throughout generateResponse function

💘 Generated with Crush

Assisted-by: Claude Sonnet 4.5 via Crush <crush@charm.land>
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-01-29 08:44:53 +01:00
Ettore Di Giacinto
6804ce1c39 chore(docs): change MEMORY_FILE_PATH to MEMORY_INDEX_PATH
Updated MEMORY_FILE_PATH to MEMORY_INDEX_PATH in memory configuration.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-01-25 22:14:11 +01:00
Ettore Di Giacinto
26a374b717 chore: drop bark which is unmaintained (#8207)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-25 09:26:40 +01:00
Ettore Di Giacinto
05904c77f5 chore(exllama): drop backend now almost deprecated (#8186)
exllama2 development has stalled and only old architectures are
supported. exllamav3 is still in development, meanwhile cleaning up
exllama2 from the gallery.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-24 08:57:37 +01:00
Ettore Di Giacinto
923ebbb344 feat(qwen-tts): add Qwen-tts backend (#8163)
* feat(qwen-tts): add Qwen-tts backend

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update intel deps

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop flash-attn for cuda13

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-23 15:18:41 +01:00
Ettore Di Giacinto
4bf2f8bbd8 chore(docs): update docs with Anthropic API and openresponses
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-20 09:25:24 +01:00
Ettore Di Giacinto
a6ff354c86 feat(tts): add pocket-tts backend (#8018)
* feat(pocket-tts): add new backend

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add to the gallery

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-13 23:35:19 +01:00