Compare commits

..

147 Commits

Author SHA1 Message Date
Ettore Di Giacinto
50580a84ae fix(ci): switch apt mirror per runner — azure on github-hosted, kernel.org on self-hosted
Self-hosted runners (arc-runner-set, bigger-runner) cannot reach
azure.archive.ubuntu.com — they live in different networks (e.g. our
arc-runner-set Kubernetes cluster) where Azure's mirror IP is not
routable. Symptom: "Connection failed [IP: 51.11.236.225 80]" with each
Ign:/Err: cycle taking 60s, hanging the build for ~16 minutes before
exit 100.

Pick the mirror based on `runner.environment`:

  * github-hosted (ubuntu-latest, ubuntu-24.04-arm) → Azure
    (http://azure.archive.ubuntu.com / http://azure.ports.ubuntu.com)
    — same VPC as the runner.
  * self-hosted (arc-runner-set, bigger-runner)    → kernel.org
    (https://mirrors.edge.kernel.org for both archive and ports)
    — publicly reachable from any network.

The choice now lives in one place: the .github/actions/configure-apt-mirror
composite action exposes `effective-mirror` / `effective-ports-mirror`
outputs so the reusable workflows can forward the same value as Docker
build-args without duplicating the per-runner-environment branch.

The now-redundant `apt-mirror` / `apt-ports-mirror` workflow inputs on
image_build.yml and backend_build.yml are dropped — defaults live in the
composite action and are visible there.

Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-03 22:59:26 +00:00
Ettore Di Giacinto
8edac61e57 feat(ci): allow routing apt traffic through an alternate Ubuntu mirror (#9650)
* feat(ci): allow routing apt traffic through an alternate Ubuntu mirror

Adds opt-in APT_MIRROR / APT_PORTS_MIRROR knobs to all Dockerfiles, the
Makefile, and CI workflows so we can fail over to a non-canonical Ubuntu
mirror when archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com
are degraded (recently observed: multi-day DDoS against the default pool).

Defaults are empty everywhere — behavior is unchanged unless a mirror is
configured. To enable in CI, set the repo-level GitHub Actions variables
APT_MIRROR (and APT_PORTS_MIRROR for arm64 builds). Locally:
    make docker APT_MIRROR=http://azure.archive.ubuntu.com

A small POSIX-sh helper in .docker/apt-mirror.sh rewrites both DEB822
(/etc/apt/sources.list.d/ubuntu.sources, Ubuntu 24.04+) and the legacy
/etc/apt/sources.list before the first apt-get update. Dockerfile stages
load it via RUN --mount=type=bind, so there is no extra layer and no
cache invalidation when the script is unchanged. Reusable workflows also
rewrite the runner's own /etc/apt sources before any sudo apt-get call.

Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(apt-mirror): default to the Azure mirror, visible in the workflow source

Bakes Azure (http://azure.archive.ubuntu.com / http://azure.ports.ubuntu.com)
in as the default for both Docker builds and runner-side apt — rather than
hiding the URL behind a GitHub Actions repo variable that's not visible
from the source tree.

A new composite action at .github/actions/configure-apt-mirror is the
single source of truth for runner-side rewrites. Five standalone
workflows (build-test, release, tests-e2e, tests-ui-e2e, update_swagger)
just `uses: ./.github/actions/configure-apt-mirror`.

Three workflows (image_build, backend_build, checksum_checker) keep an
inline bash rewrite, because they install/upgrade git via apt *before*
the checkout step (so the local composite action isn't loadable yet).
The Azure URL is visible in those files too.

The `apt-mirror` / `apt-ports-mirror` inputs of the reusable workflows
keep their now-Azure defaults — they still feed the Docker build-args
block in addition to the inline runner-side rewrite. Callers (image.yml,
image-pr.yml, backend.yml, backend_pr.yml) drop the previous
`vars.APT_MIRROR` plumbing and rely on those defaults.

Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(apt-mirror): drop Force Install GIT, consolidate on the composite action

The PPA git upgrade ran add-apt-repository ppa:git-core/ppa, which talks
to api.launchpad.net — also part of Canonical's infrastructure and
currently returning HTTP 504. The Azure mirror only covers
archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com, not PPAs.

The system git that ubuntu-latest already ships is sufficient for
actions/checkout and the build pipeline, so just drop the upgrade. With
that gone, the apt-before-checkout constraint disappears too — all three
holdouts (image_build, backend_build, checksum_checker) can now switch
to ./.github/actions/configure-apt-mirror like the other five.

Net: 0 inline apt-mirror blocks, all 8 workflows route through the
composite action.

Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-03 23:50:13 +02:00
Tai An
0b024f0886 chore(model gallery): add chroma1-hd diffusers model (#9646)
Resolves https://github.com/mudler/LocalAI/issues/9604

Adds Chroma1-HD (lodestones/Chroma1-HD), an 8.9B-parameter
text-to-image model derived from FLUX.1-schnell, served via the
upstream-diffusers ChromaPipeline. Inference defaults follow the
model card recommendations: 40 steps, CFG 3.0, bfloat16.

Assisted-by: claude-code:opus-4.7
2026-05-03 09:06:31 +02:00
Ettore Di Giacinto
a6121e240e docs: credit the LocalAI maintainers team
Update README and docs to attribute maintenance to the LocalAI team
(Ettore Di Giacinto and Richard Palethorpe) and drop the autonomous
AI dev team section.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Edit] [Bash]
2026-05-02 23:37:04 +00:00
Ettore Di Giacinto
87cf736068 feat(react-ui): add multilingual (i18n) support (#9642)
Adds end-to-end internationalization to the React UI with five seed
languages (English, Italian, Spanish, German, Simplified Chinese) and
a sidebar-footer language switcher next to the existing theme toggle.

Library: react-i18next + i18next + i18next-http-backend +
i18next-browser-languagedetector. The detector caches the user's
choice in localStorage (key `localai-language`, mirroring the existing
`localai-theme` convention) and updates the `<html lang>` attribute on
change. fallbackLng is `en`, so any missing translation in another
locale falls back transparently.

Translation files live under `public/locales/<lng>/<ns>.json`. They
ride along with the existing `//go:embed react-ui/dist/*` directive,
but the previous SPA route in core/http/app.go only exposed
`/assets/*` from the embedded React build. This commit generalizes
the asset handler into a `serveReactSubdir(subdir)` helper and adds a
matching `/locales/*` route so i18next-http-backend can fetch the
JSONs at runtime. The http-backend `loadPath` is built via the
existing `apiUrl()` helper so instances served under a sub-path (e.g.
`<base href="/ui/">`) resolve correctly.

Namespaces (13): common, nav, errors, auth, home, models, importModel,
chat, agents, skills, collections, media, admin. Translated UI surfaces
include the sidebar/header/footer chrome, login + account flows, the
Home dashboard (incl. the manage-by-chat assistant CTA), the model
gallery + import flow, the chat experience (Chat.jsx + ChatsMenu),
agents/skills/collections list pages, the studio media tabs (Image,
Video, TTS), and the admin page-headers (Settings incl. its section
nav, Manage, Backends, Traces, Nodes, P2P, Users, Usage). Shared
components (ConfirmDialog, Toast) take their default labels from the
common namespace so callers don't need to pass strings explicitly.

Tooling for incremental adoption is included:
  - `i18next-parser.config.js` + `npm run i18n:extract` to sweep `t()`
    keys into the JSON skeletons.
  - `scripts/translate-locales.mjs` (one-off helper) to bootstrap
    non-English locales from English source via OpenAI or Anthropic
    APIs, with --copy mode as a placeholder fallback. Idempotent;
    preserves existing translations unless --overwrite is passed.

Larger config-driven pages (ModelEditor, Settings deep field forms,
AgentChat/AgentCreate, SkillEdit, CollectionDetails, Talk, Sound,
biometrics, FineTune/Quantize, Users modals, Nodes/P2P install
pickers, BackendLogs, Traces deep filters, Explorer) intentionally
keep their inner content untranslated for now — they fall back to
English via fallbackLng so functionality is unaffected, and the
extracted-strings pattern + the bootstrap script make follow-up
extraction straightforward.

The initial Suspense fallback at the root in main.jsx covers the
first JSON fetch on cold load. A simple `.app-boot-spinner` styled
in App.css provides a non-empty paint while the first namespace
loads.

Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-02 22:42:08 +02:00
LocalAI [bot]
1ad5b5907d feat(swagger): update swagger (#9643)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-02 22:33:47 +02:00
Russell Sim
18e039f305 fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds (#9626)
* fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds

399c1dec wired amdgpu-targets through the backend_build workflow_call
interface, intending the input's default value to cover matrix entries
that don't specify targets. However, GitHub Actions only applies a
workflow_call input default when the caller omits the input entirely.
When backend.yml passes `amdgpu-targets: ${{ matrix.amdgpu-targets }}`
and the matrix entry has no amdgpu-targets key, the expression evaluates
to an empty string, which is treated as an explicit value — bypassing
the default. The result is Docker receiving AMDGPU_TARGETS="" which in
turn causes Make's ?= default to be skipped (since the variable is
already set in the environment, even to empty), and cmake gets
-DAMDGPU_TARGETS= with no targets, so the HIP backend compiles for an
indeterminate target rather than the intended GPU list.

Fix this at two levels:

1. backend.yml: use a || fallback in the expression so that an undefined
   matrix.amdgpu-targets never reaches the reusable workflow as an empty
   string. The target list is the canonical default and lives here.

2. backend_build.yml: remove the now-misleading default value from the
   input declaration. The default never fired due to the above bug, so
   keeping it implied a guarantee that didn't exist.

3. backend/cpp/llama-cpp/Makefile: add an explicit $(error ...) guard
   after the ?= assignment so that if AMDGPU_TARGETS is empty (whether
   from environment or any future CI wiring mistake) the build fails
   immediately with a clear message rather than silently producing a
   binary compiled for an unknown GPU target.

Assisted-by: Claude Code:claude-sonnet-4-6
Signed-off-by: Russell Sim <rsl@simopolis.xyz>

* fix(build): plumb AMDGPU_TARGETS through to Docker builds

The docker-build-backend Makefile macro and Dockerfile.golang did not
pass AMDGPU_TARGETS to the inner make invocation, so hipblas builds
always used the backend Makefile's hardcoded default GPU targets
regardless of what was specified via environment or CI inputs.

Signed-off-by: Russell Sim <rsl@simopolis.xyz>

---------

Signed-off-by: Russell Sim <rsl@simopolis.xyz>
2026-05-02 15:53:14 +02:00
Ettore Di Giacinto
b1a99436c7 feat(branding): admin-configurable instance name, tagline, and assets (#9635)
Adds a whitelabeling feature so an operator can replace the LocalAI
instance name, tagline, square logo, horizontal logo, and favicon from
the admin Settings page. Defaults fall back to the bundled assets so
existing installs are unaffected.

The public GET /api/branding endpoint is reachable pre-auth so the
login screen can render the configured branding before sign-in.
Mutating routes (POST/DELETE /api/branding/asset/:kind) remain
admin-only. Text fields (instance_name, instance_tagline) ride the
existing /api/settings flow; binary assets get a dedicated multipart
upload route that persists files under DynamicConfigsDir/branding/.

To prevent the Settings page's stale local state from clobbering an
upload on save, UpdateSettingsEndpoint preserves whatever the on-disk
asset filename fields are regardless of the body — /api/branding/asset/*
are the sole writers for those fields.

The MCP catalog gains get_branding and set_branding tools (text fields
only; file upload stays UI-only) plus a configure_branding skill prompt.

While wiring this up, the same restart-loss class of bug surfaced for
several existing fields whose RuntimeSettings entries were never read
by the startup loader. Fix loadRuntimeSettingsFromFile() to load:

  - branding (instance_name, instance_tagline, *_file basenames)
  - auto_upgrade_backends, prefer_development_backends
  - localai_assistant_enabled
  - open_responses_store_ttl
  - the 7 existing AgentPool fields (enabled, default/embedding model,
    chunking sizes, enable_logs, collection_db_path)

Also exposes 3 new AgentPool runtime settings (vector_engine,
database_url, agent_hub_url) via /api/settings + the Settings UI, with
the same load-on-startup wiring. The file watcher's manual-edit path
is intentionally not changed — the in-process API endpoints already
update appConfig directly, so the watcher is redundant for supported
flows and a separate refactor for everything else.

15 TDD specs cover the loader behaviour (1 branding + 11 adjacent + 3
new agent-pool); 2 specs cover the persistence helpers and the
clobber-prevention contract.


Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-02 15:51:36 +02:00
Ettore Di Giacinto
7325046650 fix(diffusers): drop compel from requirements to unblock pip resolver (#9632)
compel 2.3.1 (latest, Nov 2025) declares transformers~=4.25 in its
metadata, i.e. >=4.25,<5.0. After transformers 5.0 (2026-01-26) and
huggingface-hub 1.0 (2025-10-27) shipped, the weekly DEPS_REFRESH
cache rotation in CI started seeing the new majors and pip's resolver
went into multi-hour backtracking storms walking every transformers
4.x candidate against every accelerate/hf-hub/tokenizers combination
to find a set compel would accept. The 2026-04-29 backend-build for
the diffusers backend (darwin-mps + l4t + cublas13-turboquant matrix
cells) hit the GitHub Actions 6h job timeout still inside pip
install — the build itself never started.

compel is the only hard upper bound on transformers in this stack
(diffusers, accelerate, peft, optimum-quanto are all flexible), and
upstream support for transformers 5 is still in flight: damian0815/
compel#129 ("Modernize Compel for Transformers 5") and #128 ("Bump
transformers version to >5.0") are both open as of today.

backend.py only constructs Compel() when COMPEL=1 is set in the env
(default off), so make compel a true optional extra:

  - Wrap the top-level `from compel import ...` in try/except
    ImportError, mirroring the existing sd_embed pattern.
  - Auto-disable COMPEL with a warning when the module isn't
    installed, instead of crashing on module load.
  - Drop compel from all eight requirements-*.txt variants so the
    resolver no longer has to satisfy its transformers cap.
  - Leave a TODO in backend.py and in each requirements file
    pointing at the upstream PR/issue, so the dependency can be
    reinstated once compel supports transformers >= 5.

Users who rely on weighted-prompt embeddings can opt in with a
manual `pip install compel` alongside COMPEL=1; the warning emitted
on startup tells them how.

Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit WebFetch]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-01 14:45:14 +02:00
Ettore Di Giacinto
8452068f43 feat(importers): whisper.cpp HF repos pick a quant + nest under whisper/models (#9630)
The WhisperImporter's Import() switch ordered LooksLikeURL ahead of the
HuggingFace branch, so any https://huggingface.co/<owner>/<repo> URI
(e.g. LocalAI-io/whisper-large-v3-it-yodas-only-ggml) hijacked the URL
path. FilenameFromUrl returned the repo slug, the gallery entry pointed
at the HTML repo page, the SHA256 was empty, and the HF file listing
was effectively dead code for HTTPS imports. The HF branch only fired
for huggingface://owner/repo and hf://owner/repo references.

Gate the URL case on a "ggml-*.bin" basename signal — mirroring how
the llama-cpp importer gates on ".gguf" — so direct file URLs still
take the URL path while HF repo URLs fall through to the HF branch.
There the file listing is actually consulted: every ggml-*.bin entry
is collected and one is picked by the new preferences.quantizations
preference (default q5_0; comma-separated for fallback ordering).

Pin the chosen file under whisper/models/<name>/<file> so a single
repo can ship q4_0/q5_0/q8_0 side-by-side without colliding on disk,
matching the llama-cpp/models/<name>/ layout. The fallback when no
preference matches is the last available ggml file, mirroring
llama-cpp's pickPreferredGroup behaviour.

Tests: replace the previous probe spec with positive assertions
against LocalAI-io/whisper-large-v3-it-yodas-only-ggml (default →
ggml-model-q5_0.bin, quantizations=q4_0 → ggml-model-q4_0.bin) plus
two offline specs that build a fake hfapi.ModelDetails to cover the
fallback rule and non-ggml filtering without touching the network.


Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit WebFetch]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-01 12:03:07 +02:00
ER-EPR
0b0078047f Add tags to qwen3-vl-reranker and Qwen3-VL-Embedding to the gallery (#9628)
* Add tags to Qwen3-VL-Reranker models

Added tags for reranker models in index.yaml.

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

* Add Qwen3-VL-Embedding models to gallery

Added Qwen3-VL-Embedding-8B and Qwen3-VL-Embedding-2B models with detailed descriptions and file references.

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

* Update index.yaml

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

---------

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>
2026-05-01 10:56:58 +02:00
Tai An
80961d2da6 feat(backends/python): use tempfile.gettempdir() instead of hardcoded /tmp (#9629)
Closes #9601

Makes the temporary scratch paths in vllm, vllm-omni, tinygrad, and pocket-tts
backends configurable via the standard TMPDIR env var, instead of always writing
to /tmp. This is a one-line change per call site that calls tempfile.gettempdir()
for the directory and keeps the same filename suffix.

Users who run on systems with a small root partition (or want to relocate scratch
files to a larger volume) can now redirect these by setting TMPDIR
(e.g. TMPDIR=/data/tmp), without affecting the existing LOCALAI_GENERATED_CONTENT_PATH
or LOCALAI_UPLOAD_PATH options that already cover other temp paths.

Files touched:
- backend/python/vllm/backend.py        (1 site: video base64 scratch)
- backend/python/tinygrad/backend.py    (1 site: image fallback dst)
- backend/python/pocket-tts/backend.py  (1 site: tts wav fallback dst)
- backend/python/vllm-omni/backend.py   (2 sites: video + audio scratch)
2026-05-01 10:56:24 +02:00
LocalAI [bot]
9c4c3f9d8f chore: ⬆️ Update ggml-org/llama.cpp to beb42fffa45eded44804a1fd4916146222371581 (#9624)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-01 02:02:56 +02:00
LocalAI [bot]
273416f54b chore: ⬆️ Update ikawrakow/ik_llama.cpp to a8aecbf15933295af96504f9a693998322185b5c (#9625)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-01 02:02:29 +02:00
Ettore Di Giacinto
c02a50f2ab feat(llama-cpp): bump to d775992 and adapt to spec params refactor (#9618)
Bumps backend/cpp/llama-cpp/Makefile LLAMA_VERSION from 665abc6 to
d775992, picking up upstream PR ggml-org/llama.cpp#22397 which splits
common_params_speculative into nested draft / ngram_simple / ngram_mod
sub-structs. Renames every grpc-server.cpp reference to match:

  speculative.mparams_dft.path  -> speculative.draft.mparams.path
  speculative.{n_max,n_min}     -> speculative.draft.{n_max,n_min}
  speculative.{p_min,p_split}   -> speculative.draft.{p_min,p_split}
  speculative.{n_gpu_layers,n_ctx} -> speculative.draft.{n_gpu_layers,n_ctx}
  speculative.ngram_size_n      -> speculative.ngram_simple.size_n
  speculative.ngram_size_m      -> speculative.ngram_simple.size_m
  speculative.ngram_min_hits    -> speculative.ngram_simple.min_hits

The "speculative.n_max" JSON key sent to the upstream server stays
unchanged — server-task.cpp still reads it and routes the value into
draft.n_max internally.

The turboquant fork (TheTom/llama-cpp-turboquant @ 11a241d) branched
before #22397 and still exposes the flat layout. Since turboquant
reuses the shared backend/cpp/llama-cpp/grpc-server.cpp, extend
patch-grpc-server.sh with an idempotent sed block that reverts the
ten field references back to the legacy flat names on the build copy
only — the original under backend/cpp/llama-cpp/ stays compiling
against vanilla upstream. Drop the block once the fork rebases.

ik-llama-cpp has its own grpc-server.cpp with no speculative refs
(0/2661 lines), so it is unaffected.

Validated locally with `make docker-build-llama-cpp` (avx, avx2,
avx512, fallback, grpc + rpc-server all built; image exported).


Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-30 08:44:43 +02:00
LocalAI [bot]
76971fb2aa chore: ⬆️ Update leejet/stable-diffusion.cpp to 3d6064b37ef4607917f8acf2ca8c8906d5087413 (#9617)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-30 08:43:42 +02:00
LocalAI [bot]
ebd9fcbe20 chore(model gallery): 🤖 add 1 new models via gallery agent (#9615)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-29 22:33:41 +02:00
Ettore Di Giacinto
091eda8d70 feat: react chat redesign (#9616)
* feat(react-ui): redesign chat — popover history, focus on send, density pass

Replace the persistent 260px conversation sidebar with a Cmd/Ctrl+K
popover (ChatsMenu) so the conversation owns the page. Once a chat has
at least one message we auto-collapse the global app rail and fade
non-essential header chrome; Esc gives the user back the full chrome
for the rest of the session.

Move Canvas mode and the MCP dropdown into the input wrapper as mode
chips — they describe what's armed for the next message and now live
where the user composes. The chat header drops to Chats · title ·
ModelSelector · overflow · settings, and an overflow menu carries
admin-only Manage mode along with Info / Edit / Export / Clear.

Density pass: tighter header (40px), smaller avatars with the assistant
left-border accent doing the work, 88% bubble width, modern
field-sizing on the textarea, 32px send/stop buttons.

Empty state now surfaces a Recent strip (top 4 non-empty chats) and a
Cmd+K hint, replacing the discoverability the persistent sidebar used
to provide.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* feat(react-ui): chat input chips, slimmer menu, focus mode polish

Move Canvas mode and the MCP dropdown into the input wrapper as compact
mode chips — they describe what's armed for the next message and now
sit where the user composes. The MCP popover flips upward when anchored
to the input row so it stays on-screen.

Eliminate the chat header overflow ("…") menu entirely; relocate each
item to its semantic home so users don't have to remember a
miscellany drawer:

- Manage mode toggle → top of the Settings drawer, alongside the
  other sticky chat knobs. The shield next to the title still
  signals state at a glance.
- Model info / Edit config → small admin-only "ⓘ" button next to the
  ModelSelector; the existing model-info panel now hosts the Edit
  config link.
- Export as Markdown → per-row hover action in ChatsMenu, so it works
  for any chat (not just the active one).
- Clear chat history → destructive button at the bottom of the
  Settings drawer.

Make the Sidebar listen to its own `sidebar-collapse` event so the
chat's focus mode actually shrinks the rail (it previously only
flipped the layout class, leaving the sidebar element at full width
and overlapping the chat). Drop the focus-mode toast — the visual
shift is enough; the toast was noise.

Define `--color-text-tertiary` in both themes; without it metadata
text (recent strip timestamps and a few other sites) was inheriting
the platform default, which read as black on the dark surface.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* fix(model/log-store): close merged channel exactly once; clean up Remove

Two latent races in BackendLogStore.Subscribe could panic under load
(distributed e2e test triggered "send on closed channel" at
backend_log_store.go:288):

1. The aggregated path closed the merged channel `ch` from two
   places — the fan-in waiter goroutine (after all source channels
   drained) and unsubscribe(). When unsubscribe ran while a fan-in
   goroutine was mid-flight on `ch <- line`, the close beat the send
   and the runtime panicked. Now `ch` is closed by exactly one
   goroutine: the waiter that observes all fan-in goroutines finish.
   unsubscribe() only closes the per-buffer source channels — the
   for-range in each fan-in goroutine then exits naturally and the
   waiter takes care of the merged close.

2. Remove() closed every subscriber channel but didn't delete the
   entries from the subscribers map, so a concurrent unsubscribe()
   would call close() again on the already-closed channel
   ("close of closed channel"). Clear the map entry while closing.

Add a regression test that hammers AppendLine concurrently with
Subscribe + unsubscribe + Remove; the race detector catches both
classes of regression.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* test(model/log-store): port backend log store tests to ginkgo

Bring backend_log_store_test.go in line with the rest of pkg/model
(loader_test, watchdog_test, store_test): same external test package
(`model_test`), same ginkgo + gomega imports, same Describe/It
nesting around the public API. Behaviour is unchanged — the four
existing scenarios plus the unsubscribe race regression all run as
specs under the existing `TestModel` suite.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-29 22:33:26 +02:00
Ettore Di Giacinto
fe6eb57082 feat(vibevoice-cpp): add purego TTS+ASR backend (#9610)
* feat(vibevoice-cpp): add purego TTS+ASR backend

Wire up Microsoft VibeVoice via the vibevoice.cpp C ABI as a new
purego-based Go backend that serves both Backend.TTS and
Backend.AudioTranscription from a single gRPC binary. Mirrors the
qwen3-tts-cpp / sherpa-onnx pattern so the variant matrix
(cpu/cuda12/cuda13/metal/rocm/sycl-f16/f32/vulkan/l4t) and the
e2e-backends gRPC harness reuse existing infrastructure.

- backend/go/vibevoice-cpp/ - Makefile, CMakeLists, purego shim, gRPC
  Backend with model-dir auto-detection, closed-loop TTS->ASR smoke test
- backend/index.yaml - &vibevoicecpp meta + 18 image entries
- Makefile - .NOTPARALLEL, BACKEND_VIBEVOICE_CPP, docker-build wiring,
  test-extra-backend-vibevoice-cpp-{tts,transcription} e2e wrappers
- .github/workflows/backend.yml - matrix entries for all variants
- .github/workflows/test-extra.yml - per-backend smoke + 2 gRPC e2e jobs

* feat(vibevoice-cpp): drop hardcoded glob detection, add gallery entries

Refactor backend Load() to follow the standard Options[] convention
used by sherpa-onnx and the rest of the multi-role backends:
ModelFile is the primary gguf, supplementary paths come through
opts.Options[] as key=value (or key:value for Make-target compat),
resolved against opts.ModelPath. type=asr/tts decides the role of
ModelFile when neither tts_model nor asr_model is set explicitly.

Add gallery/index.yaml entries:
- vibevoice-cpp     - realtime 0.5B Q8_0 TTS + tokenizer + Carter voice
- vibevoice-cpp-asr - long-form ASR Q8_0 + tokenizer

Both pull from huggingface://mudler/vibevoice.cpp-models with sha256
verification. parameters.model + Options[] paths are siblings under
{models_dir} per the qwen3-tts-cpp convention.

Update Makefile e2e wrappers to pass BACKEND_TEST_OPTIONS comma+colon
style, and tighten the per-backend Go closed-loop test to use the
explicit Options API.

* fix(vibevoice-cpp): force whole-archive link so vv_capi_* exports survive

libvibevoice is a STATIC archive linked into the MODULE library.
Without --whole-archive (or -force_load on Apple, /WHOLEARCHIVE on
MSVC), the linker garbage-collects symbols not referenced from this
translation unit - which means dlopen+RegisterLibFunc panics with
'undefined symbol: vv_capi_load' at backend startup, since purego
looks them up by name and our cpp/govibevoicecpp.cpp doesn't call
them directly.

* test(vibevoice-cpp): rewrite suite with Ginkgo v2

Match the convention used by backend/go/sherpa-onnx/backend_test.go.
The suite now covers backend semantics that don't need purego (Locking,
empty-ModelFile rejection, TTS/ASR-without-loaded-model errors) on top
of the gRPC lifecycle specs (Health, Load, closed-loop TTS->ASR).
Model-dependent specs Skip() when VIBEVOICE_MODEL_DIR is unset, so
`go test ./backend/go/vibevoice-cpp/` is green on a clean checkout
and runs the heavyweight closed-loop spec when test.sh has staged
the bundle.

* fix(vibevoice-cpp): implement TTSStream + AudioTranscriptionStream

The gRPC server's stream handlers (pkg/grpc/server.go) spawn a
goroutine that ranges over a chan; the only thing closing that chan
is the backend's own *Stream method. With the default Base stub
returning 'unimplemented' and never touching the chan, the server
goroutine hangs forever and the client hits DeadlineExceeded - which
is exactly what the e2e harness saw in the test-extra-backend-vibevoice-cpp-tts
matrix run.

TTSStream synthesizes via vv_capi_tts to a tempfile, then emits a
streaming WAV header (chunk sizes 0xFFFFFFFF so HTTP clients can
start playback before the full PCM lands) followed by the PCM body
in 64 KB slices. The header + >=2 PCM frames satisfy the harness's
'expected >=2 chunks' assertion and give a real progressive stream.

AudioTranscriptionStream runs the offline transcription, emits each
segment as a delta, and closes with a final_result whose Text equals
the concatenated deltas (the harness asserts those match).

Two new Ginkgo specs guard the close-channel-on-error path so the
deadline-exceeded regression can't come back silently.

* fix(vibevoice-cpp): silence errcheck on cleanup paths

Lint flagged six unchecked Close()/Remove()/RemoveAll() calls along
purely-cleanup deferred paths. Wrap each in '_ = ...' (or a closure
for defers that take args) - matches what the rest of the LocalAI
backend/go/* tree already does for these callsites.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(vibevoice-cpp): closed-loop slot fill + modelRoot-relative path resolution

Two bugs the test-extra-backend-vibevoice-cpp-* CI matrix surfaced:

1. Closed-loop Load with ModelFile=tts.gguf + Options[asr_model=...] left
   v.ttsModel empty, because the default-fill block only ran when BOTH
   slots were empty. vv_capi_load then got tts="" + a voice and the
   C side rejected it with rc=-3 'TTS model required to load a voice'.
   Fix: ModelFile fills the *primary* role-slot (decided by 'type=' in
   Options, defaulting to tts) independently of the secondary, so
   ModelFile + asr_model resolves to both.

2. resolvePath stat'd CWD before falling back to relTo. With LocalAI
   launched from a directory that happens to contain a same-named
   file, supplementary Options[] paths could leak away from the
   models dir. Drop the CWD probe entirely - relative paths now
   *always* join onto opts.ModelPath (the gallery convention).

New Ginkgo coverage:
  * 'ModelFile slot resolution' (4 specs) - asr_model+ModelFile, type=asr,
    explicit tts_model override, key:value variant.
  * 'resolvePath (relative-to-modelRoot)' (5 specs) - join, abs passthrough,
    empty input, empty relTo, and the CWD-trap regression test.
  * 'Load resolves relative Options paths against opts.ModelPath' - end-
    to-end gallery layout round-trip.

Verified locally: 19/19 specs pass (with model bundle, including the
closed-loop TTS->ASR; without bundle, 17 pass + 2 model-dependent skip).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(vibevoice-cpp): use gallery convention in closed-loop spec

The 'loads the realtime TTS model' / closed-loop specs were passing
already-prefixed paths into Options[]:

    Options: ['tokenizer=' + filepath.Join(modelDir, 'tokenizer.gguf')]

Combined with no ModelPath set on the request, the backend's
modelRoot fell back to filepath.Dir(ModelFile) = modelDir, then
resolvePath joined the prefixed Options path on top of it -
producing 'vibevoice-models/vibevoice-models/tokenizer.gguf' when
the CI's VIBEVOICE_MODEL_DIR is the relative './vibevoice-models'.

The fix is to mirror the gallery contract LocalAI core actually
sends in production: ModelPath is the models root (absolute),
ModelFile is a name *under* it, every Options[] path is relative
to ModelPath. Uses filepath.Base() to get bare filenames.

Verified locally with both VIBEVOICE_MODEL_DIR=/tmp/vv-bundle (abs)
and VIBEVOICE_MODEL_DIR=vibevoice-models (the relative shape that
broke CI). Both: 19/19 specs pass, ~55-60s.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(vibevoice-cpp): switch ASR to Q4_K + bump transcription timeout

The Q8_0 ASR gguf is ~14 GB - too big to fit alongside the runner
image, the docker build cache, and the test artifacts on a free
ubuntu-latest GHA runner; 'test-extra-backend-vibevoice-cpp-transcription'
was getting SIGTERM'd at 90 min before the model could finish loading.

Switch to Q4_K (~10 GB on disk, slightly faster CPU decode) for:
  * the e2e harness Make target
  * the gallery 'vibevoice-cpp-asr' entry (parameters + files block)
  * the per-backend test.sh auto-download list

Bump tests-vibevoice-cpp-grpc-transcription's timeout-minutes from
90 to 150 - even with Q4_K, the 30 s JFK clip on a CPU runner needs
runway above the previous 90 min cap.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(vibevoice-cpp): drop transcription gRPC e2e job - too heavy for free runners

The vibevoice ASR is a 7B-parameter model. Even on Q4_K (~10 GB on
disk) a single 30 s transcription saturates the per-test 30 min
timeout in the e2e-backends harness on a 4-core ubuntu-latest, and
the 10 GB download + Docker layer + working space leaves no headroom
on the runner's free disk. Two attempts in CI got SIGTERM'd at the
LoadModel boundary - the bottleneck isn't tunable from the workflow
side without a paid-tier runner.

The per-backend tests-vibevoice-cpp job already runs the same
AudioTranscription path via a closed-loop TTS->ASR Ginkgo spec - same
gRPC contract, same model, single process - so the standalone
tests-vibevoice-cpp-grpc-transcription job was redundant on top of
the disk/CPU pressure.

The Makefile target test-extra-backend-vibevoice-cpp-transcription
stays for local invocation on workstations that can afford it -
useful when developing the streaming codepaths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(vibevoice-cpp): restore transcription gRPC e2e on bigger-runner

Switch tests-vibevoice-cpp-grpc-transcription from ubuntu-latest to
the self-hosted 'bigger-runner' label that GPU image builds in
backend.yml use, plus the documented Free-disk-space prep step (purge
dotnet / ghc / android / CodeQL caches) the disabled vllm/sglang
entries in this file describe. That gives the 7B-param Q4_K ASR
model the disk + CPU runway it needs.

Keep timeout-minutes: 150 - even on a beefier runner the 30 s JFK
decode plus 10 GB download has to fit comfortably.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(vibevoice-cpp): apt-get install make on bigger-runner before transcription e2e

bigger-runner is a self-hosted bare runner without the standard
ubuntu image's preinstalled build tools, so the previous job died at
the very first command with 'make: command not found' (exit 127).
Add the Dependencies step that the disabled vllm/sglang entries in
this file already document - apt-get installs make + build-essential
+ curl + unzip + ca-certificates + git + tar before the make target
runs. Mirrors how every other 'runs-on: bigger-runner' entry in
backend.yml prepares the runner.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-29 22:22:14 +02:00
LocalAI [bot]
13fe37df89 chore(model gallery): 🤖 add 1 new models via gallery agent (#9611)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-29 09:06:22 +02:00
Richard Palethorpe
4916f8c880 feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563)
* feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map

LocalAI's vLLM backend wraps a small typed subset of vLLM's
AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.).
Anything outside that subset -- pipeline/data/expert parallelism,
speculative_config, kv_transfer_config, all2all_backend, prefix
caching, chunked prefill, etc. -- requires a new protobuf field, a
Go struct field, an options.go line, and a backend.py mapping per
feature. That cadence is the bottleneck on shipping vLLM's
production feature set.

Add a generic `engine_args:` map on the model YAML that is
JSON-serialised into a new ModelOptions.EngineArgs proto field and
applied verbatim to AsyncEngineArgs at LoadModel time. Validation
is done by the Python backend via dataclasses.fields(); unknown
keys fail with the closest valid name as a hint.
dataclasses.replace() is used so vLLM's __post_init__ re-runs and
auto-converts dict values into nested config dataclasses
(CompilationConfig, AttentionConfig, ...). speculative_config and
kv_transfer_config flow through as dicts; vLLM converts them at
engine init.

Operators can now write:

  engine_args:
    data_parallel_size: 8
    enable_expert_parallel: true
    all2all_backend: deepep_low_latency
    speculative_config:
      method: deepseek_mtp
      num_speculative_tokens: 3
    kv_cache_dtype: fp8

without further proto/Go/Python plumbing per field.

Production defaults seeded by hooks_vllm.go: enable_prefix_caching
and enable_chunked_prefill default to true unless explicitly set.

Existing typed YAML fields (gpu_memory_utilization,
tensor_parallel_size, etc.) remain for back-compat; engine_args
overrides them when both are set.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel

vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't
load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130
simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin
vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and
includes the DFlash speculative-decoding method that landed in 0.20.0.

cublas13 install gets --index-strategy=unsafe-best-match so uv consults
both the cu130 index and PyPI when resolving — PyPI also publishes
vllm==0.20.0, but with cu12 binaries that error at import time.

Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat
completions on RTX 5070 Ti (sm_120, cu130).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(vllm): bot job to bump cublas13 vLLM wheel pin

vLLM's cu130 wheel index URL is itself version-locked
(wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM
bump means rewriting two values atomically — the URL segment and the
version constraint. bump_deps.sh handles git-sha-in-Makefile only;
add a sibling bump_vllm_wheel.sh and a matching workflow job that
mirrors the existing matrix's PR-creation pattern.

The bumper queries /releases/latest (which excludes prereleases),
strips the leading 'v', and seds both lines unconditionally. When the
file is already on the latest tag the rewrite is a no-op and
peter-evans/create-pull-request opens no PR.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* docs(vllm): document engine_args and speculative decoding

The new engine_args: map plumbs arbitrary AsyncEngineArgs through to
vLLM, but the public docs only covered the basic typed fields. Add a
short subsection in the vLLM section explaining the typed/generic
split and showing a worked DFlash speculative-decoding config, with
pointers to vLLM's SpeculativeConfig reference and z-lab's drafter
collection.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-29 00:49:28 +02:00
LocalAI [bot]
55afda22e3 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 453a027c17e4d63a7f16b871197a396240a65138 (#9608)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-29 00:18:19 +02:00
LocalAI [bot]
1fe3558ec6 feat(swagger): update swagger (#9607)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-29 00:18:02 +02:00
Ettore Di Giacinto
e370318bd7 fix(vllm): seed pybind11 for fastsafetensors build under --no-build-isolation
fastsafetensors==0.3 (transitive dep of vllm) imports pybind11 in
setup.py without declaring it in build-system.requires. With
--no-build-isolation it has to already exist in the venv, otherwise the
wheel build fails with ModuleNotFoundError on arm64 L4T CUDA 13 (and
any other profile that picks up vllm 0.20.0).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-28 20:08:26 +00:00
Richard Palethorpe
4443250756 chore: add golangci-lint with new-from-merge-base baseline (#9603)
* chore: add golangci-lint with new-from-merge-base baseline

Configure golangci-lint v2 with the standard linter set (errcheck, govet,
ineffassign, unused) plus forbidigo, which enforces the Ginkgo/Gomega-only
test convention from .agents/coding-style.md by rejecting stdlib testing
calls (t.Errorf, t.Fatalf, t.Run, ...). staticcheck is disabled — the
codebase has many pre-existing QF-style suggestions not worth gating on.

issues.new-from-merge-base = master makes the lint job a gate for new
issues only; the ~1300 pre-existing baseline stays visible via
'make lint-all' for incremental cleanup. CI runs 'make lint'.

Backends needing C/C++ headers we don't install in the lint runner are
excluded via a deny list in the Makefile (backend/go/{piper,silero-vad,
llm}, cmd/launcher). Discovery still flows through 'go list ./...', so
new packages are scanned automatically.

To make backend/go/{sam3-cpp,stablediffusion-ggml,whisper} typecheckable,
move their .cpp/.h sources into cpp/ subdirs (matching qwen3-tts-cpp /
acestep-cpp). Without this 'go list' rejects the package because Go does
not allow .cpp alongside .go without cgo.

Fix two real bugs found by lint in tests/integration/ (run only via
'make test-stores', not default CI): a stale zerolog reference left over
from the slog migration (c37785b7) and an unused 'os' import.

Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(lint): generate proto sources and fetch full history

The lint job was failing for two reasons:

- pkg/grpc/proto/*.go is generated, not checked in. Several packages
  import it, so without 'make protogen-go' typecheck fails project-wide
  with "no required module provides package github.com/mudler/LocalAI/
  pkg/grpc/proto".

- golangci-lint's new-from-merge-base needs to git-merge-base the PR
  against master, but actions/checkout's default shallow clone doesn't
  fetch master. fetch-depth: 0 brings full history; the config now
  references origin/master (the remote-tracking branch that survives
  the shallow checkout) instead of bare master (which doesn't exist
  locally after checkout).

Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(lint): stub react-ui/dist for go:embed glob

core/http/app.go has //go:embed react-ui/dist/*. The glob must match at
least one non-hidden entry or typecheck fails the whole core/http
package. We don't need the real React bundle to lint Go code, so just
touch an empty index.html to satisfy the embed.

Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-28 22:07:44 +02:00
Ettore Di Giacinto
bcef72b9c1 feat: localai assistant chat modality (#9602)
* fix(tests): inline model_test fixtures after tests/models_fixtures removal

The previous reorg removed tests/models_fixtures/ but core/config/model_test.go
still read CONFIG_FILE/MODELS_PATH env vars pointing into that directory, so
`make test` failed with "open : no such file or directory" on the readConfigFile
spec (the suite ran with --fail-fast and bailed before openresponses_test).

Inline the YAMLs (config/embeddings/grpc/rwkv/whisper) directly into the test
file, materialise them into a per-test tmpdir via BeforeEach, and drop the
env-var lookups. The test no longer depends on Makefile plumbing.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash]

* refactor(modeladmin): extract model-admin helpers into a service package

Lift the bodies of EditModelEndpoint, PatchConfigEndpoint,
ToggleStateModelEndpoint, TogglePinnedModelEndpoint and
VRAMEstimateEndpoint into core/services/modeladmin so the same logic can
be called by non-HTTP clients (notably the in-process MCP server that
backs the LocalAI Assistant chat modality, landing in a follow-up commit).

The HTTP handlers shrink to thin shells that parse echo inputs, call the
matching helper, map typed errors (ErrNotFound, ErrConflict,
ErrPathNotTrusted, ErrBadAction, ...) to the existing HTTP status codes,
and render the existing response shapes. No REST-surface behaviour change;
the existing localai endpoint tests cover the regression net.

Adds focused unit tests for each helper against tmp-dir-backed
ModelConfigLoader fixtures (deep-merge patch, rename + conflict, path
separator guard, toggle/pin enable/disable, sync callback).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(assistant): LocalAI Assistant chat modality with in-memory MCP server

Adds a chat modality, admin-only, that wires the chat session to an
in-memory MCP server exposing LocalAI's own admin/management surface as
tools. An admin can install models, manage backends, edit configs and
check status by chatting; the LLM calls tools like gallery_search,
install_model, import_model_uri, list_installed_models, edit_model_config
and surfaces the results.

Same Go package powers two modes:

  pkg/mcp/localaitools/

    NewServer(client, opts) builds an MCP server that registers the
    19-tool admin catalog. The LocalAIClient interface has two impls:

    - inproc.Client — calls services directly (no HTTP loopback,
      no synthetic admin API key). Used in-process by the chat handler.
    - httpapi.Client — calls the LocalAI REST API. Used by the new
      `local-ai mcp-server --target=…` subcommand to control a remote
      LocalAI from a stdio MCP host.

    Tools and their embedded skill prompts are agnostic to which client
    backs them. Skill prompts are markdown files under prompts/, embedded
    via go:embed and assembled into the system prompt at server init.

Wiring:

  - core/http/endpoints/mcp/localai_assistant.go — process-wide holder
    that spins up the in-memory MCP server once at Application start
    using paired net.Pipe transports, then reuses LocalToolExecutor
    (no fork) for every chat request that opts in.

  - core/http/endpoints/openai/chat.go — small branch ahead of the
    existing MCP block: when metadata.localai_assistant=true,
    defense-in-depth admin check + executor swap + system-prompt
    injection. All downstream tool dispatch is unchanged.

  - core/http/auth/{permissions,features}.go — adds
    FeatureLocalAIAssistant; gating happens at the chat handler entry
    plus admin-only `/api/settings`.

  - core/cli/{run.go,cli.go,mcp_server.go} —
    LOCALAI_DISABLE_ASSISTANT flag (runtime-toggleable via Settings, no
    restart), plus `local-ai mcp-server` stdio subcommand.

  - core/config/runtime_settings.go — `localai_assistant_enabled`
    runtime setting; the chat handler reads `DisableLocalAIAssistant`
    live at request entry.

UI:

  - Home.jsx — prominent self-explanatory CTA card on first run
    ("Manage LocalAI by chatting"); collapses to a compact
    "Manage by chat" button in the quick-links row once used,
    persisted via localStorage.
  - Chat.jsx — admin-only "Manage" toggle in the chat header,
    "Manage mode" badge, dedicated empty-state copy, starter chips.
  - Settings.jsx — "LocalAI Assistant" section with the runtime
    enable toggle.
  - useChat.js — `localaiAssistant` flag on the chat schema; injects
    `metadata.localai_assistant=true` on requests when active.

Distributed mode: the in-memory MCP server lives only on the head node;
inproc.Client wraps already-distributed-aware services so installs
propagate to workers via the existing GalleryService machinery.

Documentation: `.agents/localai-assistant-mcp.md` is the contributor
contract — when adding an admin REST endpoint, also add a LocalAIClient
method, an inproc + httpapi impl, a tool registration, and a skill
prompt update; the AGENTS.md index links to it.

Out of scope (follow-ups): per-tool RBAC granularity for non-admin
read-only access; streaming mcp_tool_progress for long installs;
React Vitest rig for the UI changes.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(assistant): extract tool/capability/MiB/server-name constants

The MCP tool surface, capability tag set, server-name default, and the
chat-handler metadata key were repeated as bare string literals across
seven files. Renaming any one required hand-editing every call site and
risked code/test/prompt drift.

This pulls them into typed constants:

- pkg/mcp/localaitools/tools.go — Tool* constants for the 19 MCP tools,
  plus DefaultServerName.
- pkg/mcp/localaitools/capability.go — typed Capability + constants for
  the capability tag set the LLM passes to list_installed_models. The
  type rides through LocalAIClient.ListInstalledModels and replaces the
  triplet of "embed"/"embedding"/"embeddings" with the single
  CapabilityEmbeddings.
- pkg/mcp/localaitools/inproc/client.go — bytesPerMiB constant for the
  VRAMEstimate byte→MB conversion.
- core/http/endpoints/mcp/tools.go — MetadataKeyLocalAIAssistant for the
  "localai_assistant" request-metadata key consumed by the chat handler.

Tool registrations, the test catalog, the dispatch table, the validation
fixtures, and the fake/stub clients all reference the constants. The
embedded skill prompts under prompts/ keep their bare strings (go:embed
markdown can't import Go constants); the existing TestPromptsContain
SafetyAnchors guards the alignment.

No behaviour change. All tests pass with -race.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(modeladmin): typed Action for ToggleState/TogglePinned

The toggle/pin verbs were bare strings everywhere — handler signatures,
service implementations, MCP tool args, the fake/stub clients, the
inproc and httpapi LocalAIClient impls, plus 4 test files. A typo in
any caller silently fell through to the runtime "must be 'enable' or
'disable'" check.

Introduce core/services/modeladmin.Action (string alias) with
ActionEnable, ActionDisable, ActionPin, ActionUnpin and a small Valid
helper. The compiler now catches mismatches at every boundary; renames
ripple through one source of truth.

LocalAIClient.ToggleModelState/Pinned signatures change to take
modeladmin.Action. The package is brand-new and unreleased so this is
a free public-API tightening.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(assistant): respect ctx cancellation on gallery channel sends

InstallModel, DeleteModel, ImportModelURI, InstallBackend and
UpgradeBackend all pushed onto galleryop channels with bare sends. If the
worker was paused or the buffer full, the chat-handler goroutine blocked
forever — the LLM kept polling and the request leaked.

Wrap the five sends in a sendModelOp/sendBackendOp helper that selects
on ctx.Done() so a cancelled chat completion surfaces context.Canceled
back to the LLM instead of hanging.

Adds inproc/client_test.go with a pre-cancelled-ctx regression test on
InstallModel; the helpers are shared so the same guarantee covers the
other four call sites.

Assisted-by: Claude:claude-opus-4-7 [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(assistant): graceful shutdown for in-memory holder and stdio CLI

Two related leaks:

- Application.start() built the LocalAIAssistantHolder but never wired
  Close() into the graceful-termination chain — the in-memory MCP
  transport pair stayed alive until process exit, and the goroutines
  behind net.Pipe() didn't drain. Hook into the existing
  signals.RegisterGracefulTerminationHandler chain (same pattern as
  core/http/endpoints/mcp/tools.go:770).

- core/cli/mcp_server.go ran srv.Run with context.Background(); a
  Ctrl-C from the host (Claude Desktop, mcphost, npx inspector) or a
  SIGTERM from process supervision left the stdio loop reading from a
  closed pipe. Switch to signal.NotifyContext to surface the signal
  through ctx and let srv.Run drain.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(assistant): typed HTTPError + propagate prompt walk error

The httpapi client detected "no such job" by substring-matching on the
error string ("404", "could not find") — brittle to status-code
formatting changes and to LocalAI fixing /models/jobs/:uuid to return a
proper 404. Replace with a typed *HTTPError whose Is() method honours
errors.Is(err, ErrHTTPNotFound). The 500-with-"could not find" branch
stays as a transitional fallback documented in Is().

Same change covers ListNodes' 404 fallback for the /api/nodes endpoint.

Adds httptest tests for both 404 and the legacy 500 path, plus a
direct errors.Is exposure test so external callers (the standalone
stdio CLI host) can match without re-string-parsing.

Also tightens prompts.SystemPrompt: panic when fs.WalkDir on the
embedded FS fails. The only realistic cause is a build-time //go:embed
misconfiguration; serving an empty system prompt to the LLM is much
worse than crashing init. TestSystemPromptIncludesAllEmbeddedFiles
catches regressions in CI.

Assisted-by: Claude:claude-opus-4-7 [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(modeladmin): atomic writes for model config files

The five sites that wrote model YAML used os.WriteFile, which opens
with O_TRUNC|O_WRONLY|O_CREATE. A crash mid-write left the destination
truncated and the model unloadable until manual repair. Pre-existing
behaviour inherited from the original endpoint handlers — fix once now
that there's a single helper.

Adds writeFileAtomic: writes to a sibling temp file, chmods, syncs via
Close(), then os.Rename. Same-directory temp keeps the rename atomic on
the same filesystem; cleanup runs on every error path so stray temps
don't accumulate. No new dependency.

Applied to:
- ConfigService.PatchConfig
- ConfigService.EditYAML (both rename and in-place branches)
- mutateYAMLBoolFlag (drives ToggleState + TogglePinned)

atomic_test.go covers the happy path plus a read-only-dir failure case
that asserts the original file is preserved (skipped on Windows where
the chmod trick is POSIX-specific).

Assisted-by: Claude:claude-opus-4-7 [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(assistant): prune dead code, mark stub, document conventions

Three small cleanups landing together:

- Drop the unused errNotImplemented sentinel from inproc/client.go.
  All five methods that used to return it are wired to modeladmin
  helpers since the Phase B commit; the package var is dead.

- Annotate httpapi.Client.GetModelConfig as a known stub. LocalAI's
  /models/edit/:name returns rendered HTML, not JSON, so the standalone
  CLI's get_model_config tool surfaces a clear error to the LLM. A
  future JSON-only /api/models/config-yaml/:name endpoint is tracked in
  the agent contract; FIXME points at it.

- Extend `.agents/localai-assistant-mcp.md` with a "Code conventions"
  section that documents the audit-driven rules: tool/Capability/Action
  constants, errors.Is over substring matching, ctx-aware channel
  sends, atomic writes, and graceful shutdown. Refresh the file map so
  it lists tools.go and capability.go and drops the removed
  tools_bootstrap.go.

The tools_models.go diff is a comment-only change explaining why the
ModelName empty-string check stays at the tool layer (consistency
across LocalAIClient implementations, since the SDK schema validator
only enforces presence, not non-empty).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(assistant): convert test files to ginkgo + gomega

The repo convention (per core/http/endpoints/localai/*_test.go,
core/gallery/**, etc.) is Ginkgo v2 with Gomega assertions. The tests I
introduced for the assistant feature used vanilla testing.T, which made
them stand out and stripped the BDD structure the rest of the suite
relies on.

Convert every test file in the assistant scope to Ginkgo:

  pkg/mcp/localaitools/
    dto_test.go            — Describe("DTOs round-trip through JSON")
    prompts_test.go        — Describe("SystemPrompt assembler")
    server_test.go         — Describe("Server tool catalog"),
                              Describe("Tool dispatch"),
                              Describe("Tool error surfacing"),
                              Describe("Argument validation"),
                              Describe("Concurrent tool calls")
    parity_test.go         — Describe("LocalAIClient parity"),
                              hosts the suite's single RunSpecs (the file
                              is package localaitools_test so it can
                              import httpapi without an import cycle;
                              Ginkgo aggregates Describes from both the
                              internal and external test packages into
                              one run).
    httpapi/client_test.go — Describe("httpapi.Client against the
                              LocalAI admin REST surface"),
                              Describe("ErrHTTPNotFound"),
                              Describe("Bearer token")
    inproc/client_test.go  — Describe("inproc.Client cancellation")

  core/services/modeladmin/
    config_test.go         — Describe("ConfigService") with sub-Describes
                              for GetConfig, PatchConfig, EditYAML
    state_test.go          — Describe("ConfigService.ToggleState")
    pinned_test.go         — Describe("ConfigService.TogglePinned")
    atomic_test.go         — Describe("writeFileAtomic")

  core/http/endpoints/mcp/
    localai_assistant_test.go — Describe("LocalAIAssistantHolder")

Each package gets a `*_suite_test.go` with the standard
`RegisterFailHandler(Fail) + RunSpecs(t, "...")` boilerplate. Helpers
that previously took *testing.T (newTestService, writeModelYAML,
readMap, sortedStrings, sortGalleries, etc.) drop the *T receiver and
use Gomega Expectations directly. tmp dirs come from GinkgoT().TempDir().

No semantic change to test coverage — every original assertion has a
direct Gomega counterpart. All suites pass with -race.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test+docs(assistant): drift detector for Tool ↔ REST route mapping

Honest gap from the audit: the parity_test.go suite only checks four
methods, and uses the same httpapi.Client for both sides — it asserts
stability of the DTO shapes, not equivalence between in-process and
HTTP. If a contributor adds an admin REST endpoint without an MCP tool,
or a tool without a matching httpapi route, both surfaces silently
diverge.

Add a coverage test plus stronger docs:

- pkg/mcp/localaitools/coverage_test.go introduces a hand-maintained
  toolToHTTPRoute map: every Tool* constant must list the REST endpoint
  the httpapi.Client hits (or "(none)" with a documented reason). Two
  Ginkgo specs assert the map and the published catalog stay in sync —
  one fails when a Tool is added without a route entry, the other fails
  when a route entry references a tool that no longer exists. Verified
  by removing the ToolDeleteModel entry locally; the test fired with a
  clear message pointing the contributor at the file.

  Deliberate non-test: we don't enumerate live admin REST routes from
  here. Walking the route registry requires booting Application;
  parsing core/http/routes/localai.go is brittle. The "new admin REST
  endpoint → MCP tool" direction stays a PR checklist item — see below.

- AGENTS.md gets a new Quick Reference bullet that calls out the rule
  and points at the test by name.

- .agents/api-endpoints-and-auth.md tightens the existing "Companion:
  MCP admin tool surface" subsection from "if useful, consider..." to
  "MUST be considered, with three concrete outcomes (tool added,
  deliberately skipped with documented reason, or forgot — which
  breaks the contract)". Adds a checklist item at the bottom of the
  file's authoritative checklist.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(assistant): drop duplicate DTOs, surface canonical types

Audit feedback: localaitools/dto.go reinvented several types that already
existed in the codebase. Replace the duplicates with the canonical types
so the LLM-visible wire format stays aligned with the rest of LocalAI by
construction (no parallel structs to keep in sync).

Removed (and the canonical type now used by the LocalAIClient interface):

  localaitools.Gallery          → config.Gallery
  localaitools.GalleryModelHit  → gallery.Metadata
  localaitools.VRAMEstimate     → vram.EstimateResult

Tightened scope:

  localaitools.Backend          → kept, but reduced to {Name, Installed}.
                                  ListKnownBackends now returns
                                  []schema.KnownBackend (the canonical
                                  type already used by REST /backends/known).

Kept with documented rationale:

  localaitools.JobStatus       — galleryop.OpStatus has Error error which
                                 marshals to "{}". JobStatus is the
                                 JSON-friendly mirror.
  localaitools.Node            — nodes.BackendNode carries gorm internals
                                 + token hash; we expose only the
                                 LLM-relevant fields.
  ImportModelURIRequest/Response — schema.ImportModelRequest and
                                   GalleryResponse are wire-shaped, mine
                                   are LLM-shaped (BackendPreference flat,
                                   AmbiguousBackend exposed).

Side wins:

  - Drop bytesPerMiB; vram.EstimateResult already carries human-readable
    display strings (size_display, vram_display) the LLM uses directly.
  - Drop the handler-private vramEstimateRequest in
    core/http/endpoints/localai/vram.go and bind directly into
    modeladmin.VRAMRequest (now JSON-tagged).

Both clients pass through these types now where possible (e.g.
ListGalleries in inproc.Client is a one-liner returning
AppConfig.Galleries; httpapi.Client.GallerySearch decodes straight into
[]gallery.Metadata).

All tests green with -race.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(assistant): extract REST route paths into named constants

httpapi.Client had 18 bare-string path sites scattered across methods.
Pull them into pkg/mcp/localaitools/httpapi/routes.go: static paths as
package-private constants, dynamic paths as small builders that handle
url.PathEscape on segment values.

No behaviour change. Drops the now-unused net/url import from client.go
since path escaping moved into routes.go alongside the path it applies to.

Local-only by design: the server-side registrations in
core/http/routes/localai.go remain bare strings. Sharing constants across
the pkg/ ↔ core/ boundary would invert the layering today; the existing
Tool↔REST drift-detector in coverage_test.go is the safety net for that
direction.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* docs(assistant): align with shipped UI and dropped bootstrap env vars

The LocalAI Assistant doc still described the older iteration:

- The in-chat toggle was renamed from "Admin" to "Manage" (the badge is
  now "Manage mode" and the home page exposes a "Manage by chat" CTA).
- LOCALAI_ASSISTANT_BOOTSTRAP_MODEL / --localai-assistant-bootstrap-model
  and the bootstrap_default_model tool were removed — admins pick a model
  from the existing selector instead, no env-var configuration required.
- The shipped tool catalog includes import_model_uri but didn't appear in
  the doc; bootstrap_default_model appeared but no longer exists.
- The Settings → LocalAI Assistant runtime toggle wasn't mentioned as the
  preferred way to disable without restart.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-28 19:29:27 +02:00
Ettore Di Giacinto
142919fc79 fix(tests): inline model_test fixtures after tests/models_fixtures removal
The previous reorg removed tests/models_fixtures/ but core/config/model_test.go
still read CONFIG_FILE/MODELS_PATH env vars pointing into that directory, so
`make test` failed with "open : no such file or directory" on the readConfigFile
spec (the suite ran with --fail-fast and bailed before openresponses_test).

Inline the YAMLs (config/embeddings/grpc/rwkv/whisper) directly into the test
file, materialise them into a per-test tmpdir via BeforeEach, and drop the
env-var lookups. The test no longer depends on Makefile plumbing.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash]
2026-04-28 12:58:49 +00:00
dependabot[bot]
439471baec chore(deps): bump packaging from 24.1 to 26.2 in /backend/python/coqui (#9594)
Bumps [packaging](https://github.com/pypa/packaging) from 24.1 to 26.2.
- [Release notes](https://github.com/pypa/packaging/releases)
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pypa/packaging/compare/24.1...26.2)

---
updated-dependencies:
- dependency-name: packaging
  dependency-version: '26.2'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-28 08:44:53 +02:00
dependabot[bot]
eff4be6794 chore(deps): bump github.com/onsi/ginkgo/v2 from 2.28.1 to 2.28.2 (#9593)
Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.28.1 to 2.28.2.
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](https://github.com/onsi/ginkgo/compare/v2.28.1...v2.28.2)

---
updated-dependencies:
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-version: 2.28.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-28 08:44:41 +02:00
dependabot[bot]
f1ec30d646 chore(deps): bump github.com/testcontainers/testcontainers-go/modules/postgres from 0.41.0 to 0.42.0 (#9591)
chore(deps): bump github.com/testcontainers/testcontainers-go/modules/postgres

Bumps [github.com/testcontainers/testcontainers-go/modules/postgres](https://github.com/testcontainers/testcontainers-go) from 0.41.0 to 0.42.0.
- [Release notes](https://github.com/testcontainers/testcontainers-go/releases)
- [Commits](https://github.com/testcontainers/testcontainers-go/compare/v0.41.0...v0.42.0)

---
updated-dependencies:
- dependency-name: github.com/testcontainers/testcontainers-go/modules/postgres
  dependency-version: 0.42.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-28 08:44:27 +02:00
LocalAI [bot]
f3500223d7 chore: ⬆️ Update leejet/stable-diffusion.cpp to a81677f59c92d90343aebca51dfed7decf0a0cb0 (#9586)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-28 08:44:10 +02:00
LocalAI [bot]
b69bacfcdc chore: ⬆️ Update ikawrakow/ik_llama.cpp to d6f3e4e28fbf75e6181e6ea32e734de9ce9304fd (#9585)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-28 08:43:51 +02:00
LocalAI [bot]
8e50066fa2 chore: ⬆️ Update ggml-org/llama.cpp to 665abc609740d397d30c0d8ef4157dbf900bd1a3 (#9584)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-28 08:43:33 +02:00
Ettore Di Giacinto
a0317d9926 refactor(tests): split app_test.go, move real-backend coverage to e2e-backends
core/http/app_test.go had grown to 1495 lines exercising three concerns at
once: HTTP-layer integration, real-backend inference (llama-gguf, tts,
stablediffusion, transformers embeddings, whisper), and service logic that
already has unit-level coverage. Each PR paid for 6 backend builds plus
real-model downloads to satisfy a single suite.

Reorg per layer:

- app_test.go (1495 -> 1003 lines) drives the mock-backend binary only.
  Kept: auth, routing, gallery API, file:// import, /system, agent-jobs
  HTTP plumbing, config-file model loading. Deleted real-inference specs
  (llama-gguf chat, ggml completions/streaming, logprobs, logit_bias,
  transcription, embeddings, External-gRPC, Stores duplicate, Model gallery
  Context). Lifted Agent Jobs out of the deleted Stores Context.
- tests/e2e-backends/backend_test.go gains logprobs, logit_bias, and
  no-first-token-dup specs (the latter folded into PredictStream). Two
  new caps gate them so non-LLM backends opt out.
- tests/e2e-aio/e2e_test.go gains a streaming smoke under Context("text")
  to catch container-level streaming regressions.
- tests/models_fixtures/ removed; all fixtures referenced testmodel.ggml.
  app_test.go now writes per-Context inline mock-model YAMLs.

CI:

- test.yml + tests-e2e.yml gain paths-ignore (docs/, examples/, *.md,
  backend/) so docs and backend-only PRs skip them. test.yml drops the
  6-backend Build step plus TRANSFORMER_BACKEND/GO_TAGS=tts; tests-apple
  drops the llama-cpp-darwin build.
- New tests-aio.yml runs the AIO container nightly + on workflow_dispatch
  + master/tags. The tests-e2e-container job moved out of test.yml so PRs
  no longer pay AIO cost.
- New tests-llama-cpp-smoke job in test-extra.yml runs on every PR with
  no detect-changes gate; pulls quay.io/go-skynet/local-ai-backends:
  master-cpu-llama-cpp (no build on PR) and exercises predict/stream/
  logprobs/logit_bias against Qwen3-0.6B. This is the PR-acceptance
  real-backend gate after AIO moved to nightly. The path-gated heavy
  test-extra-backend-llama-cpp wrapper appends the same caps so it
  exercises the moved specs when the backend actually changes.

Makefile:

- Deleted test-models/testmodel.ggml (the wget chain), test-llama-gguf,
  test-tts, test-stablediffusion, test-realtime-models. test target
  drops --label-filter, HUGGINGFACE_GRPC, TRANSFORMER_BACKEND, TEST_DIR,
  FIXTURES, CONFIG_FILE, MODELS_PATH, BACKENDS_PATH; depends on
  build-mock-backend. test-stores keeps a focused entry point and depends
  on backends/local-store. clean-tests also clears the mock-backend
  binary.

Net per typical Go-side PR: ~25min (6 backend builds + tests + AIO) +
~8min e2e drops to ~5min mock-backend test + ~8min e2e + ~5-10min
llama-cpp-smoke (image pulled). Docs and backend-only PRs skip the
always-on workflows entirely.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash]
2026-04-27 23:09:20 +00:00
Ettore Di Giacinto
3948b580d2 fix(distributed): worker stopBackend/isRunning resolve bare modelID to replica keys
PR #9583 changed the supervisor's process map key from `modelID` to
`modelID#replicaIndex`, but the NATS lifecycle handlers kept passing
the bare modelID:

* `backend.stop` (subscribeLifecycleEvents): `s.stopBackend(req.Backend)`
  → `s.processes["Qwen3.6-..."]` missed (actual key is "...#0") →
  silent no-op. Admin "Unload model" clicks released VRAM via
  model.unload but left the gRPC process alive on its old port.
  Subsequent chats hit installBackend, found the leftover process,
  reused its address — and the UI reported "no models loaded" while
  the model kept responding.

* `backend.delete` (subscribeLifecycleEvents): same map miss in
  `isRunning(req.Backend)` and `s.stopBackend(req.Backend)` — admin
  "Delete backend" deleted the binary while the process was still
  serving traffic.

Add `resolveProcessKeys(id)`: exact match if `id` is a full processKey
(stopAllBackends iterates the map and passes its own keys);
prefix-match if `id` is bare (NATS handlers); empty if `id` contains
`#` but doesn't match (no spurious fallback when the caller was
explicit). stopBackend and isRunning now call it; stopBackend gets a
new stopBackendExact helper for per-key cleanup.

TDD: regression test fails without the fix (resolveProcessKeys
doesn't exist; map lookup by bare name returns nothing). Tests pass
post-fix.

Reproduced live: registry row count was 0 for the model the user
"Unloaded", chat still served by the leftover worker process.
SmartRouter behavior is correct in itself — it falls through to
scheduleAndLoad when no row exists; the bug was that the leftover
process corrupted the install path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [Bash]
2026-04-27 21:43:15 +00:00
LocalAI [bot]
5efbe8405f feat(swagger): update swagger (#9587)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-27 23:28:03 +02:00
Ettore Di Giacinto
ea1df8945b fix(distributed): preserve UI-added node labels across worker re-register
The register endpoint called SetNodeLabels(req.Labels) — replace-all
semantics — so every worker re-register wiped every label not in the
worker's body. The bug existed since labels were introduced in
PR #9186 (Mar 31), but only triggered for workers that supplied
labels via --node-labels.

PR #9583 (the multi-replica refactor) added an auto-mirrored
`node.replica-slots` label to every worker's registration body, which
made `len(req.Labels) > 0` always true — turning a latent edge-case
bug into a universal one. Operators reported "labels assigned to
node do not persist": labels survived until the next worker restart,
then disappeared.

Fix: iterate req.Labels and call SetNodeLabel (upsert) for each
instead of SetNodeLabels (delete-then-recreate). Worker-managed
labels still refresh on re-register; UI-added labels survive.

Trade-off: an operator who removes a label from --node-labels won't
have it auto-removed from the DB on next register — they can clean it
via the UI. Acceptable, since the alternative (current behavior)
silently destroys operator state.

Regression test added first (TDD): RegisterNodeEndpoint registers a
node, the test simulates a UI add via SetNodeLabel, then re-registers
with a different worker label set; assertion that the UI-added label
survives. Test fails against the broken code, passes against the fix.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [Bash]
2026-04-27 21:24:50 +00:00
Ettore Di Giacinto
3280b9a287 fix(distributed): per-replica backend logs (store aggregation + UI)
The multi-replica refactor (PR #9583) changed the worker's process key
from `modelID` to `modelID#replicaIndex`, but the BackendLogStore kept
the bare-modelID lookup. Result: every distributed deployment lost
backend logs in the Nodes UI — single-replica too, since even the
default capacity of 1 produces a `#0` suffix.

Two changes wired together:

* pkg/model: BackendLogStore.GetLines/Subscribe now treat a modelID
  without `#` as a model prefix and merge across all `modelID#N` replica
  buffers (timestamp-sorted for GetLines; fan-in for Subscribe). Calls
  with a full `modelID#N` key resolve exactly. ListModels strips
  replica suffixes and deduplicates so the listing surfaces one entry
  per loaded model.

* react-ui: per-replica log streams as the default. Loaded Models
  table disambiguates each row with a `rep N` pill (only when the node
  hosts >1 replica of a model). Each row's "View logs" link routes to
  the per-replica process key so operators see only that replica's
  output. The logs page renders the replica context as a chip in the
  title and surfaces a segmented control — `Replica 0 / 1 / … / All
  merged` — when the model has multiple replicas; the merged segment
  uses the bare-modelID URL (delegating to the store's prefix
  aggregation) for the side-by-side comparison case. Single-replica
  deployments see no extra UI.

Tests added first (TDD): the regression set in
backend_log_store_test.go reproduces the bug at the exact failure
point — GetLines/ListModels/Subscribe assertions all fail against the
broken code, all pass against the fix. TestSubscribe_PerReplicaFilter
pins the exact-key path so a future change can't silently break it.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:distill]
2026-04-27 20:55:24 +00:00
Ettore Di Giacinto
375bf1929d fix(ui): hide meta-dev backends in System → Backends Development toggle
The Manage view's flagsFor() short-circuited on b.IsMeta and returned
dev=false for every meta backend, so meta-dev entries
(e.g. llama-cpp-development, whisper-development, insightface-development)
leaked through the Development toggle in distributed mode and stayed
visible whether the toggle was on or off. The count chip even
under-reported because those rows were excluded from it.

Drop the IsMeta short-circuit and trust gallery enrichment for both
flags. Production metas (llama-cpp) are tagged isAlias=false /
isDevelopment=false in the gallery so they still pass both toggles;
meta-dev entries carry isDevelopment=true and now correctly hide
alongside concrete dev variants.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-27 20:38:20 +00:00
Ettore Di Giacinto
9a7f5e68bd ci(darwin): add native caches to backend_build_darwin
macOS runners can't use the registry-backed BuildKit cache (no Docker
daemon), so every darwin matrix run was paying full cost for brew
installs, Go module downloads, llama.cpp recompiles and Python wheel
resolution.

Wires actions/cache@v4 into the reusable workflow for four caches:

- Go modules + build cache (setup-go cache: true), shared across matrix
- Homebrew downloads + selected /opt/homebrew/Cellar entries, with
  HOMEBREW_NO_AUTO_UPDATE so restored Cellar paths stay stable
- ccache for the llama-cpp CMake variants, keyed on the pinned
  LLAMA_VERSION; CMAKE_*_COMPILER_LAUNCHER is exported via GITHUB_ENV
  so backend/cpp/llama-cpp/Makefile picks it up without script changes
- Python uv + pip wheel cache, keyed by backend + ISO week — same
  one-cold-rebuild-per-week cadence as the Linux DEPS_REFRESH

Read/write semantics match the existing BuildKit policy: every run
restores, only master/tag pushes save, so PRs can't pollute master's
warm cache.

Documents the new caches and the macOS-specific constraints in
.agents/ci-caching.md.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
2026-04-27 20:17:36 +00:00
Ettore Di Giacinto
6b63b47f61 feat(distributed): support multiple replicas of one model on the same node (#9583)
* feat(distributed): support multiple replicas of one model on the same node

The distributed scheduler implicitly assumed `(node_id, model_name)` was
unique, but the schema didn't enforce it and the worker keyed all gRPC
processes by model name alone. With `MinReplicas=2` against a single
worker, the reconciler "scaled up" every 30s but the registry never
advanced past 1 row — the worker re-loaded the model in-place every tick
until VRAM fragmented and the gRPC process died.

This change introduces multi-replica-per-node as a first-class concept,
with capacity-aware scheduling, a circuit breaker, and VRAM
soft-reservation. Operators can declare per-node capacity via the worker
flag `--max-replicas-per-model` (mirrored as auto-label
`node.replica-slots=N`) or override per-node from the UI.

* Schema: BackendNode gains MaxReplicasPerModel (default 1) and
  ReservedVRAM. NodeModel gains ReplicaIndex (composite with node_id +
  model_name). ModelSchedulingConfig gains UnsatisfiableUntil/Ticks for
  the reconciler circuit breaker.

* Registry: replica_index threaded through SetNodeModel, RemoveNodeModel,
  IncrementInFlight, DecrementInFlight, TouchNodeModel, GetNodeModel,
  SetNodeModelLoadInfo and the InFlightTrackingClient. New helpers:
  CountReplicasOnNode, NextFreeReplicaIndex (with ErrNoFreeSlot),
  RemoveAllNodeModelReplicas, FindNodesWithFreeSlot,
  ClusterCapacityForModel, ReserveVRAM/ReleaseVRAM (atomic UPDATE with
  ErrInsufficientVRAM), and the unsatisfiable-flag CRUD.

* Worker: processKey now `<modelID>#<replicaIndex>` so concurrent loads
  of the same model land on distinct ports. Adds CLI flag
  --max-replicas-per-model (env LOCALAI_MAX_REPLICAS_PER_MODEL, default 1)
  and emits the auto-label.

* Router: scheduleNewModel filters candidates by free slot, allocates the
  replica index, and soft-reserves VRAM before installing the backend.
  evictLRUAndFreeNode now deletes the targeted row by ID instead of all
  replicas of the model on the node — fixes a latent bug where evicting
  one replica orphaned its siblings.

* Reconciler: caps scale-up at ClusterCapacityForModel so a misconfig
  (MinReplicas > capacity) doesn't loop forever. After 3 consecutive
  ticks of capacity==0 it sets UnsatisfiableUntil for a 5m cooldown and
  emits a warning. ClearAllUnsatisfiable fires from Register,
  ApproveNode, SetNodeLabel(s), RemoveNodeLabel and
  UpdateMaxReplicasPerModel so a new node joining or label changes wake
  the reconciler immediately. scaleDownIdle removes highest-replica-index
  first to keep slots compact.

* Heartbeat resets reserved_vram to 0 — worker is the source of truth
  for actual free VRAM; the reservation is only for the in-tick race
  window between two scheduling decisions.

* Probe path (reconciler.probeLoadedModels and health.doCheckAll) now
  pass the row's replica_index to RemoveNodeModel so an unreachable
  replica doesn't orphan healthy siblings.

* Admin override: PUT /api/nodes/:id/max-replicas-per-model sets a
  sticky override (preserved across worker re-registration). DELETE
  clears the override so the worker's flag applies again on next
  register. Required because Kong defaults the worker flag to 1, so
  every worker restart would have silently reverted the UI value.

* React UI: always-visible slot badge on the node row (muted at default
  1, accented when >1); inline editor in the expanded drawer with
  pencil-to-edit, Save/Cancel, Esc/Enter, "(override)" indicator when
  the value is admin-set, and a "Reset" button to hand control back to
  the worker. Soft confirm when shrinking the cap below the count of
  loaded replicas. Scheduling rules table gets an "Unsatisfiable until
  HH:MM" status badge surfacing the cooldown.

* node.replica-slots filtered out of the labels strip on the row to
  avoid duplicating the slot badge.

23 new Ginkgo specs (registry, reconciler, inflight, health) cover:
multi-replica row independence, RemoveNodeModel of one replica
preserving siblings, NextFreeReplicaIndex slot allocation including
ErrNoFreeSlot, capacity-gated scale-up with circuit breaker tripping
and recovery on Register, scheduleDownIdle ordering, ClusterCapacity
math, ReserveVRAM admission gating, Heartbeat reset, override survival
across worker re-registration, and ResetMaxReplicasPerModel handing
control back. Plus 8 stdlib tests for the worker processKey / CLI /
auto-label.

Closes the flap reproduced on Qwen3.6-35B against the nvidia-thor
worker (single 128 GiB node, MinReplicas=2): the reconciler now caps
the scale-up at the cluster's actual capacity instead of looping.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Read] [Edit] [Bash] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:golang-testing]

* refactor(react-ui/nodes): tighten capacity editor copy + adopt ActionMenu for row actions

* Capacity editor hint trimmed from operator-doc-style ("Sourced from
  the worker's `--max-replicas-per-model` flag. Changing it here makes it
  a sticky admin override that survives worker restarts." → "Saved
  values stick across worker restarts.") and the override-state copy
  similarly compressed. The full mechanic is no longer needed in the UI
  — the override pill carries the meaning and the docs cover the rest.

* Node row actions migrated from an inline cluster of icon buttons
  (Drain / Resume / Trash) to the kebab ActionMenu used by /manage for
  per-row model actions, so dense Nodes tables stay clean. Approve
  stays as a prominent primary button — it's a stateful admission gate,
  not a routine action, and elevating it matches how /manage surfaces
  install-time decisions outside the menu.

* The expanded drawer's Labels section now filters node.replica-slots
  out of the editable label list. The label is owned by the Capacity
  editor above; surfacing it again as an editable label invited
  confusion (the Capacity save would clobber any direct edit).

Both backend and agent workers benefit — they share the row rendering
path, so the action menu and label filter apply to both.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:critique] [Skill:audit] [Skill:polish]

* fix(react-ui/nodes): suppress slot badge on agent workers

Agent workers don't load models, so the per-node replica capacity is
inapplicable to them. Showing "1× slots" on agent rows was a tiny
inconsistency from the unified rendering path — gate the badge on
node_type !== 'agent' so it only appears on backend workers.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp]

* refactor(react-ui/nodes): distill expanded drawer + restyle scheduling form

The expanded node drawer used to stack five panels — slot badge,
filled capacity box, Loaded Models h4+empty-state, Installed Backends
h4+empty-state, Labels h4+chips+form — making routine inspections feel
like a control panel. The scheduling rule form wrapped its mode toggle
as two 50%-width filled buttons that competed visually with the actual
primary action.

* Drawer: collapse three rarely-touched config zones (Capacity,
  Backends, Labels) into one `<details>` "Manage" disclosure (closed by
  default) with small uppercase eyebrow labels for each zone instead of
  parallel h4 sub-headings. Loaded Models stays as the at-a-glance
  headline with a single-line empty hint instead of a boxed empty state.
  CapacityEditor renders flat (no filled background) — the Manage
  disclosure provides framing.

* Scheduling form: replace the chunky 50%-width button-tabs with the
  project's existing `.segmented` control (icon + label, sized to
  content). Mode hint becomes a single tied line below. Fields stack
  vertically with helper text under inputs and a hairline divider above
  the right-aligned Save / Cancel.

The empty drawer collapses from ~5 stacked sections (~280px tall) to
two lines (~80px). The scheduling form now reads as a designed dialog
instead of raw building blocks. Both surfaces now match the typographic
density and weight of the rest of the admin pages.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:distill] [Skill:audit] [Skill:polish]

* feat(react-ui/nodes): replace scheduling form's model picker with searchable combobox

The native <select> made operators scroll through every gallery entry to
find a model name. The project already has SearchableModelSelect (used
in Studio/Talk/etc.) which combines free-text search with the gallery
list and accepts typed model names that aren't installed yet — useful
for pre-staging a scheduling rule before the node it'll run on has
finished bootstrapping.

Also drops the now-unused useModels import (the combobox manages the
gallery hook internally).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit]

* refactor(react-ui/nodes): consolidate key/value chip editor + add replica preset chips

The Nodes page was rendering the same key=value chip pattern in two
places with subtly different markup: the Labels editor in the expanded
drawer and (post-distill) the Node Selector input in the scheduling
form. The form's input was also a comma-separated string that operators
were getting wrong.

* Extract <KeyValueChips> as a fully controlled chip-builder. Parent
  owns the map and decides what onAdd/onRemove does — form state for the
  scheduling form, API calls for the live drawer Labels editor. Same
  visuals everywhere; one component to change when polish needs apply.

* Replace the comma-separated Node Selector text input with KeyValueChips.
  Operators were copying syntax from docs and missing commas; the chip
  vocabulary makes the key=value structure self-documenting.

* Add <ReplicaInput>: numeric input + quick-pick preset chips for Min/Max
  replicas. Picked over a slider because replica counts are exact specs
  derived from VRAM math (operator decision, not a fuzzy estimate). The
  chips give one-click access to common values (1/2/3/4 for Min,
  0=no-limit/2/4/8 for Max) without the slider's special-value problem
  (MaxReplicas=0 is categorical, not a position on a continuum).

* Drop the now-unused labelInputs state in the Nodes page (the inline
  label editor's per-node draft state lived there and is now owned by
  KeyValueChips).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [Skill:distill]

* test: fix CI fallout from multi-replica refactor (e2e/distributed + playwright)

Two breakages caught by CI that didn't surface in the local run:

* tests/e2e/distributed/*.go — multiple files used the pre-PR2 registry
  signatures for SetNodeModel / IncrementInFlight / DecrementInFlight /
  RemoveNodeModel / TouchNodeModel / GetNodeModel / SetNodeModelLoadInfo
  and one stale adapter.InstallBackend call in node_lifecycle_test.go.
  All updated to pass replicaIndex=0 — these tests don't exercise
  multi-replica behavior, they just need to compile against the new
  signatures. The chip-builder tests in core/services/nodes/ already
  cover the multi-replica logic.

* core/http/react-ui/e2e/nodes-per-node-backend-actions.spec.js — the
  drawer's distill refactor moved Backends inside a "Manage" <details>
  disclosure that's collapsed by default. The test helper expanded the
  node row but never opened Manage, so the per-node backend table was
  never in the DOM. Helper now clicks `.node-manage > summary` after
  expanding the row.

All 100 playwright tests pass locally; tests/e2e/distributed compiles
clean.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [Bash]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-27 21:20:05 +02:00
Ettore Di Giacinto
f4036fa83f ci(python-backends): add weekly DEPS_REFRESH cache-buster
The shared backend/Dockerfile.python ends in:
    RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
which `pip install`s each backend's requirements*.txt. A scan of all 34
Python backends shows every single one ships at least some unpinned deps
(torch, transformers, vllm, diffusers, ...). With the registry cache now
enabled, that `make` layer's BuildKit hash depends only on Dockerfile
instructions + COPYed source — not on what pip resolves at runtime — so
a warm cache would freeze upstream versions indefinitely.

DEPS_REFRESH is an ARG declared right before that RUN. backend_build.yml
computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W17`) and passes it as
a build-arg, so the install layer invalidates at most once per week and
re-resolves PyPI / nightly indexes. Within a week, builds stay warm.

Only Dockerfile.python is affected: Go (go.sum) and Rust (Cargo.lock)
already lock their deps, and the C++ backends pull gRPC at a pinned tag
and llama.cpp at a pinned commit.

Add .agents/ci-caching.md documenting the cache layout
(quay.io/go-skynet/ci-cache:cache<tag-suffix>), read/write semantics
(master writes, PRs read-only), DEPS_REFRESH semantics, and how to
manually evict tags. Index it from AGENTS.md (CLAUDE.md is a symlink).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7-1m
2026-04-27 14:21:11 +00:00
Ettore Di Giacinto
3810fe1a1e fix(distributed): worker container healthcheck always unhealthy
The Dockerfile's HEALTHCHECK probes http://localhost:8080/readyz, which
is the OpenAI API server port. When the same image runs as a worker, it
listens on the gRPC base port (50051) and an HTTP file transfer server
on port-1 (50050) — nothing on 8080 — so docker always reports the
container as unhealthy.

Add unauthenticated /readyz and /healthz endpoints to the worker's HTTP
file transfer server, and override HEALTHCHECK_ENDPOINT for worker-1 in
the distributed compose file. Disable the healthcheck for agent-worker
since it is NATS-only and exposes no HTTP server.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-27 13:51:57 +00:00
Ettore Di Giacinto
bdfa5e934a ci: switch image/backend build cache to a dedicated registry image
- Switch cache-from/cache-to in backend_build.yml and image_build.yml
  from the unused gha cache to type=registry pointing at
  quay.io/go-skynet/ci-cache:cache<tag-suffix>, mode=max with
  ignore-error=true. Master/tag builds populate their own
  per-matrix-entry cache; PR builds read-only.
- Drop the broken generate_grpc_cache.yaml cron. It targeted a `grpc`
  Dockerfile stage that was removed by b1fc5acd in July 2025, has been
  failing every night since, and never populated the gha cache. The new
  registry-cache scheme is self-warming, so no separate populator is
  needed.
- Remove the dead GRPC_VERSION / GRPC_BASE_IMAGE / GRPC_MAKEFLAGS
  build-args from image_build.yml and the orphan ARG GRPC_BASE_IMAGE in
  the root Dockerfile (the root Dockerfile no longer compiles gRPC; the
  source build now lives in backend/Dockerfile.{llama-cpp,
  ik-llama-cpp, turboquant} only and uses its own ARG defaults).
- Drop the unused grpc-base-image input from image_build.yml plus the
  matrix passthroughs in image.yml / image-pr.yml.
- Drop the unused GRPC_VERSION env in test.yml.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7-1m
2026-04-27 13:13:04 +00:00
Richard Palethorpe
deca6dbdad feat: Log backend exit code (#9581)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-27 14:19:18 +02:00
Ettore Di Giacinto
60549a8a60 feat(react-ui): page-width archetype system + mobile/tablet nav polish
Replace the universal max-width:1200px cap on .page with a four-tier
archetype system (narrow 760, medium 1080, default 1600, wide unbounded)
selected per page based on what its UX actually wants. Data/table pages
fill ultrawide displays; forms cap at reading width; tabbed feature
surfaces breathe.

Mobile/tablet:
- New 640/1024 breakpoint split. Tablets (640-1023) get a persistent
  52px icon rail; below 640 keeps the slide-off drawer.
- Drawer polish: body-scroll lock, Escape to close, focus moves into
  the drawer on open and back to the hamburger on close, aria-hidden
  + inert on main while open.
- Mobile top bar carries hamburger + theme toggle + account avatar
  (44x44 touch targets) so theme/account aren't trapped in the drawer.
- Page-level reflow on phones: page-header column-stacks, filter chips
  scroll horizontally, tables go edge-to-edge, OperationsBar overflows
  rather than wrapping. Honors prefers-reduced-motion.

Manage > Models: drop the toggle column; Enable/Disable joins the
per-row Actions menu alongside Stop/Pin/Edit/Logs/Delete for
consistency with the other action verbs.

Page-width tokens live in theme.css so future tuning is one line.
Removes 7 inline maxWidth workarounds from page roots.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude Code:claude-opus-4-7 [Edit] [Bash]
2026-04-27 11:51:29 +00:00
Ettore Di Giacinto
54728e292f feat(react-ui): split Manage backends toggle into Variants and Development
Meta backends are now always shown — they're the entries operators
configure against — and two independent toggles govern the noise around
them. "Variants" hides platform-specific concrete builds that a meta
backend aliases on the host (e.g. llama-cpp-cuda12-12.4). "Development"
hides pre-release `-development` builds. Each toggle shows the count of
items currently hidden in its category. The legacy `bm` URL flag is
honored on read so existing deep-links resolve to the same view they
used to.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-27 08:23:53 +00:00
Tai An
86fd62233f fix(gallery): correct Qwen3.5 typo in qwen3.5-27b-claude-4.6 model override (closes #9362) (#9580)
The overrides.parameters.model field referenced 'Qwen3.-27B-Claude-...' (missing the '5'), so model loads failed because the configured filename did not match the file actually downloaded by the entry's files: list ('Qwen3.5-27B-Claude-...').

Aligns the override filename with the files: entries and with the upstream HF repo (mradermacher/Qwen3.5-27B-...).
2026-04-27 09:24:00 +02:00
Alex Brick
41ed8ced70 [intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (in more places this time) (#9578)
Update additional intel base images
2026-04-27 09:18:57 +02:00
LocalAI [bot]
05e94bd9e7 chore: ⬆️ Update ggml-org/llama.cpp to f53577432541bb9edc1588c4ef45c66bf07e4468 (#9577)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-27 08:57:24 +02:00
Ettore Di Giacinto
8d124d080f feat(gallery): add whisper-development umbrella stanza
Mirrors the whisper capabilities map with -development variants so
clients can pull the master-tagged whisper.cpp backend via a single
platform-resolved name, matching the existing faster-whisper-development
and whisperx-development entries.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-26 23:04:27 +00:00
Ettore Di Giacinto
2da1a4d230 feat(distributed): per-node backend installation from the gallery
In distributed mode the Backends gallery used to fan every install out to
every worker — fine for auto-resolving (meta) backends like llama-cpp where
each node picks its own variant, but wrong for hardware-specific builds
like cpu-llama-cpp that would silently land on every GPU node.

Adds a node-targeted install path through the existing
POST /api/nodes/:id/backends/install plumbing, with two entry points:

- Backends gallery row gets a split-button in distributed mode. Auto-
  resolving keeps "Install on all nodes" as the primary; chevron menu
  opens the picker. Hardware-specific routes the primary directly to the
  picker — no fan-out path on the row.
- Nodes-page drawer gets a "+ Add backend" button that navigates to
  /app/backends?target=<node-id>; the gallery scopes itself to that node
  (banner, single per-row install button, Reinstall/Remove for already-
  installed). One gallery, two scopes — no second UI to maintain.

The picker (new NodeInstallPicker) shows a 3-state suitability column
(Compatible / Override / Installed), an auto-expanding variant override
disclosure that fires when selected nodes have no working GPU, parallel
per-node installs with inline status and Retry-failed-nodes, and a
mismatch confirm that names the consequence on the button itself.

A 409 fan-out guard on /api/backends/apply protects CLI/Terraform/script
users from the same footgun: hardware-specific installs in distributed
mode now return code "concrete_backend_requires_target" with a human-
readable error and a meta_alternative pointer.

The gallery list payload now surfaces capabilities, metaBackendFor and
per-row nodes (NodeBackendRef) so the picker and the new Nodes column
have everything they need without re-walking the gallery client-side.

GODEBUG=netdns=go is set on the compose services because the cgo DNS
resolver follows the container's nsswitch.conf to host systemd-resolved
(127.0.0.53), unreachable from inside the container; the pure-Go
resolver reads /etc/resolv.conf directly and uses Docker's embedded DNS.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude Code:claude-opus-4-7[1m] [Edit] [Bash] [Read] [Write]
2026-04-26 22:05:18 +00:00
Ettore Di Giacinto
988430c850 test(react-ui): drive Manage page Backend logs link via the new kebab menu
Manage page row actions moved into ActionMenu in b336d9c6, so the
inline `<a title="Backend logs">` the e2e specs were asserting on no
longer exists. Open the row's kebab and assert against the menuitem.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7
2026-04-26 20:51:01 +00:00
Ettore Di Giacinto
b336d9c626 feat(react-ui): polish Manage page with kebab menus and gallery rows
Bring the System / Manage page up to the visual standard of the Install
gallery so installed models and backends stop reading like a debug dump.

- Unified ResourceRow anatomy (icon, name+description, badges, status,
  expandable detail) shared across both tabs.
- Gallery enrichment cross-references installed names against the gallery
  list endpoints to surface icons, descriptions, license, tags, and links
  with a graceful "no description" fallback for custom imports.
- Header summary with four StatCards (Models / Backends / Running /
  Updates) — clickable to switch tab + pre-set filter.
- Backends meta + development entries hidden by default; "Show meta &
  development" paired toggle in the FilterBar with hidden-count hint.
- Kebab (three-dot) ActionMenu replaces the inline button cluster on
  every row; restrained until hover, keyboard-navigable, danger items
  separated by a divider.
- Backend "Version" cell now falls back to short digest, OCI tag, or
  ocifile basename when no semver is set, instead of showing "—" for
  every OCI install. Detail panel exposes full Source URI + Digest.
- Drop redundant column headers ("Actions", "On") — kebabs and toggles
  carry their own affordance; screen readers still get a label.
- Inline System / User / Meta / Dev badges next to the backend name so
  the dedicated Type column doesn't reserve space for "USER" repeated.
- Tightened the spacing between the System Resources card and the
  StatCards so they no longer crowd the RAM bar.

Extracted StatCard and GalleryLoader from Nodes.jsx and Models.jsx into
shared components so the visual language is one source of truth.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude Code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
2026-04-26 20:33:49 +00:00
Ettore Di Giacinto
f384c64a91 fix(model-loader): also skip .ckpt, .zip, and .tag files when scanning models
The local model directory scan treats every non-skipped file as a model
config candidate. Sidecar artifacts that ship alongside checkpoints
(checkpoint blobs, downloaded archives, ggml-style tag files) were
slipping through and showing up as bogus models in the listing. Add
their extensions to the suffix-skip list.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-04-26 19:37:53 +00:00
Ettore Di Giacinto
e9d8e92988 fix(react-ui): don't yank chat scroll to bottom while user is reading
The chat and agent-chat pages auto-scrolled to the bottom on every
streamed token. If the user scrolled up to re-read part of a response,
the next chunk pulled them back down — making long replies unreadable
while streaming.

Track a stickToBottomRef on each scroll event: if the user is within
80px of the bottom we keep auto-scrolling, otherwise we leave them
where they are. On chat switch we snap back to the bottom and re-pin.

Same fix applied to both Chat.jsx and AgentChat.jsx since they share
the same streaming pattern.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-04-26 19:35:39 +00:00
Ettore Di Giacinto
5b0196c7d0 fix(whisper): scrub invalid UTF-8 from segment text before protobuf marshal
whisper.cpp can emit bytes that are not valid UTF-8 — typically a
multibyte codepoint split across token boundaries. protobuf string
fields reject those at marshal time, which would surface as a transcribe
failure. Run strings.ToValidUTF8 on the segment text before it leaves
the cgo boundary so the bad byte gets replaced with U+FFFD.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-04-26 19:35:39 +00:00
Ettore Di Giacinto
c8d63a1003 fix(react-ui): stop Manage page from blanking on auto-refresh; show real model use cases
- useModels.refetch now runs silently — distributed-mode 10s auto-refresh
  no longer flips loading=true and replaces the table with a spinner card.
- Manage Use Cases column derives badges from each model's actual
  capabilities (Chat / Image / TTS / Embeddings / etc.) instead of
  hardcoding a "Chat" link for every row.
- FilterBar right slot is right-aligned via margin-left:auto so the
  Update button lives at the end of the row, not next to the chips.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-04-26 19:35:39 +00:00
LocalAI [bot]
d9cb0d6133 chore: ⬆️ Update ggml-org/llama.cpp to dcad77cc3b0865153f486327064fb0320a57a476 (#9572)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-26 12:38:35 +02:00
LocalAI [bot]
f5c268deac chore: ⬆️ Update TheTom/llama-cpp-turboquant to 11a241d0db78a68e0a5b99fe6f36de6683100f6a (#9571)
⬆️ Update TheTom/llama-cpp-turboquant

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-26 12:38:25 +02:00
Tai An
8931a2ad31 fix(gallery): normalize inconsistent tag casing/plurals across gallery models (#9574)
- embeddings → embedding (6 models): aligns with the WebUI filter button
  defined in core/http/views/models.html ({ term: 'embedding', ... }), so
  models like nomic-embed-text-v1.5 now appear under the Embedding filter
- TTS → tts (5 models), ASR → asr (2 models): lowercase, per existing
  convention used by 161+ models
- CPU/Cpu → cpu (17 models), GPU → gpu (17 models): lowercase, per existing
  convention used by 666+ models
- dedupe duplicate tag entries on 3 models that already had repeated tags
  (gpt-oss-20b had gguf x2; arcee-ai/AFM-4.5B had gpu x2; one Qwen model
  had default x2)

Closes #9247
2026-04-26 08:33:38 +02:00
Ettore Di Giacinto
e16e758dff ci(backends): build cpu-whisperx and cpu-faster-whisper for linux/arm64 (#9573)
Extend the existing CPU build matrix entries to produce a multi-arch
manifest (linux/amd64,linux/arm64) at the same image tags. arm64
Linux hosts without an NVIDIA GPU report the "default" capability,
which already maps to cpu-whisperx / cpu-faster-whisper in
backend/index.yaml -- so the manifest list lets Docker pull the right
variant without any gallery changes.

Both stacks install cleanly under aarch64: torch (2.4.1/2.8.0),
faster-whisper, ctranslate2, whisperx, opencv-python and the
remaining deps all ship manylinux2014_aarch64 wheels, so no source
builds run under QEMU emulation.

Follows the same pattern already used by cpu-llama-cpp-quantization.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-26 08:30:03 +02:00
LocalAI [bot]
1c45227346 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 3a945af45d45936341a45bbf7deda56776a4af26 (#9570)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-26 08:26:37 +02:00
Ettore Di Giacinto
fbe4f0a99b fix(docs): replace Docsy alert shortcode with Relearn notice
The docs site uses the hugo-theme-relearn theme, which provides
`notice` instead of Docsy's `alert`. The face-recognition,
voice-recognition, and stores feature pages used `{{% alert %}}`,
breaking `hugo build` with "template for shortcode \"alert\" not
found".

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-25 21:04:31 +00:00
Ettore Di Giacinto
d733c9cd13 fix(mlx-vlm): pin upstream to v0.4.4 to unblock CUDA builds (#9568)
Blaizzy/mlx-vlm git HEAD bumped its constraint to mlx>=0.31.2, but
mlx-cuda-12 and mlx-cuda-13 are only published up to 0.31.1 on PyPI.
Since mlx[cudaXX]==0.31.2 forces a sibling wheel that doesn't exist,
pip backtracks through every older mlx[cudaXX], none of which satisfy
mlx>=0.31.2, producing ResolutionImpossible.

Pin all variants to the v0.4.4 tag (mlx>=0.30.0), which resolves
cleanly against mlx[cuda13]==0.31.1. cpu/mps weren't broken yet but
are pinned for consistency.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-25 22:06:01 +02:00
Ettore Di Giacinto
703b4fcae8 Change cron schedule to run every 12 hours
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-25 18:38:28 +02:00
Richard Palethorpe
73aacad2f9 fix(vllm): drop flash-attn wheel to avoid torch 2.10 ABI mismatch (#9557)
The pinned flash-attn 2.8.3+cu12torch2.7 wheel breaks at import time
once vllm 0.19.1 upgrades torch to its hard-pinned 2.10.0:

  ImportError: .../flash_attn_2_cuda...so: undefined symbol:
  _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

That C10 CUDA symbol is libtorch-version-specific. Dao-AILab has not yet
published flash-attn wheels for torch 2.10 -- the latest release (2.8.3)
tops out at torch 2.8 -- so any wheel pinned here is silently ABI-broken
the moment vllm completes its install.

vllm 0.19.1 lists flashinfer-python==0.6.6 as a hard dep, which already
covers the attention path. The only other use of flash-attn in vllm is
the rotary apply_rotary import in
vllm/model_executor/layers/rotary_embedding/common.py, which is guarded
by find_spec("flash_attn") and falls back cleanly when absent.

Also unpin torch in requirements-cublas12.txt: the 2.7.0 pin only
existed to give the flash-attn wheel a matching torch to link against.
With flash-attn gone, vllm's own torch==2.10.0 dep is the binding
constraint regardless of what we put here.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-25 15:38:13 +00:00
LocalAI [bot]
806ea24ff4 chore: ⬆️ Update TheTom/llama-cpp-turboquant to 67559e580b10e4e47e9a6fd6218873997976886d (#9497)
⬆️ Update TheTom/llama-cpp-turboquant

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-25 14:03:46 +02:00
LocalAI [bot]
385de3705e chore(model gallery): 🤖 add 1 new models via gallery agent (#9558)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-25 14:03:15 +02:00
Ettore Di Giacinto
21eace40ec feat(llama-cpp): expose split_mode option for multi-GPU placement (#9560)
Adds split_mode (alias sm) to the llama.cpp backend options allowlist,
accepting none|layer|row|tensor. The tensor value targets the experimental
backend-agnostic tensor parallelism from ggml-org/llama.cpp#19378 and
requires a llama.cpp build that includes that PR, FlashAttention enabled,
KV-cache quantization disabled, and a manually set context size.


Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-25 14:02:57 +02:00
Ettore Di Giacinto
24505e57f5 feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang (#9553)
* feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang

Adds new build profiles mirroring the diffusers/ace-step pattern so vLLM
serving (and SGLang on arm64) can be deployed on CUDA 13 hosts and
JetPack 7 boards:

- vllm: cublas13 (PyPI cu130 channel) + l4t13 (jetson-ai-lab SBSA cu130
  prebuilt vllm + flash-attn).
- vllm-omni: cublas13 + l4t13. Floats vllm version on cu13 since vllm
  0.19+ ships cu130 wheels by default and vllm-omni tracks vllm master;
  cu12 path keeps the 0.14.0 pin to avoid disturbing existing images.
- sglang: l4t13 arm64 only — uses the prebuilt sglang wheel from the
  jetson-ai-lab SBSA cu130 index, so no source build is needed.
  Cublas13 sglang on x86_64 is intentionally deferred.

CI matrix gains five new images (-gpu-nvidia-cuda-13-vllm{,-omni},
-nvidia-l4t-cuda-13-arm64-{vllm,vllm-omni,sglang}); backend/index.yaml
gains the matching capability keys (nvidia-cuda-13, nvidia-l4t-cuda-13)
and latest/development merge entries.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]

* fix(backends): use unsafe-best-match index strategy on l4t13 builds

The jetson-ai-lab SBSA cu130 index lists transitive deps (decord, etc.)
at limited versions / older Python ABIs. uv defaults to the first index
that contains a package and refuses to fall through to PyPI, so sglang
l4t13 build fails resolving decord. Mirror the existing cpu sglang
profile by setting --index-strategy=unsafe-best-match on l4t13 across
the three backends, and apply it to the explicit vllm install line in
vllm-omni's install.sh (which doesn't honor EXTRA_PIP_INSTALL_FLAGS).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]

* fix(sglang): drop [all] extras on l4t13, floor version at 0.5.0

The [all] extra brings in outlines→decord, and decord has no aarch64
cp312 wheel on PyPI nor the jetson-ai-lab index (only legacy cp35-cp37
tags). With unsafe-best-match enabled, uv backtracked through sglang
versions trying to satisfy decord and silently landed on
sglang==0.1.16, an ancient version with an entirely different dep
tree (cloudpickle/outlines 0.0.44, etc.).

Drop [all] so decord is no longer required, and floor sglang at 0.5.0
to prevent any future resolver misfire from degrading the version
again.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-25 12:26:29 +02:00
LocalAI [bot]
d09706dc60 chore(model gallery): 🤖 add 1 new models via gallery agent (#9555)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-25 09:00:37 +02:00
LocalAI [bot]
08e393f7db chore: ⬆️ Update ikawrakow/ik_llama.cpp to cb58a561f0c49f68b6d125cdfda037ed80433821 (#9549)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-25 08:59:48 +02:00
LocalAI [bot]
47cc3dc8d7 chore: ⬆️ Update ggml-org/llama.cpp to 361fe72acb7b9bd79059cc177cbeda99b35b5db9 (#9548)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-25 08:58:27 +02:00
Ettore Di Giacinto
83b384de97 feat: surface distributed backend management errors (#9552)
* fix(distributed): surface per-node backend op errors to OpStatus

DistributedBackendManager.{Install,Upgrade,Delete}Backend discarded the
per-node BackendOpResult from enqueueAndDrainBackendOp with `_, err :=`.
When workers replied Success=false (e.g. an OCI image with no arm64
variant on a Jetson host), the per-node Error string was recorded in
result.Nodes[].Error but never reached the toplevel return value, so
OpStatus.Error stayed empty and the UI reported the install as
"completed" while the backend was nowhere on the cluster.

Add BackendOpResult.Err() that aggregates per-node Status=="error"
entries into a single error. Queued nodes (waiting for reconciler retry)
are deliberately not treated as failures. Wire the three callers and
DeleteBackendDetailed to call result.Err() so reply.Success=false
finally reaches OpStatus.Error → /api/backends/job/:uid → the UI.

The Delete closures had a related bug: they discarded the reply with
`_` and only checked the NATS round-trip error, so reply.Success=false
was a silent success even with the new aggregation. Check both.

Standalone mode (LocalBackendManager) already surfaces gallery errors
correctly through the same OpStatus.Error path; no change needed there.

Tests: 9 new Ginkgo specs covering all-success / all-fail with distinct
errors / mixed / all-queued / no-nodes for Install, Upgrade, Delete.

Assisted-by: Claude:claude-opus-4-7 [Bash] [Edit] [Read] [Write]

* feat(react-ui): per-node backend delete + clearer upgrade affordance

The Nodes page exposed a per-node "reinstall" button (fa-sync-alt,
tooltip "Reinstall backend") but no per-node delete, even though the
Go side has had POST /api/nodes/:id/backends/delete →
RemoteUnloaderAdapter.DeleteBackend → NATS-to-specific-node wired up
for a while. Sync icons read as "refresh data" — the action is
functionally an upgrade (re-pulls the gallery image), so the affordance
was misleading.

Per-node backend row now renders two icon buttons:

- Upgrade: btn-secondary btn-sm + fa-arrow-up, tooltip "Upgrade backend
  on this node". Names both action and scope to differentiate from the
  cluster-wide upgrade on the Backends page.
- Delete: btn-danger-ghost btn-sm + fa-trash, tooltip "Delete backend
  from this node". Matches the node-level destructive style at the row
  action column rather than the solid btn-danger of primary destructive
  pages, since this is a secondary action inside a busy row.

Delete goes through the existing ConfirmDialog (danger=true) with copy
that names the backend and the node explicitly — it's a non-recoverable
op on a specific scope. Reuses nodesApi.deleteBackend(id, backend) which
already existed in the API client.

Tests: 4 new Playwright specs covering upgrade clarity (icon + tooltip),
delete button presence, confirm dialog flow with POST body assertion,
and cancel-doesn't-POST.

Assisted-by: Claude:claude-opus-4-7 [Bash] [Edit] [Read] [Write]
2026-04-25 08:57:59 +02:00
Ettore Di Giacinto
487e3fd2a4 feat(react-ui): editorial refresh with Nord palette and polished primitives (#9550)
* feat(react-ui): editorial refresh with Nord palette and polished primitives

Replaces the cool gray-blue theme with a deep Nord-inspired palette:
frost-cyan accent (#88c0d0) on deep blue-black surfaces (#13171f /
#1a1f2a / #242a36), snow-storm text scale, aurora status colours.

- Typography: Geist Variable + Geist Mono Variable (Google Fonts) with
  ss01/ss03/cv11 stylistic alternates; strengthened h1-h6 hierarchy;
  editorial negative tracking.
- Primitives: buttons gain depth (inset highlight + hover lift +
  brightness filter); inputs become sunken wells with sage-swap-to-frost
  focus rings; cards hover-lift and gain an .card--accent left-rail
  variant; badges become mono caps rectangles with tabular-nums.
- Chrome: sidebar active state is now an inset left rail + tint
  (no border-left); modals get popIn animation and proper shadow lift;
  toasts carry an inset accent bar + slide-in instead of tinted fills;
  operations bar breathes on active installs.
- Empty states: editorial pattern (eyebrow rule, large mono title,
  52ch lede) that inherits gracefully even without page JSX edits.
- Chat: assistant bubbles drop the gray-nested-in-gray card for a
  transparent pull-quote with a left border; user bubbles soften from
  loud accent fill to a subtle frost tint.
- Motion: custom spring easing cubic-bezier(0.22,1,0.36,1), 180ms
  standard; breathing/pulse/popIn keyframes; global prefers-reduced-
  motion honoring.
- Radii tightened to 3/5/8/10px; warm-shadow tokens redone for cool
  depth; ::selection, :focus-visible, kbd globals added.
- Migrated hardcoded 'JetBrains Mono' CSS literals to var(--font-mono)
  so the Geist Mono swap lands everywhere.

Scope is intentionally tokens + primitives only. Page JSX and the
~1,800 inline style={{…}} instances are untouched and flagged as
follow-ups.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write]

* feat(react-ui): complete-coverage pass — migrate inline styles to tokens

Follows up the editorial/Nord token refresh with a mechanical sweep of
page JSX and shared components so nothing bypasses the design system.

- Font family: replaced 80+ 'JetBrains Mono' / 'Space Grotesk' inline
  literals (and the string-CSS variants in CollectionDetails and
  AgentStatus) with var(--font-mono) / var(--font-sans). SVG <text>
  nodes that used the attribute form were switched to style={{ }} so
  the CSS variable resolves.
- Radii: every unquoted numeric borderRadius (2/3/4/10) is now a
  var(--radius-*) token; 50% and 999px kept as computed shapes.
- Spacing: clean-token gaps and margins (4/8/16px) moved to
  var(--spacing-xs/sm/md); padding: '4px 8px' and '8px 16px' lifted
  into token pairs. Micro-values (2/6/10/12px) left inline where no
  token maps cleanly.
- Colors: Talk.jsx button/canvas-surface hardcodes moved to
  var(--color-*); FineTune.jsx chart series colours now use the
  --color-data-* Nord palette (cyan/red/purple/orange instead of
  tailwind hex); AgentStatus tool-call icon and error tag hex swapped
  for var(--color-warning) / var(--color-text-inverse).
- CodeMirror editor (utils/cmTheme.js): both themes rebased on Nord —
  polar-night surfaces and aurora syntax highlighting (dark), snow-
  storm surfaces with darkened aurora (light). Caret/selection/active
  line/search now frost-cyan tinted instead of legacy indigo/purple.

Legitimately dynamic styles (computed widths, per-row colours, canvas
2D context fill/stroke for waveform and spectrogram drawing) remain
inline — they can't be expressed as CSS tokens.

29 files, +237/-237 — identity preserved, semantics re-anchored to
the token system.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write]
2026-04-24 23:35:59 +02:00
dependabot[bot]
9ab3496de2 chore(deps): bump rustls-webpki from 0.103.10 to 0.103.13 in /backend/rust/kokoros in the cargo group across 1 directory (#9546)
chore(deps): bump rustls-webpki

Bumps the cargo group with 1 update in the /backend/rust/kokoros directory: [rustls-webpki](https://github.com/rustls/webpki).


Updates `rustls-webpki` from 0.103.10 to 0.103.13
- [Release notes](https://github.com/rustls/webpki/releases)
- [Commits](https://github.com/rustls/webpki/compare/v/0.103.10...v/0.103.13)

---
updated-dependencies:
- dependency-name: rustls-webpki
  dependency-version: 0.103.13
  dependency-type: indirect
  dependency-group: cargo
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-24 22:02:58 +02:00
dependabot[bot]
c4511be33a chore(deps): bump postcss from 8.5.8 to 8.5.10 in /core/http/react-ui in the npm_and_yarn group across 1 directory (#9544)
chore(deps): bump postcss

Bumps the npm_and_yarn group with 1 update in the /core/http/react-ui directory: [postcss](https://github.com/postcss/postcss).


Updates `postcss` from 8.5.8 to 8.5.10
- [Release notes](https://github.com/postcss/postcss/releases)
- [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/postcss/postcss/compare/8.5.8...8.5.10)

---
updated-dependencies:
- dependency-name: postcss
  dependency-version: 8.5.10
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-24 22:02:41 +02:00
Ettore Di Giacinto
551ebdb57a fix(distributed): correct VRAM/RAM reporting on NVIDIA unified-memory hosts (#9545)
Workers on NVIDIA unified-memory hardware (DGX Spark / GB10, Jetson AGX Thor,
Jetson Orin/Xavier/Nano) were reporting `available_vram=0` back to the frontend,
so the Nodes UI showed the node as fully used even when most of the unified
memory was actually free.

Three causes addressed:

* `isTegraDevice` only matched `/sys/devices/soc0/family == "Tegra"`. DGX Spark
  (SBSA) reports JEDEC codes there instead — `jep106:0426` for the NVIDIA
  manufacturer — so the Tegra/unified-memory fallback never ran. Renamed to
  `isNVIDIAIntegratedGPU` and extended to also match `jep106:0426[:*]` via
  `/sys/devices/soc0/soc_id`.

* The unified-iGPU code defaulted the device name to `"NVIDIA Jetson"` when
  `/proc/device-tree/model` was missing. That's what happens for Thor inside a
  docker container, and always on DGX Spark. New `nvidiaIntegratedGPUName`
  resolves via dt-model → `/sys/devices/soc0/machine` → `soc_id` lookup
  (`jep106:0426:8901` → `"NVIDIA GB10"`) so the Nodes UI labels the box
  correctly.

* Worker heartbeat sent `available_vram=0` (or total-as-available) when VRAM
  usage was momentarily unknown — e.g. when `nvidia-smi` intermittently failed
  with `waitid: no child processes` under containers without `--init`. Each
  such heartbeat overwrote the DB and made the UI flip to "fully used".
  `heartbeatBody` now omits `available_vram` in that case so the DB keeps its
  last good value.

Also updates the commented GPU blocks in both compose files with
`NVIDIA_DRIVER_CAPABILITIES=compute,utility`, `capabilities: [gpu, utility]`,
and `init: true`, and documents the requirement in the distributed-mode and
nvidia-l4t pages. Without `utility`, NVML/`nvidia-smi` are absent inside the
container, which is what put the DGX Spark worker into the buggy fallback in
the first place.

Detection verified on live hardware (dgx.casa / GB10 and 192.168.68.23 / Thor)
by running a cross-compiled probe of the new helpers on both host and inside
the worker container.

Assisted-by: Claude:opus-4.7 [Claude Code]
2026-04-24 22:02:23 +02:00
Andreas Egli
1d0de757c3 fix: add hipblaslt library (#9541)
Signed-off-by: Andreas Egli <github@kharan.ch>
2026-04-24 18:50:03 +02:00
Alex Brick
e5337039b0 [intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (#9543)
* Use latest oneapi-basekit image for Intel images

The current `localai/localai:master-gpu-intel` images don't work with the intel arc pro b70. Updating the base_image to 2025.3.2 fixes it.

Signed-off-by: Alex Brick <3220905+arbrick@users.noreply.github.com>

* Update github workflow base image

---------

Signed-off-by: Alex Brick <3220905+arbrick@users.noreply.github.com>
2026-04-24 18:29:10 +02:00
LocalAI [bot]
1c9592c77f chore: ⬆️ Update leejet/stable-diffusion.cpp to b8bdffc19962be7e5a84bfefeb2e31bd885b571a (#9521)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-24 15:15:15 +02:00
Richard Palethorpe
3db60b57e6 fix(realtime): consume ChatDeltas when C++ autoparser clears Response (#9538)
The llama.cpp C++-side chat autoparser clears Reply.Message and delivers
parsed content/reasoning/tool-calls via Reply.chat_deltas. chat.go handles
this (non-SSE path uses ToolCallsFromChatDeltas/ContentFromChatDeltas/
ReasoningFromChatDeltas), but realtime.go only read pred.Response, so any
model routed through the autoparser (Qwen2.5/3 and friends) produced a
silent reply: backend emitted N tokens, the session surface saw zero.

Mirror the non-SSE chat path in realtime's triggerResponse: when deltas
carry tool calls or content, use them directly; otherwise fall back to
the existing raw-text parsing.

Assisted-by: claude-opus-4-7-1M [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-24 14:41:38 +02:00
Richard Palethorpe
13734ae9fa feat: Add Sherpa ONNX backend for ASR and TTS (#8523)
feat(backend): Add Sherpa ONNX backend and Omnilingual ASR

Adds a new Go backend wrapping sherpa-onnx via purego (no cgo). Same
approach as opus/stablediffusion-ggml/whisper — a thin C shim
(csrc/shim.c + shim.h → libsherpa-shim.so) wraps the bits purego
can't reach directly: nested struct config writes, result-struct field
reads, and the streaming TTS callback trampoline. The Go side uses
opaque uintptr handles and purego.NewCallback for the TTS callback.

Supports:
- VAD via sherpa-onnx's Silero VAD
- Offline ASR: Whisper, Paraformer, SenseVoice, Omnilingual CTC
- Online/streaming ASR: zipformer transducer with endpoint detection
  (AudioTranscriptionStream emits delta events during decode)
- Offline TTS: VITS (LJS, etc.)
- Streaming TTS: sherpa-onnx's callback API → PCM chunks on a channel,
  prefixed by a streaming WAV header

Gallery entries: omnilingual-0.3b-ctc-q8-sherpa (1600-language offline
ASR), streaming-zipformer-en-sherpa (low-latency streaming ASR),
silero-vad-sherpa, vits-ljs-sherpa.

E2E coverage: tests/e2e-backends for offline + streaming ASR,
tests/e2e for the full realtime pipeline (VAD + STT + TTS).

Assisted-by: claude-opus-4-7-1M [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-24 14:40:06 +02:00
Ettore Di Giacinto
c0920f3273 fix(ik-llama-cpp): patch clip.cpp for new ggml_quantize_chunk signature (#9531)
Bumps ik_llama.cpp pin to 16996aeab7. Upstream 286ce32...16996ae adds a
trailing `const struct quantize_user_data *` parameter to
`ggml_quantize_chunk` (PR ikawrakow/ik_llama.cpp#1677) but leaves
`examples/llava/clip.cpp` unchanged because their build has moved to
`examples/mtmd/`. LocalAI's prepare.sh still copies from
`examples/llava/`, so the dead 7-arg call reaches the grpc-server
compile and fails. Patch the call site to pass `nullptr` for the new
param.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]
2026-04-24 13:07:26 +02:00
LocalAI [bot]
7c1934b183 chore: ⬆️ Update ggml-org/llama.cpp to 187a45637054881ecacf17f8e2f6f8f2ba7df1c7 (#9520)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-24 09:17:06 +02:00
Tai An
5e062b4d1f fix: use SetFunctionCallNameString when forcing a specific tool (3 sites) (#9526)
* fix(anthropic): use SetFunctionCallNameString for specific tool forcing

* fix(openai/realtime): use SetFunctionCallNameString for specific tool forcing

* fix(openresponses): use SetFunctionCallNameString for specific tool forcing
2026-04-24 09:06:42 +02:00
Ettore Di Giacinto
4906cbad04 feat: add biometrics UI (#9524)
* feat(react-ui): add Face & Voice Recognition pages

Expose the face and voice biometrics endpoints
(/v1/face/*, /v1/voice/*) through the React UI. Each page has four
tabs driving the six endpoints per modality: Analyze (demographics
with bounding boxes / waveform segments), Compare (verify with a
match gauge and live threshold slider), Enrollment (register /
identify / forget with a top-K matches view), Embedding (raw
vector inspector with sparkline + copy).

MediaInput supports file upload plus live capture: webcam
snap-to-canvas for face, MediaRecorder -> AudioContext ->
16-bit PCM mono WAV transcode for voice (libsndfile on the
backend only handles WAV/FLAC/OGG natively).

Sidebar gets a new Biometrics section feature-gated on
face_recognition / voice_recognition; routes are wrapped in
<RequireFeature>. No new dependencies -- Font Awesome icons
picked from the Free set.

Assisted-by: Claude:Opus 4.7

* fix(localai): accept data URI prefixes with codec/charset params

Browser MediaRecorder produces data URIs like
  data:audio/webm;codecs=opus;base64,...
so the pre-';base64,' section can carry multiple parameter
segments. The `^data:([^;]+);base64,` regex in pkg/utils/base64.go
and core/http/endpoints/localai/audio.go only matched exactly one
segment, so recordings straight from the React UI's live-capture
tab failed the strip and then tripped the base64 decoder on the
leading 'data:' literal, surfacing as
  "invalid audio base64: illegal base64 data at input byte 4"

Widened both regexes to `^data:[^,]+?;base64,` so any number of
';param=value' segments between the mime type and ';base64,' are
tolerated. Added a regression test covering the MediaRecorder
shape.

Assisted-by: Claude:Opus 4.7

* fix(insightface): scope pack ONNX loading to known manifests

LocalAI's gallery extracts buffalo_* zips flat into the models
directory, which inevitably mixes with ONNX files from other
backends (opencv face engine, MiniFASNet antispoof, WeSpeaker
voice embedding) and older buffalo pack installs. Feeding those
foreign files into insightface's model_zoo.get_model() blows up
inside the router -- it assumes a 4-D NCHW input and indexes
`input_shape[2]` on tensors that aren't shaped like a face model,
raising IndexError mid-load and leaving the backend unusable.

The router's dispatch isn't amenable to per-file try/except alone
(first-file-wins picks det_10g.onnx from buffalo_l even when the
user asked for buffalo_sc -- alphabetical order happens to favour
the wrong pack). Instead, ship an explicit manifest of the
upstream v0.7 pack contents and scope the glob to that when the
requested pack is known. The manifest is small and stable; future
packs can be added alongside or fall through to the tolerance
loop, which also swallows any remaining IndexError / ValueError
from foreign files with a clear `[insightface] skipped` stderr
line for diagnostics.

Assisted-by: Claude:Opus 4.7

* fix(speaker-recognition): extract FBank features for rank-3 ONNX encoders

Pre-exported speaker-encoder ONNX graphs come in two shapes:

  rank-2  [batch, samples]           -- some 3D-Speaker exports,
                                        take raw waveform directly.
  rank-3  [batch, frames, n_mels]    -- WeSpeaker and most Kaldi-
                                        lineage encoders, expect
                                        pre-computed Kaldi FBank.

OnnxDirectEngine unconditionally fed `audio.reshape(1, -1)` --
correct for rank-2, IndexError-on-input_shape[3] on rank-3, which
surfaced to the UI as
  "Invalid rank for input: feats Got: 2 Expected: 3"

Detect the input rank at session init and run Kaldi FBank
(80-dim, 25ms/10ms frames, dither=0.0, per-utterance CMN) before
the forward pass when rank>=3. All knobs are configurable via
backend options for encoders that deviate from defaults.

torchaudio.compliance.kaldi is already in the backend's
requirements (SpeechBrain pulls torchaudio in), so no new
dependency.

Assisted-by: Claude:Opus 4.7

* fix(biometrics): isolate face and voice vector stores

Face (ArcFace, 512-D) and voice (ECAPA-TDNN 192-D / WeSpeaker
256-D) biometric embeddings were colliding inside a single
in-memory local-store instance. Enrolling one after the other
failed with
  "Try to add key with length N when existing length is M"
because local-store correctly refuses to mix dimensions in one
keyspace.

The registries were constructed with `storeName=""`, which in
StoreBackend() is just a WithModel() call. But ModelLoader's
cache is keyed on `modelID`, not `model` -- so both registries
collapsed to the same `modelID=""` slot and reused the same
backend process despite looking isolated on paper.

Three complementary fixes:

  1. application.go -- give each registry a distinct default
     namespace ("localai-face-biometrics" /
     "localai-voice-biometrics"). The comment claimed
     isolation, now it's actually enforced.

  2. stores.go -- pass the storeName as both WithModelID and
     WithModel so the ModelLoader cache key separates
     namespaces and the loader spawns distinct processes.

  3. local-store/store.go -- drop the Load() `opts.Model != ""`
     guard. It was there to prevent generic model-loading loops
     from picking up local-store by accident, but that auto-load
     path is being retired; the guard now just blocks legitimate
     namespace isolation. opts.Model is treated as a tag; the
     per-tuple process isolation upstream handles discrimination.

Assisted-by: Claude:Opus 4.7

* fix(gallery): stale-file cleanup and upgrade-tmp directory safety

Two related robustness fixes for backend install/upgrade:

pkg/downloader/uri.go
  OCI downloads passed through
      if filepath.Ext(filePath) != "" ...
          filePath = filepath.Dir(filePath)
  which was intended to redirect file-shaped download targets
  into their parent directory for OCI extraction. The heuristic
  misfires on directory-shaped paths with a dot-suffix --
  gallery.UpgradeBackend uses
      tmpPath = "<backendsPath>/<name>.upgrade-tmp"
  and Go's filepath.Ext treats ".upgrade-tmp" as an extension.
  The rewrite landed the extraction at "<backendsPath>/", which
  then **overwrote the real install** (backends/<name>/) with a
  flat-layout file and left a stray run.sh at the top level. The
  tmp dir itself stayed empty, so the validation step that
  checked "<tmpPath>/run.sh" predictably failed with
      "upgrade validation failed: run.sh not found in new backend"
  Every manual upgrade silently corrupted the backends tree this
  way. Guard the rewrite behind "target isn't already an existing
  directory" -- InstallBackend / UpgradeBackend both pre-create
  the target as a directory, so they get the correct behaviour;
  existing file-path callers with a genuine dot-extension still
  get the parent redirect.

core/gallery/backends.go
  InstallBackend's MkdirAll returned ENOTDIR when something at
  the target path was already a file (legacy dev builds dropped
  golang backend binaries directly at `<backendsPath>/<name>`
  instead of nesting them under their own subdir). That
  permanently blocked reinstall and upgrade for anyone carrying
  that state, since every retry hit the same error. Detect a
  pre-existing non-directory, warn, and remove it before the
  MkdirAll so the fresh install can write the correct nested
  layout with metadata.json + run.sh.

Assisted-by: Claude:Opus 4.7

* fix(galleryop): refresh upgrade cache after backend ops

UpgradeChecker caches the last upgrade-check result and only
refreshes on the 6-hour tick or after an auto-upgrade cycle.
Manual upgrades (POST /api/backends/upgrade/:name) go through
the async galleryop worker, which completes the upgrade
correctly but never tells UpgradeChecker to re-check -- so
/api/backends/upgrades continued to list a just-upgraded backend
as upgradeable, indistinguishable from a failed upgrade, for up
to six hours.

Add an optional `OnBackendOpCompleted func()` hook on
GalleryService that fires after every successful install /
upgrade / delete on the backend channel (async, so a slow
callback doesn't stall the queue). startup.go wires it to
UpgradeChecker.TriggerCheck after both services exist. Result:
the upgrade banner clears within milliseconds of the worker
finishing.

Assisted-by: Claude:Opus 4.7

* build: prepend GOPATH/bin to PATH for protogen-go

install-go-tools runs `go install` for protoc-gen-go and
protoc-gen-go-grpc, which writes them into `go env GOPATH`/bin.
That directory isn't on every dev's PATH, and protoc resolves
its code-gen plugins via PATH, so the immediately-following
protoc invocation fails with
  "protoc-gen-go: program not found"
which in turn blocks `make build` and any
`make backends/%` target that depends on build.

Prepend `go env GOPATH`/bin to PATH for the protoc invocation
so the freshly-installed plugins are found without requiring a
shell-profile change.

Assisted-by: Claude:Opus 4.7

* refactor(ui-api): non-blocking backend upgrade handler with opcache

POST /api/backends/upgrade/:name used to send the ManagementOp
directly onto the unbuffered BackendGalleryChannel, which blocked
the HTTP request whenever the galleryop worker was busy with a
prior operation. The op also didn't show up in /api/operations,
so the Backends UI couldn't reflect upgrade progress on the
affected row.

Register the op in opcache immediately, wrap it in a cancellable
context, store the cancellation function on the GalleryService,
and push onto the channel from a goroutine so the handler
returns right away. Response gains a `jobID` field and a
`message` string so clients have a consistent handle regardless
of whether the op is queued or running.

Pairs with the OnBackendOpCompleted hook added in the galleryop
commit — together the UI sees the upgrade start, watches
progress via /api/operations, and drops the "upgradeable" flag
the moment the worker finishes.

Assisted-by: Claude:Opus 4.7
2026-04-24 08:50:34 +02:00
LocalAI [bot]
c755cd5ab5 feat(swagger): update swagger (#9518)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-23 23:26:50 +02:00
LocalAI [bot]
0fb04f7ac3 chore(model-gallery): ⬆️ update checksum (#9522)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-23 23:26:27 +02:00
Ettore Di Giacinto
d9d7b5c29b docs(readme): add April 2026 highlights to Latest News
Assisted-by: Claude-Code:claude-opus-4-7
2026-04-23 20:47:06 +00:00
walcz-de
f877942d97 fix(openresponses): parse OpenAI-spec nested tool_choice + use correct setter (#9509)
Two bugs in MergeOpenResponsesConfig (/v1/responses + WebSocket, *not*
/v1/chat/completions — that has a separate, working path via Tool
unmarshal + SetFunctionCallNameString):

1. **Shape mismatch.** OpenAI's specific-function tool_choice nests the
   name under "function":
       {"type": "function", "function": {"name": "my_function"}}
   The legacy flat shape was:
       {"type": "function", "name": "my_function"}
   Only the flat shape was handled. OpenAI-compliant clients that reach
   /v1/responses (openai-python with the Responses API, Stainless-generated
   SDKs, …) silently failed to force the function.

2. **Wrong setter.** The code called SetFunctionCallString(name), which
   writes the mode field (functionCallString: "none"/"auto"/"required").
   The specific-function name lives in a separate field
   (functionCallNameString), read by ShouldCallSpecificFunction and
   FunctionToCall. Net effect: a correctly-formed tool_choice never
   engaged grammar-based forcing.

The fix preserves backward compatibility by accepting both shapes
(nested preferred, flat as fallback) and routes to the correct setter.

Note: The same "wrong setter" pattern appears at three other sites —
anthropic/messages.go:883, openai/realtime_model.go:171, and
openresponses/responses.go:776 — and /v1/chat/completions has its own
issue parsing tool_choice="required" as a string (json.Unmarshal on a
raw string fails silently). Those are filed as a tracking issue rather
than bundled here to keep this PR focused.

## Test plan
9 new Ginkgo specs under "MergeOpenResponsesConfig tool_choice parsing":
  - string modes: "required" / "auto" / "none"
  - OpenAI-spec nested shape: {type:function, function:{name}}
  - Legacy Anthropic-compat flat shape: {type:function, name}
  - Shape-preference: nested wins over flat when both present
  - Malformed: missing type, wrong type, missing name, empty name, nil

$ go test ./core/http/middleware/ -count=1 -run TestMiddleware
  Ran 28 of 28 Specs in 0.003 seconds -- PASS

## Repro (against /v1/responses)

    curl -N http://localai/v1/responses \
         -H 'Content-Type: application/json' \
         -d '{"model":"qwen3.6-35b-a3b-apex",
              "input":"Weather in Berlin?",
              "tools":[{"type":"function","name":"get_weather",
                        "parameters":{"type":"object",
                          "properties":{"city":{"type":"string"}},
                          "required":["city"]}}],
              "tool_choice":{"type":"function",
                             "function":{"name":"get_weather"}}}'

Before: grammar-based forcing silently inactive; model free-texts.
After : grammar forces get_weather invocation; output contains
        tool_calls with function:{name:"get_weather", arguments:{...}}.
2026-04-23 18:30:05 +02:00
Ettore Di Giacinto
f5eb13d3c2 feat(insightface): add antispoofing (liveness) detection (#9515)
* feat(insightface): add antispoofing (liveness) detection

Light up the anti_spoofing flag that was parked during the first pass.
Both FaceVerify and FaceAnalyze now run the Silent-Face MiniFASNetV2 +
MiniFASNetV1SE ensemble (~4 MB, Apache 2.0, CPU <10ms) when the flag is
set. Failed liveness on either image vetoes FaceVerify regardless of
embedding similarity. Every insightface* gallery entry now ships the
MiniFASNet ONNX weights so existing packs light up after reinstall.

Setting the flag against a model without the MiniFASNet files returns
FAILED_PRECONDITION (HTTP 412) with a clear install message — no
silent is_real=false.

FaceVerifyResponse gained per-image img{1,2}_is_real and
img{1,2}_antispoof_score (proto 9-12); FaceAnalysis's existing
is_real/antispoof_score fields are now populated. Schema fields are
pointers so they are fully absent from the JSON response when
anti_spoofing was not requested — avoids collapsing "not checked" with
"checked and fake" under Go's omitempty on bool.

Validated end-to-end over HTTP against a local install:
- verify + anti_spoofing, both real -> verified=true, score ~0.76
- verify + anti_spoofing, img2 spoof -> verified=false, img2_is_real=false
- analyze + anti_spoofing -> is_real and score per face
- flag against model without MiniFASNet -> HTTP 412 fail-loud

Assisted-by: Claude:claude-opus-4-7 go vet

* test(insightface): wire test target into test-extra

The root Makefile's `test-extra` already runs
`$(MAKE) -C backend/python/insightface test`, but the backend's
Makefile never defined the target — so the command silently errored
and the suite was never executed in CI. Adding the two-line target
(matching ace-step/Makefile) hooks `test.sh` → `runUnittests` →
`python -m unittest test.py`, which discovers both the pre-existing
engine classes (InsightFaceEngineTest, OnnxDirectEngineTest) and the
new AntispoofingTest. Each class skips gracefully when its weights
can't be downloaded from a network-restricted runner.

Assisted-by: Claude:claude-opus-4-7

* test(insightface): exercise antispoofing in e2e-backends (both paths)

Add a `face_antispoof` capability to the Ginkgo e2e suite and extend
the existing FaceVerify + FaceAnalyze specs with liveness assertions
covering BOTH paths:

  real fixture -> is_real=true, score>0, verified stays true
  spoof fixture -> is_real=false, verified vetoed to false

The spoof fixture is upstream's own `image_F2.jpg` (via the yakhyo
mirror) — verified locally against the MiniFASNetV2+V1SE ensemble to
classify as is_real=false with score ~0.013. That makes the assertion
deterministic across CI runs; synthetic/derived spoofs fool the model
unpredictably and would be flaky.

Makefile wires it up end-to-end:
- New INSIGHTFACE_ANTISPOOF_* cache dir + two ONNX downloads with
  pinned SHAs, matching the gallery entries.
- insightface-antispoof-models target shared by both backend configs.
- FACE_SPOOF_IMAGE_URL passed via BACKEND_TEST_FACE_SPOOF_IMAGE_URL.
- Both e2e targets (buffalo-sc + opencv) now:
  * depend on insightface-antispoof-models
  * pass antispoof_v2_onnx / antispoof_v1se_onnx in BACKEND_TEST_OPTIONS
  * include face_antispoof in BACKEND_TEST_CAPS

backend_test.go adds the new capability constant and a faceSpoofFile
fixture resolved the same way as faceFile1/2/3. Spoof assertions are
gated on both capFaceAntispoof AND faceSpoofFile being set, so a test
config that omits the spoof fixture degrades gracefully to "real path
only" instead of failing.

Assisted-by: Claude:claude-opus-4-7 go vet
2026-04-23 18:28:15 +02:00
Ettore Di Giacinto
c1f923b2bc fix(importer): emit all shards for multi-part GGUF models (#9513)
The llama-cpp HuggingFace importer iterated files one at a time and
kept overwriting `lastGGUFFile`, so sharded repos such as
`unsloth/Kimi-K2.6-GGUF` (14 `Q8_K_XL` parts) produced a gallery entry
pointing only at the final shard — useless to llama.cpp's split loader,
which needs shard 1 to discover the set.

Group shards up front via new helpers in `pkg/huggingface-api`
(`SplitShardSuffix`, `ShardGroup`, `GroupShards`). The llama-cpp
importer now picks a group (preferred quant, then last-group fallback)
and emits every shard, with `Model:` pointing at shard 1.
`FindPreferredModelFile` returns shard 1 of the first matching group so
the gallery agent's preview stays coherent for sharded repos.

Adds unit coverage for the HuggingFace branch of the importer (which
had none), plus shard-detection tests in the hfapi package.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]
2026-04-23 15:00:02 +02:00
Ettore Di Giacinto
ed648b3b4e fix(llama-cpp): include server-chat.cpp in grpc-server translation unit (#9511)
* fix(llama-cpp): include server-chat.cpp in grpc-server translation unit

Upstream llama.cpp refactor (ggml-org/llama.cpp#20690) moved the
OAI/Anthropic/Responses and transcription conversion helpers out of
server-common.cpp into a new server-chat.cpp, and server-task.cpp and
server-context.cpp now call those symbols (convert_transcriptions_to_chatcmpl,
server_chat_convert_responses_to_chatcmpl, server_chat_convert_anthropic_to_oai,
server_chat_msg_diff_to_json_oaicompat) via server-chat.h.

grpc-server.cpp builds as a single translation unit by #include-ing the
upstream .cpp files directly. Without including server-chat.cpp, the
declarations are satisfied at compile time via server-chat.h but the
link step fails with undefined references once LLAMA_VERSION crosses
the refactor commit (134d6e54).

Guard the include with __has_include so the same source stays buildable
on older LLAMA_VERSION pins that predate the refactor (where prepare.sh
won't copy server-chat.cpp into tools/grpc-server/).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(llama-cpp): bump LLAMA_VERSION to 0d0764dfd

Bump to ggml-org/llama.cpp@0d0764dfd2.
Paired with the preceding grpc-server server-chat.cpp include so the
refactor at 134d6e54 links cleanly. Supersedes PR #9494.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-23 14:59:39 +02:00
Ettore Di Giacinto
3ce5248126 Update expected length of instructions in test
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-23 14:58:57 +02:00
Ettore Di Giacinto
04f1a0285d fix(ik-llama-cpp): adapt to common_grammar struct in sampling.h (#9512)
Upstream ik_llama.cpp commit e0596bf6 ("Autoparser") changed
common_params_sampling::grammar from std::string to a common_grammar
struct (type + grammar), which broke our two direct accesses:

 - JSON ingest fed the field through json_value<common_grammar>(...),
   for which nlohmann has no from_json adapter.
 - JSON export emitted the struct directly, for which nlohmann has no
   to_json adapter.

Wrap the incoming JSON string in common_grammar{COMMON_GRAMMAR_TYPE_USER, ...}
and serialize via the inner .grammar member, mirroring upstream's
examples/server/server-context.cpp.

Also bump IK_LLAMA_VERSION to 286ce324baed17c95faec77792eaa6bdb1c7a5f5
so the local-ai side lines up with the dependency bump in #9496.

Assisted-by: Claude-Code:claude-opus-4-7
2026-04-23 13:45:06 +02:00
Ettore Di Giacinto
181ebb6df4 feat: voice recognition (#9500)
* feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend

Audio analog to face recognition. Adds three gRPC RPCs
(VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP
layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python
backend scaffold under backend/python/speaker-recognition/ wrapping
SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for
WeSpeaker / 3D-Speaker ONNX exports.

The kokoros Rust backend gets matching unimplemented trait stubs —
tonic's async_trait has no defaults, so adding an RPC without Rust
stubs breaks the build (same regression fixed by eb01c772 for face).

Swagger, /api/instructions, and the auth RouteFeatureRegistry /
APIFeatures list are updated so the endpoints surface everywhere a
client or admin UI looks.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): add 1:N identify + register/forget endpoints

Mirrors the face-recognition register/identify/forget surface. New
package core/services/voicerecognition/ carries a Registry interface
and a local-store-backed implementation (same in-memory vector-store
plumbing facerecognition uses, separate instance so the embedding
spaces stay isolated).

Handlers under /v1/voice/{register,identify,forget} reuse
backend.VoiceEmbed to compute the probe vector, then delegate the
nearest-neighbour search to the registry. Default cosine-distance
threshold is tuned for ECAPA-TDNN on VoxCeleb (0.25, EER ~1.9%).

As with the face registry, the current backing is in-memory only — a
pgvector implementation is a future constructor-level swap.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): gallery, docs, CI and e2e coverage

- backend/index.yaml: speaker-recognition backend entry + CPU and
  CUDA-12 image variants (plus matching development variants).
- gallery/index.yaml: speechbrain-ecapa-tdnn (default) and
  wespeaker-resnet34 model entries. The WeSpeaker SHA-256 is a
  deliberate placeholder — the HF URI must be curl'd and its hash
  filled in before the entry installs.
- docs/content/features/voice-recognition.md: API reference + quickstart,
  mirrors the face-recognition docs.
- React UI: CAP_SPEAKER_RECOGNITION flag export (consumers follow face's
  precedent — no dedicated tab yet).
- tests/e2e-backends: voice_embed / voice_verify / voice_analyze specs.
  Helper resolveFaceFixture is reused as-is — the only thing face/voice
  share is "download a file into workDir", so no need for a new helper.
- Makefile: docker-build-speaker-recognition + test-extra-backend-
  speaker-recognition-{ecapa,all} targets. Audio fixtures default to
  VCTK p225/p226 samples from HuggingFace.
- CI: test-extra.yml grows a tests-speaker-recognition-grpc job
  mirroring insightface. backend.yml matrix gains CPU + CUDA-12 image
  build entries — scripts/changed-backends.js auto-picks these up.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): wire a working /v1/voice/analyze head

Adds AnalysisHead: a lazy-loading age / gender / emotion inference
wrapper that plugs into both SpeechBrainEngine and OnnxDirectEngine.

Defaults to two open-licence HuggingFace checkpoints:
  - audeering/wav2vec2-large-robust-24-ft-age-gender (Apache 2.0) —
    age regression + 3-way gender (female / male / child).
  - superb/wav2vec2-base-superb-er (Apache 2.0) — 4-way emotion.

Both are optional and degrade gracefully when transformers or the
model can't be loaded — the engine raises NotImplementedError so the
gRPC layer returns 501 instead of a generic 500.

Emotion classes pass through from the model (neutral/happy/angry/sad
on the default checkpoint); the e2e test now accepts any non-empty
dominant gender so custom age_gender_model overrides don't fail it.

Adds transformers to the backend's CPU and CUDA-12 requirements.

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): pin real WeSpeaker ResNet34 ONNX SHA-256

Replaces the placeholder hash in gallery/index.yaml with the actual
SHA-256 (7bb2f06e…) of the upstream
Wespeaker/wespeaker-voxceleb-resnet34-LM ONNX at ~25MB. `local-ai
models install wespeaker-resnet34` now succeeds.

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): soundfile loader + honest analyze default

Two issues surfaced on first end-to-end smoke with the actual backend
image:

1. torchaudio.load in torchaudio 2.8+ requires the torchcodec package
   for audio decoding. Switch SpeechBrainEngine._load_waveform to the
   already-present soundfile (listed in requirements.txt) plus a numpy
   linear resample to 16kHz. Drops a heavy ffmpeg-linked dep and the
   codepath we never exercise (torchaudio's ffmpeg backend).

2. The AnalysisHead was defaulting to audeering/wav2vec2-large-robust-
   24-ft-age-gender, but AutoModelForAudioClassification silently
   mangles that checkpoint — it reports the age head weights as
   UNEXPECTED and re-initialises the classifier head with random
   values, so the "gender" output is noise and there is no age output
   at all. Make age/gender opt-in instead (empty default; users wire
   a cleanly-loadable Wav2Vec2ForSequenceClassification checkpoint via
   age_gender_model: option). Emotion keeps its working Superb default.
   Also broaden _infer_age_gender's tensor-shape handling and catch
   runtime exceptions so a dodgy age/gender head never takes down the
   whole analyze call.

Docs and README updated to match the new policy.

Verified with the branch-scoped gallery on localhost:
- voice/embed    → 192-d ECAPA-TDNN vector
- voice/verify   → same-clip dist≈6e-08 verified=true; cross-speaker
                   dist 0.76–0.99 verified=false (as expected)
- voice/register/identify/forget → round-trip works, 404 on unknown id
- voice/analyze  → emotion populated, age/gender omitted (opt-in)

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): real CI audio fixtures + fixture-agnostic verify spec

Two issues surfaced after CI actually ran the speaker-recognition e2e
target (I'd curl-tested against a running server but hadn't run the
make target locally):

1. The default BACKEND_TEST_VOICE_AUDIO_* URLs pointed at
   huggingface.co/datasets/CSTR-Edinburgh/vctk paths that return 404
   (the dataset is gated). Swap them for the speechbrain test samples
   served from github.com/speechbrain/speechbrain/raw/develop/ —
   public, no auth, correct 16kHz mono format.

2. The VoiceVerify spec required d(file1,file2) < 0.4, assuming
   file1/file2 were same-speaker. The speechbrain samples are three
   different speakers (example1/2/5), and there is no easy un-gated
   source of true same-speaker audio pairs (VoxCeleb/VCTK/LibriSpeech
   are all license- or size-gated for CI use). Replace the ceiling
   check with a relative-ordering assertion: d(pair) > d(same-clip)
   for both file2 and file3 — that's enough to prove the embeddings
   encode speaker info, and it works with any three non-identical
   clips. Actual speaker ordering d(1,2) vs d(1,3) is logged but not
   asserted.

Local run: 4/4 voice specs pass (Health, LoadModel, VoiceEmbed,
VoiceVerify) on the built backend image. 12 non-voice specs skipped
as expected.

Assisted-by: Claude:claude-opus-4-7

* fix(ci): checkout with submodules in the reusable backend_build workflow

The kokoros Rust backend build fails with

    failed to read .../sources/Kokoros/kokoros/Cargo.toml: No such file

because the reusable backend_build.yml workflow's actions/checkout
step was missing `submodules: true`. Dockerfile.rust does `COPY .
/LocalAI`, and without the submodule files the subsequent `cargo
build` can't find the vendored Kokoros crate.

The bug pre-dates this PR — scripts/changed-backends.js only triggers
the kokoros image job when something under backend/rust/kokoros or
the shared proto changes, so master had been coasting past it. The
voice-recognition proto addition re-broke it.

Other checkouts in backend.yml (llama-cpp-darwin) and test-extra.yml
(insightface, kokoros, speaker-recognition) already pass
`submodules: true`; this brings the shared backend image builder in
line.

Assisted-by: Claude:claude-opus-4-7
2026-04-23 12:07:14 +02:00
LocalAI [bot]
1c59165d63 chore(model gallery): 🤖 add 1 new models via gallery agent (#9505)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-23 09:32:44 +02:00
LocalAI [bot]
eb00d9b178 chore: ⬆️ Update leejet/stable-diffusion.cpp to c97702e1057c2fe13a7074cd9069cb9dd6edc1bf (#9495)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-23 09:32:21 +02:00
LocalAI [bot]
2068b6f43c feat(swagger): update swagger (#9498)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-22 22:51:39 +02:00
Ettore Di Giacinto
eb01c77214 fix(kokoros): implement face_verify and face_analyze trait stubs (#9499)
The backend.proto was updated to add FaceVerify and FaceAnalyze RPCs
(face detection support), but the Rust KokorosService was never updated
to match the regenerated tonic trait, breaking compilation with E0046:

    not all trait items implemented, missing: `face_verify`, `face_analyze`

Stubs both methods as unimplemented, matching the pattern used for the
other RPCs Kokoros does not support.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-04-22 22:51:18 +02:00
Richard Palethorpe
bb4fda6f0e chore(agents): Update the backend creation instructions to include Rust and extra tests (#9490)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-22 22:43:01 +02:00
Ettore Di Giacinto
f0c92610a1 feat(importer): expand importer flow to almost all backends (#9466)
* docs(agents): require importer integration when adding backends

Document the importer registry workflow so contributors know that adding
a new backend also requires updating the /import-model dropdown source:
either a new importer in core/gallery/importers/, extending an existing
one for drop-in replacements, or the pref-only slice for backends with
no reliable auto-detect signal. Always covered by a table-driven test.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for Batch 0 primitives

Introduce failing tests that drive Batch 0 of the importer expansion:

- pkg/huggingface-api: assert GetModelDetails populates PipelineTag and
  LibraryName from /api/models/{repo}, and that a failing metadata
  endpoint still returns file details (best-effort fetch).
- core/gallery/importers/helpers_test.go: new table-driven coverage for
  HasFile, HasExtension, HasONNX, HasONNXConfigPair, HasGGMLFile.
- core/gallery/importers/importers_test.go: assert ErrAmbiguousImport
  sentinel exists and round-trips through errors.Is.
- core/gallery/importers/local_test.go: extend with detection cases for
  ggml-*.bin (whisper), silero_vad.onnx (silero-vad), and the piper
  .onnx + .onnx.json pair.
- core/http/endpoints/localai/import_model_test.go: assert
  ImportModelURIEndpoint returns HTTP 400 with a structured
  {error, detail, hint} body when ErrAmbiguousImport surfaces.

All tests fail in the expected places (missing fields, missing
helpers, missing sentinel, endpoint still wraps as 500).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): Batch 0 foundation — helpers, sentinel, local detection

Implements the Batch 0 primitives that subsequent importer batches build on:

- pkg/huggingface-api: ModelDetails gains PipelineTag and LibraryName.
  GetModelDetails now layers a best-effort GET /api/models/{repo} fetch
  on top of ListFiles — a metadata outage leaves the fields empty but
  still returns full file details. Uses a dedicated response struct
  because the single-model endpoint uses snake_case keys while the list
  endpoint historically returned camelCase.

- core/gallery/importers/helpers.go: generic HasFile, HasExtension,
  HasONNX, HasONNXConfigPair, HasGGMLFile helpers working on
  []hfapi.ModelFile so per-backend importers can detect artefact
  patterns without duplicating string wrangling.

- core/gallery/importers/importers.go: adds the ErrAmbiguousImport
  sentinel. DiscoverModelConfig now returns it (wrapped with
  fmt.Errorf("%w: ...")) when no importer matched AND the HF
  pipeline_tag falls in a whitelist of narrow modalities (ASR, TTS,
  sentence-similarity, text-classification, object-detection). The
  whitelist is intentionally narrow — unknown tags keep the previous
  "no importer matched" behaviour to avoid blocking rare repos.

- core/gallery/importers/local.go: three new local-path detections,
  inserted before the existing merged-transformers branch:
    * ggml-*.bin → whisper
    * silero*.onnx → silero-vad
    * *.onnx + *.onnx.json pair → piper

- core/http/endpoints/localai/import_model.go: ImportModelURIEndpoint
  surfaces ErrAmbiguousImport as HTTP 400 with
  {error, detail, hint} JSON, preserving existing behaviour for
  unrelated errors.

Green tests:
  go test ./core/gallery/importers/... ./pkg/huggingface-api/... \
          ./core/http/endpoints/localai/...

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(importers): red tests for KnownBackend endpoint and importer metadata

Add failing tests that drive Batch UI-Dropdown:

- importers_test.go: assert importers expose Name/Modality/AutoDetects
  and that LlamaCPPImporter advertises drop-in replacements via a new
  AdditionalBackendsProvider interface. A Registry() accessor is also
  expected.

- backend_test.go (new): assert GET /backends/known returns
  []schema.KnownBackend, covers every importer, exposes drop-in
  llama-cpp replacements, includes curated pref-only backends, has no
  duplicates, and is sorted by Modality+Name.

These tests fail at compile time against master; they are intentionally
red so the follow-up green commit is reviewable.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery): add /backends/known endpoint for importer-aware backend list

Extend the Importer interface with Name/Modality/AutoDetects so the
import system can self-describe its registry, and introduce the
AdditionalBackendsProvider interface so importers can advertise drop-in
replacements (llama-cpp advertises ik-llama-cpp and turboquant).

Expose the new GET /backends/known endpoint that merges:

- the importer registry (auto-detect supported),
- drop-in replacements hosted by importers (preference-only),
- a curated knownPrefOnlyBackends slice for backends with no dedicated
  importer (sglang, tinygrad, trl, mlx-vlm, whisperx, kokoros, Qwen TTS
  variants, sam3-cpp) — kept at the top of backend.go so contributors
  adding a new pref-only backend have one obvious place to edit,
- backends installed on disk but unknown to the importer (marked
  AutoDetect=false, empty Modality).

The endpoint deliberately does NOT filter by gallery membership or host
capability (unlike /backends/available): LocalAI may auto-install a
backend that is not yet present, so the import form dropdown must show
everything the importer knows about.

Response is deduplicated (importer wins over pref-only) and sorted by
Modality+Name for deterministic output.

Registered in core/http/routes/localai.go next to /backends/available
under the same admin middleware.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui): source import form backend dropdown from /backends/known

Replace the hard-coded BACKENDS constant in ImportModel.jsx with a
live fetch of /backends/known on mount. Users now see every backend
the importer layer knows about (including preference-only entries)
grouped by modality, not a stale subset.

Changes:

- config.js: add backendsKnown endpoint constant next to
  backendsAvailable.
- api.js: add backendsApi.listKnown() wrapper.
- ImportModel.jsx: remove BACKENDS constant, fetch the list via
  useEffect, and derive grouped options via buildBackendOptions.
  Preference-only entries render with a " (preference-only)" suffix.
  Loading state disables the dropdown with a "Loading backends…"
  placeholder; on fetch failure the form falls back to auto-detect
  only and surfaces a non-blocking toast.
- SearchableSelect.jsx: accept items flagged isHeader=true and render
  them as non-selectable section dividers. Keyboard navigation skips
  headers and search queries hide them so filtered output stays
  relevant.

Vitest is not set up in this project (devDependencies ship Playwright
only). Per the brief's guard-rail, no frontend test framework is
introduced; coverage is provided by the Go handler tests that assert
the /backends/known contract consumed by the React form.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for whisper importer

Asserts detection on ggerganov/whisper.cpp (via ggml-*.bin filename),
the preferences.backend=whisper override path for arbitrary URIs,
and the Importer interface metadata (name/modality/autodetect).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add whisper importer

Recognises whisper.cpp GGML models by the "ggml-*.bin" filename
convention (direct URL or HF repo member) and by the explicit
preferences.backend="whisper" override. Emits backend: whisper with
the transcript use-case. Registered before llama-cpp so the narrow
filename signal wins before any generic GGUF match is attempted.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for moonshine importer

Asserts detection on UsefulSensors/moonshine-tiny via owner + ONNX
files, the preferences.backend=moonshine override for arbitrary URIs,
and the Importer interface metadata (name/modality/autodetect).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add moonshine importer

Matches UsefulSensors-owned HF repos whose artefacts or metadata
identify them as ASR: on-disk .onnx files (the canonical Moonshine
packaging) OR pipeline_tag=automatic-speech-recognition (covers
transformers/safetensors-only sibling repos). preferences.backend=
moonshine overrides detection. Test uses the live moonshine-tiny
repo because the canonical UsefulSensors/moonshine repo currently
hits a recursive-subfolder bug in pkg/huggingface-api ListFiles.

Registered after WhisperImporter but before LlamaCPPImporter and
TransformersImporter so the narrower owner+ASR signal wins before
the generic tokenizer.json check routes the repo to transformers.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for nemo importer

Asserts detection on nvidia/parakeet-tdt-0.6b-v3 via owner + .nemo
file, the preferences.backend=nemo override for arbitrary URIs, and
the Importer interface metadata (name/modality/autodetect).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add nemo importer

Matches nvidia-owned HF repos that ship a .nemo checkpoint archive,
the canonical NeMo ASR packaging. preferences.backend=nemo forces
detection. Registered between moonshine and llama-cpp so the narrow
owner + extension signal wins before any downstream generic matcher.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for faster-whisper importer

Asserts detection on Systran/faster-whisper-large-v3 (owner +
model.bin + config.json + ASR pipeline), the preferences.backend=
faster-whisper override for arbitrary URIs, and the Importer
interface metadata.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add faster-whisper importer

Recognises CTranslate2-packaged whisper checkpoints distributed for
the faster-whisper runtime: model.bin + config.json + ASR
pipeline_tag, narrowed to Systran-owned repos or repo names
containing "faster-whisper" to avoid falsely claiming vanilla
OpenAI whisper HF repos. preferences.backend=faster-whisper
overrides detection. Registered before llama-cpp and transformers
so the narrow signal wins before tokenizer.json routes the repo to
the generic transformers importer.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for qwen-asr importer

Asserts detection on Qwen/Qwen3-ASR-1.7B via owner + ASR substring
in the repo name, the preferences.backend=qwen-asr override for
arbitrary URIs, and the Importer interface metadata.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add qwen-asr importer

Matches Qwen-owned HF repos whose name contains "ASR"
(case-insensitive), routing them to the qwen-asr backend rather
than the generic transformers/vllm path. The substring check scans
the repo portion only so the owner field cannot leak a false match.
preferences.backend=qwen-asr forces detection. Registered before
llama-cpp and transformers so the narrow owner+name signal wins.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): ASR ambiguity surfaces ErrAmbiguousImport

Locks in the behaviour added in Batch 0: an HF repo whose pipeline_tag
marks it as automatic-speech-recognition but whose artefacts match no
ASR importer (and no generic importer) must fail with
ErrAmbiguousImport so callers know to pass preferences.backend rather
than silently guess. pyannote/voice-activity-detection is the fixture
— its file list is only config.yaml + README, leaving every importer's
artefact check negative.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for piper importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add piper importer

Detects piper TTS voices by the canonical <voice>.onnx + <voice>.onnx.json
pair packaging (via HasONNXConfigPair). Narrow enough to skip generic
ONNX repos used by other backends (Moonshine ASR, sentence-transformers).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for bark importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add bark importer

Detects Suno's Bark TTS checkpoints by HF owner "suno" + repo name
prefix "bark". Adds HFOwnerRepoFromURI() helper so importers can fall
back to URI parsing when pkg/huggingface-api's recursive tree listing
errors on repos with nested subdirectories (suno/bark ships a
speaker_embeddings/v2 subtree that trips a pre-existing path-doubling
bug in the listFilesInPath recursion).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for fish-speech importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add fish-speech importer

Detects Fish Audio TTS releases by HF owner "fishaudio" with a URI-based
fallback for repos whose tree recursion trips the pre-existing hfapi
path-doubling bug.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for outetts importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add outetts importer

Detects OuteAI's OuteTTS releases by HF owner "OuteAI" or a case-
insensitive "OuteTTS" substring in the repo name, with a URI-based
fallback for recursion-bugged repos.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for voxcpm importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add voxcpm importer

Detects OpenBMB's VoxCPM TTS family by repo-name substring (community
mirrors re-host the weights under many owners — mlx-community,
bluryar, callgg, etc).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for kokoro importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add kokoro importer

Detects hexgrad's Kokoro TTS by the "Kokoro" repo-name substring paired
with a PyTorch .pth/.pt checkpoint — the pairing excludes ONNX-only
mirrors (handled by the pref-only `kokoros` Rust runtime) and GGUF
mirrors (handled by llama-cpp).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for kitten-tts importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add kitten-tts importer

Detects KittenML's kitten-tts releases by owner or "kitten-tts" repo-name
substring, with URI-parsing fallback.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for neutts importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add neutts importer

Detects Neuphonic's NeuTTS releases by owner "neuphonic" or "neutts"
repo-name substring, with URI-parsing fallback.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for chatterbox importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add chatterbox importer

Detects Resemble AI's Chatterbox TTS by owner "ResembleAI" or
"chatterbox" repo-name substring, with URI-parsing fallback.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for vibevoice importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add vibevoice importer

Detects Microsoft's VibeVoice TTS by "vibevoice" repo-name substring
(case-insensitive) so community mirrors still route here.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for coqui importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add coqui importer

Detects Coqui AI's TTS releases (XTTS-v2, YourTTS, …) by the
authoritative `coqui` HF owner, with URI-parsing fallback.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): TTS ambiguity surfaces ErrAmbiguousImport

Adds a Ginkgo spec that imports nari-labs/Dia-1.6B — a real HF repo
carrying pipeline_tag="text-to-speech" whose artefacts (*.pth, one
safetensors shard, preprocessor_config.json, config.json) match none of
the Batch-2 TTS importers nor the generic text/image importers — and
asserts DiscoverModelConfig wraps ErrAmbiguousImport via errors.Is.

Also pivots the endpoint-level ambiguity fixture from hexgrad/Kokoro-82M
to nari-labs/Dia-1.6B. Batch 2 added a dedicated kokoro importer that
now claims the original fixture; Dia remains genuinely unclaimed and
so exercises the same ambiguity code path at the HTTP layer.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for stablediffusion-ggml importer

Covers HF repo detection (city96/FLUX.1-dev-gguf), raw .gguf URL matching on
filename arch tokens, preference override, and Importer interface metadata.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add stablediffusion-ggml importer

Detects GGUF-packed Stable Diffusion and FLUX checkpoints (leejet owner,
city96 FLUX mirrors, second-state SD dumps, raw .gguf URLs with arch
tokens) and routes them to the stablediffusion-ggml backend. Registered
BEFORE LlamaCPPImporter so .gguf image checkpoints are not stolen by
llama-cpp's generic .gguf match. Reuses HFOwnerRepoFromURI for the
hfapi-recursion-bug fallback. preferences.backend overrides detection.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for ace-step importer

Covers HF repo-name detection (ACE-Step/ACE-Step-v1-3.5B), preference
override, and Importer interface metadata.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add ace-step importer

Routes ACE-Step music generation checkpoints (ACE-Step/ACE-Step-v1-3.5B,
ACE-Step/Ace-Step1.5, community mirrors) to the ace-step backend.
Matching is case-insensitive on the "ace-step" repo-name substring and
owner, with an HFOwnerRepoFromURI fallback for the hfapi recursion bug.
KnownUsecaseStrings mirrors the gallery's ace-step-turbo entry
(sound_generation, tts). preferences.backend overrides.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): surface ErrAmbiguousImport on text-to-image misses

Adds text-to-image to ambiguousModalities whitelist and covers the
h94/IP-Adapter-FaceID case — pipeline_tag=text-to-image but ships only
.bin/.safetensors so diffusers, stablediffusion-ggml, llama-cpp,
transformers, vllm, mlx, and ace-step all miss. DiscoverModelConfig now
surfaces ErrAmbiguousImport for that shape instead of the opaque
"no importer matched" error.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for vllm-omni importer

Introduces the test surface for the forthcoming VLLMOmniImporter:
detection via preferences.backend, Qwen owner + Omni repo token,
URI-only fallback, negative cases (plain Qwen, random OmniX repo), and
Import() emitting backend: vllm-omni with chat + multimodal usecases.

Includes a registration-order assertion via DiscoverModelConfig to pin
the requirement that vllm-omni wins over vllm for Qwen Omni repos
(tokenizer files are usually present too).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add vllm-omni importer

Adds VLLMOmniImporter for Qwen Omni-style multimodal checkpoints
(Qwen3-Omni, Qwen2.5-Omni, …). Detection is narrow: HF owner "Qwen"
combined with "omni" in the repo name, or a repo name matching the
-Omni-/Omni- naming pattern. preferences.backend="vllm-omni" always
wins; HFOwnerRepoFromURI provides a URI-only fallback for the hfapi
recursion-bug edge case.

Emitted YAML sets backend: vllm-omni and known_usecases: [chat,
multimodal], matching the gallery/index.yaml vllm-omni entries. The
importer is registered ahead of VLLMImporter so Qwen Omni repos —
which also carry tokenizer files — route to vllm-omni rather than the
plain vllm backend.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for llama-cpp drop-in preferences

Pins the expected drop-in replacement behaviour: preferences.backend
of ik-llama-cpp or turboquant must swap the emitted YAML backend
field while keeping the llama-cpp file layout identical. Also covers
the unknown-backend case (must stay llama-cpp) and re-asserts
AdditionalBackends() returns the two curated entries with non-empty
descriptions.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): llama-cpp honours ik-llama-cpp and turboquant drop-in preferences

preferences.backend set to ik-llama-cpp or turboquant now swaps the
emitted YAML backend field while leaving the file layout, model path,
mmproj handling and everything else in the llama-cpp Import pipeline
untouched. Unknown values are ignored and fall back to backend:
llama-cpp so arbitrary input can't leak into the config.

Aligns the AdditionalBackends() descriptions with the user-facing
naming conventions surfaced via /backends/known. No changes to the
pref-only curated list in endpoints/localai/backend.go: the two
drop-in names have always lived on the importer side via
AdditionalBackends.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for silero-vad importer

Add the SileroVADImporter test fixtures covering metadata, preference
overrides, snakers4 + onnx detection, silero_vad.onnx canonical filename,
URI fallback, and live HF discovery. Implementation follows in the next
commit.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add silero-vad importer

Recognise the Silero VAD ONNX packaging: the canonical silero_vad.onnx
filename or any ONNX file under the snakers4 owner. Emits a
backend: silero-vad config with the vad known_usecase, and attaches the
canonical file entry when present so the weights download on import.

Registered before the generic importers so the unique-filename signal
takes precedence over any downstream tokenizer-based matcher.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for rerankers importer

Cover the RerankersImporter contract: interface metadata, preference
override, cross-encoder owner detection, case-insensitive 'reranker'
substring match (BAAI/bge-reranker, Alibaba-NLP/gte-reranker), URI
fallback, and the full-discovery ordering check that a BAAI reranker
repo must route to the rerankers importer rather than transformers.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add rerankers importer

Recognise reranker repositories — cross-encoder owner or any repo whose
name contains 'reranker' (case-insensitive). Emits backend: rerankers
with reranking: true and the rerank known_usecase.

Registered ahead of sentencetransformers and transformers so reranker
repos that happen to ship tokenizer.json or modules.json still route
here.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for sentencetransformers importer

Cover the SentenceTransformersImporter contract: interface metadata,
preference override, modules.json marker file, sentence_bert_config.json
marker file, sentence-transformers owner, URI fallback, and the
full-discovery ordering check that ensures a sentence-transformers HF
URI routes here rather than transformers.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add sentencetransformers importer

Recognise sentence-transformers embedding repos by modules.json,
sentence_bert_config.json, or the sentence-transformers owner. Emits
backend: sentencetransformers with embeddings: true and the embeddings
known_usecase.

Registered ahead of transformers so ST repos that carry tokenizer.json
still route here.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for rfdetr importer

Cover the RFDetrImporter contract: interface metadata, preference
override, case-insensitive rf-detr and rfdetr substring matches, URI
fallback, and negative cases. Implementation follows in the next
commit.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add rfdetr importer

Recognise RF-DETR object-detection repositories by a case-insensitive
'rf-detr' / 'rfdetr' substring in the repo name. Emits backend: rfdetr
with the detection known_usecase.

Registered ahead of transformers so RF-DETR repos with tokenizer
artefacts still route here.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): surface ErrAmbiguousImport on sentence-similarity misses

Add an ambiguity fixture covering the embeddings/rerankers modality.
Qdrant/bm25 carries pipeline_tag=sentence-similarity but ships only
config.json + stopword .txt files — none of the Batch 5 importers
(silero-vad, rerankers, sentencetransformers, rfdetr) or the generic
vllm/transformers/llama-cpp/mlx/diffusers importers match. Because the
modality is in the ambiguous whitelist, DiscoverModelConfig must
surface ErrAmbiguousImport.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(localai/backend): red tests for KnownBackend.Installed flag

Extend the /backends/known suite with three failing cases that pin down
the forthcoming Installed field: JSON field presence on every entry,
flipping to true when an importer-registered backend is also present on
disk (and staying false for non-installed pref-only entries), and
surfacing system-only backends with empty modality and AutoDetect=false.

A small writeFakeSystemBackend helper plants a run.sh under the backends
dir so gallery.ListSystemBackends recognises the fixture.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(schema,localai/backend): add Installed flag to KnownBackend

Add an Installed bool to schema.KnownBackend and populate it from the
/backends/known handler so the React import form can warn users that
picking a not-yet-installed backend will trigger an automatic download
on submit.

Computation: after merging the importer registry, additional backends
provider entries and the curated pref-only slice, the handler walks
gallery.ListSystemBackends(systemState) and either flips the existing
map entry's Installed flag to true (preserving modality / autodetect /
description metadata) or inserts a bare {Installed:true} entry for
system-only backends the importer layer doesn't know about.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(localai/import_model): structured ambiguous-import response

Add red tests covering the extended ambiguity shape the React import
form needs:

- ImportModelURIEndpoint must return an HTTP 400 body that exposes the
  detected `modality` (normalised to the importer modality key, e.g.
  "tts" for pipeline_tag=text-to-speech) and a list of `candidates`
  (backend names filtered by modality, excluding text-LLM backends).
- The importers package must surface a typed AmbiguousImportError so
  HTTP consumers can read Modality + Candidates without parsing the
  error string. errors.Is against the existing sentinel keeps working.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(localai/import_model): structured ambiguity response with modality + candidates

DiscoverModelConfig now returns a typed AmbiguousImportError that
carries the importer modality key, candidate backend names, the
original URI, and the raw HF pipeline_tag. Its Is() preserves
errors.Is(err, ErrAmbiguousImport) for legacy callers.

The importer modality is pre-mapped from the HF pipeline_tag
(automatic-speech-recognition → asr, text-to-speech → tts, etc) via
PipelineTagToModality — surfaced as an exported helper so downstream
consumers can avoid duplicating the table. CandidatesForModality
filters the default importer registry plus AdditionalBackendsProvider
drop-ins by modality, sorts deterministically, and is the single
source of truth used by ImportModelURIEndpoint.

ImportModelURIEndpoint now returns HTTP 400 with
  { error, detail, modality, candidates, hint }
when ambiguity fires, letting the React form render a modality-scoped
picker inline instead of a generic toast.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): manual pick badge + tooltip

Red Playwright coverage for the preference-only → manual pick rename:

- The Backend dropdown renders a "manual pick" badge on every option
  whose KnownBackend.auto_detect is false.
- The badge carries a title attribute with hover-tooltip copy that
  explains auto-detect won't route to this backend.
- Auto-detectable backends must NOT carry the badge.
- The legacy " (preference-only)" suffix is gone from every label.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* ui(import): replace preference-only suffix with manual pick badge

SearchableSelect option rows now support an optional badge field — a
muted pill rendered to the right of the label with an optional title
attribute for native hover tooltips. Plain text so screen readers read
it alongside the option name.

buildBackendOptions in ImportModel stops appending " (preference-only)"
to the label and instead sets badge="manual pick" plus a descriptive
tooltip on every option whose auto_detect is false. The Backend help
text explains what "manual pick" means so users aren't left wondering
about the badge.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): inline ambiguity picker

Red Playwright coverage for Batch A2 — when the server returns a 400
ambiguity body, the form must render an inline alert instead of a
toast, expose one clickable chip per candidate backend, and support
both auto-resubmit on pick and silent dismiss.

- Mocks /api/models/import-uri with the structured ambiguity body
  (error, detail, modality, candidates, hint).
- On first click of Import, the alert is visible, carries
  modality-specific copy, and shows a chip per candidate.
- Clicking a chip clears the alert, sets the Backend dropdown, and
  triggers a second POST to /api/models/import-uri.
- Dismissing the alert leaves the Backend dropdown on Auto-detect —
  no implicit backend assignment.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): inline ambiguity alert with candidate chips

Adds AmbiguityAlert — a soft, info-coloured card rendered above the URI
input when the server returns a structured 400 with { modality,
candidates }. Message is modality-aware (tts/asr/embeddings/image/
reranker/detection get purpose-written copy, everything else falls back
to a generic template). Each candidate is a clickable chip that shows a
download icon when /backends/known marks the backend as not yet
installed, so users aren't surprised by an implicit install.

ImportModel wires the alert to handleSimpleImport's error path:
- api.handleResponse now attaches { status, body } to the thrown Error
  so pages can pattern-match on structured responses instead of string
  error messages.
- handleSimpleImport detects `status === 400 && body.error === 'ambiguous
  import'` and flips into the inline-picker mode instead of toasting.
- Clicking a chip sets prefs.backend and auto-resubmits (passing the
  picked backend as an override so setPrefs's asynchrony doesn't leak
  a stale value).
- Dismissing clears the alert; changing the URI or the backend also
  clears it so a stale alert never sticks around.

Test fixtures mock GET /backends/known + POST /models/import-uri so the
Playwright specs don't depend on real network reachability.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): auto-install warning

Red Playwright coverage for Batch A3 — when the user picks a backend
whose KnownBackend.installed is false, the form must render a muted
inline note under the Backend dropdown warning that submitting will
download the backend first. Picking an installed backend or leaving
Auto-detect selected must keep the note hidden.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): auto-install warning under backend dropdown

When the user picks a backend whose KnownBackend.installed is false,
render a muted inline note under the Backend dropdown's help text
warning that submitting will download the backend first. The note
lives inside the same form-group so it lines up with the existing
hint text; it's hidden when Auto-detect is selected (the selected
backend is unknowable at that point) or when the chosen backend is
already on disk.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* ui(import): drop redundant section header, adjust icons, rename HF shortcut

- Remove the "Import from URI" card-level <h2> — the page title already
  says "Import New Model" one row up, so the secondary header was
  duplicating information.
- Swap the fa-star on "Common Preferences" for fa-sliders (stars imply
  favourites/ratings; this is just a preferences block) and move the
  Custom Preferences fa-sliders-h to fa-plus-circle so the two blocks
  read as distinct rather than as two sliders.
- Rename the HF shortcut from "Search GGUF on HF" → "Browse models on
  HF" and drop the `search=gguf` filter on the linked URL. The import
  form now supports ~40 backends; hard-coding GGUF in the copy no
  longer matches the form's actual reach.
- Pure polish — no behaviour change, covered by the existing Batch A
  Playwright suite.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): batch B — simple/power switch, options, tabs, dialog

Adds a failing Playwright suite covering the full Batch B surface ahead
of implementation:

- B1: SimplePowerSwitch segmented control renders, toggles, persists to
  localStorage across reloads.
- B2: Simple-mode Options disclosure is collapsed by default; expanding
  exposes only Backend, Model Name, Description (no quantizations,
  mmproj, model type, or custom prefs).
- B3: Power mode has Preferences and YAML tabs with a persistent
  selection across reloads; URI/name/description typed in Simple carry
  over to Power; YAML tab swaps the primary action to Create.
- B4: Switching Power -> Simple with a custom preference set triggers
  the 3-button confirmation dialog (Keep / Discard / Cancel) with the
  documented semantics.

Tests fail against master — implementation lands in the following
commits.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): add SimplePowerSwitch segmented control

Replaces the previous "Advanced Mode / Simple Mode" toggle button in the
page header with a two-segment control that flips between Simple and
Power. The control reuses the existing .segmented CSS shared with the
Sound page for visual consistency.

Mode state is persisted to localStorage under `import-form-mode` so
reloads land on the same view (default: simple). The boolean alias
`isAdvancedMode` is retained internally to minimise diff — subsequent
commits reshape the Simple and Power surfaces independently.

Closes B1 from the Batch B Playwright suite.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): simple mode collapsible options, power tabs, switch dialog

Completes the Batch B surface in a single structural pass so Simple and
Power mode can evolve independently:

Simple mode
  - URI input + Ambiguity alert + Import button, plus a collapsible
    "Options" disclosure that exposes ONLY Backend, Model Name,
    Description. Quantizations / MMProj / Model Type / Diffusers fields
    / Custom Preferences are no longer rendered in Simple mode.

Power mode
  - In-page segmented "Preferences · YAML" tab strip. Active tab
    persists to localStorage under `import-form-power-tab`.
  - Preferences tab = the full existing preferences + custom prefs
    panel (no progressive disclosure yet — that's Batch D).
  - YAML tab = the existing CodeEditor. Primary button reads "Create"
    here, "Import Model" everywhere else.

Switch dialog
  - Power -> Simple with non-default prefs (advanced pref keys set,
    any custom-pref key non-empty, or YAML edited away from the
    template) opens a 3-button dialog: Keep & switch / Discard &
    switch / Cancel.
  - Keep preserves all state. Discard resets prefs + customPrefs + YAML
    to defaults. Cancel leaves the user in Power mode.

Page subtitle reflects the current surface (Simple, Power/Preferences,
Power/YAML). Estimate banner renders everywhere except Power/YAML.

Closes B2/B3/B4 from the Batch B Playwright suite.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): expand Options disclosure in Batch A tests

Batch B hid the Backend dropdown behind a collapsible Options disclosure
in Simple mode. The Batch A tests that exercise the dropdown directly
(manual-pick badge, ambiguity chip sets the selected backend, auto-
install warning) now click the disclosure toggle before asserting on
dropdown contents. Test intent is unchanged.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* ui(import): strip decorative icons from field labels

The preference panel had 12 Font Awesome icons decorating field labels
(Backend, Model Name, Description, Quantizations, MMProj Quantizations,
Model Type, Pipeline Type, Scheduler Type, Enable Parameters, Embeddings,
CUDA, plus fa-link on Model URI). Every label screamed equally, flattening
the visual hierarchy.

Remove them. Keep icons where they carry meaning: page-level section
headers, URI format guide entries, primary buttons, the Simple-mode
Options disclosure, the ambiguity alert's fa-lightbulb, the auto-install
note's fa-download, and the Estimated-requirements banner's
fa-memory / fa-microchip / fa-download.

No new behaviour, no layout / spacing changes beyond removing the
orphaned icon margin. Playwright suite green.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): progressive disclosure of preference fields

Cover the Batch D visibility matrix for Power > Preferences: Quantizations,
MMProj Quantizations, and Model Type each render only for the backends that
can consume them, stay visible when the backend is unset, and preserve any
value the user already typed when toggled off and back on. Also pin the
shrunk Description textarea at rows=2.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): progressive disclosure + shorter description textarea

Gate Quantizations, MMProj Quantizations, and Model Type in the Power >
Preferences tab so each field only renders for the backends that can
actually consume it. Backend unset keeps everything visible. Hidden
fields' state is preserved (the JSX wrapper is guarded, not the
underlying prefs state) so users flipping backends back and forth don't
lose input.

Also shrink the Description textarea from rows=3 to rows=2 — it's
shared between Simple Options and Power Preferences so the change
applies to both.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): enter-to-submit in Simple mode

Red test for Batch F3 — pressing Enter in the URI input must POST
/models/import-uri, and Enter in the Description textarea must insert
a newline without submitting the form.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): enter-to-submit in Simple mode

Wrap the Simple-mode URI input + ambiguity alert + Options disclosure
in a <form> whose onSubmit calls handleSimpleImport. Pressing Enter in
the URI input (or any Simple-mode text input) now submits the import
without having to move the mouse to the header button. The Description
textarea keeps its native behaviour — Enter inserts a newline.

A hidden submit button is included because the visible Import button
lives outside the form in the page header; some browsers only fire
implicit Enter-submit when the form contains a submit-capable element.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* ui(import,SearchableSelect,components): aria-hidden on decorative icons

Every Font Awesome icon in the import form is decorative — its meaning
is already conveyed by adjacent visible text. Adding aria-hidden="true"
prevents screen readers from announcing the unicode glyph point as
content. Covers ImportModel.jsx (all remaining <i> glyphs) and
SearchableSelect.jsx (the trigger chevron).

AmbiguityAlert and SimplePowerSwitch already set aria-hidden on their
icons when the components landed in Batches A and B — no change needed
there.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* ui(SearchableSelect): responsive dropdown maxHeight + hover focus guard

F2 — replace fixed pixel heights with min(pixel, vh) so the dropdown
and its inner scroll region don't overflow short viewports. Outer
container: 260px -> min(260px, 60vh); inner listbox: 200px ->
min(200px, 50vh). Tall viewports still get the original pixel caps.

F5 — short-circuit onMouseEnter when the hovered row is already the
focused row. Avoids queueing a setFocusIndex call (and a render) for
every mousemove inside the same item — the state would be identical.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* ui(import): aria-label on custom preference rows

The Key / Value inputs and trash button in each Custom Preferences row
previously relied on placeholder text alone. Placeholders are not
accessible names — they vanish on input and screen readers do not
announce them consistently. Add row-indexed aria-labels so assistive
tech can distinguish "Preference key for row 1" from "row 2", and give
the trash button an explicit "Remove this preference" label.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): modality chip row

Red tests for Batch E — a horizontal modality chip row that filters the
Backend dropdown by modality. Covers visibility in Simple-mode Options
and Power/Preferences (and absence in Power/YAML), filter behaviour,
mismatched-backend clearing with toast, ambiguity-alert auto-selection,
and radiogroup keyboard navigation.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): add ModalityChips component + filter integration

Horizontal chip row (Any, Text, Speech, TTS, Image, Embeddings,
Rerankers, Detection, VAD) filters the Backend dropdown options to the
selected modality. Default is Any — no filter, current behaviour.

- New ModalityChips component (radiogroup pattern, roving tabindex,
  arrow-key navigation, Home/End).
- buildBackendOptions now accepts an optional modalityFilter so grouped
  output is narrowed before rendering.
- Chips render inside Simple-mode Options disclosure and Power >
  Preferences tab. Power > YAML stays unaffected.
- Switching the filter drops a mismatched backend selection and
  surfaces a toast so the auto-clear is visible.
- Ambiguity alerts auto-activate the matching chip so users see only
  relevant backends even if they dismiss the alert.

Tightens the Batch E tests' option-matching to the label <span> so the
"↵" keybind hint on the focused row doesn't break accessible-name
lookups.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* fix(ui/import): rename Power to Advanced + stop URI-formats toggle from submitting form

The "Supported URI Formats" disclosure button inside the Simple-mode form
lacked an explicit type attribute, so it defaulted to type="submit". Every
click triggered the form's onSubmit and surfaced the empty-URI validation
toast ("Please enter a model URI"). Marking it type="button" lets it
behave as a pure toggle.

While here, rename the user-visible "Power" label to "Advanced" in the
mode switch (button text + tooltip) and the Power-mode tab's aria-label,
matching the term users actually expect. The internal mode key stays
'power' so tests, localStorage, and data-testid selectors are untouched.

Assisted-by: Claude:claude-opus-4-7

* fix(system): fall back to cpu when meta backend lacks default capability

Meta backends like vllm and sglang enumerate concrete variants for
nvidia/amd/intel/cpu but omit a default: catch-all entry. On a no-GPU
host the reported capability is "default", so the previous Capability()
returned "default" unconditionally on a miss — IsCompatibleWith then saw
no "default" key and filtered the meta out of AvailableBackends. The
import flow's auto-install step then failed with "no backend found with
name <meta>", contradicting the UI's promise that the backend would be
downloaded on demand.

Try the explicit "default" key first, then fall back to "cpu" before
giving up. vllm now resolves to cpu-vllm on CPU-only Linux without
touching the gallery YAML.

Assisted-by: Claude:claude-opus-4-7
2026-04-22 22:42:37 +02:00
orbisai0security
bbeacf140d fix: remove unsafe sprintf() in grpc-server.cpp (#9486)
fix: V-001 security vulnerability

Automated security fix generated by Orbis Security AI
2026-04-22 21:57:29 +02:00
LocalAI [bot]
6820ec468f chore(model gallery): 🤖 add 1 new models via gallery agent (#9491)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-22 21:56:11 +02:00
Ettore Di Giacinto
20baec77ab feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480)
* feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis

Adds face recognition as a new first-class capability in LocalAI via the
`insightface` Python backend, with a pluggable two-engine design so
non-commercial (insightface model packs) and commercial-safe
(OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface.

New gRPC RPCs (backend/backend.proto):
  * FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse
  * FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse

Existing Embedding and Detect RPCs are reused (face image in
PredictOptions.Images / DetectOptions.src) for face embedding and
face detection respectively.

New HTTP endpoints under /v1/face/:
  * verify     — 1:1 image pair same-person decision
  * analyze    — per-face age + gender (emotion/race reserved)
  * register   — 1:N enrollment; stores embedding in vector store
  * identify   — 1:N recognition; detect → embed → StoresFind
  * forget     — remove a registered face by opaque ID

Service layer (core/services/facerecognition/) introduces a
`Registry` interface with one in-memory `storeRegistry` impl backed
by LocalAI's existing local-store gRPC vector backend. HTTP handlers
depend on the interface, not on StoresSet/StoresFind directly, so a
persistent PostgreSQL/pgvector implementation can be slotted in via a
single constructor change in core/application (TODO marker in the
package doc).

New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired
into FLAG_DETECTION so /v1/detection works for face bounding boxes.

Gallery (backend/index.yaml) ships three entries:
  * insightface-buffalo-l   — SCRFD-10GF + ArcFace R50 + genderage
                              (~326MB pre-baked; non-commercial research use only)
  * insightface-opencv      — YuNet + SFace (~40MB pre-baked; Apache 2.0)
  * insightface-buffalo-s   — SCRFD-500MF + MBF (runtime download; non-commercial)

Python backend (backend/python/insightface/):
  * engines.py — FaceEngine protocol with InsightFaceEngine and
    OnnxDirectEngine; resolves model paths relative to the backend
    directory so the same gallery config works in docker-scratch and
    in the e2e-backends rootfs-extraction harness.
  * backend.py — gRPC servicer implementing Health, LoadModel, Status,
    Embedding, Detect, FaceVerify, FaceAnalyze.
  * install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the
    backend directory so first-run is offline-clean (the final scratch
    image only preserves files under /<backend>/).
  * test.py — parametrized unit tests over both engines.

Tests:
  * Registry unit tests (go test -race ./core/services/facerecognition/...)
    — in-memory fake grpc.Backend, table-driven, covers register/
    identify/forget/error paths + concurrent access.
  * tests/e2e-backends/backend_test.go extended with face caps
    (face_detect, face_embed, face_verify, face_analyze); relative
    ordering + configurable verifyCeiling per engine.
  * Makefile targets: test-extra-backend-insightface-buffalo-l,
    -opencv, and the -all aggregate.
  * CI: .github/workflows/test-extra.yml gains tests-insightface-grpc,
    auto-triggered by changes under backend/python/insightface/.

Docs:
  * docs/content/features/face-recognition.md — feature page with
    license table, quickstart (defaults to the commercial-safe model),
    models matrix, API reference, 1:N workflow, storage caveats.
  * Cross-refs in object-detection.md, stores.md, embeddings.md, and
    whats-new.md.
  * Contributor README at backend/python/insightface/README.md.

Verified end-to-end:
  * buffalo_l: 6/6 specs (health, load, face_detect, face_embed,
    face_verify, face_analyze).
  * opencv: 5/5 specs (same minus face_analyze — SFace has no
    demographic head; correctly skipped via BACKEND_TEST_CAPS).

Assisted-by: Claude:claude-opus-4-7

* fix(face-recognition): move engine selection to model gallery, collapse backend entries

The previous commit put engine/model_pack options on backend gallery
entries (`backend/index.yaml`). That was wrong — `GalleryBackend`
(core/gallery/backend_types.go:32) has no `options` field, so the
YAML decoder silently dropped those keys and all three "different
insightface-*" backend entries resolved to the same container image
with no distinguishing configuration.

Correct split:

  * `backend/index.yaml` now has ONE `insightface` backend entry
    shipping the CPU + CUDA 12 container images. The Python backend
    bundles both the non-commercial insightface model packs
    (buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo
    weights (YuNet + SFace); the active engine is selected at
    LoadModel time via `options: ["engine:..."]`.

  * `gallery/index.yaml` gains three model entries —
    `insightface-buffalo-l`, `insightface-opencv`,
    `insightface-buffalo-s` — each setting the appropriate
    `overrides.backend` + `overrides.options` so installing one
    actually gives the user the intended engine. This matches how
    `rfdetr-base` lives in the model gallery against the `rfdetr`
    backend.

The earlier e2e tests passed despite this bug because the Makefile
targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC,
bypassing any gallery resolution entirely. No code changes needed.

Assisted-by: Claude:claude-opus-4-7

* feat(face-recognition): cover all supported models in the gallery + drop weight baking

Follows up on the model-gallery split: adds entries for every model
configuration either engine actually supports, and switches weight
delivery from image-baked to LocalAI's standard gallery mechanism.

Gallery now has seven `insightface-*` model entries (gallery/index.yaml):

  insightface (family)  — non-commercial research use
    • buffalo-l   (326MB)  — SCRFD-10GF + ResNet50 + genderage, default
    • buffalo-m   (313MB)  — SCRFD-2.5GF + ResNet50 + genderage
    • buffalo-s   (159MB)  — SCRFD-500MF + MBF + genderage
    • buffalo-sc  (16MB)   — SCRFD-500MF + MBF, recognition only
                             (no landmarks, no demographics — analyze
                             returns empty attributes)
    • antelopev2  (407MB)  — SCRFD-10GF + ResNet100@Glint360K + genderage

  OpenCV Zoo family — Apache 2.0 commercial-safe
    • opencv       — YuNet + SFace fp32 (~40MB)
    • opencv-int8  — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU)

Model weights are no longer baked into the backend image. The image
now ships only the Python runtime + libraries (~275MB content size,
~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through
LocalAI's gallery mechanism:

  * OpenCV variants list `files:` with ONNX URIs + SHA-256, so
    `local-ai models install insightface-opencv` pulls them into the
    models directory exactly like any other gallery-managed model.

  * insightface packs (upstream distributes .zip archives only, not
    individual ONNX files) auto-download on first LoadModel via
    FaceAnalysis' built-in machinery, rooted at the LocalAI models
    directory so they live alongside everything else — same pattern
    `rfdetr` uses with `inference.get_model()`.

Backend changes (backend/python/insightface/):

  * backend.py — LoadModel propagates `ModelOptions.ModelPath` (the
    LocalAI models directory) to engines via a `_model_dir` hint.
    This replaces the earlier ModelFile-dirname approach; ModelPath
    is the canonical "models directory" variable set by the Go loader
    (pkg/model/initializers.go:144) and is always populated.

  * engines.py::_resolve_model_path — picks up `model_dir` and searches
    it (plus basename-in-model-dir) before falling back to the dev
    script-dir. This is how OnnxDirectEngine finds gallery-downloaded
    YuNet/SFace files by filename only.

  * engines.py::_flatten_insightface_pack — new helper that works
    around an upstream packaging inconsistency: buffalo_l/s/sc zips
    expand flat, but buffalo_m and antelopev2 zips wrap their ONNX
    files in a redundant `<name>/` directory. insightface's own
    loader looks one level too shallow and fails. We call
    `ensure_available()` explicitly, flatten if nested, then hand to
    FaceAnalysis.

  * engines.py::InsightFaceEngine.prepare — root-resolution order now
    includes the `_model_dir` hint so packs download into the LocalAI
    models directory by default.

  * install.sh — no longer pre-downloads any weights. Everything is
    gallery-managed now.

  * smoke.py (new) — parametrized smoke test that iterates over every
    gallery configuration, simulating the LocalAI install flow
    (creates a models dir, fetches OpenCV files with checksum
    verification, lets insightface auto-download its packs), then
    runs detect + embed + verify (+ analyze where supported) through
    the in-process BackendServicer.

  * test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/`
    paths; downloads ONNX files to a temp dir at setUpClass time and
    passes ModelPath accordingly.

Registry change (core/services/facerecognition/store_registry.go):

  * `dim=0` in NewStoreRegistry now means "accept whatever dimension
    arrives" — needed because the backend supports 512-d ArcFace/MBF
    and 128-d SFace via the same Registry. A non-zero dim still fails
    fast with ErrDimensionMismatch.

  * core/application plumbs `faceEmbeddingDim = 0`, explaining the
    rationale in the comment.

Backend gallery description updated to reflect that the image carries
no weights — it's just Python + engines.

Smoke-tested all 7 configurations against the rebuilt image (with the
flatten fix applied), exit 0:

    PASS: insightface-buffalo-l    faces=6 dim=512 same-dist=0.000
    PASS: insightface-buffalo-sc   faces=6 dim=512 same-dist=0.000
    PASS: insightface-buffalo-s    faces=6 dim=512 same-dist=0.000
    PASS: insightface-buffalo-m    faces=6 dim=512 same-dist=0.000
    PASS: insightface-antelopev2   faces=6 dim=512 same-dist=0.000
    PASS: insightface-opencv       faces=6 dim=128 same-dist=0.000
    PASS: insightface-opencv-int8  faces=6 dim=128 same-dist=0.000
    7/7 passed

Assisted-by: Claude:claude-opus-4-7

* fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim

CI regression from the previous commit: I moved OpenCV Zoo weight
delivery to LocalAI's gallery `files:` mechanism, but the
test-extra-backend-insightface-opencv target was still passing
relative paths `detector_onnx:models/opencv/yunet.onnx` in
BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over
gRPC without going through the gallery, so those relative paths
resolved to nothing and OpenCV's ONNXImporter failed:

    LoadModel failed: Failed to load face engine:
    OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx

Fix: add an `insightface-opencv-models` prerequisite target that
fetches the two ONNX files (YuNet + SFace) to a deterministic host
cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256,
and skips the download on re-runs. The opencv test target depends on
it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend
finds the files via its normal absolute-path resolution branch.

Also refresh the buffalo_l comment: it no longer says "pre-baked"
(nothing is — the pack auto-downloads from upstream's GitHub release
on first LoadModel, same as in CI).

Locally verified: `make test-extra-backend-insightface-opencv` passes
5/5 specs (health, load, face_detect, face_embed, face_verify).

Assisted-by: Claude:claude-opus-4-7

* feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs

The docs promised that /v1/embeddings returns face vectors when you
send an image data-URI. That was never true: /v1/embeddings is
OpenAI-compatible and text-only by contract — its handler goes
through `core/backend/embeddings.go::ModelEmbedding`, which sets
`predictOptions.Embeddings = s` (a string of TEXT to embed) and never
populates `predictOptions.Images[]`. The Python backend's Embedding
gRPC method does handle Images[] (that's how /v1/face/register reaches
it internally via `backend.FaceEmbed`), but the HTTP embeddings
endpoint wasn't wired to populate it.

Rather than overload /v1/embeddings with image-vs-text detection —
messy, and the endpoint is OpenAI-compatible by design — add a
dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed`
(already used internally by /v1/face/register and /v1/face/identify).

Matches LocalAI's convention of a dedicated path per non-standard flow
(/v1/rerank, /v1/detection, /v1/face/verify etc.).

Response:

    {
      "embedding": [<dim> floats, L2-normed],
      "dim": int,           // 512 for ArcFace R50 / MBF, 128 for SFace
      "model": "<name>"
    }

Live-tested on the opencv engine: returns a 128-d L2-normalized vector
(sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings
is text-only and point image users at /v1/face/embed instead.

Assisted-by: Claude:claude-opus-4-7

* fix(http): map malformed image input + gRPC status codes to proper 4xx

Image-input failures on LocalAI's single-image endpoints (/v1/detection,
/v1/face/{verify,analyze,embed,register,identify}) have historically
returned 500 — even when the client was the one who sent garbage.
Classic example: you POST an "image" that isn't a URL, isn't a
data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim
that's its fault.

Two helpers land in core/http/endpoints/localai/images.go and every
single-image handler is switched over:

  * decodeImageInput(s)
      Wraps utils.GetContentURIAsBase64 and turns any failure
      (invalid URL, not a data-URI, download error, etc.) into
      echo.NewHTTPError(400, "invalid image input: ...").

  * mapBackendError(err)
      Inspects the gRPC status on a backend call error and maps:
        INVALID_ARGUMENT     → 400 Bad Request
        NOT_FOUND            → 404 Not Found
        FAILED_PRECONDITION  → 412 Precondition Failed
        Unimplemented        → 501 Not Implemented
      All other codes fall through unchanged (still 500).

Before, my 1×1 PNG error-path test returned:
    HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images"
After:
    HTTP 400 "failed to decode one or both images"

Scope-limited to the LocalAI single-image endpoints. The multi-modal
paths (middleware/request.go, openresponses/responses.go,
openai/realtime.go) intentionally log-and-skip individual media parts
when decoding fails — different design intent (graceful degradation
of a multi-part message), not a 400-worthy failure. Left untouched.

Live-verified: every error case in /tmp/face_errors.py now returns
4xx with a meaningful message; the "image with no face (1x1 PNG)"
case specifically went from 500 → 400.

Assisted-by: Claude:claude-opus-4-7

* refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis

Follows up on the discovery that LocalAI's gallery `files:` mechanism
handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy
piper voices use exactly this pattern. Insightface packs are zip
archives, so we can now deliver them the same way every other
gallery-managed model gets delivered: declaratively, checksum-verified,
through LocalAI's standard download+extract pipeline.

Two changes:

1. Gallery (gallery/index.yaml) — every insightface-* entry gains a
   `files:` list with the pack zip's URI + SHA-256. `local-ai models
   install insightface-buffalo-l` now fetches the zip, verifies the
   hash, and extracts it into the models directory. No more reliance
   on insightface's library-internal `ensure_available()` auto-download
   or its hardcoded `BASE_REPO_URL`.

2. InsightFaceEngine (backend/python/insightface/engines.py) — drops
   the FaceAnalysis wrapper and drives insightface's `model_zoo`
   directly. The ~50 lines FaceAnalysis provides — glob ONNX files,
   route each through `model_zoo.get_model()`, build a
   `{taskname: model}` dict, loop per-face at inference — are
   reimplemented in `InsightFaceEngine`. The actual inference classes
   (RetinaFace, ArcFaceONNX, Attribute, Landmark) are still
   insightface's — we only replicate the glue, so drift risk against
   upstream is minimal.

   Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/*.onnx`
   layout that doesn't match what LocalAI's zip extraction produces.
   LocalAI unpacks archives flat into `<models_dir>`. Upstream packs
   are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands
   at `<models_dir>/*.onnx`), buffalo_m/antelopev2 wrap in a redundant
   `<name>/` dir (lands at `<models_dir>/<name>/*.onnx`). The new
   `_locate_insightface_pack` helper searches both locations plus
   legacy paths and returns whichever has ONNX files. Replaces the
   earlier `_flatten_insightface_pack` helper (which tried to fight
   FaceAnalysis's layout expectations; now we just find the files
   wherever they are).

Net effect for users: install once via LocalAI's managed flow,
weights live alongside every other model, progress shows in the
jobs endpoint, no first-load network call. Same API surface,
cleaner plumbing.

Assisted-by: Claude:claude-opus-4-7

* fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched

The e2e suite drives LoadModel over gRPC without going through LocalAI's
gallery flow, so the engine's `_model_dir` option (normally populated
from ModelPath) is empty. Previously the insightface target relied on
FaceAnalysis auto-download to paper over this, but we dropped
FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l
target started failing at LoadModel with "no insightface pack found".

Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip
(same SHA as the gallery entry), extract it on the host, and pass
`root:<dir>` so the engine locates the pack without needing
ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI
fast; it covers the same insightface engine code path as buffalo_l.

Face analyze cap dropped since buffalo_sc has no age/gender head.

Assisted-by: Claude:claude-opus-4-7[1m]

* feat(face-recognition): surface face-recognition in advertised feature maps

The six /v1/face/* endpoints were missing from every place LocalAI
advertises its feature surface to clients:

  * api_instructions — the machine-readable capability index at
    GET /api/instructions. Added `face-recognition` as a dedicated
    instruction area with an intro that calls out the in-memory
    registry caveat and the /v1/face/embed vs /v1/embeddings split.
  * auth/permissions — added FeatureFaceRecognition constant, routed
    all six face endpoints through it so admins can gate them per-user
    like any other API feature. Default ON (matches the other API
    features).
  * React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to
    FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a
    follow-up (noted in the plan).

Instruction count bumped 9 → 10; test updated.

Assisted-by: Claude:claude-opus-4-7[1m]

* docs(agents): capture advertising-surface steps in the endpoint guide

Before this change, adding a new /v1/* endpoint reliably missed one or
more of: the swagger @Tags annotation, the /api/instructions registry,
the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The
endpoint would work but be invisible to API consumers, admins, and the
UI — and nothing in the existing docs said to look in those places.

Extend .agents/api-endpoints-and-auth.md with a new "Advertising
surfaces" section covering all four surfaces (swagger tags, /api/
instructions, capabilities.js, docs/), and expand the closing checklist
so it's impossible to ship a feature without visiting each one. Hoist a
one-liner reminder into AGENTS.md's Quick Reference so agents skim it
before diving in.

Assisted-by: Claude:claude-opus-4-7[1m]
2026-04-22 21:55:41 +02:00
Richard Palethorpe
d16f19f1eb fix(kokoros): Build and publish the backend images from CI/CD (#9487)
* fix(kokoros): Build and publish the backend images from CI/CD

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* Delete .claude/agents

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

* Delete .claude/commands

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

* Delete .claude/settings.json

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

* Delete .claude/skills

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-22 13:19:55 +02:00
LocalAI [bot]
cd7b035716 chore: ⬆️ Update ggml-org/llama.cpp to 5a4cd6741fc33227cdacb329f355ab21f8481de2 (#9479)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-22 08:58:19 +02:00
LocalAI [bot]
0f3bb2d647 chore(model gallery): 🤖 add 1 new models via gallery agent (#9481)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-22 08:22:05 +02:00
Adira
607efe5a4c fix(backend-monitor): accept model as a query parameter (#9411)
The /backend/monitor endpoint is routed as GET but its handler bound the
model name from a request body, which is invalid per REST and breaks
Swagger UI and OpenAPI codegen tools that refuse to send bodies with GET.

Switch to reading ?model=<name> as a query parameter and update the
Swagger annotation, regenerated spec files, and documentation. The
handler still falls back to body binding when the query parameter is
absent, so existing clients sending {"model": "..."} continue to work.

Fixes #9207

Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
2026-04-21 22:06:35 +02:00
Ettore Di Giacinto
7d8c1d5e45 fix(streaming): dedupe content, recover reasoning, unique tool_call IDs in deferred flush (#9470)
* fix(streaming): dedupe content, recover reasoning, unique tool IDs

When tool calls are discovered only during final parsing (after the
streaming token callback returns), processTools' default switch branch
used to emit the full accumulated content alongside the tool_call args
chunk. Clients that accumulate delta.content per the OpenAI streaming
contract end up showing every narration line twice. Three related bugs
in the same flush path:

1. Content duplication: the args chunk carried Content:textContentToReturn
   even though the text had already been streamed token-by-token via
   the token callback, so delta.content was both the running total and
   bundled with tool_calls in one delta (two spec violations).
2. Reasoning drop: when the C++ autoparser surfaces reasoning only as
   a final aggregate (no incremental tokens), the callback never emits
   it and the flush branch didn't either, silently losing it.
3. tool_call ID collision: empty ss.ID fell back to the request id, so
   multiple empty-ID calls in the same turn all shared the same id,
   breaking tool_result matching by tool_call_id.

Extracted the block into buildDeferredToolCallChunks (pure function,
unit-testable) and added 19 Ginkgo specs covering streamed vs.
not-streamed content/reasoning, single vs. multi call, and
incremental-vs-deferred emission. Every case asserts the invariant
that no delta carries both non-empty Content/Reasoning and non-empty
ToolCalls.

Fix summary:
- emit reasoning in its own leading chunk when !reasoningAlreadyStreamed
- emit role+content in their own chunks when !contentAlreadyStreamed
- drop Content from the tool_call args chunk
- fallback to fmt.Sprintf("%s-%d", id, i) for empty ss.ID so calls stay
  uniquely addressable

Reproduced live against qwen3.6-35b-a3b-apex served by LocalAI with
the C++ autoparser; the full-content replay chunk that preceded each
tool_calls block is gone after the fix.

Assisted-by: Claude:claude-opus-4-7 go vet

* fix(streaming): dedupe reasoning in the noActionToRun final chunk

extractor.Reasoning() returns only the Go-side extractor's lastReasoning
accumulator (pkg/reasoning/extractor.go:129). ChatDelta reasoning
coming through ProcessChatDeltaReasoning lives in a separate
accumulator (cdLastStrippedReasoning) that Reasoning() does not
expose. The "reasoning != \"\" && extractor.Reasoning() == \"\"" guard
therefore fires exactly when the autoparser streamed reasoning
incrementally via the callback — producing a duplicate final delivery.

Replace both guard sites in the noActionToRun branch with the
sentReasoning flag introduced in the previous commit. Extract the
closing-chunk logic into buildNoActionFinalChunks so the refactor is
testable; the helper mirrors buildDeferredToolCallChunks.

Add Ginkgo coverage for both the content-streamed and
content-not-streamed paths: reasoning is dropped when it was streamed,
delivered once when it arrived only as a final aggregate, and omitted
when empty. Metadata invariants carried over from the sibling helper.

Assisted-by: Claude:claude-opus-4-7 go vet

* fix(streaming): detect noActionToRun anywhere in functionResults

The previous condition only looked at functionResults[0].Name, which
misbehaved when a real tool call followed a noAction sentinel — the
noAction shadowed the real call and the whole turn was treated as a
question to answer, silently dropping the tool call. The mirror case,
[realCall, noActionCall], fell into the default branch and emitted the
noAction entry as if it were a real tool_call.

Replace with hasRealCall, which scans the slice and returns true as
soon as it finds a non-noAction entry. noActionToRun now matches the
semantic intent: "every entry is the noAction sentinel (or the slice
is empty)".

Note: this does not change incremental emission, where noAction
entries may still be forwarded as tool_call chunks by the XML/JSON
iterative parsers. That is a separate layer (functions.Parse*) and
addressing it requires threading noAction through the parser APIs —
out of scope for this change.

Assisted-by: Claude:claude-opus-4-7 go vet
2026-04-21 21:59:33 +02:00
leinasi2014
d18d434bb2 Respect explicit reasoning config during GGUF thinking probe (#9463)
Signed-off-by: leinasi2014 <leinasi2014@gmail.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-21 21:53:10 +02:00
Ettore Di Giacinto
39573ecd2a chore(whisperx): drop ROCm/hipblas build target (#9474)
whisperx has no upstream AMD GPU support and its core transcription path
(faster-whisper -> ctranslate2) falls back to CPU on AMD since the PyPI
ctranslate2 is CUDA-only. The torch rocm wheels would accelerate only the
alignment/diarization stages, producing a misleadingly half-working image.

Drop the hipblas variant rather than shipping a partially accelerated build
users can't distinguish from the real thing. AMD hosts now fall through
the capability map to cpu-whisperx / cpu-whisperx-development.

Also removes the now-dangling rocm-whisperx assertion from
pkg/system/capabilities_test.go and the ROCm mention from the whisperx
row in docs/content/reference/compatibility-table.md.

Assisted-by: Claude Code:claude-opus-4-7
2026-04-21 21:50:18 +02:00
Ettore Di Giacinto
a7dbb2a83d fix(gallery-agent): process blacklist command on recently-closed PRs (#9473)
The command-processing step only walked open PRs, so when a maintainer
wrote `/gallery-agent blacklist` and immediately closed the PR, the
next scheduled run missed the command, the `gallery-agent/blacklisted`
label was never applied, and the skip-URL step (which only pulls URLs
from closed PRs carrying that label) re-proposed the model on the next
cron.

Also scan closed gallery-agent PRs from the last 14 days that don't
already carry the blacklist label, and apply the label retroactively
when the command is present. Close/recreate actions still only run on
open PRs.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-21 16:29:13 +02:00
dependabot[bot]
3ad9b16c29 chore(deps): bump github.com/coreos/go-oidc/v3 from 3.17.0 to 3.18.0 (#9455)
Bumps [github.com/coreos/go-oidc/v3](https://github.com/coreos/go-oidc) from 3.17.0 to 3.18.0.
- [Release notes](https://github.com/coreos/go-oidc/releases)
- [Commits](https://github.com/coreos/go-oidc/compare/v3.17.0...v3.18.0)

---
updated-dependencies:
- dependency-name: github.com/coreos/go-oidc/v3
  dependency-version: 3.18.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-21 15:31:02 +02:00
dependabot[bot]
c806d5ab73 chore(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.14 to 1.32.16 (#9456)
chore(deps): bump github.com/aws/aws-sdk-go-v2/config

Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.32.14 to 1.32.16.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.32.14...config/v1.32.16)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/config
  dependency-version: 1.32.16
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-21 15:30:22 +02:00
LocalAI [bot]
47efaf5b43 Fix: Add model parameter to neutts-air gallery definition (#8793)
fix: Add model parameter to neutts-air gallery definition

The neutts-air model entry was missing the 'model' parameter in its
configuration, which caused LocalAI to fail with an 'Unrecognized model'
error when trying to use it. This change adds the required model parameter
pointing to the HuggingFace repository (neuphonic/neutts-air) so the backend
can properly load the model.

Fixes #8792

Signed-off-by: localai-bot <localai-bot@example.com>
Co-authored-by: localai-bot <localai-bot@example.com>
2026-04-21 11:56:00 +02:00
LocalAI [bot]
315b634a91 feat: improve CLI error messages with actionable guidance (#8880)
- transcript.go: Model not found error now suggests available models commands
- util.go: GGUF error explains format and how to get models
- worker_p2p.go: Token error explains purpose and how to obtain one
- run.go: Startup failure includes troubleshooting steps and docs link
- model_config_loader.go: Config validation errors include file path and guidance

Refs: H2 - UX Review Issue

Signed-off-by: localai-bot <localai-bot@noreply.github.com>
Co-authored-by: localai-bot <localai-bot@noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-21 11:53:26 +02:00
dependabot[bot]
6b245299d7 chore(deps): bump github.com/modelcontextprotocol/go-sdk from 1.4.1 to 1.5.0 (#9454)
chore(deps): bump github.com/modelcontextprotocol/go-sdk

Bumps [github.com/modelcontextprotocol/go-sdk](https://github.com/modelcontextprotocol/go-sdk) from 1.4.1 to 1.5.0.
- [Release notes](https://github.com/modelcontextprotocol/go-sdk/releases)
- [Commits](https://github.com/modelcontextprotocol/go-sdk/compare/v1.4.1...v1.5.0)

---
updated-dependencies:
- dependency-name: github.com/modelcontextprotocol/go-sdk
  dependency-version: 1.5.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-21 11:43:00 +02:00
dependabot[bot]
677c0315c1 chore(deps): bump github.com/containerd/containerd from 1.7.30 to 1.7.31 (#9453)
Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.7.30 to 1.7.31.
- [Release notes](https://github.com/containerd/containerd/releases)
- [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md)
- [Commits](https://github.com/containerd/containerd/compare/v1.7.30...v1.7.31)

---
updated-dependencies:
- dependency-name: github.com/containerd/containerd
  dependency-version: 1.7.31
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-21 11:42:43 +02:00
dependabot[bot]
478522ce4d chore(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.97.1 to 1.99.1 (#9452)
chore(deps): bump github.com/aws/aws-sdk-go-v2/service/s3

Bumps [github.com/aws/aws-sdk-go-v2/service/s3](https://github.com/aws/aws-sdk-go-v2) from 1.97.1 to 1.99.1.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/service/s3/v1.97.1...service/s3/v1.99.1)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/service/s3
  dependency-version: 1.99.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-21 11:42:27 +02:00
Ettore Di Giacinto
c54897ad44 fix(tests): update InstallBackend call sites for new URI/Name/Alias params (#9467)
Commit 02bb715c (#9446) added uri, name, alias parameters to
RemoteUnloaderAdapter.InstallBackend but missed the e2e test call
sites, breaking the distributed test build. Pass empty strings to
match the pattern used by the other non-URI call sites.

Assisted-by: Claude Code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-21 11:41:38 +02:00
LocalAI [bot]
8bb1e8f21f chore: ⬆️ Update ggml-org/llama.cpp to cf8b0dbda9ac0eac30ee33f87bc6702ead1c4664 (#9448)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-21 11:15:45 +02:00
LocalAI [bot]
cd94a0b61a chore: ⬆️ Update ggml-org/whisper.cpp to fc674574ca27cac59a15e5b22a09b9d9ad62aafe (#9450)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-21 11:09:05 +02:00
LocalAI [bot]
047bc48fa9 chore(model gallery): 🤖 add 1 new models via gallery agent (#9464)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-21 11:07:07 +02:00
sec171
01bd8ae5d0 [gallery] Fix duplicate sha256 keys in Wan models (#9461)
Fix duplicate sha256 keys in wan models gallery

The wan models previously defined the `sha256` key twice in their files lists,
which triggered strict mapping key checks in the YAML parser and resulted
in unmarshal errors that crashed the `/api/models` loading. This removes
the redundant trailing `sha256` keys from the Wan model definitions.

Assisted-by: Antigravity:Gemini-3.1-Pro-High [multi_replace_file_content, run_command]

Signed-off-by: Alex <codecrusher24@gmail.com>
2026-04-21 11:06:36 +02:00
LocalAI [bot]
d9808769be chore(model-gallery): ⬆️ update checksum (#9451)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-21 00:07:58 +02:00
LocalAI [bot]
5973c0a9df chore: ⬆️ Update ikawrakow/ik_llama.cpp to d4824131580b94ffa7b0e91c955e2b237c2fe16e (#9447)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-21 00:07:19 +02:00
leinasi2014
486b5e25a3 fix(config): ignore yaml backup files in model loader (#9443)
Only load files whose real extension is .yaml or .yml so backup files like model.yaml.bak do not override active configs. Add a regression test covering plain and timestamped backup files.

Assisted-by: Codex:gpt-5.4 docker

Signed-off-by: leinasi2014 <leinasi2014@gmail.com>
2026-04-20 23:41:39 +02:00
Russell Sim
c66c41e8d7 fix(ci): wire AMDGPU_TARGETS through backend build workflow (#9445)
Commit 8839a71c exposed AMDGPU_TARGETS as an ARG/ENV in
Dockerfile.llama-cpp so GPU targets could be overridden, but never
wired the value through the CI workflow inputs. Without it, Docker
receives AMDGPU_TARGETS="" which overrides the Makefile's ?= default,
causing all hipblas builds to compile only for gfx906 regardless of
the target list in the Makefile.

Add amdgpu-targets as a workflow_call input with the same default list
as the Makefile, and pass it as AMDGPU_TARGETS in the build-args of
both the push and PR build steps.

Assisted-by: Claude Code:claude-sonnet-4-6

Signed-off-by: Russell Sim <rsl@simopolis.xyz>
2026-04-20 23:41:19 +02:00
Russell Sim
02bb715c0a fix(distributed): pass ExternalURI through NATS backend install (#9446)
When installing a backend with a custom OCI URI in distributed mode,
the URI was captured in ManagementOp.ExternalURI by the HTTP handler
but never forwarded to workers. BackendInstallRequest had no URI field,
so workers fell through to the gallery lookup and failed with
"no backend found with name <custom-name>".

Add URI/Name/Alias fields to BackendInstallRequest and thread them from
ManagementOp through DistributedBackendManager.InstallBackend() and the
RemoteUnloaderAdapter. On the worker side, route to InstallExternalBackend
when URI is set instead of InstallBackendFromGallery. Update all
remaining InstallBackend call sites (UpgradeBackend, reconciler
pending-op drain, router auto-install) to pass empty strings for the
new params.

Assisted-by: Claude Code:claude-sonnet-4-6

Signed-off-by: Russell Sim <rsl@simopolis.xyz>
2026-04-20 23:39:35 +02:00
Ettore Di Giacinto
8ab56e2ad3 feat(gallery): add wan i2v 720p (#9457)
feat(gallery): add Wan 2.1 I2V 14B 720P + pin all wan ggufs by sha256

Adds a new entry for the native-720p image-to-video sibling of the
480p I2V model (wan-2.1-i2v-14b-480p-ggml). The 720p I2V model is
trained purely as image-to-video — no first-last-frame interpolation
path — so motion is freer than repurposing the FLF2V 720P variant as
an i2v. Shares the same VAE, umt5_xxl text encoder, and clip_vision_h
auxiliary files as the existing 480p I2V and 720p FLF2V entries, so
no new aux downloads are introduced.

Also pins the main diffusion gguf by sha256 for the new entry and for
the three existing wan entries that were previously missing a hash
(wan-2.1-t2v-1.3b-ggml, wan-2.1-i2v-14b-480p-ggml,
wan-2.1-flf2v-14b-720p-ggml). Hashes were fetched from HuggingFace's
x-linked-etag header per .agents/adding-gallery-models.md.

Assisted-by: Claude:claude-opus-4-7
2026-04-20 23:34:11 +02:00
pjbrzozowski
ecf85fde9e fix(api): remove duplicate /api/traces endpoint that broke React UI (#9427)
The API Traces tab in /app/traces always showed (0) traces despite requests
being recorded.

The /api/traces endpoint was registered in both localai.go and ui_api.go.
The ui_api.go version wrapped the response as {"traces": [...]} instead of
the flat []APIExchange array that both the React UI (Traces.jsx) and the
legacy Alpine.js UI (traces.html) expect. Because Echo matched the ui_api.go
handler, Array.isArray(apiData) always returned false, making the API Traces
tab permanently empty.

Remove the duplicate endpoints from ui_api.go so only the correct flat-array
version in localai.go is served.

Also use mime.ParseMediaType for the Content-Type check in the trace
middleware so requests with parameters (e.g. application/json; charset=utf-8)
are still traced.

Signed-off-by: Pawel Brzozowski <paul@ontux.net>
Co-authored-by: Pawel Brzozowski <paul@ontux.net>
2026-04-20 18:44:49 +02:00
Sai Asish Y
6480715a16 fix(settings): strip env-supplied ApiKeys from the request before persisting (#9438)
GET /api/settings returns settings.ApiKeys as the merged env+runtime list
via ApplicationConfig.ToRuntimeSettings(). The WebUI displays that list and
round-trips it back on POST /api/settings unchanged.

UpdateSettingsEndpoint was then doing:

    appConfig.ApiKeys = append(envKeys, runtimeKeys...)

where runtimeKeys already contained envKeys (because the UI got them from
the merged GET). Every save therefore duplicated the env keys on top of
the previous merge, and also wrote the duplicates to runtime_settings.json
so the duplication survived restarts and compounded with each save. This
is the user-visible behaviour in #9071: the Web UI shows the keys
twice / three times after consecutive saves.

Before we marshal the settings to disk or call ApplyRuntimeSettings, drop
any incoming key that already appears in startupConfig.ApiKeys. The file
on disk now stores only the genuinely runtime-added keys; the subsequent
append(envKeys, runtimeKeys...) produces one copy of each env key, as
intended. Behaviour is unchanged for users who never had env keys set.

Fixes #9071

Co-authored-by: SAY-5 <SAY-5@users.noreply.github.com>
2026-04-20 10:36:54 +02:00
Ettore Di Giacinto
f683231811 feat(gallery): add Wan 2.1 FLF2V 14B 720P (#9440)
First-last-frame-to-video variant of the 14B Wan family. Accepts a
start and end reference image and — unlike the pure i2v path — runs
both through clip_vision, so the final frame lands on the end image
both in pixel and semantic space. Right pick for seamless loops
(start_image == end_image) and narrative A→B cuts.

Shares the same VAE, umt5_xxl text encoder, and clip_vision_h as the
I2V 14B entry. Options block mirrors i2v's full-list-in-override
style so the template merge doesn't drop fields.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 10:34:36 +02:00
LocalAI [bot]
960757f0e8 chore(model gallery): 🤖 add 1 new models via gallery agent (#9436)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-20 08:48:47 +02:00
Ettore Di Giacinto
865fd552f5 docs(agents): adopt kernel's AI coding assistants policy
Align LocalAI with the Linux kernel project's policy for AI-assisted
contributions (https://docs.kernel.org/process/coding-assistants.html).

- Add .agents/ai-coding-assistants.md with the full policy adapted to
  LocalAI's MIT license: no Signed-off-by or Co-Authored-By from AI,
  attribute AI involvement via an Assisted-by: trailer, human submitter
  owns the contribution.
- Surface the rules at the entry points: AGENTS.md (and its CLAUDE.md
  symlink) and CONTRIBUTING.md.
- Publish a user-facing reference page at
  docs/content/reference/ai-coding-assistants.md and link it from the
  references index.

Assisted-by: Claude:claude-opus-4-7
2026-04-19 22:50:54 +00:00
LocalAI [bot]
cb77a5a4b9 chore(model gallery): 🤖 add 1 new models via gallery agent (#9425)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-20 00:42:44 +02:00
Ettore Di Giacinto
60633c4dd5 fix(stable-diffusion.ggml): force mp4 container in ffmpeg mux (#9435)
gen_video's ffmpeg subprocess was relying on the filename extension to
choose the output container. Distributed LocalAI hands the backend a
staging path (e.g. /staging/localai-output-NNN.tmp) that is renamed to
.mp4 only after the backend returns, so ffmpeg saw a .tmp extension and
bailed with "Unable to choose an output format". Inference had already
completed and the frames were piped in, producing the cryptic
"video inference failed (code 1)" at the API layer.

Pass -f mp4 explicitly so the container is selected by flag instead of
by filename suffix.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 00:41:54 +02:00
Ettore Di Giacinto
9e44944cc1 fix(i2v): Add new options to the model configuration
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-20 00:27:05 +02:00
Ettore Di Giacinto
372eb08dcf fix(gallery): allow uninstalling orphaned meta backends + force reinstall (#9434)
Two interrelated bugs that combined to make a meta backend impossible
to uninstall once its concrete had been removed from disk (partial
install, earlier crash, manual cleanup).

1. DeleteBackendFromSystem returned "meta backend %q not found" and
   bailed out early when the concrete directory didn't exist,
   preventing the orphaned meta dir from ever being removed. Treat a
   missing concrete as idempotent success — log a warning and continue
   to remove the orphan meta.

2. InstallBackendFromGallery's "already installed, skip" short-circuit
   only checked that the name was known (`backends.Exists(name)`); an
   orphaned meta whose RunFile points at a missing concrete still
   satisfies that check, so every reinstall returned nil without doing
   anything. Afterwards the worker's findBackend returned empty and we
   kept looping with "backend %q not found after install attempt".
   Require the entry to be actually runnable (run.sh stat-able, not a
   directory) before skipping.

New helper isBackendRunnable centralises the runnability test so both
the install guard and future callers stay in sync. Tests cover the
orphaned-meta delete path and the non-runnable short-circuit case.
2026-04-20 00:10:19 +02:00
LocalAI [bot]
28091d626e chore: ⬆️ Update ikawrakow/ik_llama.cpp to 00ba208a5c036eee72d4a631b4f57c126095cb03 (#9430)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-20 00:01:48 +02:00
LocalAI [bot]
cae79d9107 feat(swagger): update swagger (#9431)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-19 23:39:50 +02:00
LocalAI [bot]
babbbc6ec8 chore: ⬆️ Update ggml-org/llama.cpp to 4eac5b45095a4e8a1ff1cce4f6d030e0872fb4ad (#9429)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-19 23:39:19 +02:00
LocalAI [bot]
3804497186 chore: ⬆️ Update leejet/stable-diffusion.cpp to 44cca3d626d301e2215d5e243277e8f0e65bfa78 (#9428)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-19 23:39:07 +02:00
Ettore Di Giacinto
fda1c553a1 fix(distributed): stop queue loops on agent nodes + dead-letter cap (#9433)
pending_backend_ops rows targeting agent-type workers looped forever:
the reconciler fan-out hit a NATS subject the worker doesn't subscribe
to, returned ErrNoResponders, we marked the node unhealthy, and the
health monitor flipped it back to healthy on the next heartbeat. Next
tick, same row, same failure.

Three related fixes:

1. enqueueAndDrainBackendOp skips nodes whose NodeType != backend.
   Agent workers handle agent NATS subjects, not backend.install /
   delete / list, so enqueueing for them guarantees an infinite retry
   loop. Silent skip is correct — they aren't consumers of these ops.

2. Reconciler drain mirrors enqueueAndDrainBackendOp's behavior on
   nats.ErrNoResponders: mark the node unhealthy before recording the
   failure, so subsequent ListDuePendingBackendOps (filters by
   status=healthy) stops picking the row until the node actually
   recovers. Matches the synchronous fan-out path.

3. Dead-letter cap at maxPendingBackendOpAttempts (10). After ~1h of
   exponential backoff the row is a poison message; further retries
   just thrash NATS. Row is deleted and logged at ERROR so it stays
   visible without staying infinite.

Plus a one-shot startup cleanup in NewNodeRegistry: drop queue rows
that target agent-type nodes, non-existent nodes, or carry an empty
backend name. Guarded by the same schema-migration advisory lock so
only one instance performs it. The guards above prevent new rows of
this shape; this closes the migration gap for existing ones.

Tests: the prune migration (valid row stays, agent + empty-name rows
drop) on top of existing upsert / backoff coverage.
2026-04-19 23:38:43 +02:00
615 changed files with 55406 additions and 5175 deletions

View File

@@ -8,6 +8,7 @@ Create the backend directory under the appropriate location:
- **Python backends**: `backend/python/<backend-name>/`
- **Go backends**: `backend/go/<backend-name>/`
- **C++ backends**: `backend/cpp/<backend-name>/`
- **Rust backends**: `backend/rust/<backend-name>/`
For Python backends, you'll typically need:
- `backend.py` - Main gRPC server implementation
@@ -18,9 +19,22 @@ For Python backends, you'll typically need:
- `run.sh` - Runtime script
- `test.py` / `test.sh` - Test files
For Rust backends, you'll typically need (see `backend/rust/kokoros/` as a reference):
- `Cargo.toml` - Crate manifest; depend on the upstream project as a submodule under `sources/`
- `build.rs` - Invokes `tonic_build` to generate gRPC stubs from `backend/backend.proto` (use the `BACKEND_PROTO_PATH` env var so the Makefile can inject the canonical copy)
- `src/` - The gRPC server implementation (implement `Backend` via `tonic`)
- `Makefile` - Copies `backend.proto` into the crate, runs `cargo build --release`, then `package.sh`
- `package.sh` - Uses `ldd` to bundle the binary's dynamic deps and `ld.so` into `package/lib/`
- `run.sh` - Sets `LD_LIBRARY_PATH`/`SSL_CERT_DIR` and execs the binary via the bundled `lib/ld.so`
- `sources/<UpstreamProject>/` - Git submodule with the upstream Rust crate
## 2. Add Build Configurations to `.github/workflows/backend.yml`
Add build matrix entries for each platform/GPU type you want to support. Look at similar backends (e.g., `chatterbox`, `faster-whisper`) for reference.
Add build matrix entries for each platform/GPU type you want to support. Look at similar backends for reference — `chatterbox`/`faster-whisper` for Python, `piper`/`silero-vad` for Go, `kokoros` for Rust.
**Without an entry here no image is ever built or pushed, and the gallery entry in `backend/index.yaml` will point at a tag that does not exist.** The `dockerfile:` field must point at `./backend/Dockerfile.<lang>` matching the language bucket from step 1 (e.g. `Dockerfile.python`, `Dockerfile.golang`, `Dockerfile.rust`). The `tag-suffix` must match the `uri:` in the corresponding `backend/index.yaml` image entry exactly.
If you add a new language bucket, `scripts/changed-backends.js` also needs a branch in `inferBackendPath` so PR change-detection routes file edits correctly.
**Placement in file:**
- CPU builds: Add after other CPU builds (e.g., after `cpu-chatterbox`)
@@ -29,7 +43,7 @@ Add build matrix entries for each platform/GPU type you want to support. Look at
**Additional build types you may need:**
- ROCm/HIP: Use `build-type: 'hipblas'` with `base-image: "rocm/dev-ubuntu-24.04:7.2.1"`
- Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"`
- Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"`
- L4T (ARM): Use `build-type: 'l4t'` with `platforms: 'linux/arm64'` and `runs-on: 'ubuntu-24.04-arm'`
## 3. Add Backend Metadata to `backend/index.yaml`
@@ -56,24 +70,28 @@ Add `backends/<backend-name>` to the `.NOTPARALLEL` line (around line 2) to prev
**Step 4b: Add to `prepare-test-extra`**
Add the backend to the `prepare-test-extra` target (around line 312) to prepare it for testing:
Add the backend to the `prepare-test-extra` target to prepare it for testing. Use the path matching your language bucket (`backend/python/`, `backend/go/`, `backend/rust/`, …):
```makefile
prepare-test-extra: protogen-python
...
$(MAKE) -C backend/python/<backend-name>
$(MAKE) -C backend/<lang>/<backend-name>
```
For Rust backends the target is usually the crate build target itself (e.g. `$(MAKE) -C backend/rust/<backend-name> <backend-name>-grpc`) so the binary is in place before `test` runs.
**Step 4c: Add to `test-extra`**
Add the backend to the `test-extra` target (around line 319) to run its tests:
Add the backend to the `test-extra` target to run its tests — applies to Go and Rust backends too, not only Python:
```makefile
test-extra: prepare-test-extra
...
$(MAKE) -C backend/python/<backend-name> test
$(MAKE) -C backend/<lang>/<backend-name> test
```
Each backend's own `Makefile` should define a `test` target so this line works regardless of language. Integration tests that need large model downloads should be gated behind an env var (see `backend/rust/kokoros/`'s `KOKOROS_MODEL_PATH` pattern) so CI only runs unit tests.
**Step 4d: Add Backend Definition**
Add a backend definition variable in the backend definitions section (around line 428-457). The format depends on the backend type:
@@ -93,6 +111,13 @@ BACKEND_<BACKEND_NAME> = <backend-name>|python|./backend|false|true
BACKEND_<BACKEND_NAME> = <backend-name>|golang|.|false|true
```
**For Rust backends**:
```makefile
BACKEND_<BACKEND_NAME> = <backend-name>|rust|.|false|true
```
The language field (`python`/`golang`/`rust`/…) must match a `backend/Dockerfile.<lang>` file.
**Step 4e: Generate Docker Build Target**
Add an eval call to generate the docker-build target (around line 480-501):
@@ -153,6 +178,29 @@ ls /tmp/check # expect the bundled .so files + symlinks
Then boot it inside a fresh `ubuntu:24.04` (which intentionally does *not* have the lib installed) to confirm it actually loads from the backend dir.
## Importer integration
When you add a new backend, you MUST also make it importable via the model import form (`/import-model`). The import form dropdown is sourced dynamically from `GET /backends/known` — it reads the importer registry at `core/gallery/importers/importers.go`, so the steps below are the ONLY way to make your backend show up.
Required steps:
1. **If your backend has unambiguous detection signals** (unique file extension, HF `pipeline_tag`, unique repo name pattern, unique artefact like `modules.json`):
- Create an importer file at `core/gallery/importers/<backend>.go` following the Match/Import pattern in `llama-cpp.go`.
- Register it in `importers.go:defaultImporters` in **specificity order** — more specific detectors must appear BEFORE more generic ones (e.g. `sentencetransformers` before `transformers`, `stablediffusion-ggml` before `llama-cpp`, `vllm-omni` before `vllm`). First match wins.
2. **If your backend is a drop-in replacement** (same artefacts as another backend, e.g. `ik-llama-cpp` and `turboquant` both consume GGUF the same way `llama-cpp` does):
- Do NOT create a new importer. Extend the existing importer's `Import()` to swap the emitted `backend:` field when `preferences.backend` matches. See `llama-cpp.go` for the pattern.
3. **If your backend has no reliable auto-detect signal** (preference-only — e.g. `sglang`, `tinygrad`, `whisperx`):
- Do NOT create an importer. Instead add the backend name to the curated pref-only slice in `core/http/endpoints/localai/backend.go` that feeds `/backends/known`. A single line addition.
4. **Always** add a table-driven test in `core/gallery/importers/importers_test.go` (Ginkgo/Gomega):
- Use a real public HuggingFace repo URI as the test fixture (existing tests already hit the live HF API — follow that pattern).
- Cover detection (auto-match without preferences), preference-override (explicit `backend:` in preferences wins), and — if the backend's modality has a common `pipeline_tag` but ambiguous artefacts — an ambiguity test asserting `errors.Is(err, importers.ErrAmbiguousImport)`.
Rules of thumb:
- When in doubt, lean pref-only. A wrong auto-detect is worse than a forced preference.
- Never silently emit a modality mismatch (e.g. emit `llama-cpp` for a TTS repo because `.gguf` is present). Return `ErrAmbiguousImport` instead.
- Registration order is the single most common source of bugs. Check by running `go test ./core/gallery/importers/...` — the existing suite will fail if you've shadowed a pre-existing detector.
## 6. Example: Adding a Python Backend
For reference, when `moonshine` was added:

View File

@@ -0,0 +1,101 @@
# AI Coding Assistants
This document provides guidance for AI tools and developers using AI
assistance when contributing to LocalAI.
**LocalAI follows the same guidelines as the Linux kernel project for
AI-assisted contributions.** See the upstream policy here:
<https://docs.kernel.org/process/coding-assistants.html>
The rules below mirror that policy, adapted to LocalAI's license and
project layout. If anything is unclear, the kernel document is the
authoritative reference for intent.
AI tools helping with LocalAI development should follow the standard
project development process:
- [CONTRIBUTING.md](../CONTRIBUTING.md) — development workflow, commit
conventions, and PR guidelines
- [.agents/coding-style.md](coding-style.md) — code style, editorconfig,
logging, and documentation conventions
- [.agents/building-and-testing.md](building-and-testing.md) — build and
test procedures
## Licensing and Legal Requirements
All contributions must comply with LocalAI's licensing requirements:
- LocalAI is licensed under the **MIT License** — see the [LICENSE](../LICENSE)
file
- New source files should use the SPDX license identifier `MIT` where
applicable to the file type
- Contributions must be compatible with the MIT License and must not
introduce code under incompatible licenses (e.g., GPL) without an
explicit discussion with maintainers
## Signed-off-by and Developer Certificate of Origin
**AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally
certify the Developer Certificate of Origin (DCO). The human submitter
is responsible for:
- Reviewing all AI-generated code
- Ensuring compliance with licensing requirements
- Adding their own `Signed-off-by` tag (when the project requires DCO)
to certify the contribution
- Taking full responsibility for the contribution
AI agents MUST NOT add `Co-Authored-By` trailers for themselves either.
A human reviewer owns the contribution; the AI's involvement is recorded
via `Assisted-by` (see below).
## Attribution
When AI tools contribute to LocalAI development, proper attribution helps
track the evolving role of AI in the development process. Contributions
should include an `Assisted-by` tag in the commit message trailer in the
following format:
```
Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
```
Where:
- `AGENT_NAME` — name of the AI tool or framework (e.g., `Claude`,
`Copilot`, `Cursor`)
- `MODEL_VERSION` — specific model version used (e.g.,
`claude-opus-4-7`, `gpt-5`)
- `[TOOL1] [TOOL2]` — optional specialized analysis tools invoked by the
agent (e.g., `golangci-lint`, `staticcheck`, `go vet`)
Basic development tools (git, go, make, editors) should **not** be listed.
### Example
```
fix(llama-cpp): handle empty tool call arguments
Previously the parser panicked when the model returned a tool call with
an empty arguments object. Fall back to an empty JSON object in that
case so downstream consumers receive a valid payload.
Assisted-by: Claude:claude-opus-4-7 golangci-lint
Signed-off-by: Jane Developer <jane@example.com>
```
## Scope and Responsibility
Using an AI assistant does not reduce the contributor's responsibility.
The human submitter must:
- Understand every line that lands in the PR
- Verify that generated code compiles, passes tests, and follows the
project style
- Confirm that any referenced APIs, flags, or file paths actually exist
in the current tree (AI models may hallucinate identifiers)
- Not submit AI output verbatim without review
Reviewers may ask for clarification on any change regardless of how it
was produced. "An AI wrote it" is not an acceptable answer to a design
question.

View File

@@ -2,6 +2,8 @@
This guide covers how to add new API endpoints and properly integrate them with the auth/permissions system.
> **Before you ship a new endpoint or capability surface**, re-read the [checklist at the bottom of this file](#checklist). LocalAI advertises its feature surface in several independent places — miss any one of them and clients/admins/UI won't know the endpoint exists.
## Architecture overview
Authentication and authorization flow through three layers:
@@ -234,6 +236,66 @@ Use these HTTP status codes:
If your endpoint should be tracked for usage (token counts, request counts), add the `usageMiddleware` to its middleware chain. See `core/http/middleware/usage.go` and how it's applied in `routes/openai.go`.
## Advertising surfaces — where to register a new capability
Beyond routing and auth, LocalAI publishes its capability surface in **four independent places**. When you add an endpoint — especially one introducing a net-new capability like a new media type or a new auth-gated feature — you must update every relevant surface. These aren't optional: missing them means the endpoint works but is invisible to clients, admins, and the UI.
### 1. Swagger `@Tags` annotation (mandatory)
Every handler needs a swagger block so the endpoint appears in `/swagger/index.html` and in the `/api/instructions` output. The `@Tags` value is what groups the endpoint into a capability area:
```go
// MyEndpoint does X.
// @Summary Do X.
// @Tags my-capability
// @Param request body schema.MyRequest true "payload"
// @Success 200 {object} schema.MyResponse "Response"
// @Router /v1/my-endpoint [post]
func MyEndpoint(...) echo.HandlerFunc { ... }
```
Use an existing tag when the endpoint extends an existing area (e.g. `audio`, `images`, `face-recognition`). Create a new tag only when the endpoint introduces a genuinely new capability surface — and in that case, also register it in step 2.
After adding endpoints, regenerate the embedded spec so the runtime serves it:
```bash
make protogen-go # ensures gRPC codegen is fresh first
make swagger # regenerates swagger/swagger.json
```
### 2. `/api/instructions` registry (for new capability areas)
`core/http/endpoints/localai/api_instructions.go` defines `instructionDefs` — a lightweight, machine-readable index of capability areas that groups swagger endpoints by tag. It's the primary discovery surface for agents and SDKs ("what can this server do?").
**When to update:** only when adding a new capability area (a new swagger tag). Existing-tag additions automatically surface without any change here.
Add an entry to `instructionDefs`:
```go
{
Name: "my-capability", // URL segment at /api/instructions/my-capability
Description: "Short sentence describing the capability",
Tags: []string{"my-capability"}, // must match swagger @Tags
Intro: "Optional gotcha/context that isn't in the swagger descriptions (caveats, defaults, cross-references to other endpoints).",
},
```
Also bump the expected-length count in `api_instructions_test.go` and add the name to the `ContainElements` assertion.
### 3. `capabilities.js` symbol (for new model-config FLAG_* flags)
If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), also declare the matching symbol in `core/http/react-ui/src/utils/capabilities.js`:
```js
export const CAP_MY_CAPABILITY = 'FLAG_MY_CAPABILITY'
```
React pages that want to filter the ModelSelector by capability import this symbol. Declare it even if you're not building the UI page yet — the declaration keeps the Go/JS vocabularies in sync.
### 4. `docs/content/` (user-facing documentation)
A new capability deserves its own page under `docs/content/features/`, plus cross-links from related features and an entry in `docs/content/whats-new.md`. See the pattern used by `face-recognition.md` / `object-detection.md`.
## Path protection rules
The global auth middleware classifies paths as API paths or non-API paths:
@@ -248,12 +310,36 @@ If you add endpoints under a new top-level path prefix, add it to `isAPIPath()`
When adding a new endpoint:
**Routing & auth**
- [ ] Handler in `core/http/endpoints/`
- [ ] Route registered in appropriate `core/http/routes/` file
- [ ] Auth level chosen: public / standard / admin / feature-gated
- [ ] If feature-gated: constant in `permissions.go`, metadata in `features.go`, middleware in `app.go`
- [ ] Entry added to `RouteFeatureRegistry` in `core/http/auth/features.go` (one row per route/method — all /v1/* routes gate through this, not per-route middleware)
- [ ] If new feature: constant in `permissions.go`, added to the right slice (`APIFeatures` default-ON / `AgentFeatures` default-OFF), metadata in `features.go` `*FeatureMetas()`
- [ ] If feature uses group middleware: wired in `core/http/app.go` and passed to the route registration function
- [ ] If new path prefix: added to `isAPIPath()` in `middleware.go`
- [ ] If OpenAI-compatible: entry in `RouteFeatureRegistry`
- [ ] If token-counting: `usageMiddleware` added to middleware chain
- [ ] Error responses use `schema.ErrorResponse` format
**Advertising surfaces (easy to miss — see the [Advertising surfaces](#advertising-surfaces--where-to-register-a-new-capability) section)**
- [ ] Swagger block on the handler: `@Summary`, `@Tags`, `@Param`, `@Success`, `@Router`
- [ ] If new capability area (new swagger tag): entry in `instructionDefs` in `core/http/endpoints/localai/api_instructions.go` + test count bumped in `api_instructions_test.go`
- [ ] If new `FLAG_*` usecase flag: matching `CAP_*` symbol exported from `core/http/react-ui/src/utils/capabilities.js`
- [ ] `docs/content/features/<feature>.md` created; cross-links from related feature pages; entry in `docs/content/whats-new.md`
**Quality**
- [ ] Error responses use `schema.ErrorResponse` format (or `echo.NewHTTPError` with a mapped gRPC status — see the `mapBackendError` helper in `core/http/endpoints/localai/images.go`)
- [ ] Tests cover both authenticated and unauthenticated access
- [ ] Swagger regenerated (`make swagger`) if you changed any `@Router`/`@Tags`/`@Param` annotation
## Companion: MCP admin tool surface
**Required for admin endpoints.** Every new admin endpoint MUST be considered for the MCP admin tool surface — the REST API and the MCP tool catalog can drift silently otherwise, and both the LocalAI Assistant chat modality and the standalone `local-ai mcp-server` rely on `pkg/mcp/localaitools/` to mirror REST.
Two outcomes are acceptable; one is not:
- **Tool added.** The new endpoint is something an admin would manage conversationally (install, list, edit, toggle, upgrade). Follow the full checklist in [.agents/localai-assistant-mcp.md](localai-assistant-mcp.md): add a `LocalAIClient` interface method, implement it in both `inproc` and `httpapi`, register the tool with a `Tool*` constant, update the skill prompts, **and add the route to `toolToHTTPRoute` in `pkg/mcp/localaitools/coverage_test.go`**.
- **Tool deliberately skipped.** The endpoint is internal/diagnostic and adding a chat path would be misleading. Document the decision in the PR description; no code action.
- **Forgot.** This breaks the contract. The `TestToolHTTPRouteMappingComplete` test in `pkg/mcp/localaitools` is a partial guard (it checks every `Tool*` has a route mapping), but it does NOT detect new REST endpoints without a tool — that's still a process check on the PR author.
**Add to the bottom of the checklist below**:
- [ ] If admin: decided whether MCP coverage is needed; if yes, tool registered + map updated; if no, skip-reason in PR description.

111
.agents/ci-caching.md Normal file
View File

@@ -0,0 +1,111 @@
# CI Build Caching
Container builds — both the root LocalAI image (`Dockerfile`) and the per-backend images (`backend/Dockerfile.*`) — share a registry-backed BuildKit cache. This file explains how that cache is laid out, what invalidates it, and how to bypass it.
## Cache layout
- **Cache registry**: `quay.io/go-skynet/ci-cache`
- **One tag per matrix entry**, derived from the existing `tag-suffix`:
- Backend builds (`backend_build.yml`): `cache<tag-suffix>`
- e.g. `cache-gpu-nvidia-cuda-12-llama-cpp`, `cache-cpu-vllm`, `cache-nvidia-l4t-cuda-13-arm64-vllm`
- Root image builds (`image_build.yml`): `cache-localai<tag-suffix>`
- e.g. `cache-localai-gpu-nvidia-cuda-12`, `cache-localai-gpu-vulkan`
- Each tag stores a multi-arch BuildKit cache manifest (`mode=max`), so every intermediate stage is re-usable, not just the final image.
## Read/write semantics
| Trigger | `cache-from` | `cache-to` |
|---|---|---|
| `push` to `master` / tag | yes | yes (`mode=max,ignore-error=true`) |
| `pull_request` | yes | **no** |
PR builds read master's warm cache but never write — this prevents PRs from polluting the shared cache with their experimental state. After merge, the master build for that matrix entry refreshes the cache.
`ignore-error=true` on the write side means a transient quay push failure does not fail the build; the next master push retries.
## Self-warming, no separate populator
There is no cron job that pre-warms the cache. The production builds *are* the populator. The first master build of a given matrix entry pays the cold cost; subsequent same-entry master builds reuse everything that hasn't changed (apt installs, gRPC compile in `Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}`, Python wheel installs, etc.).
Historically there was a `generate_grpc_cache.yaml` cron that targeted a `grpc` stage in the root Dockerfile. That stage was removed in July 2025 and the cron silently failed every night for 9 months without writing anything. It was deleted along with the registry-cache rollout.
## The `DEPS_REFRESH` cache-buster (Python backends)
Every Python backend goes through the shared `backend/Dockerfile.python`, which ends with:
```dockerfile
ARG DEPS_REFRESH=initial
RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
```
Most Python backends ship `requirements*.txt` files that **do not pin every transitive dep** (`torch`, `transformers`, `vllm`, `diffusers`, etc. are listed without a `==` pin, or with `>=` lower bounds only). With a warm BuildKit cache, the `make` layer hashes only on Dockerfile instructions + COPYed source — not on what `pip install` resolves at runtime. So a warm cache would ship the *first* version of `vllm` ever cached and never pick up upstream releases.
`DEPS_REFRESH` defends against that:
- `backend_build.yml` computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W17`) before each build and passes it as a build-arg.
- The `RUN ... make` layer's BuildKit hash now includes that string, so the layer invalidates **at most once per week**, automatically picking up newer wheels.
- Within a week, builds stay warm.
This applies only to `Dockerfile.python` because:
- Go (`Dockerfile.golang`) pins versions in `go.mod` / `go.sum`.
- Rust (`Dockerfile.rust`) pins via `Cargo.lock`.
- C++ backends (`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}`) clone gRPC at a pinned tag (`v1.65.0`) and llama.cpp at a pinned commit; their inputs don't drift between rebuilds.
### Adjusting the cadence
If you need a faster refresh (e.g. while debugging an upstream flake), bump the format to daily (`+%Y-%m-%d`) or hourly (`+%Y-%m-%d-%H`). If you need a one-shot rebuild for a specific backend without changing the schedule, append a marker to the tag-suffix in the matrix or temporarily delete that backend's cache tag in quay.
## Manually evicting cache
To force a fully cold build for one backend or the whole image:
```bash
# Delete a single tag (requires quay credentials with admin on the repo)
curl -X DELETE \
-H "Authorization: Bearer ${QUAY_TOKEN}" \
https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/cache-gpu-nvidia-cuda-12-vllm
# List all tags
curl -s -H "Authorization: Bearer ${QUAY_TOKEN}" \
"https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/?limit=100" | jq '.tags[].name'
```
Eviction is rarely needed in normal operation — `DEPS_REFRESH` handles weekly drift, source changes invalidate naturally, and `mode=max` keeps the cache scoped per matrix entry so a stale tag never bleeds into a different build.
## What the cache **does not** cover
- The "Free Disk Space" / "Release space from worker" steps run on every job — these reclaim ~6 GB on `ubuntu-latest` runners. They are runner-state cleanup, not Docker, and BuildKit caches don't apply.
- Intermediate artifacts of `Build and push (PR)` are not pushed anywhere — PRs only build for verification.
- Darwin builds (see below) — macOS runners have no Docker daemon, so the registry-backed BuildKit cache cannot apply.
## Darwin native caches
`backend_build_darwin.yml` runs natively on `macOS-14` GitHub-hosted runners — there is no Docker, no BuildKit, no cross-job registry cache. Instead, the reusable workflow uses `actions/cache@v4` for four native caches that mirror the spirit of the Linux cache (warm by default, weekly refresh for unpinned Python deps, PRs read-only).
| Cache | Path(s) | Key | Scope |
|---|---|---|---|
| Go modules + build | `~/go/pkg/mod`, `~/Library/Caches/go-build` | `go.sum` (managed by `actions/setup-go@v5` `cache: true`) | All darwin jobs |
| Homebrew | `~/Library/Caches/Homebrew/downloads`, selected `/opt/homebrew/Cellar/*` | hash of `backend_build_darwin.yml` | All darwin jobs |
| ccache (llama.cpp CMake) | `~/Library/Caches/ccache` | pinned `LLAMA_VERSION` from `backend/cpp/llama-cpp/Makefile` | `inputs.backend == 'llama-cpp'` only |
| Python wheels (uv + pip) | `~/Library/Caches/pip`, `~/Library/Caches/uv` | `inputs.backend` + ISO week (`+%Y-W%V`) + hash of that backend's `requirements*.txt` | `inputs.lang == 'python'` only |
Read/write semantics match the BuildKit cache: `actions/cache/restore` runs every time, `actions/cache/save` is gated on `github.event_name != 'pull_request'`. PRs read master's warm cache but never write back.
The Python wheel cache uses the same ISO-week cache-buster as the Linux `DEPS_REFRESH` build-arg — same problem (unpinned `torch`/`mlx`/`diffusers`/`transformers` resolve to fresh wheels weekly), same ~one-cold-rebuild-per-week solution.
The brew Cellar cache requires `HOMEBREW_NO_AUTO_UPDATE=1` and `HOMEBREW_NO_INSTALL_CLEANUP=1` (set as job-level env). Without those, `brew install` would mutate the very directories that were just restored, defeating the cache.
For ccache, the workflow exports `CMAKE_ARGS=… -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache` via `$GITHUB_ENV` before running `make build-darwin-go-backend`. The Makefile in `backend/cpp/llama-cpp/` already forwards `CMAKE_ARGS` through to each variant build (`fallback`, `grpc`, `rpc-server`), so no script changes are needed. The three variants share most TUs, so ccache dedupes object files across them.
### Cache budget on Darwin
GitHub Actions caches are limited to 10 GB per repo. Steady-state worst case: ~800 MB Go cache + ~2 GB brew Cellar + up to 2 GB ccache + ~1.5 GB × 5 python backends. If the cap is hit, prefer collapsing the per-backend Python keys into a shared `pyenv-darwin-shared-<week>` key (accepts more cross-backend churn for a smaller footprint) before reducing other caches.
## Touching the cache pipeline
When changing `image_build.yml`, `backend_build.yml`, or any of the `backend/Dockerfile.*` files:
1. **Don't drop `DEPS_REFRESH=...` from the build-args** without a replacement strategy (lockfiles, pinned requirements). Otherwise master will silently freeze on whichever versions were cached at the time.
2. **Keep `tag-suffix` unique per matrix entry** — it's the cache namespace. Two matrix entries sharing a tag-suffix would clobber each other's cache.
3. **Keep `cache-to` gated on `github.event_name != 'pull_request'`** — PRs must not write.
4. **Keep `ignore-error=true` on `cache-to`** — quay registry hiccups must not fail builds.

View File

@@ -42,6 +42,14 @@ trim_trailing_whitespace = false
Use `github.com/mudler/xlog` for logging which has the same API as slog.
## Go tests
All Go tests — including backend tests — must use [Ginkgo](https://onsi.github.io/ginkgo/) (v2) with Gomega matchers, not the stdlib `testing` package with `t.Run` / `t.Errorf`. A test file should register a suite with `RegisterFailHandler(Fail)` in a `TestXxx(t *testing.T)` bootstrap and use `Describe`/`Context`/`It` blocks for the actual cases. Look at any existing `*_test.go` under `core/` or `pkg/` for a template.
Do not mix styles within a package. If you are extending tests in a package that already uses Ginkgo, keep using Ginkgo. If you find stdlib-style Go tests in the tree, treat them as tech debt to be migrated rather than as a pattern to follow.
This is enforced by `golangci-lint` via the `forbidigo` linter (see `.golangci.yml`); calls like `t.Errorf` / `t.Fatalf` / `t.Run` / `t.Skip` / `t.Logf` are flagged. Run `make lint` locally before submitting; the same check runs in CI (`.github/workflows/lint.yml`).
## Documentation
The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.

View File

@@ -0,0 +1,97 @@
# LocalAI Assistant — admin MCP server
This document is the contract for **anyone** (human or AI agent) touching LocalAI's admin REST surface, the in-process MCP server that wraps it, or the embedded skill prompts that teach the assistant how to use it. Read this before adding/removing/renaming admin endpoints, MCP tools, or skill recipes.
## What this feature is
`pkg/mcp/localaitools/` is a public Go package that exposes LocalAI's admin/management surface as an MCP server. It is used in two ways:
1. **In-process**: when an admin opens a chat with `metadata.localai_assistant=true`, the chat handler injects the in-memory MCP server (paired `net.Pipe()` transport, no HTTP loopback) so the LLM can install models, manage backends and edit configs by chatting.
2. **Standalone**: the `local-ai mcp-server --target=…` subcommand serves the same MCP server over stdio, talking HTTP to a remote LocalAI instance.
The two modes share **all** tool definitions and skill prompts. They differ only in their `LocalAIClient` implementation (`inproc/` calls services directly; `httpapi/` calls REST).
## The three things you must keep in sync
When you change LocalAI's admin surface, three layers must stay aligned:
1. **REST endpoint** in `core/http/endpoints/localai/*.go`.
2. **MCP tool registration** in `pkg/mcp/localaitools/tools_*.go`, plus a method on `LocalAIClient` (in `client.go`) and implementations in both `inproc/client.go` **and** `httpapi/client.go`.
3. **Skill prompt** under `pkg/mcp/localaitools/prompts/skills/*.md` — the markdown that teaches the LLM how to use the new tool. If the new tool fits an existing recipe, update that recipe; otherwise add a new file.
If you ship a REST endpoint without (2) and (3), conversational admins won't see the feature.
## Checklist for adding a new admin endpoint
- [ ] REST endpoint exists in `core/http/endpoints/localai/*.go` and is gated by `auth.RequireAdmin()` in `core/http/routes/localai.go`.
- [ ] `LocalAIClient` interface in `pkg/mcp/localaitools/client.go` has a method covering the new operation.
- [ ] DTOs added/updated in `pkg/mcp/localaitools/dto.go` (JSON-tagged; never expose raw service types).
- [ ] `inproc/client.go` implements the new method by calling the service directly (not via HTTP loopback).
- [ ] `httpapi/client.go` implements the new method by calling the REST endpoint.
- [ ] Tool registration added in the appropriate `pkg/mcp/localaitools/tools_*.go`. Mutating tools must reference safety rule 1 in the description.
- [ ] If the tool is mutating, ensure `Options{DisableMutating: true}` skips it (mirror the pattern in `tools_models.go`).
- [ ] Skill prompt added or updated under `pkg/mcp/localaitools/prompts/skills/`. The prompt must instruct the LLM when to call the tool, what to ask the user first, and what to do on error.
- [ ] Tests:
- `pkg/mcp/localaitools/server_test.go` adds the tool name to `expectedFullCatalog` and `expectedReadOnlyCatalog` (if read-only).
- Tool dispatch is added to `TestEachToolDispatchesToClient`.
- `pkg/mcp/localaitools/httpapi/client_test.go` covers the new HTTP path.
## Adding a new skill recipe (no new tool)
Sometimes you want to teach the LLM a new pattern that uses existing tools. Drop a markdown file under `pkg/mcp/localaitools/prompts/skills/<verb>_<noun>.md`. The file is automatically embedded by `//go:embed` and assembled into the system prompt in lexicographic order. No Go changes needed.
Conventions:
- Filename: `<verb>_<noun>.md` (e.g. `install_chat_model.md`, `upgrade_backend.md`).
- First line: `# Skill: <Title Case description>`.
- Number the steps. Reference exact tool names in backticks.
- If the skill mutates state, remind the LLM to confirm with the user.
## Code conventions
These rules guard against the magic-literal drift that surfaced in the first audit. Do not re-introduce bare strings.
- **Tool names** always come from the `Tool*` constants in `pkg/mcp/localaitools/tools.go`. Tool registrations, the test catalog (`server_test.go`'s `expectedFullCatalog` / `expectedReadOnlyCatalog`), and dispatch tables reference the constants. The embedded skill prompts under `prompts/` keep bare strings — that's the one allowed exception, and `TestPromptsContainSafetyAnchors` enforces alignment.
- **Toggle/pin actions** use the `modeladmin.Action` type (`pkg/mcp/localaitools` and `core/services/modeladmin`). Use `ActionEnable`/`ActionDisable`/`ActionPin`/`ActionUnpin`; never bare `"enable"`/`"pin"` strings.
- **Capability tags** for `list_installed_models` use the `localaitools.Capability` type (`capability.go`). The `LocalAIClient.ListInstalledModels` interface takes a typed `Capability`, and the `inproc` switch only accepts canonical values (`"embed"`/`"embedding"` are not aliases — only `CapabilityEmbeddings`).
- **HTTP error checks** in `httpapi.Client` use `errors.Is(err, ErrHTTPNotFound)`, not substring matches on `err.Error()`. The typed `*HTTPError` carries `StatusCode` and `Body`; add new sentinel errors as needed rather than re-introducing string matching.
- **Channel sends** to `GalleryService.ModelGalleryChannel` / `BackendGalleryChannel` from inproc clients MUST select on `ctx.Done()` so a cancelled chat completion releases the goroutine. See `inproc.sendModelOp` / `sendBackendOp`.
- **Disk writes** of model config YAML go through `modeladmin.writeFileAtomic` (temp file + `os.Rename`). `os.WriteFile` truncates on crash and corrupts the model.
- **MCP server lifecycle**: every initialised holder MUST register `Close()` with `signals.RegisterGracefulTerminationHandler`. The standalone `mcp-server` CLI uses `signal.NotifyContext` to honour SIGINT/SIGTERM.
## File map (where to look)
```
pkg/mcp/localaitools/
client.go # LocalAIClient interface + DTO registry
dto.go # JSON-tagged DTOs shared by both client impls
server.go # NewServer(client, opts) — registers tools
tools.go # Tool* name constants (single source of truth)
capability.go # Capability type + constants
tools_models.go # gallery_search, install_model, import_model_uri, ...
tools_backends.go
tools_config.go
tools_system.go
tools_state.go
prompts.go # //go:embed loader + SystemPrompt(opts)
prompts/00_role.md
prompts/10_safety.md # SAFETY RULES — change with care
prompts/20_tools.md # curated tool catalog with one-liners
prompts/skills/*.md
inproc/client.go # in-process LocalAIClient (services-direct)
httpapi/client.go # REST LocalAIClient (for standalone CLI / remote)
core/http/endpoints/mcp/
localai_assistant.go # process-wide holder + LocalToolExecutor
core/cli/mcp_server.go # local-ai mcp-server subcommand
```
## Why two clients
The in-process MCP server runs inside the same LocalAI binary that serves chat. Going over HTTP loopback would (a) require minting a synthetic admin API key for the server to authenticate against itself, (b) double-marshal every tool dispatch, and (c) lose access to in-process channels (e.g. `GalleryService.ModelGalleryChannel` for streaming install progress). So in-process uses `inproc.Client`. The standalone stdio CLI talks to a *remote* LocalAI; HTTP is the only option, so it uses `httpapi.Client`. Both implement the same `LocalAIClient` interface, and the parity test in `pkg/mcp/localaitools/parity_test.go` (when present) keeps their output equivalent.
## Why prompt-enforced confirmation, not code gates
The user chose KISS. Every mutating tool has a safety rule (`prompts/10_safety.md` rule 1) that requires the LLM to summarise the action and wait for explicit user confirmation before calling it. There is no `plan_*`/`apply_*` two-step in code. If you add a mutating tool, do **not** add per-tool confirmation logic in Go — instead, list the new tool name in `prompts/10_safety.md` so the LLM knows it falls under the confirmation rule.
## Distributed mode
The in-memory MCP server runs only on the head node (where the chat handler runs). `inproc.Client` wraps services that are already distributed-aware (`GalleryService` coordinates with workers; `ListNodes` reads the NATS-populated registry). No NATS routing of MCP tools — the admin surface lives on the head, period.

39
.docker/apt-mirror.sh Executable file
View File

@@ -0,0 +1,39 @@
#!/bin/sh
# Reconfigure Ubuntu apt sources to point at an alternate mirror.
#
# Used by Dockerfiles via `RUN --mount=type=bind,source=.docker/apt-mirror.sh,...`
# and by CI workflows on the runner to mitigate outages of the default
# archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com pool.
#
# Inputs (env):
# APT_MIRROR Replacement for archive.ubuntu.com and security.ubuntu.com
# (e.g. "http://azure.archive.ubuntu.com" or
# "https://mirrors.edge.kernel.org").
# Leave empty to keep upstream. The trailing "/ubuntu/..."
# path is preserved by the rewrite.
# APT_PORTS_MIRROR Replacement for ports.ubuntu.com (arm64/ppc64el/...).
# Leave empty to keep upstream.
#
# Both default to empty, in which case the script is a no-op.
set -e
if [ -z "${APT_MIRROR}" ] && [ -z "${APT_PORTS_MIRROR}" ]; then
exit 0
fi
# Ubuntu 24.04 (noble) ships DEB822 sources at /etc/apt/sources.list.d/ubuntu.sources;
# older releases use /etc/apt/sources.list. We rewrite whichever exists.
for f in /etc/apt/sources.list.d/ubuntu.sources /etc/apt/sources.list; do
[ -f "$f" ] || continue
if [ -n "${APT_MIRROR}" ]; then
# Use a comma delimiter so the alternation pipe in the regex
# is not interpreted as the s/// separator.
sed -i -E "s,https?://(archive\.ubuntu\.com|security\.ubuntu\.com),${APT_MIRROR},g" "$f"
fi
if [ -n "${APT_PORTS_MIRROR}" ]; then
sed -i -E "s,https?://ports\.ubuntu\.com,${APT_PORTS_MIRROR},g" "$f"
fi
done
echo "apt-mirror: rewrote sources (APT_MIRROR='${APT_MIRROR}', APT_PORTS_MIRROR='${APT_PORTS_MIRROR}')"

View File

@@ -0,0 +1,91 @@
name: 'Configure apt mirror'
description: |
Reconfigure the GitHub Actions runner's Ubuntu apt sources to use an
alternate mirror, and emit the effective URLs as outputs so callers can
forward them as Docker build-args.
Two mirror profiles depending on where the runner lives, because the
best mirror differs by network:
* github-hosted runners run on Azure, so they default to the
Azure-hosted Ubuntu mirror (lowest latency, same VPC).
* self-hosted runners (arc-runner-set, bigger-runner, ...) typically
cannot route to azure.archive.ubuntu.com, so they default to the
kernel.org mirror, which is publicly reachable from anywhere.
Pass an empty string to either input to skip the rewrite for that
profile and keep upstream archive.ubuntu.com / ports.ubuntu.com.
inputs:
github-hosted-mirror:
description: 'archive/security mirror URL for github-hosted runners (empty = upstream)'
required: false
default: 'http://azure.archive.ubuntu.com'
github-hosted-ports-mirror:
description: 'ports.ubuntu.com mirror URL for github-hosted runners (empty = upstream)'
required: false
default: 'http://azure.ports.ubuntu.com'
self-hosted-mirror:
description: 'archive/security mirror URL for self-hosted runners (empty = upstream)'
required: false
default: 'https://mirrors.edge.kernel.org'
self-hosted-ports-mirror:
description: 'ports.ubuntu.com mirror URL for self-hosted runners (empty = upstream)'
required: false
default: 'https://mirrors.edge.kernel.org'
outputs:
effective-mirror:
description: 'The mirror URL actually applied for this runner (or empty)'
value: ${{ steps.pick.outputs.mirror }}
effective-ports-mirror:
description: 'The ports mirror URL actually applied for this runner (or empty)'
value: ${{ steps.pick.outputs.ports-mirror }}
runs:
using: 'composite'
steps:
- name: Pick effective mirror for this runner
id: pick
shell: bash
env:
RUNNER_ENV: ${{ runner.environment }}
GH_MIRROR: ${{ inputs.github-hosted-mirror }}
GH_PORTS_MIRROR: ${{ inputs.github-hosted-ports-mirror }}
SH_MIRROR: ${{ inputs.self-hosted-mirror }}
SH_PORTS_MIRROR: ${{ inputs.self-hosted-ports-mirror }}
run: |
if [ "${RUNNER_ENV}" = "github-hosted" ]; then
MIRROR="${GH_MIRROR}"
PORTS_MIRROR="${GH_PORTS_MIRROR}"
else
MIRROR="${SH_MIRROR}"
PORTS_MIRROR="${SH_PORTS_MIRROR}"
fi
echo "configure-apt-mirror: runner=${RUNNER_ENV} mirror='${MIRROR}' ports-mirror='${PORTS_MIRROR}'"
echo "mirror=${MIRROR}" >> "$GITHUB_OUTPUT"
echo "ports-mirror=${PORTS_MIRROR}" >> "$GITHUB_OUTPUT"
- name: Rewrite apt sources
if: steps.pick.outputs.mirror != '' || steps.pick.outputs.ports-mirror != ''
shell: bash
env:
APT_MIRROR: ${{ steps.pick.outputs.mirror }}
APT_PORTS_MIRROR: ${{ steps.pick.outputs.ports-mirror }}
run: |
set -e
# Ubuntu 24.04 (noble) ships DEB822 sources at
# /etc/apt/sources.list.d/ubuntu.sources; older releases use
# /etc/apt/sources.list. Rewrite whichever exists.
for f in /etc/apt/sources.list.d/ubuntu.sources /etc/apt/sources.list; do
sudo test -f "$f" || continue
if [ -n "${APT_MIRROR}" ]; then
# Comma delimiter so the alternation pipe in the regex is not
# interpreted as the s/// separator.
sudo sed -i -E "s,https?://(archive\.ubuntu\.com|security\.ubuntu\.com),${APT_MIRROR},g" "$f"
fi
if [ -n "${APT_PORTS_MIRROR}" ]; then
sudo sed -i -E "s,https?://ports\.ubuntu\.com,${APT_PORTS_MIRROR},g" "$f"
fi
done
echo "Runner apt mirror configured (APT_MIRROR='${APT_MIRROR}', APT_PORTS_MIRROR='${APT_PORTS_MIRROR}')"

45
.github/bump_vllm_wheel.sh vendored Executable file
View File

@@ -0,0 +1,45 @@
#!/bin/bash
# Bump the cublas13 vLLM wheel pin in requirements-cublas13-after.txt.
#
# vLLM's PyPI wheel is built against CUDA 12 so the cublas13 build pulls a
# cu130-flavoured wheel from vLLM's per-tag index at
# https://wheels.vllm.ai/<TAG>/cu130/. That URL segment is itself version-locked
# (no /latest/ alias upstream), so bumping vLLM means rewriting both the URL
# segment and the version constraint atomically. bump_deps.sh handles git-sha
# vars in Makefiles; this script handles the two-value rewrite specific to the
# vLLM requirements file.
set -xe
REPO=$1 # vllm-project/vllm
FILE=$2 # backend/python/vllm/requirements-cublas13-after.txt
VAR=$3 # VLLM_VERSION (used for output file names so the workflow can read them)
if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
echo "usage: $0 <repo> <requirements-file> <var-name>" >&2
exit 1
fi
# /releases/latest returns the most recent non-prerelease tag.
LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
"https://api.github.com/repos/$REPO/releases/latest" \
| python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
# Strip leading 'v' (vLLM tags are 'v0.20.0', the URL/version use '0.20.0').
NEW_VERSION="${LATEST_TAG#v}"
set +e
CURRENT_VERSION=$(grep -oE '^vllm==[0-9]+\.[0-9]+\.[0-9]+' "$FILE" | head -1 | cut -d= -f3)
set -e
# sed both lines unconditionally — peter-evans/create-pull-request opens no PR
# when the working tree is clean, so a no-op rewrite is safe.
sed -i "$FILE" \
-e "s|wheels\.vllm\.ai/[^/]*/cu130|wheels.vllm.ai/$NEW_VERSION/cu130|g" \
-e "s|^vllm==.*|vllm==$NEW_VERSION|"
if [ -z "$CURRENT_VERSION" ]; then
echo "Could not find vllm==X.Y.Z in $FILE."
exit 0
fi
echo "Changes: https://github.com/$REPO/compare/v${CURRENT_VERSION}...${LATEST_TAG}" >> "${VAR}_message.txt"
echo "${NEW_VERSION}" >> "${VAR}_commit.txt"

View File

@@ -30,6 +30,7 @@ jobs:
skip-drivers: ${{ matrix.skip-drivers }}
context: ${{ matrix.context }}
ubuntu-version: ${{ matrix.ubuntu-version }}
amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
@@ -140,7 +141,7 @@ jobs:
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-whisperx'
runs-on: 'ubuntu-latest'
@@ -153,7 +154,7 @@ jobs:
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-faster-whisper'
runs-on: 'ubuntu-latest'
@@ -697,6 +698,19 @@ jobs:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-vibevoice-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "vibevoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -710,6 +724,32 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-insightface'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "insightface"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-speaker-recognition'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "speaker-recognition"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -893,6 +933,32 @@ jobs:
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-vllm'
runs-on: 'arc-runner-set'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "vllm"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-vllm-omni'
runs-on: 'arc-runner-set'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "vllm-omni"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1049,6 +1115,45 @@ jobs:
backend: "diffusers"
dockerfile: "./backend/Dockerfile.python"
context: "./"
- build-type: 'l4t'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-vllm'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
ubuntu-version: '2404'
backend: "vllm"
dockerfile: "./backend/Dockerfile.python"
context: "./"
- build-type: 'l4t'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-vllm-omni'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
ubuntu-version: '2404'
backend: "vllm-omni"
dockerfile: "./backend/Dockerfile.python"
context: "./"
- build-type: 'l4t'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-sglang'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
ubuntu-version: '2404'
backend: "sglang"
dockerfile: "./backend/Dockerfile.python"
context: "./"
- build-type: 'l4t'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1348,6 +1453,19 @@ jobs:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-vibevoice-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "vibevoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1374,6 +1492,19 @@ jobs:
backend: "qwen3-tts-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-vibevoice-cpp'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "vibevoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1623,19 +1754,6 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-whisperx'
runs-on: 'bigger-runner'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "whisperx"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
@@ -1657,7 +1775,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-intel-rerankers'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "rerankers"
dockerfile: "./backend/Dockerfile.python"
@@ -1670,7 +1788,7 @@ jobs:
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-llama-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "llama-cpp"
dockerfile: "./backend/Dockerfile.llama-cpp"
@@ -2554,6 +2672,85 @@ jobs:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# vibevoice-cpp
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-vibevoice-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "vibevoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-vibevoice-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "vibevoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-vibevoice-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "vibevoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-vibevoice-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "vibevoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-vibevoice-cpp'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "vibevoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-vibevoice-cpp'
base-image: "rocm/dev-ubuntu-24.04:6.4.4"
runs-on: 'ubuntu-latest'
skip-drivers: 'false'
backend: "vibevoice-cpp"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# voxtral
- build-type: ''
cuda-major-version: ""
@@ -2596,6 +2793,20 @@ jobs:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# kokoros (Rust TTS)
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-kokoros'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "kokoros"
dockerfile: "./backend/Dockerfile.rust"
context: "./"
ubuntu-version: '2404'
# local-store
- build-type: ''
cuda-major-version: ""
@@ -2624,6 +2835,34 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
# insightface (face recognition)
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-insightface'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "insightface"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
# speaker-recognition (voice/speaker biometrics)
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-speaker-recognition'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "speaker-recognition"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'intel'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2821,6 +3060,49 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
# sherpa-onnx CPU
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-sherpa-onnx'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "sherpa-onnx"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# sherpa-onnx CUDA 12
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-sherpa-onnx'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "sherpa-onnx"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# sherpa-onnx CUDA 13 — requires onnxruntime 1.24.x+ for the
# gpu_cuda13 tarball; sherpa-onnx SHERPA_COMMIT pins to v1.12.39.
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-sherpa-onnx'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "sherpa-onnx"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
backend-jobs-darwin:
uses: ./.github/workflows/backend_build_darwin.yml
strategy:
@@ -2863,6 +3145,10 @@ jobs:
tag-suffix: "-metal-darwin-arm64-qwen3-tts-cpp"
build-type: "metal"
lang: "go"
- backend: "vibevoice-cpp"
tag-suffix: "-metal-darwin-arm64-vibevoice-cpp"
build-type: "metal"
lang: "go"
- backend: "voxtral"
tag-suffix: "-metal-darwin-arm64-voxtral"
build-type: "metal"

View File

@@ -58,6 +58,11 @@ on:
required: false
default: '2204'
type: string
amdgpu-targets:
description: 'AMD GPU targets for ROCm/HIP builds'
required: false
default: ''
type: string
secrets:
dockerUsername:
required: false
@@ -75,6 +80,14 @@ jobs:
quay_username: ${{ secrets.quayUsername }}
steps:
- name: Checkout
uses: actions/checkout@v6
with:
submodules: true
- name: Configure apt mirror on runner
id: apt_mirror
uses: ./.github/actions/configure-apt-mirror
- name: Free Disk Space (Ubuntu)
if: inputs.runs-on == 'ubuntu-latest'
@@ -92,18 +105,6 @@ jobs:
docker-images: true
swap-storage: true
- name: Force Install GIT latest
run: |
sudo apt-get update \
&& sudo apt-get install -y software-properties-common \
&& sudo apt-get update \
&& sudo add-apt-repository -y ppa:git-core/ppa \
&& sudo apt-get update \
&& sudo apt-get install -y git
- name: Checkout
uses: actions/checkout@v6
- name: Release space from worker
if: inputs.runs-on == 'ubuntu-latest'
run: |
@@ -201,6 +202,15 @@ jobs:
username: ${{ secrets.quayUsername }}
password: ${{ secrets.quayPassword }}
# Weekly cache-buster for the per-backend `make` step. Most Python
# backends list unpinned deps (torch, transformers, vllm, ...), so a
# warm cache freezes upstream versions indefinitely. Rolling this
# weekly forces a re-resolve of the install layer at most once per
# week, picking up newer wheels without a full cold rebuild.
- name: Compute deps refresh key
id: deps_refresh
run: echo "key=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
- name: Build and push
uses: docker/build-push-action@v7
if: github.event_name != 'pull_request'
@@ -214,9 +224,14 @@ jobs:
BASE_IMAGE=${{ inputs.base-image }}
BACKEND=${{ inputs.backend }}
UBUNTU_VERSION=${{ inputs.ubuntu-version }}
AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
context: ${{ inputs.context }}
file: ${{ inputs.dockerfile }}
cache-from: type=gha
cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}
cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }},mode=max,ignore-error=true
platforms: ${{ inputs.platforms }}
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta.outputs.tags }}
@@ -235,9 +250,13 @@ jobs:
BASE_IMAGE=${{ inputs.base-image }}
BACKEND=${{ inputs.backend }}
UBUNTU_VERSION=${{ inputs.ubuntu-version }}
AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
context: ${{ inputs.context }}
file: ${{ inputs.dockerfile }}
cache-from: type=gha
cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}
platforms: ${{ inputs.platforms }}
push: ${{ env.quay_username != '' }}
tags: ${{ steps.meta_pull_request.outputs.tags }}

View File

@@ -48,6 +48,13 @@ jobs:
strategy:
matrix:
go-version: ['${{ inputs.go-version }}']
env:
# Keep the brew Cellar stable across cache restores. Without these,
# `brew install` would auto-update brew itself and re-link formulas,
# mutating the very paths the cache just restored.
HOMEBREW_NO_AUTO_UPDATE: '1'
HOMEBREW_NO_INSTALL_CLEANUP: '1'
HOMEBREW_NO_ANALYTICS: '1'
steps:
- name: Clone
uses: actions/checkout@v6
@@ -58,21 +65,141 @@ jobs:
uses: actions/setup-go@v5
with:
go-version: ${{ matrix.go-version }}
cache: false
# Caches ~/go/pkg/mod and ~/Library/Caches/go-build keyed on go.sum.
# Shared across every darwin matrix entry — first job in a run warms
# it, the rest hit warm.
cache: true
# You can test your matrix by printing the current Go version
- name: Display Go version
run: go version
# ---- Homebrew cache ----
# macOS runners have no Docker daemon, so the BuildKit registry cache used
# for Linux backend images (see .agents/ci-caching.md) doesn't apply here.
# We cache the brew downloads + Cellar entries for the formulas we install
# below. Read on every run, write only on master/tag pushes — same policy
# as the Linux registry cache.
- name: Restore Homebrew cache
id: brew-cache
uses: actions/cache/restore@v4
with:
path: |
~/Library/Caches/Homebrew/downloads
/opt/homebrew/Cellar/protobuf
/opt/homebrew/Cellar/grpc
/opt/homebrew/Cellar/protoc-gen-go
/opt/homebrew/Cellar/protoc-gen-go-grpc
/opt/homebrew/Cellar/libomp
/opt/homebrew/Cellar/llvm
/opt/homebrew/Cellar/ccache
key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
- name: Dependencies
run: |
brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm
# ccache is always installed (used by the llama-cpp variant build) so
# the brew cache content stays stable across every backend in the
# matrix — they all share one cache key.
brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache
- name: Save Homebrew cache
if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
uses: actions/cache/save@v4
with:
path: |
~/Library/Caches/Homebrew/downloads
/opt/homebrew/Cellar/protobuf
/opt/homebrew/Cellar/grpc
/opt/homebrew/Cellar/protoc-gen-go
/opt/homebrew/Cellar/protoc-gen-go-grpc
/opt/homebrew/Cellar/libomp
/opt/homebrew/Cellar/llvm
/opt/homebrew/Cellar/ccache
key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
# ---- ccache for llama.cpp CMake builds ----
# Three CMake variants (fallback, grpc, rpc-server) compile the same
# llama.cpp source tree with overlapping flags — ccache dedupes object
# files across them. Key on the pinned LLAMA_VERSION so a pin bump
# invalidates cleanly; restore-keys fall back to the latest entry for the
# same pin so unchanged TUs stay warm even when the cache is fresh.
- name: Compute llama.cpp version
if: inputs.backend == 'llama-cpp'
id: llama-version
run: |
version=$(grep '^LLAMA_VERSION' backend/cpp/llama-cpp/Makefile | head -1 | cut -d= -f2 | cut -d'?' -f1 | tr -d ' ')
echo "version=${version}" >> "$GITHUB_OUTPUT"
- name: Restore ccache
if: inputs.backend == 'llama-cpp'
id: ccache-cache
uses: actions/cache/restore@v4
with:
path: ~/Library/Caches/ccache
key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
restore-keys: |
ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-
- name: Configure ccache
if: inputs.backend == 'llama-cpp'
run: |
mkdir -p "$HOME/Library/Caches/ccache"
ccache -M 2G
ccache -z
# llama-cpp-darwin.sh reads CMAKE_ARGS / CCACHE_DIR from env.
{
echo "CMAKE_ARGS=${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache"
echo "CCACHE_DIR=$HOME/Library/Caches/ccache"
} >> "$GITHUB_ENV"
# ---- Python wheel cache (uv + pip) ----
# Mirrors the Linux DEPS_REFRESH cadence (see .agents/ci-caching.md): the
# ISO-week segment of the cache key forces at most one cold rebuild per
# backend per week, automatically picking up newer wheels for unpinned
# deps (torch, mlx, diffusers, …). Restore-keys fall back to the most
# recent build of the same backend so off-week PRs still hit warm.
- name: Compute weekly cache bucket
if: inputs.lang == 'python'
id: weekly
run: echo "bucket=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
- name: Restore Python wheel cache
if: inputs.lang == 'python'
id: pyenv-cache
uses: actions/cache/restore@v4
with:
path: |
~/Library/Caches/pip
~/Library/Caches/uv
key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
restore-keys: |
pyenv-darwin-${{ inputs.backend }}-
- name: Build ${{ inputs.backend }}-darwin
run: |
make protogen-go
BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend
- name: ccache stats
if: inputs.backend == 'llama-cpp'
run: ccache -s
- name: Save ccache
if: inputs.backend == 'llama-cpp' && github.event_name != 'pull_request'
uses: actions/cache/save@v4
with:
path: ~/Library/Caches/ccache
key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
- name: Save Python wheel cache
if: inputs.lang == 'python' && github.event_name != 'pull_request' && steps.pyenv-cache.outputs.cache-hit != 'true'
uses: actions/cache/save@v4
with:
path: |
~/Library/Caches/pip
~/Library/Caches/uv
key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
- name: Upload ${{ inputs.backend }}.tar
uses: actions/upload-artifact@v7
with:

View File

@@ -53,6 +53,7 @@ jobs:
skip-drivers: ${{ matrix.skip-drivers }}
context: ${{ matrix.context }}
ubuntu-version: ${{ matrix.ubuntu-version }}
amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
secrets:
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}

View File

@@ -50,6 +50,8 @@ jobs:
uses: actions/checkout@v6
with:
fetch-depth: 0
- name: Configure apt mirror on runner
uses: ./.github/actions/configure-apt-mirror
- name: Set up Go
uses: actions/setup-go@v5
with:

View File

@@ -80,5 +80,37 @@ jobs:
body: ${{ steps.bump.outputs.message }}
signoff: true
bump-vllm-wheel:
# vLLM's cu130 wheel comes from a per-tag index URL (no /latest/ alias),
# so the cublas13 requirements file pins both a URL segment and a version
# constraint. bump_deps.sh handles git-sha-in-Makefile only — this job
# rewrites both values atomically when a new vLLM stable tag ships.
if: github.repository == 'mudler/LocalAI'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- name: Bump vLLM cu130 wheel pin 🔧
id: bump
run: |
bash .github/bump_vllm_wheel.sh vllm-project/vllm backend/python/vllm/requirements-cublas13-after.txt VLLM_VERSION
{
echo 'message<<EOF'
cat "VLLM_VERSION_message.txt"
echo EOF
} >> "$GITHUB_OUTPUT"
{
echo 'commit<<EOF'
cat "VLLM_VERSION_commit.txt"
echo EOF
} >> "$GITHUB_OUTPUT"
rm -rfv VLLM_VERSION_message.txt VLLM_VERSION_commit.txt
- name: Create Pull Request
uses: peter-evans/create-pull-request@v8
with:
token: ${{ secrets.UPDATE_BOT_TOKEN }}
push-to-fork: ci-forks/LocalAI
commit-message: ':arrow_up: Update vllm-project/vllm cu130 wheel'
title: 'chore: :arrow_up: Update vllm-project/vllm cu130 wheel to `${{ steps.bump.outputs.commit }}`'
branch: "update/VLLM_VERSION"
body: ${{ steps.bump.outputs.message }}
signoff: true

View File

@@ -8,15 +8,9 @@ jobs:
if: github.repository == 'mudler/LocalAI'
runs-on: ubuntu-latest
steps:
- name: Force Install GIT latest
run: |
sudo apt-get update \
&& sudo apt-get install -y software-properties-common \
&& sudo apt-get update \
&& sudo add-apt-repository -y ppa:git-core/ppa \
&& sudo apt-get update \
&& sudo apt-get install -y git
- uses: actions/checkout@v6
- name: Configure apt mirror on runner
uses: ./.github/actions/configure-apt-mirror
- name: Install dependencies
run: |
sudo apt-get update

View File

@@ -2,7 +2,7 @@ name: Gallery Agent
on:
schedule:
- cron: '0 */3 * * *' # Run every 4 hours
- cron: '0 */12 * * *' # Run every 4 hours
workflow_dispatch:
inputs:
search_term:
@@ -54,24 +54,41 @@ jobs:
REPO: ${{ github.repository }}
SEARCH: 'gallery agent in:title'
run: |
# Walk open gallery-agent PRs and act on maintainer comments:
# Walk gallery-agent PRs and act on maintainer comments:
# /gallery-agent blacklist → label `gallery-agent/blacklisted` + close (never repropose)
# /gallery-agent recreate → close without label (next run may repropose)
# Only comments from OWNER / MEMBER / COLLABORATOR are honored so
# random users can't drive the bot.
#
# We scan both open PRs AND recently-closed PRs that don't already
# carry the blacklist label. This covers the common flow where a
# maintainer writes /gallery-agent blacklist and immediately clicks
# Close — without this, the next scheduled run wouldn't see the
# command (PR is already closed) and would repropose the model.
gh label create gallery-agent/blacklisted \
--repo "$REPO" --color ededed \
--description "gallery-agent must not repropose this model" 2>/dev/null || true
prs=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" --json number --jq '.[].number')
prs_open=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" \
--json number --jq '.[].number')
# Closed PRs from the last 14 days that don't yet have the blacklist label.
# Bounded window keeps the scan cheap while covering late-applied commands.
since=$(date -u -d '14 days ago' +%Y-%m-%d)
prs_closed=$(gh pr list --repo "$REPO" --state closed \
--search "$SEARCH closed:>=$since -label:gallery-agent/blacklisted" \
--json number --jq '.[].number')
prs=$(printf '%s\n%s\n' "$prs_open" "$prs_closed" | sort -u | sed '/^$/d')
for pr in $prs; do
state=$(gh pr view "$pr" --repo "$REPO" --json state --jq '.state')
cmds=$(gh pr view "$pr" --repo "$REPO" --json comments \
--jq '.comments[] | select(.authorAssociation=="OWNER" or .authorAssociation=="MEMBER" or .authorAssociation=="COLLABORATOR") | .body')
if echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+blacklist([[:space:]]|$)'; then
echo "PR #$pr: blacklist command found"
echo "PR #$pr: blacklist command found (state=$state)"
gh pr edit "$pr" --repo "$REPO" --add-label gallery-agent/blacklisted || true
gh pr close "$pr" --repo "$REPO" --comment "Blacklisted via \`/gallery-agent blacklist\`. This model will not be reproposed." || true
elif echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+recreate([[:space:]]|$)'; then
if [ "$state" = "OPEN" ]; then
gh pr close "$pr" --repo "$REPO" --comment "Blacklisted via \`/gallery-agent blacklist\`. This model will not be reproposed." || true
fi
elif [ "$state" = "OPEN" ] && echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+recreate([[:space:]]|$)'; then
echo "PR #$pr: recreate command found"
gh pr close "$pr" --repo "$REPO" --comment "Closed via \`/gallery-agent recreate\`. The next scheduled run will propose this model again." || true
fi

View File

@@ -1,96 +0,0 @@
name: 'generate and publish GRPC docker caches'
on:
workflow_dispatch:
schedule:
# daily at midnight
- cron: '0 0 * * *'
concurrency:
group: grpc-cache-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
jobs:
generate_caches:
if: github.repository == 'mudler/LocalAI'
strategy:
matrix:
include:
- grpc-base-image: ubuntu:24.04
runs-on: 'ubuntu-latest'
platforms: 'linux/amd64,linux/arm64'
runs-on: ${{matrix.runs-on}}
steps:
- name: Release space from worker
if: matrix.runs-on == 'ubuntu-latest'
run: |
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
df -h
echo
sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
sudo apt-get remove --auto-remove android-sdk-platform-tools || true
sudo apt-get purge --auto-remove android-sdk-platform-tools || true
sudo rm -rf /usr/local/lib/android
sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^mono-.*' || true
sudo apt-get remove -y '^ghc-.*' || true
sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
sudo apt-get remove -y 'php.*' || true
sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
sudo apt-get remove -y '^google-.*' || true
sudo apt-get remove -y azure-cli || true
sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
sudo apt-get remove -y '^gfortran-.*' || true
sudo apt-get remove -y microsoft-edge-stable || true
sudo apt-get remove -y firefox || true
sudo apt-get remove -y powershell || true
sudo apt-get remove -y r-base-core || true
sudo apt-get autoremove -y
sudo apt-get clean
echo
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
sudo rm -rfv build || true
sudo rm -rf /usr/share/dotnet || true
sudo rm -rf /opt/ghc || true
sudo rm -rf "/usr/local/share/boost" || true
sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
df -h
- name: Set up QEMU
uses: docker/setup-qemu-action@master
with:
platforms: all
- name: Set up Docker Buildx
id: buildx
uses: docker/setup-buildx-action@master
- name: Checkout
uses: actions/checkout@v6
- name: Cache GRPC
uses: docker/build-push-action@v7
with:
builder: ${{ steps.buildx.outputs.name }}
# The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
# This means that even the MAKEFLAGS have to be an EXACT match.
# If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
build-args: |
GRPC_BASE_IMAGE=${{ matrix.grpc-base-image }}
GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
GRPC_VERSION=v1.65.0
context: .
file: ./Dockerfile
cache-to: type=gha,ignore-error=true
cache-from: type=gha
target: grpc
platforms: ${{ matrix.platforms }}
push: false

View File

@@ -16,7 +16,7 @@ jobs:
strategy:
matrix:
include:
- base-image: intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04
- base-image: intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04
runs-on: 'arc-runner-set'
platforms: 'linux/amd64'
runs-on: ${{matrix.runs-on}}

View File

@@ -20,7 +20,6 @@
platforms: ${{ matrix.platforms }}
runs-on: ${{ matrix.runs-on }}
base-image: ${{ matrix.base-image }}
grpc-base-image: ${{ matrix.grpc-base-image }}
makeflags: ${{ matrix.makeflags }}
ubuntu-version: ${{ matrix.ubuntu-version }}
secrets:
@@ -60,15 +59,13 @@
tag-latest: 'false'
tag-suffix: '-hipblas'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
grpc-base-image: "ubuntu:24.04"
runs-on: 'ubuntu-latest'
makeflags: "--jobs=3 --output-sync=target"
ubuntu-version: '2404'
- build-type: 'sycl'
platforms: 'linux/amd64'
tag-latest: 'false'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
grpc-base-image: "ubuntu:24.04"
base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
tag-suffix: 'sycl'
runs-on: 'ubuntu-latest'
makeflags: "--jobs=3 --output-sync=target"

View File

@@ -25,7 +25,6 @@
platforms: ${{ matrix.platforms }}
runs-on: ${{ matrix.runs-on }}
base-image: ${{ matrix.base-image }}
grpc-base-image: ${{ matrix.grpc-base-image }}
makeflags: ${{ matrix.makeflags }}
ubuntu-version: ${{ matrix.ubuntu-version }}
ubuntu-codename: ${{ matrix.ubuntu-codename }}
@@ -42,12 +41,11 @@
tag-latest: 'auto'
tag-suffix: '-gpu-hipblas'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
grpc-base-image: "ubuntu:24.04"
runs-on: 'ubuntu-latest'
makeflags: "--jobs=3 --output-sync=target"
ubuntu-version: '2404'
ubuntu-codename: 'noble'
core-image-build:
if: github.repository == 'mudler/LocalAI'
uses: ./.github/workflows/image_build.yml
@@ -60,7 +58,6 @@
platforms: ${{ matrix.platforms }}
runs-on: ${{ matrix.runs-on }}
base-image: ${{ matrix.base-image }}
grpc-base-image: ${{ matrix.grpc-base-image }}
makeflags: ${{ matrix.makeflags }}
skip-drivers: ${{ matrix.skip-drivers }}
ubuntu-version: ${{ matrix.ubuntu-version }}
@@ -121,8 +118,7 @@
- build-type: 'intel'
platforms: 'linux/amd64'
tag-latest: 'auto'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
grpc-base-image: "ubuntu:24.04"
base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
tag-suffix: '-gpu-intel'
runs-on: 'ubuntu-latest'
makeflags: "--jobs=3 --output-sync=target"
@@ -141,7 +137,6 @@
platforms: ${{ matrix.platforms }}
runs-on: ${{ matrix.runs-on }}
base-image: ${{ matrix.base-image }}
grpc-base-image: ${{ matrix.grpc-base-image }}
makeflags: ${{ matrix.makeflags }}
skip-drivers: ${{ matrix.skip-drivers }}
ubuntu-version: ${{ matrix.ubuntu-version }}

View File

@@ -8,11 +8,6 @@ on:
description: 'Base image'
required: true
type: string
grpc-base-image:
description: 'GRPC Base image, must be a compatible image with base-image'
required: false
default: ''
type: string
build-type:
description: 'Build type'
default: ''
@@ -75,6 +70,13 @@ jobs:
runs-on: ${{ inputs.runs-on }}
steps:
- name: Checkout
uses: actions/checkout@v6
- name: Configure apt mirror on runner
id: apt_mirror
uses: ./.github/actions/configure-apt-mirror
- name: Free Disk Space (Ubuntu)
if: inputs.runs-on == 'ubuntu-latest'
uses: jlumbroso/free-disk-space@main
@@ -90,16 +92,6 @@ jobs:
large-packages: true
docker-images: true
swap-storage: true
- name: Force Install GIT latest
run: |
sudo apt-get update \
&& sudo apt-get install -y software-properties-common \
&& sudo apt-get update \
&& sudo add-apt-repository -y ppa:git-core/ppa \
&& sudo apt-get update \
&& sudo apt-get install -y git
- name: Checkout
uses: actions/checkout@v6
- name: Release space from worker
if: inputs.runs-on == 'ubuntu-latest'
@@ -201,25 +193,21 @@ jobs:
if: github.event_name != 'pull_request'
with:
builder: ${{ steps.buildx.outputs.name }}
# The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
# This means that even the MAKEFLAGS have to be an EXACT match.
# If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
# This is why some build args like GRPC_VERSION and MAKEFLAGS are hardcoded
build-args: |
BUILD_TYPE=${{ inputs.build-type }}
CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
BASE_IMAGE=${{ inputs.base-image }}
GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
GRPC_VERSION=v1.65.0
MAKEFLAGS=${{ inputs.makeflags }}
SKIP_DRIVERS=${{ inputs.skip-drivers }}
UBUNTU_VERSION=${{ inputs.ubuntu-version }}
UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
context: .
file: ./Dockerfile
cache-from: type=gha
cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}
cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }},mode=max,ignore-error=true
platforms: ${{ inputs.platforms }}
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta.outputs.tags }}
@@ -230,25 +218,20 @@ jobs:
if: github.event_name == 'pull_request'
with:
builder: ${{ steps.buildx.outputs.name }}
# The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
# This means that even the MAKEFLAGS have to be an EXACT match.
# If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
# This is why some build args like GRPC_VERSION and MAKEFLAGS are hardcoded
build-args: |
BUILD_TYPE=${{ inputs.build-type }}
CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
BASE_IMAGE=${{ inputs.base-image }}
GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
GRPC_VERSION=v1.65.0
MAKEFLAGS=${{ inputs.makeflags }}
SKIP_DRIVERS=${{ inputs.skip-drivers }}
UBUNTU_VERSION=${{ inputs.ubuntu-version }}
UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
context: .
file: ./Dockerfile
cache-from: type=gha
cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}
platforms: ${{ inputs.platforms }}
#push: true
tags: ${{ steps.meta_pull_request.outputs.tags }}

48
.github/workflows/lint.yml vendored Normal file
View File

@@ -0,0 +1,48 @@
---
name: 'lint'
on:
pull_request:
paths-ignore:
- 'docs/**'
- 'examples/**'
- 'README.md'
- '**/*.md'
push:
branches:
- master
concurrency:
group: ci-lint-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
jobs:
golangci-lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
# Full history so golangci-lint's new-from-merge-base can reach
# origin/master and compute the diff against it.
fetch-depth: 0
- uses: actions/setup-go@v5
with:
go-version: '1.26.x'
cache: false
- name: install golangci-lint
run: |
curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh \
| sh -s -- -b "$(go env GOPATH)/bin" v2.11.4
- name: generate grpc proto sources
# pkg/grpc/proto/*.go is generated, not checked in. Several packages
# import it, so without this step typecheck fails project-wide.
run: make protogen-go
- name: stub react-ui dist for go:embed
# core/http/app.go has //go:embed react-ui/dist/*; the glob needs at
# least one non-hidden entry to satisfy typecheck. We don't run
# `make react-ui` here because lint doesn't need the real bundle.
run: |
mkdir -p core/http/react-ui/dist
touch core/http/react-ui/dist/index.html
- name: lint
run: make lint

View File

@@ -49,6 +49,8 @@ jobs:
uses: actions/checkout@v6
with:
fetch-depth: 0
- name: Configure apt mirror on runner
uses: ./.github/actions/configure-apt-mirror
- name: Set up Go
uses: actions/setup-go@v5
with:

View File

@@ -36,8 +36,12 @@ jobs:
sglang: ${{ steps.detect.outputs.sglang }}
acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
voxtral: ${{ steps.detect.outputs.voxtral }}
kokoros: ${{ steps.detect.outputs.kokoros }}
insightface: ${{ steps.detect.outputs.insightface }}
speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
sherpa-onnx: ${{ steps.detect.outputs.sherpa-onnx }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
@@ -504,6 +508,99 @@ jobs:
- name: Build llama-cpp backend image and run audio transcription gRPC e2e tests
run: |
make test-extra-backend-llama-cpp-transcription
# PR-acceptance smoke gate: always runs on every PR (no detect-changes gate, no
# paths filter). Pulls the pre-built master CPU llama-cpp image from quay
# instead of building from source, so the cost is a docker pull (~30s) plus the
# short Qwen3-0.6B model download. Exercises the full gRPC surface — health,
# load, predict, stream — plus the logprobs/logit_bias specs that moved out of
# core/http/app_test.go. Anything heavier or per-backend is gated to the
# detect-changes path-filter above.
tests-llama-cpp-smoke:
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Pull pre-built llama-cpp backend image
run: docker pull quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
- name: Run e2e-backends smoke
env:
BACKEND_IMAGE: quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias
run: |
make test-extra-backend
# Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
# Builds the sherpa-onnx Docker image, extracts the rootfs so the e2e suite
# can discover the backend binary + shared libs, downloads the three model
# bundles (silero-vad, omnilingual-asr, vits-ljs) and drives the realtime
# websocket spec end-to-end.
tests-sherpa-onnx-realtime:
needs: detect-changes
if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Setup Node.js
uses: actions/setup-node@v6
with:
node-version: '22'
- name: Build sherpa-onnx backend image and run realtime e2e tests
run: |
make test-extra-e2e-realtime-sherpa
# Streaming ASR via the sherpa-onnx online recognizer (zipformer
# transducer). Exercises both AudioTranscription (buffered) and
# AudioTranscriptionStream (real-time deltas) on the e2e-backends
# harness.
tests-sherpa-onnx-grpc-transcription:
needs: detect-changes
if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build sherpa-onnx backend image and run streaming ASR gRPC e2e tests
run: |
make test-extra-backend-sherpa-onnx-transcription
# VITS TTS via the sherpa-onnx backend. Drives both TTS (file write) and
# TTSStream (PCM chunks) on the e2e-backends harness.
tests-sherpa-onnx-grpc-tts:
needs: detect-changes
if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build sherpa-onnx backend image and run TTS gRPC e2e tests
run: |
make test-extra-backend-sherpa-onnx-tts
tests-ik-llama-cpp-grpc:
needs: detect-changes
if: needs.detect-changes.outputs.ik-llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
@@ -696,6 +793,97 @@ jobs:
- name: Test qwen3-tts-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp test
# Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
# runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
# the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K
# + tokenizer + voice) and runs the closed-loop TTS → ASR Go test.
tests-vibevoice-cpp:
needs: detect-changes
if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential cmake curl libopenblas-dev ffmpeg
- name: Setup Go
uses: actions/setup-go@v5
- name: Display Go version
run: go version
- name: Proto Dependencies
run: |
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
PATH="$PATH:$HOME/go/bin" make protogen-go
- name: Build vibevoice-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/vibevoice-cpp
- name: Test vibevoice-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/vibevoice-cpp test
# End-to-end TTS via the e2e-backends gRPC harness. Builds the
# vibevoice-cpp Docker image and drives Backend/TTS against it with a
# real LocalAI gRPC client.
tests-vibevoice-cpp-grpc-tts:
needs: detect-changes
if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build vibevoice-cpp backend image and run TTS gRPC e2e tests
run: |
make test-extra-backend-vibevoice-cpp-tts
# End-to-end transcription via the e2e-backends gRPC harness. The
# vibevoice ASR is a 7B-param model (Q4_K weights ~10 GB on disk)
# and the JFK 30 s decode is too heavy for a free 4-core
# ubuntu-latest pool runner - two CI attempts got SIGTERM'd during
# LoadModel, before the test could even progress. Use the
# self-hosted 'bigger-runner' label (same one the GPU image builds
# in backend.yml use) and the documented dotnet/ghc/android cache
# purge to clear ~10-20 GB of headroom for the model + Docker
# image + working dir.
tests-vibevoice-cpp-grpc-transcription:
needs: detect-changes
if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: bigger-runner
timeout-minutes: 150
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
make build-essential curl unzip ca-certificates git tar
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Free disk space
run: |
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
df -h
- name: Build vibevoice-cpp backend image and run ASR gRPC e2e tests
run: |
make test-extra-backend-vibevoice-cpp-transcription
tests-voxtral:
needs: detect-changes
if: needs.detect-changes.outputs.voxtral == 'true' || needs.detect-changes.outputs.run-all == 'true'
@@ -751,3 +939,55 @@ jobs:
- name: Test kokoros
run: |
make -C backend/rust/kokoros test
tests-insightface-grpc:
needs: detect-changes
if: needs.detect-changes.outputs.insightface == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
make build-essential curl unzip ca-certificates git tar
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.26.0'
- name: Free disk space
run: |
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
df -h
- name: Build insightface backend image and run both model configurations
run: |
make test-extra-backend-insightface-all
tests-speaker-recognition-grpc:
needs: detect-changes
if: needs.detect-changes.outputs.speaker-recognition == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
make build-essential curl ca-certificates git tar
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.26.0'
- name: Free disk space
run: |
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
df -h
- name: Build speaker-recognition backend image and run the ECAPA-TDNN configuration
run: |
make test-extra-backend-speaker-recognition-all

View File

@@ -3,15 +3,18 @@ name: 'tests'
on:
pull_request:
paths-ignore:
- 'docs/**'
- 'examples/**'
- 'README.md'
- '**/*.md'
- 'backend/**'
push:
branches:
- master
tags:
- '*'
env:
GRPC_VERSION: v1.65.0
concurrency:
group: ci-tests-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
@@ -100,73 +103,9 @@ jobs:
node-version: '22'
- name: Build React UI
run: make react-ui
- name: Build backends
run: |
make backends/transformers
mkdir external && mv backends/transformers external/transformers
make backends/llama-cpp backends/local-store backends/silero-vad backends/piper backends/whisper backends/stablediffusion-ggml
- name: Test
run: |
TRANSFORMER_BACKEND=$PWD/external/transformers/run.sh PATH="$PATH:/root/go/bin" GO_TAGS="tts" make --jobs 5 --output-sync=target test
- name: Setup tmate session if tests fail
if: ${{ failure() }}
uses: mxschmitt/action-tmate@v3.23
with:
detached: true
connect-timeout-seconds: 180
limit-access-to-actor: true
tests-e2e-container:
runs-on: ubuntu-latest
steps:
- name: Release space from worker
run: |
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
df -h
echo
sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
sudo apt-get remove --auto-remove android-sdk-platform-tools || true
sudo apt-get purge --auto-remove android-sdk-platform-tools || true
sudo rm -rf /usr/local/lib/android
sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^mono-.*' || true
sudo apt-get remove -y '^ghc-.*' || true
sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
sudo apt-get remove -y 'php.*' || true
sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
sudo apt-get remove -y '^google-.*' || true
sudo apt-get remove -y azure-cli || true
sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
sudo apt-get remove -y '^gfortran-.*' || true
sudo apt-get autoremove -y
sudo apt-get clean
echo
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
sudo rm -rfv build || true
df -h
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
# Install protoc
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
PATH="$PATH:$HOME/go/bin" make protogen-go
- name: Test
run: |
PATH="$PATH:$HOME/go/bin" make backends/local-store backends/silero-vad backends/llama-cpp backends/whisper backends/piper backends/stablediffusion-ggml docker-build-e2e e2e-aio
PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test
- name: Setup tmate session if tests fail
if: ${{ failure() }}
uses: mxschmitt/action-tmate@v3.23
@@ -195,7 +134,7 @@ jobs:
run: go version
- name: Dependencies
run: |
brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus
brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus ffmpeg
pip install --user --no-cache-dir grpcio-tools grpcio
- name: Setup Node.js
uses: actions/setup-node@v6
@@ -203,10 +142,6 @@ jobs:
node-version: '22'
- name: Build React UI
run: make react-ui
- name: Build llama-cpp-darwin
run: |
make protogen-go
make backends/llama-cpp-darwin
- name: Test
run: |
export C_INCLUDE_PATH=/usr/local/include

86
.github/workflows/tests-aio.yml vendored Normal file
View File

@@ -0,0 +1,86 @@
---
name: 'tests-aio'
# Runs the all-in-one (AIO) Docker image with real backends + real models.
# Heavy: builds llama-cpp/whisper/piper/silero-vad/stablediffusion-ggml/local-store
# and exercises end-to-end inference inside the container. Moved out of test.yml
# (which used to run on every PR) so PR CI no longer pays this cost.
#
# Triggers:
# - schedule (nightly @ 04:00 UTC) — catches packaging/image regressions within 24h
# - workflow_dispatch — manual run on-demand
# - push to master/tags — sanity check after merge / before release
on:
schedule:
- cron: '0 4 * * *'
workflow_dispatch:
push:
branches:
- master
tags:
- '*'
concurrency:
group: ci-tests-aio-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
jobs:
tests-aio:
runs-on: ubuntu-latest
steps:
- name: Release space from worker
run: |
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
df -h
echo
sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
sudo apt-get remove --auto-remove android-sdk-platform-tools || true
sudo apt-get purge --auto-remove android-sdk-platform-tools || true
sudo rm -rf /usr/local/lib/android
sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^mono-.*' || true
sudo apt-get remove -y '^ghc-.*' || true
sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
sudo apt-get remove -y 'php.*' || true
sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
sudo apt-get remove -y '^google-.*' || true
sudo apt-get remove -y azure-cli || true
sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
sudo apt-get remove -y '^gfortran-.*' || true
sudo apt-get autoremove -y
sudo apt-get clean
echo
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
sudo rm -rfv build || true
df -h
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
# Install protoc
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
PATH="$PATH:$HOME/go/bin" make protogen-go
- name: Test
run: |
PATH="$PATH:$HOME/go/bin" make backends/local-store backends/silero-vad backends/llama-cpp backends/whisper backends/piper backends/stablediffusion-ggml docker-build-e2e e2e-aio
- name: Setup tmate session if tests fail
if: ${{ failure() }}
uses: mxschmitt/action-tmate@v3.23
with:
detached: true
connect-timeout-seconds: 180
limit-access-to-actor: true

View File

@@ -3,6 +3,12 @@ name: 'E2E Backend Tests'
on:
pull_request:
paths-ignore:
- 'docs/**'
- 'examples/**'
- 'README.md'
- '**/*.md'
- 'backend/**'
push:
branches:
- master
@@ -24,6 +30,8 @@ jobs:
uses: actions/checkout@v6
with:
submodules: true
- name: Configure apt mirror on runner
uses: ./.github/actions/configure-apt-mirror
- name: Setup Go ${{ matrix.go-version }}
uses: actions/setup-go@v5
with:

View File

@@ -26,6 +26,8 @@ jobs:
uses: actions/checkout@v6
with:
submodules: true
- name: Configure apt mirror on runner
uses: ./.github/actions/configure-apt-mirror
- name: Setup Go ${{ matrix.go-version }}
uses: actions/setup-go@v5
with:

View File

@@ -11,6 +11,8 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- name: Configure apt mirror on runner
uses: ./.github/actions/configure-apt-mirror
- uses: actions/setup-go@v5
with:
go-version: 'stable'

53
.golangci.yml Normal file
View File

@@ -0,0 +1,53 @@
version: "2"
# Only issues introduced relative to master are reported. Pre-existing issues
# in the codebase do not fail the lint job; they're treated as a baseline that
# can be cleaned up incrementally. New code (added lines on a branch) is held
# to the full linter set. Locally, `make lint-all` overrides this and reports
# every issue.
issues:
# origin/master because in shallow CI checkouts only the remote-tracking
# branch exists; a bare 'master' ref isn't reachable locally.
new-from-merge-base: origin/master
linters:
default: standard
# staticcheck is noisy on this codebase (mostly QF style suggestions like
# "could use tagged switch" or "unnecessary fmt.Sprintf"). Re-enable
# selectively if a high-signal subset is identified.
disable:
- staticcheck
enable:
- forbidigo
settings:
forbidigo:
forbid:
- pattern: '^t\.Errorf$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(...) instead of t.Errorf. See .agents/coding-style.md.'
- pattern: '^t\.Error$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(...) instead of t.Error. See .agents/coding-style.md.'
- pattern: '^t\.Fatalf$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(Succeed()) / Fail(...) instead of t.Fatalf. See .agents/coding-style.md.'
- pattern: '^t\.Fatal$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(Succeed()) / Fail(...) instead of t.Fatal. See .agents/coding-style.md.'
- pattern: '^t\.Run$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Describe/Context/It instead of t.Run. See .agents/coding-style.md.'
- pattern: '^t\.Skip$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.Skip. See .agents/coding-style.md.'
- pattern: '^t\.Skipf$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.Skipf. See .agents/coding-style.md.'
- pattern: '^t\.SkipNow$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.SkipNow. See .agents/coding-style.md.'
- pattern: '^t\.Logf$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use GinkgoWriter / fmt.Fprintf(GinkgoWriter, ...) instead of t.Logf. See .agents/coding-style.md.'
- pattern: '^t\.Log$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use GinkgoWriter / fmt.Fprintln(GinkgoWriter, ...) instead of t.Log. See .agents/coding-style.md.'
- pattern: '^t\.Fail$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.'
- pattern: '^t\.FailNow$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.'
exclusions:
paths:
# Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
- 'backend/go/whisper/sources'
- 'docs/'

View File

@@ -1,13 +1,26 @@
# LocalAI Agent Instructions
This file is an index to detailed topic guides in the `.agents/` directory. Read the relevant file(s) for the task at hand — you don't need to load all of them.
This file is the entry point for AI coding assistants (Claude Code, Cursor, Copilot, Codex, Aider, etc.) working on LocalAI. It is an index to detailed topic guides in the `.agents/` directory. Read the relevant file(s) for the task at hand — you don't need to load all of them.
Human contributors: see [CONTRIBUTING.md](CONTRIBUTING.md) for the development workflow.
## Policy for AI-Assisted Contributions
LocalAI follows the Linux kernel project's [guidelines for AI coding assistants](https://docs.kernel.org/process/coding-assistants.html). Before submitting AI-assisted code, read [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md). Key rules:
- **No `Signed-off-by` from AI.** Only the human submitter may sign off on the Developer Certificate of Origin.
- **No `Co-Authored-By: <AI>` trailers.** The human contributor owns the change.
- **Use an `Assisted-by:` trailer** to attribute AI involvement. Format: `Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]`.
- **The human submitter is responsible** for reviewing, testing, and understanding every line of generated code.
## Topics
| File | When to read |
|------|-------------|
| [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md) | Policy for AI-assisted contributions — licensing, DCO, attribution |
| [.agents/building-and-testing.md](.agents/building-and-testing.md) | Building the project, running tests, Docker builds for specific platforms |
| [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist |
| [.agents/ci-caching.md](.agents/ci-caching.md) | CI build cache layout (registry-backed BuildKit cache on quay.io/go-skynet/ci-cache), `DEPS_REFRESH` weekly cache-buster for unpinned Python deps, manual eviction |
| [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) |
| [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
| [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
@@ -15,6 +28,7 @@ This file is an index to detailed topic guides in the `.agents/` directory. Read
| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
| [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery |
| [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) | LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync |
## Quick Reference
@@ -22,5 +36,7 @@ This file is an index to detailed topic guides in the `.agents/` directory. Read
- **Go style**: Prefer `any` over `interface{}`
- **Comments**: Explain *why*, not *what*
- **Docs**: Update `docs/content/` when adding features or changing config
- **New API endpoints**: LocalAI advertises its capability surface in several independent places — swagger `@Tags`, `/api/instructions` registry, auth `RouteFeatureRegistry`, React UI `capabilities.js`, docs. Read [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) and follow its checklist — missing any surface means clients, admins, and the UI won't know the endpoint exists.
- **Admin endpoints → MCP tool**: every admin endpoint that an admin would manage conversationally (install/list/edit/toggle/upgrade) MUST also be exposed as an MCP tool in `pkg/mcp/localaitools/`. The LocalAI Assistant chat modality and the standalone `local-ai mcp-server` consume that package; drift between REST and MCP is a real risk. Read [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) — the `TestToolHTTPRouteMappingComplete` test fails until you wire the new tool and update the route map.
- **Build**: Inspect `Makefile` and `.github/workflows/` — ask the user before running long builds
- **UI**: The active UI is the React app in `core/http/react-ui/`. The older Alpine.js/HTML UI in `core/http/static/` is pending deprecation — all new UI work goes in the React UI

View File

@@ -13,6 +13,7 @@ Thank you for your interest in contributing to LocalAI! We appreciate your time
- [Development Workflow](#development-workflow)
- [Creating a Pull Request (PR)](#creating-a-pull-request-pr)
- [Coding Guidelines](#coding-guidelines)
- [AI Coding Assistants](#ai-coding-assistants)
- [Testing](#testing)
- [Documentation](#documentation)
- [Community and Communication](#community-and-communication)
@@ -185,7 +186,7 @@ Before jumping into a PR for a massive feature or big change, it is preferred to
This project uses an [`.editorconfig`](.editorconfig) file to define formatting standards (indentation, line endings, charset, etc.). Please configure your editor to respect it.
For AI-assisted development, see [`CLAUDE.md`](CLAUDE.md) for agent-specific guidelines including build instructions and backend architecture details.
For AI-assisted development, see [`AGENTS.md`](AGENTS.md) (or the equivalent [`CLAUDE.md`](CLAUDE.md) symlink) for agent-specific guidelines including build instructions and backend architecture details. Contributions produced with AI assistance must follow the rules in the [AI Coding Assistants](#ai-coding-assistants) section below.
### General Principles
@@ -211,6 +212,26 @@ For AI-assisted development, see [`CLAUDE.md`](CLAUDE.md) for agent-specific gui
- Reviewers will check for correctness, test coverage, adherence to these guidelines, and clarity of intent.
- Be responsive to review feedback and keep discussions constructive.
## AI Coding Assistants
LocalAI follows the **same guidelines as the Linux kernel project** for AI-assisted contributions: <https://docs.kernel.org/process/coding-assistants.html>.
The full policy for this repository lives in [`.agents/ai-coding-assistants.md`](.agents/ai-coding-assistants.md). Summary:
- **AI agents MUST NOT add `Signed-off-by` tags.** Only humans can certify the Developer Certificate of Origin.
- **AI agents MUST NOT add `Co-Authored-By` trailers** attributing themselves as co-authors.
- **Attribute AI involvement with an `Assisted-by` trailer** in the commit message:
```
Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2]
```
Example: `Assisted-by: Claude:claude-opus-4-7 golangci-lint`
Basic development tools (git, go, make, editors) should not be listed.
- **The human submitter is responsible** for reviewing, testing, and fully understanding every line of AI-generated code — including verifying that any referenced APIs, flags, or file paths actually exist in the tree.
- Contributions must remain compatible with LocalAI's **MIT License**.
## Testing
All new features and bug fixes should include test coverage. The project uses [Ginkgo](https://onsi.github.io/ginkgo/) as its test framework.

View File

@@ -1,13 +1,20 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
ARG INTEL_BASE_IMAGE=${BASE_IMAGE}
ARG UBUNTU_CODENAME=noble
# Optional alternate Ubuntu apt mirror(s). Empty = use upstream.
# See .docker/apt-mirror.sh for accepted values.
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
FROM ${BASE_IMAGE} AS requirements
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates curl wget espeak-ng libgomp1 \
ffmpeg libopenblas0 libopenblas-dev libopus0 sox && \
@@ -149,6 +156,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
hipblaslt-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
@@ -240,10 +248,14 @@ WORKDIR /build
# This is a temporary workaround until Intel fixes their repository
FROM ${INTEL_BASE_IMAGE} AS intel
ARG UBUNTU_CODENAME=noble
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
RUN wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg
RUN echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu ${UBUNTU_CODENAME}/lts/2350 unified" > /etc/apt/sources.list.d/intel-graphics.list
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
intel-oneapi-runtime-libs && \
apt-get clean && \

388
Makefile
View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/tinygrad backends/sherpa-onnx
GOCMD=go
GOTEST=$(GOCMD) test
@@ -10,6 +10,13 @@ LAUNCHER_BINARY_NAME=local-ai-launcher
UBUNTU_VERSION?=2404
UBUNTU_CODENAME?=noble
# Optional Ubuntu apt mirror overrides forwarded to docker builds.
# Empty = use upstream archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com.
# Set e.g. APT_MIRROR=http://azure.archive.ubuntu.com to route apt traffic
# during outages of the default Ubuntu pool.
APT_MIRROR?=
APT_PORTS_MIRROR?=
GORELEASER?=
export BUILD_TYPE?=
@@ -65,7 +72,7 @@ endif
TEST_PATHS?=./api/... ./pkg/... ./core/...
.PHONY: all test build vendor
.PHONY: all test build vendor lint lint-all
all: help
@@ -85,6 +92,7 @@ clean: ## Remove build related file
clean-tests:
rm -rf test-models
rm -rf test-dir
rm -f tests/e2e/mock-backend/mock-backend
## Install Go tools
install-go-tools:
@@ -143,32 +151,56 @@ osx-signed: build
run: ## run local-ai
CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./
test-models/testmodel.ggml:
mkdir -p test-models
mkdir -p test-dir
wget -q https://huggingface.co/mradermacher/gpt2-alpaca-gpt4-GGUF/resolve/main/gpt2-alpaca-gpt4.Q4_K_M.gguf -O test-models/testmodel.ggml
wget -q https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin -O test-models/whisper-en
wget -q https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav -O test-dir/audio.wav
cp tests/models_fixtures/* test-models
prepare-test: protogen-go
cp tests/models_fixtures/* test-models
prepare-test: protogen-go build-mock-backend
########################################################
## Tests
########################################################
## Test targets
test: test-models/testmodel.ggml protogen-go
## After the test-suite reorg (see plans/test-reorg) the default `make test`
## no longer downloads multi-GB GGUF/whisper fixtures or builds llama-cpp /
## transformers / piper / whisper / stablediffusion-ggml. core/http/app_test.go
## now drives the mock-backend binary built by build-mock-backend; real-backend
## inference moved into tests/e2e-backends/ (per-backend, path-filtered) and
## tests/e2e-aio/ (nightly).
test: prepare-test
@echo 'Running tests'
export GO_TAGS="debug"
$(MAKE) prepare-test
OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
HUGGINGFACE_GRPC=$(abspath ./)/backend/python/transformers/run.sh TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="!llama-gguf" --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)
$(MAKE) test-llama-gguf
$(MAKE) test-tts
$(MAKE) test-stablediffusion
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)
########################################################
## Lint
########################################################
## Runs golangci-lint with config from .golangci.yml. Includes the standard
## linter set plus forbidigo, which enforces the Ginkgo/Gomega-only test
## convention documented in .agents/coding-style.md.
##
## LINT_EXCLUDE_DIRS_RE matches directories whose Go packages can't typecheck
## without C/C++ headers we don't install in the lint runner (cgo wrappers
## around llama.cpp, piper/spdlog, silero-vad/onnxruntime, and Fyne/OpenGL for
## the launcher). Their compile-time correctness is enforced by their own
## build pipelines. Keep this as a deny list — `go list ./...` discovers
## everything else automatically, so new packages are scanned by default.
LINT_EXCLUDE_DIRS_RE=/(backend/go/(piper|silero-vad|llm)|cmd/launcher)(/|$$)
lint:
@command -v golangci-lint >/dev/null 2>&1 || { \
echo 'golangci-lint not installed. Install: go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest'; \
exit 1; \
}
golangci-lint run $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')
## Like `lint` but reports every issue, including the pre-existing baseline
## that `lint` ignores via .golangci.yml's new-from-merge-base. Use this to
## see what's available to clean up.
lint-all:
@command -v golangci-lint >/dev/null 2>&1 || { \
echo 'golangci-lint not installed. Install: go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest'; \
exit 1; \
}
golangci-lint run --new=false --new-from-merge-base= --new-from-rev= $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')
########################################################
## E2E AIO tests (uses standard image with pre-configured models)
@@ -184,6 +216,8 @@ docker-build-e2e:
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
--build-arg GO_TAGS="$(GO_TAGS)" \
-t local-ai:tests -f Dockerfile .
@@ -211,6 +245,8 @@ prepare-e2e:
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
--build-arg GO_TAGS="$(GO_TAGS)" \
--build-arg MAKEFLAGS="$(DOCKER_MAKEFLAGS)" \
-t localai-tests .
@@ -235,20 +271,12 @@ teardown-e2e:
## Integration and unit tests
########################################################
test-llama-gguf: prepare-test
TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="llama-gguf" --flake-attempts $(TEST_FLAKES) -v -r $(TEST_PATHS)
test-tts: prepare-test
TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="tts" --flake-attempts $(TEST_FLAKES) -v -r $(TEST_PATHS)
test-stablediffusion: prepare-test
TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="stablediffusion" --flake-attempts $(TEST_FLAKES) -v -r $(TEST_PATHS)
test-stores:
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="stores" --flake-attempts $(TEST_FLAKES) -v -r tests/integration
## Storage / vector-store integration. Requires the local-store backend to
## be available — we build it on demand and pass its location via
## BACKENDS_PATH (the model loader looks there for the gRPC binary).
test-stores: backends/local-store
BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r tests/integration
test-opus:
@echo 'Running opus backend tests'
@@ -260,6 +288,8 @@ test-opus-docker:
docker build --target builder \
--build-arg BUILD_TYPE=$(or $(BUILD_TYPE),) \
--build-arg BASE_IMAGE=$(or $(BASE_IMAGE),ubuntu:24.04) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
--build-arg BACKEND=opus \
-t localai-opus-test -f backend/Dockerfile.golang .
docker run --rm localai-opus-test \
@@ -269,23 +299,13 @@ test-realtime: build-mock-backend
@echo 'Running realtime e2e tests (mock backend)'
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="Realtime && !real-models" --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e
# Real-model realtime tests. Set REALTIME_TEST_MODEL to use your own pipeline,
# or leave unset to auto-build one from the component env vars below.
# Container-based real-model realtime testing. Build env vars / pipeline
# definition kept here so test-realtime-models-docker can drive a fully wired
# pipeline (VAD + STT + LLM + TTS) from inside a containerised runner.
REALTIME_VAD?=silero-vad-ggml
REALTIME_STT?=whisper-1
REALTIME_LLM?=qwen3-0.6b
REALTIME_TTS?=tts-1
REALTIME_BACKENDS_PATH?=$(abspath ./)/backends
test-realtime-models: build-mock-backend
@echo 'Running realtime e2e tests (real models)'
REALTIME_TEST_MODEL=$${REALTIME_TEST_MODEL:-realtime-test-pipeline} \
REALTIME_VAD=$(REALTIME_VAD) \
REALTIME_STT=$(REALTIME_STT) \
REALTIME_LLM=$(REALTIME_LLM) \
REALTIME_TTS=$(REALTIME_TTS) \
REALTIME_BACKENDS_PATH=$(REALTIME_BACKENDS_PATH) \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="Realtime" --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e
# --- Container-based real-model testing ---
@@ -311,6 +331,8 @@ test-realtime-models-docker: build-mock-backend
--build-arg BUILD_TYPE=$(or $(BUILD_TYPE),cublas) \
--build-arg CUDA_MAJOR_VERSION=$(or $(CUDA_MAJOR_VERSION),13) \
--build-arg CUDA_MINOR_VERSION=$(or $(CUDA_MINOR_VERSION),0) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
-t localai-test-runner .
docker run --rm \
$(REALTIME_DOCKER_FLAGS) \
@@ -394,7 +416,13 @@ protoc:
.PHONY: protogen-go
protogen-go: protoc install-go-tools
mkdir -p pkg/grpc/proto
./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
# install-go-tools writes protoc-gen-go and protoc-gen-go-grpc into
# $(shell go env GOPATH)/bin, which isn't on every dev's PATH. protoc
# resolves its code-gen plugins via PATH, so without this prefix the
# generate step fails with "protoc-gen-go: program not found". Prepend
# GOPATH/bin so the freshly-installed plugins win without requiring a
# shell-profile change.
PATH="$$(go env GOPATH)/bin:$$PATH" ./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
backend/backend.proto
core/config/inference_defaults.json: ## Fetch inference defaults from unsloth (only if missing)
@@ -434,6 +462,8 @@ prepare-test-extra: protogen-python
$(MAKE) -C backend/python/ace-step
$(MAKE) -C backend/python/trl
$(MAKE) -C backend/python/tinygrad
$(MAKE) -C backend/python/insightface
$(MAKE) -C backend/python/speaker-recognition
$(MAKE) -C backend/rust/kokoros kokoros-grpc
test-extra: prepare-test-extra
@@ -457,6 +487,8 @@ test-extra: prepare-test-extra
$(MAKE) -C backend/python/ace-step test
$(MAKE) -C backend/python/trl test
$(MAKE) -C backend/python/tinygrad test
$(MAKE) -C backend/python/insightface test
$(MAKE) -C backend/python/speaker-recognition test
$(MAKE) -C backend/rust/kokoros test
##
@@ -507,11 +539,20 @@ test-extra-backend: protogen-go
BACKEND_TEST_TOOL_NAME="$$BACKEND_TEST_TOOL_NAME" \
BACKEND_TEST_CACHE_TYPE_K="$$BACKEND_TEST_CACHE_TYPE_K" \
BACKEND_TEST_CACHE_TYPE_V="$$BACKEND_TEST_CACHE_TYPE_V" \
BACKEND_TEST_FACE_IMAGE_1_URL="$$BACKEND_TEST_FACE_IMAGE_1_URL" \
BACKEND_TEST_FACE_IMAGE_1_FILE="$$BACKEND_TEST_FACE_IMAGE_1_FILE" \
BACKEND_TEST_FACE_IMAGE_2_URL="$$BACKEND_TEST_FACE_IMAGE_2_URL" \
BACKEND_TEST_FACE_IMAGE_2_FILE="$$BACKEND_TEST_FACE_IMAGE_2_FILE" \
BACKEND_TEST_FACE_IMAGE_3_URL="$$BACKEND_TEST_FACE_IMAGE_3_URL" \
BACKEND_TEST_FACE_IMAGE_3_FILE="$$BACKEND_TEST_FACE_IMAGE_3_FILE" \
BACKEND_TEST_VERIFY_DISTANCE_CEILING="$$BACKEND_TEST_VERIFY_DISTANCE_CEILING" \
go test -v -timeout 30m ./tests/e2e-backends/...
## Convenience wrappers: build the image, then exercise it.
test-extra-backend-llama-cpp: docker-build-llama-cpp
BACKEND_IMAGE=local-ai-backend:llama-cpp $(MAKE) test-extra-backend
BACKEND_IMAGE=local-ai-backend:llama-cpp \
BACKEND_TEST_CAPS=health,load,predict,stream,logprobs,logit_bias \
$(MAKE) test-extra-backend
test-extra-backend-ik-llama-cpp: docker-build-ik-llama-cpp
BACKEND_IMAGE=local-ai-backend:ik-llama-cpp $(MAKE) test-extra-backend
@@ -603,6 +644,236 @@ test-extra-backend-tinygrad-all: \
test-extra-backend-tinygrad-sd \
test-extra-backend-tinygrad-whisper
## insightface — face recognition.
##
## Face fixtures default to the sample images shipped in the
## deepinsight/insightface repository (MIT-licensed). For offline/local
## runs override with BACKEND_TEST_FACE_IMAGE_{1,2,3}_FILE pointing at
## local paths.
FACE_IMAGE_1_URL ?= https://github.com/deepinsight/insightface/raw/master/python-package/insightface/data/images/t1.jpg
FACE_IMAGE_2_URL ?= https://github.com/deepinsight/insightface/raw/master/python-package/insightface/data/images/t1.jpg
FACE_IMAGE_3_URL ?= https://github.com/deepinsight/insightface/raw/master/python-package/insightface/data/images/mask_white.jpg
## Known spoof fixture used by the face_antispoof e2e cap. This is
## upstream's own `image_F2.jpg` (Silent-Face repo, via yakhyo mirror)
## — verified to classify as is_real=false with score < 0.05 on the
## MiniFASNetV2 + MiniFASNetV1SE ensemble.
FACE_SPOOF_IMAGE_URL ?= https://github.com/yakhyo/face-anti-spoofing/raw/main/assets/image_F2.jpg
## Host-side cache for the OpenCV Zoo face ONNX files used by the
## opencv e2e target. The backend image no longer bakes model weights —
## gallery installs bring them via `files:` — but the e2e suite drives
## LoadModel over gRPC directly without going through the gallery. We
## pre-download the ONNX files to a stable host path and pass absolute
## paths in BACKEND_TEST_OPTIONS; `make` skips the downloads when the
## SHA-256 already matches.
INSIGHTFACE_OPENCV_DIR := /tmp/localai-insightface-opencv-cache
INSIGHTFACE_OPENCV_YUNET_URL := https://github.com/opencv/opencv_zoo/raw/main/models/face_detection_yunet/face_detection_yunet_2023mar.onnx
INSIGHTFACE_OPENCV_SFACE_URL := https://github.com/opencv/opencv_zoo/raw/main/models/face_recognition_sface/face_recognition_sface_2021dec.onnx
INSIGHTFACE_OPENCV_YUNET_SHA := 8f2383e4dd3cfbb4553ea8718107fc0423210dc964f9f4280604804ed2552fa4
INSIGHTFACE_OPENCV_SFACE_SHA := 0ba9fbfa01b5270c96627c4ef784da859931e02f04419c829e83484087c34e79
## buffalo_sc (insightface) — pack zip + SHA-256 mirrors the gallery
## entry so the e2e target matches exactly what `local-ai models install
## insightface-buffalo-sc` would have fetched. Smallest insightface pack
## (~16MB) — keeps CI fast while still covering the insightface engine
## code path end-to-end.
INSIGHTFACE_BUFFALO_SC_DIR := /tmp/localai-insightface-buffalo-sc-cache
INSIGHTFACE_BUFFALO_SC_URL := https://github.com/deepinsight/insightface/releases/download/v0.7/buffalo_sc.zip
INSIGHTFACE_BUFFALO_SC_SHA := 57d31b56b6ffa911c8a73cfc1707c73cab76efe7f13b675a05223bf42de47c72
## Silent-Face antispoofing (MiniFASNetV2 + MiniFASNetV1SE) — shared
## between the buffalo_sc and opencv e2e targets. Both ONNX files are
## ~1.7MB, Apache 2.0. URLs + SHAs mirror the gallery entries.
INSIGHTFACE_ANTISPOOF_DIR := /tmp/localai-insightface-antispoof-cache
INSIGHTFACE_ANTISPOOF_V2_URL := https://github.com/yakhyo/face-anti-spoofing/releases/download/weights/MiniFASNetV2.onnx
INSIGHTFACE_ANTISPOOF_V2_SHA := b32929adc2d9c34b9486f8c4c7bc97c1b69bc0ea9befefc380e4faae4e463907
INSIGHTFACE_ANTISPOOF_V1SE_URL := https://github.com/yakhyo/face-anti-spoofing/releases/download/weights/MiniFASNetV1SE.onnx
INSIGHTFACE_ANTISPOOF_V1SE_SHA := ebab7f90c7833fbccd46d3a555410e78d969db5438e169b6524be444862b3676
.PHONY: insightface-opencv-models
insightface-opencv-models:
@mkdir -p $(INSIGHTFACE_OPENCV_DIR)
@if [ "$$(sha256sum $(INSIGHTFACE_OPENCV_DIR)/yunet.onnx 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_OPENCV_YUNET_SHA)" ]; then \
echo "Fetching YuNet..."; \
curl -fsSL -o $(INSIGHTFACE_OPENCV_DIR)/yunet.onnx $(INSIGHTFACE_OPENCV_YUNET_URL); \
echo "$(INSIGHTFACE_OPENCV_YUNET_SHA) $(INSIGHTFACE_OPENCV_DIR)/yunet.onnx" | sha256sum -c; \
fi
@if [ "$$(sha256sum $(INSIGHTFACE_OPENCV_DIR)/sface.onnx 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_OPENCV_SFACE_SHA)" ]; then \
echo "Fetching SFace..."; \
curl -fsSL -o $(INSIGHTFACE_OPENCV_DIR)/sface.onnx $(INSIGHTFACE_OPENCV_SFACE_URL); \
echo "$(INSIGHTFACE_OPENCV_SFACE_SHA) $(INSIGHTFACE_OPENCV_DIR)/sface.onnx" | sha256sum -c; \
fi
.PHONY: insightface-antispoof-models
insightface-antispoof-models:
@mkdir -p $(INSIGHTFACE_ANTISPOOF_DIR)
@if [ "$$(sha256sum $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_ANTISPOOF_V2_SHA)" ]; then \
echo "Fetching MiniFASNetV2..."; \
curl -fsSL -o $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx $(INSIGHTFACE_ANTISPOOF_V2_URL); \
echo "$(INSIGHTFACE_ANTISPOOF_V2_SHA) $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx" | sha256sum -c; \
fi
@if [ "$$(sha256sum $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_ANTISPOOF_V1SE_SHA)" ]; then \
echo "Fetching MiniFASNetV1SE..."; \
curl -fsSL -o $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx $(INSIGHTFACE_ANTISPOOF_V1SE_URL); \
echo "$(INSIGHTFACE_ANTISPOOF_V1SE_SHA) $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx" | sha256sum -c; \
fi
.PHONY: insightface-buffalo-sc-models
insightface-buffalo-sc-models:
@mkdir -p $(INSIGHTFACE_BUFFALO_SC_DIR)
@if [ "$$(sha256sum $(INSIGHTFACE_BUFFALO_SC_DIR)/buffalo_sc.zip 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_BUFFALO_SC_SHA)" ]; then \
echo "Fetching buffalo_sc..."; \
curl -fsSL -o $(INSIGHTFACE_BUFFALO_SC_DIR)/buffalo_sc.zip $(INSIGHTFACE_BUFFALO_SC_URL); \
echo "$(INSIGHTFACE_BUFFALO_SC_SHA) $(INSIGHTFACE_BUFFALO_SC_DIR)/buffalo_sc.zip" | sha256sum -c; \
rm -f $(INSIGHTFACE_BUFFALO_SC_DIR)/*.onnx; \
fi
@if [ ! -f "$(INSIGHTFACE_BUFFALO_SC_DIR)/det_500m.onnx" ]; then \
echo "Extracting buffalo_sc..."; \
unzip -o -q $(INSIGHTFACE_BUFFALO_SC_DIR)/buffalo_sc.zip -d $(INSIGHTFACE_BUFFALO_SC_DIR); \
fi
## buffalo_sc — smallest insightface pack (SCRFD-500MF detector + MBF
## recognizer, ~16MB). Exercises the insightface engine code path
## (model_zoo-backed inference) without the ~326MB buffalo_l download.
## No age/gender/landmark heads — face_analyze is dropped from caps.
## The pack is pre-fetched on the host and passed as `root:<dir>` since
## the e2e suite drives LoadModel directly without going through
## LocalAI's gallery flow (which is what would normally populate
## ModelPath and in turn the engine's `_model_dir` option).
test-extra-backend-insightface-buffalo-sc: docker-build-insightface insightface-buffalo-sc-models insightface-antispoof-models
BACKEND_IMAGE=local-ai-backend:insightface \
BACKEND_TEST_MODEL_NAME=insightface-buffalo-sc \
BACKEND_TEST_OPTIONS=engine:insightface,model_pack:buffalo_sc,root:$(INSIGHTFACE_BUFFALO_SC_DIR),antispoof_v2_onnx:$(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx,antispoof_v1se_onnx:$(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx \
BACKEND_TEST_CAPS=health,load,face_detect,face_embed,face_verify,face_antispoof \
BACKEND_TEST_FACE_IMAGE_1_URL=$(FACE_IMAGE_1_URL) \
BACKEND_TEST_FACE_IMAGE_2_URL=$(FACE_IMAGE_2_URL) \
BACKEND_TEST_FACE_IMAGE_3_URL=$(FACE_IMAGE_3_URL) \
BACKEND_TEST_FACE_SPOOF_IMAGE_URL=$(FACE_SPOOF_IMAGE_URL) \
BACKEND_TEST_VERIFY_DISTANCE_CEILING=0.55 \
$(MAKE) test-extra-backend
## OpenCV Zoo YuNet + SFace — Apache 2.0, commercial-safe. face_analyze
## cap is dropped (SFace has no demographic head). The ONNX files are
## pre-fetched on the host via the insightface-opencv-models target and
## passed as absolute paths, since the e2e suite drives LoadModel
## directly without going through LocalAI's gallery flow.
test-extra-backend-insightface-opencv: docker-build-insightface insightface-opencv-models insightface-antispoof-models
BACKEND_IMAGE=local-ai-backend:insightface \
BACKEND_TEST_MODEL_NAME=insightface-opencv \
BACKEND_TEST_OPTIONS=engine:onnx_direct,detector_onnx:$(INSIGHTFACE_OPENCV_DIR)/yunet.onnx,recognizer_onnx:$(INSIGHTFACE_OPENCV_DIR)/sface.onnx,antispoof_v2_onnx:$(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx,antispoof_v1se_onnx:$(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx \
BACKEND_TEST_CAPS=health,load,face_detect,face_embed,face_verify,face_antispoof \
BACKEND_TEST_FACE_IMAGE_1_URL=$(FACE_IMAGE_1_URL) \
BACKEND_TEST_FACE_IMAGE_2_URL=$(FACE_IMAGE_2_URL) \
BACKEND_TEST_FACE_IMAGE_3_URL=$(FACE_IMAGE_3_URL) \
BACKEND_TEST_FACE_SPOOF_IMAGE_URL=$(FACE_SPOOF_IMAGE_URL) \
BACKEND_TEST_VERIFY_DISTANCE_CEILING=0.55 \
$(MAKE) test-extra-backend
## Aggregate — runs both face-recognition model configurations so CI
## catches regressions across engines together.
test-extra-backend-insightface-all: \
test-extra-backend-insightface-buffalo-sc \
test-extra-backend-insightface-opencv
## speaker-recognition — voice (speaker) biometrics.
##
## Audio fixtures default to the speechbrain test samples served
## straight from their GitHub repo — public, no auth needed, and they
## ship as 16kHz mono WAV/FLAC which is exactly what the engine wants.
## example{1,2,5} are three different speakers; the suite treats
## example1 as the "same-image twin" probe (verify(clip, clip) must
## return distance≈0) and the other two as cross-speaker ceilings.
## Override with BACKEND_TEST_VOICE_AUDIO_{1,2,3}_FILE for offline runs.
VOICE_AUDIO_1_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example1.wav
VOICE_AUDIO_2_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example2.flac
VOICE_AUDIO_3_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example5.wav
## ECAPA-TDNN via SpeechBrain — default CI configuration. Auto-downloads
## the checkpoint from HuggingFace on first LoadModel (bundled in the
## backend image pip install). 192-d embeddings, cosine-distance based.
## The e2e suite drives LoadModel directly so we don't rely on LocalAI's
## gallery flow here.
test-extra-backend-speaker-recognition-ecapa: docker-build-speaker-recognition
BACKEND_IMAGE=local-ai-backend:speaker-recognition \
BACKEND_TEST_MODEL_NAME=speechbrain/spkrec-ecapa-voxceleb \
BACKEND_TEST_OPTIONS=engine:speechbrain,source:speechbrain/spkrec-ecapa-voxceleb \
BACKEND_TEST_CAPS=health,load,voice_embed,voice_verify \
BACKEND_TEST_VOICE_AUDIO_1_URL=$(VOICE_AUDIO_1_URL) \
BACKEND_TEST_VOICE_AUDIO_2_URL=$(VOICE_AUDIO_2_URL) \
BACKEND_TEST_VOICE_AUDIO_3_URL=$(VOICE_AUDIO_3_URL) \
BACKEND_TEST_VOICE_VERIFY_DISTANCE_CEILING=0.4 \
$(MAKE) test-extra-backend
## Aggregate — today there's only one voice config; the target exists
## so the CI workflow matches the insightface-all naming convention and
## can grow to include WeSpeaker / 3D-Speaker later.
test-extra-backend-speaker-recognition-all: \
test-extra-backend-speaker-recognition-ecapa
## Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked
## LLM. Extracts the sherpa-onnx Docker image rootfs, downloads the three
## gallery-referenced model bundles (silero-vad, omnilingual-asr, vits-ljs),
## writes the corresponding model config YAMLs, and runs the realtime
## websocket spec in tests/e2e with REALTIME_* env vars wiring the sherpa
## slots into the pipeline. The LLM slot stays on the in-repo mock-backend
## registered unconditionally by tests/e2e/e2e_suite_test.go. See
## tests/e2e/run-realtime-sherpa.sh for the full orchestration.
test-extra-e2e-realtime-sherpa: build-mock-backend docker-build-sherpa-onnx protogen-go react-ui
bash tests/e2e/run-realtime-sherpa.sh
## Streaming ASR via the sherpa-onnx online recognizer. Uses the streaming
## zipformer English model (encoder/decoder/joiner int8 + tokens) from the
## sherpa-onnx gallery entry. Drives both AudioTranscription and
## AudioTranscriptionStream via the e2e-backends gRPC harness; streaming
## emits real partial deltas during decode. Each file is renamed on download
## to the shape sherpa-onnx's online loader expects (encoder.int8.onnx etc.).
test-extra-backend-sherpa-onnx-transcription: docker-build-sherpa-onnx
BACKEND_IMAGE=local-ai-backend:sherpa-onnx \
BACKEND_TEST_MODEL_URL='https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/encoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx#encoder.int8.onnx' \
BACKEND_TEST_EXTRA_FILES='https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/decoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx#decoder.int8.onnx|https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/joiner-epoch-99-avg-1-chunk-16-left-128.int8.onnx#joiner.int8.onnx|https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/tokens.txt' \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_CAPS=health,load,transcription \
BACKEND_TEST_OPTIONS=subtype=online \
$(MAKE) test-extra-backend
## VITS TTS via the sherpa-onnx backend. Pulls the individual files from
## HuggingFace (the vits-ljs release tarball lives on the k2-fsa github
## but is also mirrored as discrete files on HF). Exercises both
## TTS (write-to-file) and TTSStream (PCM chunks + WAV header) via the
## e2e-backends gRPC harness.
test-extra-backend-sherpa-onnx-tts: docker-build-sherpa-onnx
BACKEND_IMAGE=local-ai-backend:sherpa-onnx \
BACKEND_TEST_MODEL_URL='https://huggingface.co/csukuangfj/vits-ljs/resolve/main/vits-ljs.onnx#vits-ljs.onnx' \
BACKEND_TEST_EXTRA_FILES='https://huggingface.co/csukuangfj/vits-ljs/resolve/main/tokens.txt|https://huggingface.co/csukuangfj/vits-ljs/resolve/main/lexicon.txt' \
BACKEND_TEST_CAPS=health,load,tts \
$(MAKE) test-extra-backend
## VibeVoice TTS via the vibevoice-cpp backend. ModelFile is the
## realtime gguf; the supplementary tokenizer + voice prompt land
## alongside it under the harness's models dir and are wired through
## via the standard Options[] convention (tokenizer=, voice=).
test-extra-backend-vibevoice-cpp-tts: docker-build-vibevoice-cpp
BACKEND_IMAGE=local-ai-backend:vibevoice-cpp \
BACKEND_TEST_MODEL_URL='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/vibevoice-realtime-0.5B-q8_0.gguf#vibevoice-realtime-0.5B-q8_0.gguf' \
BACKEND_TEST_EXTRA_FILES='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/tokenizer.gguf#tokenizer.gguf|https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/voice-en-Carter_man.gguf#voice-en-Carter_man.gguf' \
BACKEND_TEST_OPTIONS=tokenizer:tokenizer.gguf,voice:voice-en-Carter_man.gguf \
BACKEND_TEST_CAPS=health,load,tts \
$(MAKE) test-extra-backend
## VibeVoice ASR (long-form, with diarization). type=asr tells the
## backend's Load() to slot ModelFile into the asr_model role; the
## tokenizer is supplied via Options[]. Uses the Q4_K quant (~10 GB)
## rather than Q8_0 (~14 GB) so the bundle fits inside ubuntu-latest's
## post-image disk budget.
test-extra-backend-vibevoice-cpp-transcription: docker-build-vibevoice-cpp
BACKEND_IMAGE=local-ai-backend:vibevoice-cpp \
BACKEND_TEST_MODEL_URL='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/vibevoice-asr-q4_k.gguf#vibevoice-asr-q4_k.gguf' \
BACKEND_TEST_EXTRA_FILES='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/tokenizer.gguf#tokenizer.gguf' \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_OPTIONS=type:asr,tokenizer:tokenizer.gguf \
BACKEND_TEST_CAPS=health,load,transcription \
$(MAKE) test-extra-backend
## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
## tool-call extraction via sglang's native qwen parser. CPU builds use
## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
@@ -645,6 +916,8 @@ docker:
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
-t $(DOCKER_IMAGE) .
docker-cuda12:
@@ -658,11 +931,13 @@ docker-cuda12:
--build-arg BUILD_TYPE=$(BUILD_TYPE) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
-t $(DOCKER_IMAGE)-cuda-12 .
docker-image-intel:
docker build \
--build-arg BASE_IMAGE=intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04 \
--build-arg BASE_IMAGE=intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04 \
--build-arg IMAGE_TYPE=$(IMAGE_TYPE) \
--build-arg GO_TAGS="$(GO_TAGS)" \
--build-arg MAKEFLAGS="$(DOCKER_MAKEFLAGS)" \
@@ -671,6 +946,8 @@ docker-image-intel:
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
-t $(DOCKER_IMAGE) .
########################################################
@@ -739,7 +1016,9 @@ BACKEND_WHISPER = whisper|golang|.|false|true
BACKEND_VOXTRAL = voxtral|golang|.|false|true
BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
BACKEND_VIBEVOICE_CPP = vibevoice-cpp|golang|.|false|true
BACKEND_OPUS = opus|golang|.|false|true
BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true
# Python backends with root context
BACKEND_RERANKERS = rerankers|python|.|false|true
@@ -748,6 +1027,8 @@ BACKEND_OUTETTS = outetts|python|.|false|true
BACKEND_FASTER_WHISPER = faster-whisper|python|.|false|true
BACKEND_COQUI = coqui|python|.|false|true
BACKEND_RFDETR = rfdetr|python|.|false|true
BACKEND_INSIGHTFACE = insightface|python|.|false|true
BACKEND_SPEAKER_RECOGNITION = speaker-recognition|python|.|false|true
BACKEND_KITTEN_TTS = kitten-tts|python|.|false|true
BACKEND_NEUTTS = neutts|python|.|false|true
BACKEND_KOKORO = kokoro|python|.|false|true
@@ -790,7 +1071,10 @@ define docker-build-backend
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
$(if $(FROM_SOURCE),--build-arg FROM_SOURCE=$(FROM_SOURCE)) \
$(if $(AMDGPU_TARGETS),--build-arg AMDGPU_TARGETS=$(AMDGPU_TARGETS)) \
$(if $(filter true,$(5)),--build-arg BACKEND=$(1)) \
-t local-ai-backend:$(1) -f backend/Dockerfile.$(2) $(3)
endef
@@ -819,6 +1103,8 @@ $(eval $(call generate-docker-build-target,$(BACKEND_OUTETTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_FASTER_WHISPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_COQUI)))
$(eval $(call generate-docker-build-target,$(BACKEND_RFDETR)))
$(eval $(call generate-docker-build-target,$(BACKEND_INSIGHTFACE)))
$(eval $(call generate-docker-build-target,$(BACKEND_SPEAKER_RECOGNITION)))
$(eval $(call generate-docker-build-target,$(BACKEND_KITTEN_TTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_NEUTTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_KOKORO)))
@@ -840,6 +1126,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX)))
$(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP)))
$(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX_VLM)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX_DISTRIBUTED)))
@@ -848,12 +1135,13 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
$(eval $(call generate-docker-build-target,$(BACKEND_TINYGRAD)))
$(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
$(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
# Pattern rule for docker-save targets
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
########################################################
### Mock Backend for E2E Tests

View File

@@ -38,7 +38,7 @@
- **Built-in AI agents** — autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first** — your data never leaves your infrastructure
Created and maintained by [Ettore Di Giacinto](https://github.com/mudler).
Created by [Ettore Di Giacinto](https://github.com/mudler) and maintained by the [LocalAI team](#team).
> [:book: Documentation](https://localai.io/) | [:speech_balloon: Discord](https://discord.gg/uJAeKSAGDy) | [💻 Quickstart](https://localai.io/basics/getting_started/) | [🖼️ Models](https://models.localai.io/) | [❓FAQ](https://localai.io/faq/)
@@ -149,6 +149,7 @@ For more details, see the [Getting Started guide](https://localai.io/basics/gett
## Latest News
- **April 2026**: [Voice recognition](https://github.com/mudler/LocalAI/pull/9500), [Face recognition, identification & liveness detection](https://github.com/mudler/LocalAI/pull/9480), [Ollama API compatibility](https://github.com/mudler/LocalAI/pull/9284), [Video generation in stable-diffusion.ggml](https://github.com/mudler/LocalAI/pull/9420), [Backend versioning with auto-upgrade](https://github.com/mudler/LocalAI/pull/9315), [Pin models & load-on-demand toggle](https://github.com/mudler/LocalAI/pull/9309), [Universal model importer](https://github.com/mudler/LocalAI/pull/9466), new backends: [sglang](https://github.com/mudler/LocalAI/pull/9359), [ik-llama-cpp](https://github.com/mudler/LocalAI/pull/9326), [TurboQuant](https://github.com/mudler/LocalAI/pull/9355), [sam.cpp](https://github.com/mudler/LocalAI/pull/9288), [Kokoros](https://github.com/mudler/LocalAI/pull/9212), [qwen3tts.cpp](https://github.com/mudler/LocalAI/pull/9316), [tinygrad multimodal](https://github.com/mudler/LocalAI/pull/9364)
- **March 2026**: [Agent management](https://github.com/mudler/LocalAI/pull/8820), [New React UI](https://github.com/mudler/LocalAI/pull/8772), [WebRTC](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed via P2P and RDMA](https://github.com/mudler/LocalAI/pull/8801), [MCP Apps, MCP Client-side](https://github.com/mudler/LocalAI/pull/8947)
- **February 2026**: [Realtime API for audio-to-audio with tool calling](https://github.com/mudler/LocalAI/pull/6245), [ACE-Step 1.5 support](https://github.com/mudler/LocalAI/pull/8396)
- **January 2026**: **LocalAI 3.10.0** — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v3.10.0)
@@ -200,13 +201,14 @@ See the full [Backend & Model Compatibility Table](https://localai.io/model-comp
- [Media & blog posts](https://localai.io/basics/news/#media-blogs-social)
- [Examples](https://github.com/mudler/LocalAI-examples)
## Autonomous Development Team
## Team
LocalAI is helped being maintained by a team of autonomous AI agents led by an AI Scrum Master.
LocalAI is maintained by a small team of humans, together with the wider community of contributors.
- **Live Reports**: [reports.localai.io](http://reports.localai.io)
- **Project Board**: [Agent task tracking](https://github.com/users/mudler/projects/6)
- **Blog Post**: [Learn about the experiment](https://mudler.pm/posts/2026/02/28/a-call-to-open-source-maintainers-stop-babysitting-ai-how-i-built-a-100-local-autonomous-dev-team-to-maintain-localai-and-why-you-should-too/)
- **[Ettore Di Giacinto](https://github.com/mudler)** — original author and project lead
- **[Richard Palethorpe](https://github.com/richiejp)** — maintainer
A huge thank you to everyone who contributes code, reviews PRs, files issues, and helps users in [Discord](https://discord.gg/uJAeKSAGDy) — LocalAI is a community-driven project and wouldn't exist without you. See the full [contributors list](https://github.com/mudler/LocalAI/graphs/contributors).
## Citation
@@ -249,7 +251,7 @@ A special thanks to individual sponsors, a full list is on [GitHub](https://gith
## License
LocalAI is a community-driven project created by [Ettore Di Giacinto](https://github.com/mudler/).
LocalAI is a community-driven project created by [Ettore Di Giacinto](https://github.com/mudler/) and maintained by the [LocalAI team](#team).
MIT - Author Ettore Di Giacinto <mudler@localai.io>

View File

@@ -1,4 +1,6 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
FROM ${BASE_IMAGE} AS builder
ARG BACKEND=rerankers
@@ -14,8 +16,14 @@ ARG TARGETARCH
ARG TARGETVARIANT
ARG GO_VERSION=1.25.4
ARG UBUNTU_VERSION=2404
ARG AMDGPU_TARGETS
ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
git ccache \
@@ -147,6 +155,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
hipblaslt-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \

View File

@@ -1,5 +1,7 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
# The grpc target does one thing, it builds and installs GRPC. This is in it's own layer so that it can be effectively cached by CI.
@@ -12,12 +14,16 @@ ARG GRPC_VERSION=v1.65.0
ARG CMAKE_FROM_SOURCE=false
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
ARG CMAKE_VERSION=3.31.10
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
WORKDIR /build
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates \
build-essential curl libssl-dev \
@@ -71,8 +77,12 @@ ARG TARGETARCH
ARG TARGETVARIANT
ARG GO_VERSION=1.25.4
ARG UBUNTU_VERSION=2404
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ccache git \
@@ -204,6 +214,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
hipblaslt-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \

View File

@@ -1,5 +1,7 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
# The grpc target does one thing, it builds and installs GRPC. This is in it's own layer so that it can be effectively cached by CI.
@@ -12,12 +14,16 @@ ARG GRPC_VERSION=v1.65.0
ARG CMAKE_FROM_SOURCE=false
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
ARG CMAKE_VERSION=3.31.10
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
WORKDIR /build
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates \
build-essential curl libssl-dev \
@@ -73,8 +79,12 @@ ARG TARGETARCH
ARG TARGETVARIANT
ARG GO_VERSION=1.25.4
ARG UBUNTU_VERSION=2404
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ccache git \
@@ -206,6 +216,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
hipblaslt-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \

View File

@@ -1,4 +1,6 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
FROM ${BASE_IMAGE} AS builder
ARG BACKEND=rerankers
@@ -13,8 +15,12 @@ ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETARCH
ARG TARGETVARIANT
ARG UBUNTU_VERSION=2404
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ccache \
@@ -162,6 +168,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
hipblaslt-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
@@ -202,6 +209,13 @@ COPY scripts/build/package-gpu-libs.sh /package-gpu-libs.sh
ARG FROM_SOURCE=""
ENV FROM_SOURCE=${FROM_SOURCE}
# Cache-buster for the per-backend `make` step. Most Python backends list
# unpinned deps (torch, transformers, vllm, ...), so a warm registry cache
# would otherwise freeze upstream versions indefinitely. CI passes a value
# that rolls weekly so the install layer is rebuilt at most once per week
# and picks up newer wheels from PyPI / nightly indexes.
ARG DEPS_REFRESH=initial
RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
# Package GPU libraries into the backend's lib directory
@@ -216,4 +230,4 @@ RUN if [ -f "/${BACKEND}/package.sh" ]; then \
FROM scratch
ARG BACKEND=rerankers
COPY --from=builder /${BACKEND}/ /
COPY --from=builder /${BACKEND}/ /

View File

@@ -1,12 +1,18 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
FROM ${BASE_IMAGE} AS builder
ARG BACKEND=kokoros
ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETARCH
ARG TARGETVARIANT
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
git ccache \

View File

@@ -1,5 +1,7 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
# The grpc target does one thing, it builds and installs GRPC. This is in it's own layer so that it can be effectively cached by CI.
@@ -12,12 +14,16 @@ ARG GRPC_VERSION=v1.65.0
ARG CMAKE_FROM_SOURCE=false
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
ARG CMAKE_VERSION=3.31.10
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
WORKDIR /build
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates \
build-essential curl libssl-dev \
@@ -71,8 +77,12 @@ ARG TARGETARCH
ARG TARGETVARIANT
ARG GO_VERSION=1.25.4
ARG UBUNTU_VERSION=2404
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ccache git \
@@ -204,6 +214,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
hipblaslt-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \

View File

@@ -24,6 +24,11 @@ service Backend {
rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {}
rpc Status(HealthMessage) returns (StatusResponse) {}
rpc Detect(DetectOptions) returns (DetectResponse) {}
rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
rpc VoiceVerify(VoiceVerifyRequest) returns (VoiceVerifyResponse) {}
rpc VoiceAnalyze(VoiceAnalyzeRequest) returns (VoiceAnalyzeResponse) {}
rpc VoiceEmbed(VoiceEmbedRequest) returns (VoiceEmbedResponse) {}
rpc StoresSet(StoresSetOptions) returns (Result) {}
rpc StoresDelete(StoresDeleteOptions) returns (Result) {}
@@ -305,6 +310,11 @@ message ModelOptions {
bool Reranking = 71;
repeated string Overrides = 72;
// EngineArgs carries a JSON-encoded map of backend-native engine arguments
// applied verbatim to the backend's engine constructor (e.g. vLLM AsyncEngineArgs).
// Unknown keys produce an error at LoadModel time.
string EngineArgs = 73;
}
message Result {
@@ -475,6 +485,112 @@ message DetectResponse {
repeated Detection Detections = 1;
}
// --- Face recognition messages ---
message FacialArea {
float x = 1;
float y = 2;
float w = 3;
float h = 4;
}
message FaceVerifyRequest {
string img1 = 1; // base64-encoded image
string img2 = 2; // base64-encoded image
float threshold = 3; // cosine-distance threshold; 0 = use backend default
bool anti_spoofing = 4; // run MiniFASNet liveness on each image; failed liveness forces verified=false
}
message FaceVerifyResponse {
bool verified = 1;
float distance = 2; // 1 - cosine_similarity
float threshold = 3;
float confidence = 4; // 0-100
string model = 5; // e.g. "buffalo_l"
FacialArea img1_area = 6;
FacialArea img2_area = 7;
float processing_time_ms = 8;
bool img1_is_real = 9; // anti-spoofing result when enabled
float img1_antispoof_score = 10;
bool img2_is_real = 11;
float img2_antispoof_score = 12;
}
message FaceAnalyzeRequest {
string img = 1; // base64-encoded image
repeated string actions = 2; // subset of ["age","gender","emotion","race"]; empty = all-supported
bool anti_spoofing = 3;
}
message FaceAnalysis {
FacialArea region = 1;
float face_confidence = 2;
float age = 3;
string dominant_gender = 4; // "Man" | "Woman"
map<string, float> gender = 5;
string dominant_emotion = 6; // reserved; empty in MVP
map<string, float> emotion = 7;
string dominant_race = 8; // not populated
map<string, float> race = 9;
bool is_real = 10; // anti-spoofing result when enabled
float antispoof_score = 11;
}
message FaceAnalyzeResponse {
repeated FaceAnalysis faces = 1;
}
// --- Voice (speaker) recognition messages ---
//
// Analogous to the Face* messages above, but for speaker biometrics.
// Audio fields accept a filesystem path (same convention as
// TranscriptRequest.dst). The HTTP layer materialises base64 / URL /
// data-URI inputs to a temp file before calling the gRPC backend.
message VoiceVerifyRequest {
string audio1 = 1; // path to first audio clip
string audio2 = 2; // path to second audio clip
float threshold = 3; // cosine-distance threshold; 0 = use backend default
bool anti_spoofing = 4; // reserved for future AASIST bolt-on
}
message VoiceVerifyResponse {
bool verified = 1;
float distance = 2; // 1 - cosine_similarity
float threshold = 3;
float confidence = 4; // 0-100
string model = 5; // e.g. "speechbrain/spkrec-ecapa-voxceleb"
float processing_time_ms = 6;
}
message VoiceAnalyzeRequest {
string audio = 1; // path to audio clip
repeated string actions = 2; // subset of ["age","gender","emotion"]; empty = all-supported
}
message VoiceAnalysis {
float start = 1; // segment start time in seconds (0 if single-utterance)
float end = 2; // segment end time in seconds
float age = 3;
string dominant_gender = 4;
map<string, float> gender = 5;
string dominant_emotion = 6;
map<string, float> emotion = 7;
}
message VoiceAnalyzeResponse {
repeated VoiceAnalysis segments = 1;
}
message VoiceEmbedRequest {
string audio = 1; // path to audio clip
}
message VoiceEmbedResponse {
repeated float embedding = 1;
string model = 2;
}
message ToolFormatMarkers {
string format_type = 1; // "json_native", "tag_with_json", "tag_with_tagged"

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=8befd92ea5f702494ea9813fe42a52fb015db5fe
IK_LLAMA_VERSION?=a8aecbf15933295af96504f9a693998322185b5c
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -326,7 +326,7 @@ struct llama_client_slot
char buffer[512];
double t_token = t_prompt_processing / num_prompt_tokens_processed;
double n_tokens_second = 1e3 / t_prompt_processing * num_prompt_tokens_processed;
sprintf(buffer, "prompt eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)",
snprintf(buffer, sizeof(buffer), "prompt eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)",
t_prompt_processing, num_prompt_tokens_processed,
t_token, n_tokens_second);
LOG_INFO(buffer, {
@@ -340,7 +340,7 @@ struct llama_client_slot
t_token = t_token_generation / n_decoded;
n_tokens_second = 1e3 / t_token_generation * n_decoded;
sprintf(buffer, "generation eval time = %10.2f ms / %5d runs (%8.2f ms per token, %8.2f tokens per second)",
snprintf(buffer, sizeof(buffer), "generation eval time = %10.2f ms / %5d runs (%8.2f ms per token, %8.2f tokens per second)",
t_token_generation, n_decoded,
t_token, n_tokens_second);
LOG_INFO(buffer, {
@@ -352,7 +352,7 @@ struct llama_client_slot
{"n_tokens_second", n_tokens_second},
});
sprintf(buffer, " total time = %10.2f ms", t_prompt_processing + t_token_generation);
snprintf(buffer, sizeof(buffer), " total time = %10.2f ms", t_prompt_processing + t_token_generation);
LOG_INFO(buffer, {
{"slot_id", id},
{"task_id", task_id},
@@ -686,7 +686,16 @@ struct llama_server_context
slot->sparams.mirostat_eta = json_value(data, "mirostat_eta", default_sparams.mirostat_eta);
slot->params.n_keep = json_value(data, "n_keep", slot->params.n_keep);
slot->sparams.seed = json_value(data, "seed", default_sparams.seed);
slot->sparams.grammar = json_value(data, "grammar", default_sparams.grammar);
{
// upstream changed common_params_sampling::grammar from std::string to
// the common_grammar struct (type + grammar). The incoming JSON still
// carries a plain string, so build the user-provided grammar here and
// fall back to the server default when the request omits it.
std::string grammar_str = json_value(data, "grammar", std::string());
slot->sparams.grammar = grammar_str.empty()
? default_sparams.grammar
: common_grammar{COMMON_GRAMMAR_TYPE_USER, std::move(grammar_str)};
}
slot->sparams.n_probs = json_value(data, "n_probs", default_sparams.n_probs);
slot->sparams.min_keep = json_value(data, "min_keep", default_sparams.min_keep);
slot->sparams.grammar_triggers = grammar_triggers;
@@ -1232,7 +1241,7 @@ struct llama_server_context
// {"logit_bias", slot.sparams.logit_bias},
{"n_probs", slot.sparams.n_probs},
{"min_keep", slot.sparams.min_keep},
{"grammar", slot.sparams.grammar},
{"grammar", slot.sparams.grammar.grammar},
{"samplers", samplers}
};
}

View File

@@ -0,0 +1,11 @@
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -2494,7 +2494,7 @@
}
new_data = work.data();
- new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr);
+ new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr, nullptr);
} else {
new_type = cur->type;
new_data = cur->data;

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=4f02d4733934179386cbc15b3454be26237940bb
LLAMA_VERSION?=beb42fffa45eded44804a1fd4916146222371581
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=
@@ -34,6 +34,9 @@ else ifeq ($(BUILD_TYPE),hipblas)
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201
ifeq ($(strip $(AMDGPU_TARGETS)),)
$(error AMDGPU_TARGETS is emptyset it to a comma-separated list of gfx targets e.g. gfx1100,gfx1101)
endif
CMAKE_ARGS+=-DGGML_HIP=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=1

View File

@@ -10,6 +10,14 @@
#include "server-task.cpp"
#include "server-queue.cpp"
#include "server-common.cpp"
// server-chat.cpp exists only in llama.cpp after the upstream refactor that
// split OAI/Anthropic/Responses/transcription conversion helpers out of
// server-common.cpp. When present, server-context.cpp and server-task.cpp
// above call into it, so we must pull its definitions into this TU or the
// link fails. __has_include keeps the source compatible with older pins.
#if __has_include("server-chat.cpp")
#include "server-chat.cpp"
#endif
#include "server-context.cpp"
// LocalAI
@@ -434,7 +442,7 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// Draft model for speculative decoding
if (!request->draftmodel().empty()) {
params.speculative.mparams_dft.path = request->draftmodel();
params.speculative.draft.mparams.path = request->draftmodel();
// Default to draft type if a draft model is set but no explicit type
if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
@@ -634,6 +642,21 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.no_op_offload = false;
}
} else if (!strcmp(optname, "split_mode") || !strcmp(optname, "sm")) {
// Accepts: none | layer | row | tensor (the latter requires a llama.cpp build
// that includes ggml-org/llama.cpp#19378, FlashAttention enabled, and KV-cache
// quantization disabled).
if (optval != NULL) {
if (optval_str == "none") {
params.split_mode = LLAMA_SPLIT_MODE_NONE;
} else if (optval_str == "layer") {
params.split_mode = LLAMA_SPLIT_MODE_LAYER;
} else if (optval_str == "row") {
params.split_mode = LLAMA_SPLIT_MODE_ROW;
} else if (optval_str == "tensor") {
params.split_mode = LLAMA_SPLIT_MODE_TENSOR;
}
}
} else if (!strcmp(optname, "kv_unified") || !strcmp(optname, "unified_kv")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.kv_unified = true;
@@ -656,39 +679,39 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
}
} else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
if (optval != NULL) {
try { params.speculative.n_max = std::stoi(optval_str); } catch (...) {}
try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_n_min") || !strcmp(optname, "draft_min")) {
if (optval != NULL) {
try { params.speculative.n_min = std::stoi(optval_str); } catch (...) {}
try { params.speculative.draft.n_min = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_p_min") || !strcmp(optname, "draft_p_min")) {
if (optval != NULL) {
try { params.speculative.p_min = std::stof(optval_str); } catch (...) {}
try { params.speculative.draft.p_min = std::stof(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_p_split")) {
if (optval != NULL) {
try { params.speculative.p_split = std::stof(optval_str); } catch (...) {}
try { params.speculative.draft.p_split = std::stof(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_size_n") || !strcmp(optname, "ngram_size_n")) {
if (optval != NULL) {
try { params.speculative.ngram_size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
try { params.speculative.ngram_simple.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_size_m") || !strcmp(optname, "ngram_size_m")) {
if (optval != NULL) {
try { params.speculative.ngram_size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
try { params.speculative.ngram_simple.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_min_hits") || !strcmp(optname, "ngram_min_hits")) {
if (optval != NULL) {
try { params.speculative.ngram_min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
try { params.speculative.ngram_simple.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "draft_gpu_layers")) {
if (optval != NULL) {
try { params.speculative.n_gpu_layers = std::stoi(optval_str); } catch (...) {}
try { params.speculative.draft.n_gpu_layers = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "draft_ctx_size")) {
if (optval != NULL) {
try { params.speculative.n_ctx = std::stoi(optval_str); } catch (...) {}
try { params.speculative.draft.n_ctx = std::stoi(optval_str); } catch (...) {}
}
}
}
@@ -910,8 +933,8 @@ public:
if (!params.mmproj.path.empty()) {
error_msg += " (with mmproj: " + params.mmproj.path + ")";
}
if (params.speculative.has_dft() && !params.speculative.mparams_dft.path.empty()) {
error_msg += " (with draft model: " + params.speculative.mparams_dft.path + ")";
if (params.speculative.has_dft() && !params.speculative.draft.mparams.path.empty()) {
error_msg += " (with draft model: " + params.speculative.draft.mparams.path + ")";
}
// Add captured error details if available

View File

@@ -1,7 +1,7 @@
# Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
TURBOQUANT_VERSION?=627ebbc6e27727bd4f65422d8aa60b13404993c8
TURBOQUANT_VERSION?=11a241d0db78a68e0a5b99fe6f36de6683100f6a
LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
CMAKE_ARGS?=

View File

@@ -1,6 +1,6 @@
#!/bin/bash
# Patch the shared backend/cpp/llama-cpp/grpc-server.cpp *copy* used by the
# turboquant build to account for two gaps between upstream and the fork:
# turboquant build to account for the gaps between upstream and the fork:
#
# 1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
# fork-specific `turbo2` / `turbo3` / `turbo4` cache types.
@@ -11,6 +11,14 @@
# "<__media__>", and Go-side tooling falls back to that sentinel when the
# backend does not expose media_marker, so substituting the literal keeps
# behavior identical on the turboquant path.
# 3. Revert the `common_params_speculative` field references to the
# pre-refactor flat layout. Upstream ggml-org/llama.cpp#22397 split the
# struct into nested `draft` / `ngram_simple` / `ngram_mod` / etc. members;
# the turboquant fork branched before that PR and still exposes the flat
# `n_max`, `mparams_dft`, `ngram_size_n`, ... fields. The substitutions
# below map the new nested paths back to the legacy flat names so the
# shared grpc-server.cpp keeps compiling against the fork's common.h.
# Drop this block once the fork rebases past #22397.
#
# We patch the *copy* sitting in turboquant-<flavor>-build/, never the original
# under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps compiling
@@ -77,4 +85,27 @@ else
echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
fi
if grep -q 'params\.speculative\.draft\.\|params\.speculative\.ngram_simple\.' "$SRC"; then
echo "==> patching $SRC to revert common_params_speculative refs to pre-#22397 flat layout"
# Each substitution is the exact post-refactor path → legacy flat field.
# Order doesn't matter because the source paths are disjoint, but we keep
# the most-specific (mparams.path) first for readability.
sed -E \
-e 's/params\.speculative\.draft\.mparams\.path/params.speculative.mparams_dft.path/g' \
-e 's/params\.speculative\.draft\.n_max/params.speculative.n_max/g' \
-e 's/params\.speculative\.draft\.n_min/params.speculative.n_min/g' \
-e 's/params\.speculative\.draft\.p_min/params.speculative.p_min/g' \
-e 's/params\.speculative\.draft\.p_split/params.speculative.p_split/g' \
-e 's/params\.speculative\.draft\.n_gpu_layers/params.speculative.n_gpu_layers/g' \
-e 's/params\.speculative\.draft\.n_ctx/params.speculative.n_ctx/g' \
-e 's/params\.speculative\.ngram_simple\.size_n/params.speculative.ngram_size_n/g' \
-e 's/params\.speculative\.ngram_simple\.size_m/params.speculative.ngram_size_m/g' \
-e 's/params\.speculative\.ngram_simple\.min_hits/params.speculative.ngram_min_hits/g' \
"$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> speculative field rename OK"
else
echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
fi
echo "==> all patches applied"

View File

@@ -4,7 +4,6 @@ package main
// It is meant to be used by the main executable that is the server for the specific backend type (falcon, gpt3, etc)
import (
"container/heap"
"errors"
"fmt"
"math"
"slices"
@@ -100,9 +99,16 @@ func sortIntoKeySlicese(keys []*pb.StoresKey) [][]float32 {
}
func (s *Store) Load(opts *pb.ModelOptions) error {
if opts.Model != "" {
return errors.New("not implemented")
}
// local-store is an in-memory vector store with no on-disk artefact to
// load — opts.Model is just a namespace identifier. The old `!= ""` guard
// rejected any non-empty model name with "not implemented", which broke
// callers that pass a namespace to isolate embedding spaces (face vs.
// voice biometrics both go through local-store but need distinct stores
// so ArcFace 512-D and ECAPA-TDNN 192-D don't collide). Namespace
// isolation is already handled upstream: ModelLoader spawns a fresh
// local-store process per (backend, model) tuple, so each namespace is
// its own Store{} instance. Nothing to do here beyond accepting the load.
_ = opts
return nil
}

View File

@@ -10,7 +10,7 @@ set(SAM3_BUILD_TESTS OFF CACHE BOOL "Disable sam3.cpp tests" FORCE)
add_subdirectory(./sources/sam3.cpp)
add_library(gosam3 MODULE gosam3.cpp)
add_library(gosam3 MODULE cpp/gosam3.cpp)
target_link_libraries(gosam3 PRIVATE sam3 ggml)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)

View File

@@ -111,7 +111,7 @@ libgosam3-fallback.so: sources/sam3.cpp
SO_TARGET=libgosam3-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgosam3-custom
rm -rfv build*
libgosam3-custom: CMakeLists.txt gosam3.cpp gosam3.h
libgosam3-custom: CMakeLists.txt cpp/gosam3.cpp cpp/gosam3.h
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \

11
backend/go/sherpa-onnx/.gitignore vendored Normal file
View File

@@ -0,0 +1,11 @@
.cache/
sources/
build*/
package/
backend-assets/
sherpa-onnx
*.so
compile_commands.json
sherpa-onnx-whisper-*
vits-ljs/
streaming-zipformer-en/

View File

@@ -0,0 +1,120 @@
CURRENT_DIR=$(abspath ./)
GOCMD=go
ONNX_VERSION?=1.24.4
# v1.12.39 — includes upstream's onnxruntime 1.24.4 bump (#3501). Earlier
# pinned commits only support onnxruntime 1.23.2, which has no CUDA 13
# pre-built tarball, blocking the -gpu-nvidia-cuda-13 build matrix entry.
SHERPA_COMMIT?=7288d15e3e31a7bd589b2ba88828d521e7a6b140
ONNX_ARCH?=x64
ONNX_OS?=linux
ifneq (,$(findstring aarch64,$(shell uname -m)))
ONNX_ARCH=aarch64
endif
ifeq ($(OS),Darwin)
ONNX_OS=osx
ifneq (,$(findstring aarch64,$(shell uname -m)))
ONNX_ARCH=arm64
else ifneq (,$(findstring arm64,$(shell uname -m)))
ONNX_ARCH=arm64
else
ONNX_ARCH=x86_64
endif
endif
# Upstream onnxruntime ships CUDA 12 and CUDA 13 variants under different
# names: -gpu-<ver>.tgz for CUDA 12, -gpu_cuda13-<ver>.tgz for CUDA 13
# (note underscore vs dash). CUDA 13 tarballs only exist from 1.24.x onward.
ifeq ($(BUILD_TYPE),cublas)
SHERPA_GPU=ON
ONNX_PROVIDER=cuda
ifeq ($(CUDA_MAJOR_VERSION),13)
ONNX_VARIANT=-gpu_cuda13
else
ONNX_VARIANT=-gpu
endif
else
ONNX_VARIANT=
SHERPA_GPU=OFF
ONNX_PROVIDER=cpu
endif
JOBS?=$(shell nproc --ignore=1 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
sources/onnxruntime:
mkdir -p sources/onnxruntime
curl -L https://github.com/microsoft/onnxruntime/releases/download/v$(ONNX_VERSION)/onnxruntime-$(ONNX_OS)-$(ONNX_ARCH)$(ONNX_VARIANT)-$(ONNX_VERSION).tgz \
-o sources/onnxruntime/onnxruntime.tgz
cd sources/onnxruntime && tar -xf onnxruntime.tgz --strip-components=1 && rm onnxruntime.tgz
sources/sherpa-onnx: sources/onnxruntime
git clone https://github.com/k2-fsa/sherpa-onnx.git sources/sherpa-onnx
cd sources/sherpa-onnx && git checkout $(SHERPA_COMMIT)
mkdir -p sources/sherpa-onnx/build
# sherpa-onnx's cmake detects a pre-installed onnxruntime via the
# SHERPA_ONNXRUNTIME_{INCLUDE,LIB}_DIR env vars (not via -D flags).
# Point them at our locally-downloaded Microsoft tarball — without
# this, sherpa-onnx falls through to download_onnxruntime() which
# fetches from csukuangfj/onnxruntime-libs. For the GPU 1.24.4
# build that release mirror publishes `-patched.zip` instead of the
# expected `.tgz`, so the download 404s and the build fails.
cd sources/sherpa-onnx/build && \
SHERPA_ONNXRUNTIME_INCLUDE_DIR=$(CURRENT_DIR)/sources/onnxruntime/include \
SHERPA_ONNXRUNTIME_LIB_DIR=$(CURRENT_DIR)/sources/onnxruntime/lib \
cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_FLAGS="-Wno-error=format-security" \
-DCMAKE_CXX_FLAGS="-Wno-error=format-security" \
-DSHERPA_ONNX_ENABLE_GPU=$(SHERPA_GPU) \
-DSHERPA_ONNX_ENABLE_TTS=ON \
-DSHERPA_ONNX_ENABLE_BINARY=OFF \
-DSHERPA_ONNX_ENABLE_PYTHON=OFF \
-DSHERPA_ONNX_ENABLE_TESTS=OFF \
-DSHERPA_ONNX_ENABLE_C_API=ON \
-DBUILD_SHARED_LIBS=ON \
-DSHERPA_ONNX_USE_PRE_INSTALLED_ONNXRUNTIME_IF_AVAILABLE=ON \
..
cd sources/sherpa-onnx/build && make -j$(JOBS)
backend-assets/lib: sources/sherpa-onnx sources/onnxruntime
mkdir -p backend-assets/lib
cp -rfLv sources/onnxruntime/lib/* backend-assets/lib/
cp -rfLv sources/sherpa-onnx/build/lib/*.so* backend-assets/lib/ 2>/dev/null || true
cp -rfLv sources/sherpa-onnx/build/lib/*.dylib backend-assets/lib/ 2>/dev/null || true
# libsherpa-shim wraps sherpa-onnx's nested config structs and TTS
# callback plumbing behind a purego-friendly API: opaque handles plus
# fixed-signature setters/getters/trampoline. Plain C compile — no cgo.
SHIM_EXT=so
ifeq ($(OS),Darwin)
SHIM_EXT=dylib
endif
backend-assets/lib/libsherpa-shim.$(SHIM_EXT): csrc/shim.c csrc/shim.h backend-assets/lib
$(CC) -shared -fPIC -O2 \
-I$(CURRENT_DIR)/sources/sherpa-onnx/sherpa-onnx/c-api \
-o $@ csrc/shim.c \
-L$(CURRENT_DIR)/backend-assets/lib \
-lsherpa-onnx-c-api \
-Wl,-rpath,'$$ORIGIN'
sherpa-onnx: backend-assets/lib backend-assets/lib/libsherpa-shim.$(SHIM_EXT)
CGO_ENABLED=0 $(GOCMD) build \
-ldflags "$(LD_FLAGS) -X main.onnxProvider=$(ONNX_PROVIDER)" \
-tags "$(GO_TAGS)" -o sherpa-onnx ./
package:
bash package.sh
build: sherpa-onnx package
clean:
rm -rf sherpa-onnx sources/ backend-assets/ package/ vits-ljs/ sherpa-onnx-whisper-*/
test: sherpa-onnx
LD_LIBRARY_PATH=$(CURRENT_DIR)/backend-assets/lib \
bash test.sh
.PHONY: build package clean test

View File

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,169 @@
package main
import (
"os"
"path/filepath"
"testing"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestSherpaBackend(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "Sherpa-ONNX Backend Suite")
}
// Load libsherpa-shim + libsherpa-onnx-c-api via purego before any spec
// runs — otherwise any Load/TTS/VAD/AudioTranscription call hits a nil
// function pointer. LD_LIBRARY_PATH must contain the directory holding
// both .so files; test.sh sets this.
var _ = BeforeSuite(func() {
Expect(loadSherpaLibs()).To(Succeed())
})
var _ = Describe("Sherpa-ONNX", func() {
Context("lifecycle", func() {
It("is locking (C API is not thread safe)", func() {
Expect((&SherpaBackend{}).Locking()).To(BeTrue())
})
It("errors loading a non-existent model", func() {
tmpDir, err := os.MkdirTemp("", "sherpa-test-nonexistent")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tmpDir)
err = (&SherpaBackend{}).Load(&pb.ModelOptions{
ModelFile: filepath.Join(tmpDir, "non-existent-model.onnx"),
})
Expect(err).To(HaveOccurred())
})
It("errors loading a non-existent ASR model", func() {
tmpDir, err := os.MkdirTemp("", "sherpa-test-asr")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tmpDir)
err = (&SherpaBackend{}).Load(&pb.ModelOptions{
ModelFile: filepath.Join(tmpDir, "model.onnx"),
Type: "asr",
})
Expect(err).To(HaveOccurred())
})
It("dispatches Load by Type", func() {
tmpDir, err := os.MkdirTemp("", "sherpa-test-dispatch")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tmpDir)
modelFile := filepath.Join(tmpDir, "model.onnx")
for _, typ := range []string{"", "asr", "vad"} {
err := (&SherpaBackend{}).Load(&pb.ModelOptions{ModelFile: modelFile, Type: typ})
Expect(err).To(HaveOccurred(), "Type=%q", typ)
}
})
})
Context("method errors without loaded model", func() {
It("rejects TTS", func() {
tmpDir, err := os.MkdirTemp("", "sherpa-test-tts")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tmpDir)
err = (&SherpaBackend{}).TTS(&pb.TTSRequest{
Text: "should fail — no model loaded",
Dst: filepath.Join(tmpDir, "output.wav"),
})
Expect(err).To(HaveOccurred())
})
It("rejects AudioTranscription", func() {
_, err := (&SherpaBackend{}).AudioTranscription(&pb.TranscriptRequest{
Dst: "/tmp/nonexistent.wav",
})
Expect(err).To(HaveOccurred())
})
It("rejects VAD", func() {
_, err := (&SherpaBackend{}).VAD(&pb.VADRequest{
Audio: []float32{0.1, 0.2, 0.3},
})
Expect(err).To(HaveOccurred())
})
})
Context("type detection", func() {
DescribeTable("isASRType",
func(input string, want bool) {
Expect(isASRType(input)).To(Equal(want))
},
Entry("asr", "asr", true),
Entry("ASR", "ASR", true),
Entry("Asr", "Asr", true),
Entry("transcription", "transcription", true),
Entry("Transcription", "Transcription", true),
Entry("transcribe", "transcribe", true),
Entry("Transcribe", "Transcribe", true),
Entry("tts", "tts", false),
Entry("empty", "", false),
Entry("other", "other", false),
Entry("vad", "vad", false),
)
DescribeTable("isVADType",
func(input string, want bool) {
Expect(isVADType(input)).To(Equal(want))
},
Entry("vad", "vad", true),
Entry("VAD", "VAD", true),
Entry("Vad", "Vad", true),
Entry("asr", "asr", false),
Entry("tts", "tts", false),
Entry("empty", "", false),
Entry("other", "other", false),
)
})
Context("option parsing", func() {
It("parses float options with fallback on bad input", func() {
opts := &pb.ModelOptions{Options: []string{
"vad.threshold=0.3",
"tts.length_scale=1.25",
"bad.number=not-a-float",
}}
Expect(findOptionFloat(opts, "vad.threshold=", 0.5)).To(BeNumerically("~", 0.3, 1e-6))
Expect(findOptionFloat(opts, "tts.length_scale=", 1.0)).To(BeNumerically("~", 1.25, 1e-6))
Expect(findOptionFloat(opts, "missing.key=", 0.7)).To(BeNumerically("~", 0.7, 1e-6))
Expect(findOptionFloat(opts, "bad.number=", 9.9)).To(BeNumerically("~", 9.9, 1e-6))
})
It("parses int options with fallback on bad input", func() {
opts := &pb.ModelOptions{Options: []string{
"asr.sample_rate=22050",
"online.chunk_samples=800",
"bad.int=4.2",
}}
Expect(findOptionInt(opts, "asr.sample_rate=", 16000)).To(Equal(int32(22050)))
Expect(findOptionInt(opts, "online.chunk_samples=", 1600)).To(Equal(int32(800)))
Expect(findOptionInt(opts, "missing.key=", 42)).To(Equal(int32(42)))
Expect(findOptionInt(opts, "bad.int=", 100)).To(Equal(int32(100)))
})
It("parses bool options (0/1, true/false, yes/no, on/off)", func() {
opts := &pb.ModelOptions{Options: []string{
"online.enable_endpoint=0",
"asr.sense_voice.use_itn=True",
"feature.on=yes",
"feature.off=Off",
"feature.bad=maybe",
}}
Expect(findOptionBool(opts, "online.enable_endpoint=", 1)).To(Equal(int32(0)))
Expect(findOptionBool(opts, "asr.sense_voice.use_itn=", 0)).To(Equal(int32(1)))
Expect(findOptionBool(opts, "feature.on=", 0)).To(Equal(int32(1)))
Expect(findOptionBool(opts, "feature.off=", 1)).To(Equal(int32(0)))
Expect(findOptionBool(opts, "feature.bad=", 1)).To(Equal(int32(1)))
Expect(findOptionBool(opts, "missing.key=", 1)).To(Equal(int32(1)))
})
})
})

View File

@@ -0,0 +1,325 @@
#include "shim.h"
#include "c-api.h"
#include <stdlib.h>
#include <string.h>
// Replace the char* field pointed to by `slot` with a strdup of `s`
// (or NULL if s is NULL). Frees any prior value. Silently no-ops when
// strdup fails — the caller will see a Create* failure downstream.
static void shim_set_str(const char **slot, const char *s) {
free((char *)*slot);
*slot = s ? strdup(s) : NULL;
}
// ==================================================================
// VAD config
// ==================================================================
void *sherpa_shim_vad_config_new(void) {
return calloc(1, sizeof(SherpaOnnxVadModelConfig));
}
void sherpa_shim_vad_config_free(void *h) {
if (!h) return;
SherpaOnnxVadModelConfig *c = (SherpaOnnxVadModelConfig *)h;
free((char *)c->silero_vad.model);
free((char *)c->provider);
free(c);
}
void sherpa_shim_vad_config_set_silero_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxVadModelConfig *)h)->silero_vad.model, v);
}
void sherpa_shim_vad_config_set_silero_threshold(void *h, float v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.threshold = v;
}
void sherpa_shim_vad_config_set_silero_min_silence_duration(void *h, float v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.min_silence_duration = v;
}
void sherpa_shim_vad_config_set_silero_min_speech_duration(void *h, float v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.min_speech_duration = v;
}
void sherpa_shim_vad_config_set_silero_window_size(void *h, int32_t v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.window_size = v;
}
void sherpa_shim_vad_config_set_silero_max_speech_duration(void *h, float v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.max_speech_duration = v;
}
void sherpa_shim_vad_config_set_sample_rate(void *h, int32_t v) {
((SherpaOnnxVadModelConfig *)h)->sample_rate = v;
}
void sherpa_shim_vad_config_set_num_threads(void *h, int32_t v) {
((SherpaOnnxVadModelConfig *)h)->num_threads = v;
}
void sherpa_shim_vad_config_set_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxVadModelConfig *)h)->provider, v);
}
void sherpa_shim_vad_config_set_debug(void *h, int32_t v) {
((SherpaOnnxVadModelConfig *)h)->debug = v;
}
void *sherpa_shim_create_vad(void *h, float buffer_size_seconds) {
return (void *)SherpaOnnxCreateVoiceActivityDetector(
(const SherpaOnnxVadModelConfig *)h, buffer_size_seconds);
}
// ==================================================================
// Offline TTS config (VITS)
// ==================================================================
void *sherpa_shim_tts_config_new(void) {
return calloc(1, sizeof(SherpaOnnxOfflineTtsConfig));
}
void sherpa_shim_tts_config_free(void *h) {
if (!h) return;
SherpaOnnxOfflineTtsConfig *c = (SherpaOnnxOfflineTtsConfig *)h;
free((char *)c->model.vits.model);
free((char *)c->model.vits.tokens);
free((char *)c->model.vits.lexicon);
free((char *)c->model.vits.data_dir);
free((char *)c->model.provider);
free(c);
}
void sherpa_shim_tts_config_set_vits_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.model, v);
}
void sherpa_shim_tts_config_set_vits_tokens(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.tokens, v);
}
void sherpa_shim_tts_config_set_vits_lexicon(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.lexicon, v);
}
void sherpa_shim_tts_config_set_vits_data_dir(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.data_dir, v);
}
void sherpa_shim_tts_config_set_vits_noise_scale(void *h, float v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.vits.noise_scale = v;
}
void sherpa_shim_tts_config_set_vits_noise_scale_w(void *h, float v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.vits.noise_scale_w = v;
}
void sherpa_shim_tts_config_set_vits_length_scale(void *h, float v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.vits.length_scale = v;
}
void sherpa_shim_tts_config_set_num_threads(void *h, int32_t v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.num_threads = v;
}
void sherpa_shim_tts_config_set_debug(void *h, int32_t v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.debug = v;
}
void sherpa_shim_tts_config_set_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.provider, v);
}
void sherpa_shim_tts_config_set_max_num_sentences(void *h, int32_t v) {
((SherpaOnnxOfflineTtsConfig *)h)->max_num_sentences = v;
}
void *sherpa_shim_create_offline_tts(void *h) {
return (void *)SherpaOnnxCreateOfflineTts(
(const SherpaOnnxOfflineTtsConfig *)h);
}
// ==================================================================
// Offline recognizer config
// ==================================================================
void *sherpa_shim_offline_recog_config_new(void) {
return calloc(1, sizeof(SherpaOnnxOfflineRecognizerConfig));
}
void sherpa_shim_offline_recog_config_free(void *h) {
if (!h) return;
SherpaOnnxOfflineRecognizerConfig *c = (SherpaOnnxOfflineRecognizerConfig *)h;
free((char *)c->model_config.provider);
free((char *)c->model_config.tokens);
free((char *)c->model_config.whisper.encoder);
free((char *)c->model_config.whisper.decoder);
free((char *)c->model_config.whisper.language);
free((char *)c->model_config.whisper.task);
free((char *)c->model_config.paraformer.model);
free((char *)c->model_config.sense_voice.model);
free((char *)c->model_config.sense_voice.language);
free((char *)c->model_config.omnilingual.model);
free((char *)c->decoding_method);
free(c);
}
void sherpa_shim_offline_recog_config_set_num_threads(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.num_threads = v;
}
void sherpa_shim_offline_recog_config_set_debug(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.debug = v;
}
void sherpa_shim_offline_recog_config_set_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.provider, v);
}
void sherpa_shim_offline_recog_config_set_tokens(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.tokens, v);
}
void sherpa_shim_offline_recog_config_set_feat_sample_rate(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->feat_config.sample_rate = v;
}
void sherpa_shim_offline_recog_config_set_feat_feature_dim(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->feat_config.feature_dim = v;
}
void sherpa_shim_offline_recog_config_set_decoding_method(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->decoding_method, v);
}
void sherpa_shim_offline_recog_config_set_whisper_encoder(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.encoder, v);
}
void sherpa_shim_offline_recog_config_set_whisper_decoder(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.decoder, v);
}
void sherpa_shim_offline_recog_config_set_whisper_language(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.language, v);
}
void sherpa_shim_offline_recog_config_set_whisper_task(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.task, v);
}
void sherpa_shim_offline_recog_config_set_whisper_tail_paddings(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.tail_paddings = v;
}
void sherpa_shim_offline_recog_config_set_paraformer_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.paraformer.model, v);
}
void sherpa_shim_offline_recog_config_set_sense_voice_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.model, v);
}
void sherpa_shim_offline_recog_config_set_sense_voice_language(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.language, v);
}
void sherpa_shim_offline_recog_config_set_sense_voice_use_itn(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.use_itn = v;
}
void sherpa_shim_offline_recog_config_set_omnilingual_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.omnilingual.model, v);
}
void *sherpa_shim_create_offline_recognizer(void *h) {
return (void *)SherpaOnnxCreateOfflineRecognizer(
(const SherpaOnnxOfflineRecognizerConfig *)h);
}
// ==================================================================
// Online recognizer config
// ==================================================================
void *sherpa_shim_online_recog_config_new(void) {
return calloc(1, sizeof(SherpaOnnxOnlineRecognizerConfig));
}
void sherpa_shim_online_recog_config_free(void *h) {
if (!h) return;
SherpaOnnxOnlineRecognizerConfig *c = (SherpaOnnxOnlineRecognizerConfig *)h;
free((char *)c->model_config.transducer.encoder);
free((char *)c->model_config.transducer.decoder);
free((char *)c->model_config.transducer.joiner);
free((char *)c->model_config.tokens);
free((char *)c->model_config.provider);
free((char *)c->decoding_method);
free(c);
}
void sherpa_shim_online_recog_config_set_transducer_encoder(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.encoder, v);
}
void sherpa_shim_online_recog_config_set_transducer_decoder(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.decoder, v);
}
void sherpa_shim_online_recog_config_set_transducer_joiner(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.joiner, v);
}
void sherpa_shim_online_recog_config_set_tokens(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.tokens, v);
}
void sherpa_shim_online_recog_config_set_num_threads(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.num_threads = v;
}
void sherpa_shim_online_recog_config_set_debug(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.debug = v;
}
void sherpa_shim_online_recog_config_set_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.provider, v);
}
void sherpa_shim_online_recog_config_set_feat_sample_rate(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->feat_config.sample_rate = v;
}
void sherpa_shim_online_recog_config_set_feat_feature_dim(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->feat_config.feature_dim = v;
}
void sherpa_shim_online_recog_config_set_decoding_method(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->decoding_method, v);
}
void sherpa_shim_online_recog_config_set_enable_endpoint(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->enable_endpoint = v;
}
void sherpa_shim_online_recog_config_set_rule1_min_trailing_silence(void *h, float v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->rule1_min_trailing_silence = v;
}
void sherpa_shim_online_recog_config_set_rule2_min_trailing_silence(void *h, float v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->rule2_min_trailing_silence = v;
}
void sherpa_shim_online_recog_config_set_rule3_min_utterance_length(void *h, float v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->rule3_min_utterance_length = v;
}
void *sherpa_shim_create_online_recognizer(void *h) {
return (void *)SherpaOnnxCreateOnlineRecognizer(
(const SherpaOnnxOnlineRecognizerConfig *)h);
}
// ==================================================================
// Result-struct accessors
// ==================================================================
int32_t sherpa_shim_wave_sample_rate(const void *h) {
return ((const SherpaOnnxWave *)h)->sample_rate;
}
int32_t sherpa_shim_wave_num_samples(const void *h) {
return ((const SherpaOnnxWave *)h)->num_samples;
}
const float *sherpa_shim_wave_samples(const void *h) {
return ((const SherpaOnnxWave *)h)->samples;
}
const char *sherpa_shim_offline_result_text(const void *h) {
return ((const SherpaOnnxOfflineRecognizerResult *)h)->text;
}
const char *sherpa_shim_online_result_text(const void *h) {
return ((const SherpaOnnxOnlineRecognizerResult *)h)->text;
}
int32_t sherpa_shim_generated_audio_sample_rate(const void *h) {
return ((const SherpaOnnxGeneratedAudio *)h)->sample_rate;
}
int32_t sherpa_shim_generated_audio_n(const void *h) {
return ((const SherpaOnnxGeneratedAudio *)h)->n;
}
const float *sherpa_shim_generated_audio_samples(const void *h) {
return ((const SherpaOnnxGeneratedAudio *)h)->samples;
}
int32_t sherpa_shim_speech_segment_start(const void *h) {
return ((const SherpaOnnxSpeechSegment *)h)->start;
}
int32_t sherpa_shim_speech_segment_n(const void *h) {
return ((const SherpaOnnxSpeechSegment *)h)->n;
}
// ==================================================================
// TTS streaming callback trampoline
// ==================================================================
void *sherpa_shim_tts_generate_with_callback(
void *tts, const char *text, int32_t sid, float speed,
uintptr_t callback_ptr, uintptr_t user_data) {
SherpaOnnxGeneratedAudioCallbackWithArg cb =
(SherpaOnnxGeneratedAudioCallbackWithArg)callback_ptr;
return (void *)SherpaOnnxOfflineTtsGenerateWithCallbackWithArg(
(const SherpaOnnxOfflineTts *)tts, text, sid, speed, cb,
(void *)user_data);
}

View File

@@ -0,0 +1,129 @@
#ifndef LOCALAI_SHERPA_ONNX_SHIM_H
#define LOCALAI_SHERPA_ONNX_SHIM_H
#include <stdint.h>
// libsherpa-shim: purego-friendly wrapper around sherpa-onnx's C API.
// Purego can't access C struct fields and can't route C callbacks to Go
// funcs directly. Every function here is a fixed-signature trampoline
// that replaces one field read/write or callback handoff that the Go
// backend would otherwise have to do through cgo.
//
// String lifetime: setters strdup; _free walks every owned string and
// frees it. Callers may discard their input buffers the moment a setter
// returns.
//
// Opaque handles are `void *` in both directions. Nothing here holds a
// reference across calls except config handles (freed via _free) and
// sherpa-allocated results (freed via sherpa's own Destroy* entry
// points, which Go calls through purego pass-through).
#ifdef __cplusplus
extern "C" {
#endif
// --- VAD config -----------------------------------------------------
void *sherpa_shim_vad_config_new(void);
void sherpa_shim_vad_config_free(void *cfg);
void sherpa_shim_vad_config_set_silero_model(void *cfg, const char *path);
void sherpa_shim_vad_config_set_silero_threshold(void *cfg, float v);
void sherpa_shim_vad_config_set_silero_min_silence_duration(void *cfg, float v);
void sherpa_shim_vad_config_set_silero_min_speech_duration(void *cfg, float v);
void sherpa_shim_vad_config_set_silero_window_size(void *cfg, int32_t v);
void sherpa_shim_vad_config_set_silero_max_speech_duration(void *cfg, float v);
void sherpa_shim_vad_config_set_sample_rate(void *cfg, int32_t v);
void sherpa_shim_vad_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_vad_config_set_provider(void *cfg, const char *v);
void sherpa_shim_vad_config_set_debug(void *cfg, int32_t v);
void *sherpa_shim_create_vad(void *cfg, float buffer_size_seconds);
// --- Offline TTS config (VITS path — the only TTS family the backend uses) ---
void *sherpa_shim_tts_config_new(void);
void sherpa_shim_tts_config_free(void *cfg);
void sherpa_shim_tts_config_set_vits_model(void *cfg, const char *v);
void sherpa_shim_tts_config_set_vits_tokens(void *cfg, const char *v);
void sherpa_shim_tts_config_set_vits_lexicon(void *cfg, const char *v);
void sherpa_shim_tts_config_set_vits_data_dir(void *cfg, const char *v);
void sherpa_shim_tts_config_set_vits_noise_scale(void *cfg, float v);
void sherpa_shim_tts_config_set_vits_noise_scale_w(void *cfg, float v);
void sherpa_shim_tts_config_set_vits_length_scale(void *cfg, float v);
void sherpa_shim_tts_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_tts_config_set_debug(void *cfg, int32_t v);
void sherpa_shim_tts_config_set_provider(void *cfg, const char *v);
void sherpa_shim_tts_config_set_max_num_sentences(void *cfg, int32_t v);
void *sherpa_shim_create_offline_tts(void *cfg);
// --- Offline recognizer config (Whisper / Paraformer / SenseVoice / Omnilingual) ---
void *sherpa_shim_offline_recog_config_new(void);
void sherpa_shim_offline_recog_config_free(void *cfg);
void sherpa_shim_offline_recog_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_debug(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_provider(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_tokens(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_feat_sample_rate(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_feat_feature_dim(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_decoding_method(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_encoder(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_decoder(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_language(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_task(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_tail_paddings(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_paraformer_model(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_sense_voice_model(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_sense_voice_language(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_sense_voice_use_itn(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_omnilingual_model(void *cfg, const char *v);
void *sherpa_shim_create_offline_recognizer(void *cfg);
// --- Online recognizer config (streaming zipformer transducer) ---
void *sherpa_shim_online_recog_config_new(void);
void sherpa_shim_online_recog_config_free(void *cfg);
void sherpa_shim_online_recog_config_set_transducer_encoder(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_transducer_decoder(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_transducer_joiner(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_tokens(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_debug(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_provider(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_feat_sample_rate(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_feat_feature_dim(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_decoding_method(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_enable_endpoint(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_rule1_min_trailing_silence(void *cfg, float v);
void sherpa_shim_online_recog_config_set_rule2_min_trailing_silence(void *cfg, float v);
void sherpa_shim_online_recog_config_set_rule3_min_utterance_length(void *cfg, float v);
void *sherpa_shim_create_online_recognizer(void *cfg);
// --- Result accessors (sherpa-allocated; caller destroys via sherpa's own Destroy*) ---
int32_t sherpa_shim_wave_sample_rate(const void *wave);
int32_t sherpa_shim_wave_num_samples(const void *wave);
const float *sherpa_shim_wave_samples(const void *wave);
const char *sherpa_shim_offline_result_text(const void *result);
const char *sherpa_shim_online_result_text(const void *result);
int32_t sherpa_shim_generated_audio_sample_rate(const void *audio);
int32_t sherpa_shim_generated_audio_n(const void *audio);
const float *sherpa_shim_generated_audio_samples(const void *audio);
int32_t sherpa_shim_speech_segment_start(const void *seg);
int32_t sherpa_shim_speech_segment_n(const void *seg);
// --- TTS streaming callback trampoline -----------------------------
// Replaces the //export sherpaTtsGoCallback + callbacks.c bridge pattern.
// `callback_ptr` is the C-callable function pointer returned by
// purego.NewCallback. `user_data` is an integer the Go side uses to
// look up its state (sync.Map keyed by uint64).
//
// Returns the sherpa-allocated SherpaOnnxGeneratedAudio. Destroy with
// SherpaOnnxDestroyOfflineTtsGeneratedAudio (callable directly from
// Go via purego).
void *sherpa_shim_tts_generate_with_callback(
void *tts, const char *text, int32_t sid, float speed,
uintptr_t callback_ptr, uintptr_t user_data);
#ifdef __cplusplus
}
#endif
#endif

View File

@@ -0,0 +1,23 @@
package main
import (
"flag"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
func main() {
flag.Parse()
if err := loadSherpaLibs(); err != nil {
panic(err)
}
if err := grpc.StartServer(*addr, &SherpaBackend{}); err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,51 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/sherpa-onnx $CURDIR/package/
cp -avf $CURDIR/run.sh $CURDIR/package/
cp -rfLv $CURDIR/backend-assets/lib/* $CURDIR/package/lib/
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

13
backend/go/sherpa-onnx/run.sh Executable file
View File

@@ -0,0 +1,13 @@
#!/bin/bash
set -ex
CURDIR=$(dirname "$(realpath $0)")
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
exec $CURDIR/lib/ld.so $CURDIR/sherpa-onnx "$@"
fi
exec $CURDIR/sherpa-onnx "$@"

12
backend/go/sherpa-onnx/test.sh Executable file
View File

@@ -0,0 +1,12 @@
#!/bin/bash
# Unit tests for the sherpa-onnx backend. Exercises error-path and
# dispatch logic via SherpaBackend directly (no gRPC). Integration
# coverage (gRPC TTS / streaming ASR / realtime pipeline) lives in
# tests/e2e-backends and tests/e2e and runs against the Docker image.
set -e
CURDIR=$(dirname "$(realpath $0)")
cd "$CURDIR"
PACKAGES=$(go list ./... | grep -v /sources/)
go test -v -timeout 60s $PACKAGES

View File

@@ -4,7 +4,7 @@ set(CMAKE_POSITION_INDEPENDENT_CODE ON)
add_subdirectory(./sources/stablediffusion-ggml.cpp)
add_library(gosd MODULE gosd.cpp)
add_library(gosd MODULE cpp/gosd.cpp)
target_link_libraries(gosd PRIVATE stable-diffusion ggml)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# stablediffusion.cpp (ggml)
STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
STABLEDIFFUSION_GGML_VERSION?=7d33d4b2ddeafa672761a5880ec33bdff452504d
STABLEDIFFUSION_GGML_VERSION?=3d6064b37ef4607917f8acf2ca8c8906d5087413
CMAKE_ARGS+=-DGGML_MAX_NAME=128
@@ -119,7 +119,7 @@ libgosd-fallback.so: sources/stablediffusion-ggml.cpp
SO_TARGET=libgosd-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgosd-custom
rm -rfv build*
libgosd-custom: CMakeLists.txt gosd.cpp gosd.h
libgosd-custom: CMakeLists.txt cpp/gosd.cpp cpp/gosd.h
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \

View File

@@ -1106,6 +1106,11 @@ static int ffmpeg_mux_raw_to_mp4(sd_image_t* frames, int num_frames, int fps, co
const_cast<char*>("-c:v"), const_cast<char*>("libx264"),
const_cast<char*>("-pix_fmt"), const_cast<char*>("yuv420p"),
const_cast<char*>("-movflags"), const_cast<char*>("+faststart"),
// Force MP4 container. Distributed LocalAI hands us a staging
// path (e.g. /staging/localai-output-NNN.tmp) with a non-standard
// extension; relying on filename suffix makes ffmpeg bail with
// "Unable to choose an output format".
const_cast<char*>("-f"), const_cast<char*>("mp4"),
const_cast<char*>(dst),
nullptr
};

View File

@@ -0,0 +1,71 @@
cmake_minimum_required(VERSION 3.18)
project(govibevoicecpp LANGUAGES C CXX)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(VIBEVOICE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/sources/vibevoice.cpp)
# Override upstream's CMAKE_CUDA_ARCHITECTURES before add_subdirectory.
if(NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
set(CMAKE_CUDA_ARCHITECTURES "75-virtual;80-virtual;86-real;89-real")
endif()
# Force-disable upstream tests/examples — we only need libvibevoice.
set(VIBEVOICE_BUILD_TESTS OFF CACHE BOOL "" FORCE)
set(VIBEVOICE_BUILD_EXAMPLES OFF CACHE BOOL "" FORCE)
set(VIBEVOICE_BUILD_SERVER OFF CACHE BOOL "" FORCE)
# vibevoice.cpp's top-level CMakeLists already adds third_party/ggml as a
# subdirectory — no need to add it explicitly here, just include the
# whole project.
add_subdirectory(${VIBEVOICE_DIR} vibevoice EXCLUDE_FROM_ALL)
add_library(govibevoicecpp MODULE cpp/govibevoicecpp.cpp)
# libvibevoice is STATIC; without --whole-archive the linker GCs the
# vv_capi_* symbols (purego dlopens them by name, nothing in our
# translation unit references them). Force the static archive's
# entire contents into the MODULE so dlsym finds vv_capi_load etc.
if(APPLE)
target_link_libraries(govibevoicecpp PRIVATE -Wl,-force_load $<TARGET_FILE:vibevoice>)
elseif(MSVC)
target_link_libraries(govibevoicecpp PRIVATE vibevoice)
set_property(TARGET govibevoicecpp APPEND PROPERTY LINK_FLAGS "/WHOLEARCHIVE:vibevoice")
else()
target_link_libraries(govibevoicecpp PRIVATE
-Wl,--whole-archive vibevoice -Wl,--no-whole-archive)
endif()
target_include_directories(govibevoicecpp PRIVATE ${VIBEVOICE_DIR}/include)
target_include_directories(govibevoicecpp SYSTEM PRIVATE ${VIBEVOICE_DIR}/third_party/ggml/include)
# Link GPU backends if available — vibevoice's own CMake already links
# these to the libvibevoice STATIC library, but we re-link them on the
# MODULE so resolved symbols include all backend kernels.
foreach(backend blas cuda metal vulkan)
if(TARGET ggml-${backend})
target_link_libraries(govibevoicecpp PRIVATE ggml-${backend})
string(TOUPPER ${backend} BACKEND_UPPER)
target_compile_definitions(govibevoicecpp PRIVATE VIBEVOICE_HAVE_${BACKEND_UPPER})
if(backend STREQUAL "cuda")
find_package(CUDAToolkit QUIET)
if(CUDAToolkit_FOUND)
target_link_libraries(govibevoicecpp PRIVATE CUDA::cudart)
endif()
endif()
endif()
endforeach()
if(MSVC)
target_compile_options(govibevoicecpp PRIVATE /W4 /wd4100 /wd4505)
else()
target_compile_options(govibevoicecpp PRIVATE -Wall -Wextra -Wshadow
-Wno-unused-parameter -Wno-unused-function -Wno-sign-conversion)
endif()
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)
target_link_libraries(govibevoicecpp PRIVATE stdc++fs)
endif()
set_property(TARGET govibevoicecpp PROPERTY CXX_STANDARD 17)
set_target_properties(govibevoicecpp PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})

View File

@@ -0,0 +1,128 @@
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc --ignore=1)
# vibevoice.cpp version
VIBEVOICE_REPO?=https://github.com/mudler/vibevoice.cpp
VIBEVOICE_CPP_VERSION?=master
SO_TARGET?=libgovibevoicecpp.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
CMAKE_ARGS+=-DVIBEVOICE_BUILD_TESTS=OFF
CMAKE_ARGS+=-DVIBEVOICE_BUILD_EXAMPLES=OFF
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DGGML_CUDA=ON -DVIBEVOICE_GGML_CUDA=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),clblas)
CMAKE_ARGS+=-DGGML_CLBLAST=ON -DCLBlast_DIR=/some/path
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DGGML_HIPBLAS=ON -DVIBEVOICE_GGML_HIPBLAS=ON
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON -DVIBEVOICE_GGML_VULKAN=ON
else ifeq ($(OS),Darwin)
ifneq ($(BUILD_TYPE),metal)
CMAKE_ARGS+=-DGGML_METAL=OFF
else
CMAKE_ARGS+=-DGGML_METAL=ON -DVIBEVOICE_GGML_METAL=ON
CMAKE_ARGS+=-DGGML_METAL_EMBED_LIBRARY=ON
endif
endif
ifeq ($(BUILD_TYPE),sycl_f16)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DGGML_SYCL_F16=ON
endif
ifeq ($(BUILD_TYPE),sycl_f32)
CMAKE_ARGS+=-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx
endif
sources/vibevoice.cpp:
mkdir -p sources/vibevoice.cpp
cd sources/vibevoice.cpp && \
git init && \
git remote add origin $(VIBEVOICE_REPO) && \
git fetch origin && \
git checkout $(VIBEVOICE_CPP_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch
# Detect OS
UNAME_S := $(shell uname -s)
# Only build CPU variants on Linux
ifeq ($(UNAME_S),Linux)
VARIANT_TARGETS = libgovibevoicecpp-avx.so libgovibevoicecpp-avx2.so libgovibevoicecpp-avx512.so libgovibevoicecpp-fallback.so
else
# On non-Linux (e.g., Darwin), build only fallback variant
VARIANT_TARGETS = libgovibevoicecpp-fallback.so
endif
vibevoice-cpp: main.go govibevoicecpp.go $(VARIANT_TARGETS)
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o vibevoice-cpp ./
package: vibevoice-cpp
bash package.sh
build: package
clean: purge
rm -rf libgovibevoicecpp*.so package sources/vibevoice.cpp vibevoice-cpp
purge:
rm -rf build*
# Variants must build sequentially
.NOTPARALLEL:
# Build all variants (Linux only)
ifeq ($(UNAME_S),Linux)
libgovibevoicecpp-avx.so: sources/vibevoice.cpp
$(info ${GREEN}I vibevoice-cpp build info:avx${RESET})
SO_TARGET=libgovibevoicecpp-avx.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgovibevoicecpp-custom
rm -rf build-libgovibevoicecpp-avx.so
libgovibevoicecpp-avx2.so: sources/vibevoice.cpp
$(info ${GREEN}I vibevoice-cpp build info:avx2${RESET})
SO_TARGET=libgovibevoicecpp-avx2.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgovibevoicecpp-custom
rm -rf build-libgovibevoicecpp-avx2.so
libgovibevoicecpp-avx512.so: sources/vibevoice.cpp
$(info ${GREEN}I vibevoice-cpp build info:avx512${RESET})
SO_TARGET=libgovibevoicecpp-avx512.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on -DGGML_BMI2=on" $(MAKE) libgovibevoicecpp-custom
rm -rf build-libgovibevoicecpp-avx512.so
endif
# Build fallback variant (all platforms)
libgovibevoicecpp-fallback.so: sources/vibevoice.cpp
$(info ${GREEN}I vibevoice-cpp build info:fallback${RESET})
SO_TARGET=libgovibevoicecpp-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgovibevoicecpp-custom
rm -rf build-libgovibevoicecpp-fallback.so
libgovibevoicecpp-custom: CMakeLists.txt cpp/govibevoicecpp.cpp cpp/govibevoicecpp.h
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \
cmake --build . --config Release -j$(JOBS) --target govibevoicecpp && \
cd .. && \
mv build-$(SO_TARGET)/libgovibevoicecpp.so ./$(SO_TARGET)
test: vibevoice-cpp
@echo "Running vibevoice-cpp tests..."
bash test.sh
@echo "vibevoice-cpp tests completed."
all: vibevoice-cpp package

View File

@@ -0,0 +1,41 @@
// vibevoice.cpp ships its purego-friendly ABI in vibevoice_capi.h.
// This translation unit is intentionally tiny: pulling in the header
// (and linking libvibevoice PRIVATE in CMake) is enough to make the
// vv_capi_* symbols visible from the produced MODULE library.
//
// We do install a ggml log redirect so backend logs land on the gRPC
// server's stderr — same pattern as backend/go/qwen3-tts-cpp/cpp/.
#include "govibevoicecpp.h"
#include "ggml.h"
#include "ggml-backend.h"
#include <cstdio>
namespace {
void govibevoice_log_cb(enum ggml_log_level level, const char* msg, void* /*ud*/) {
if (!msg) return;
const char* tag = "?????";
switch (level) {
case GGML_LOG_LEVEL_DEBUG: tag = "DEBUG"; break;
case GGML_LOG_LEVEL_INFO: tag = "INFO"; break;
case GGML_LOG_LEVEL_WARN: tag = "WARN"; break;
case GGML_LOG_LEVEL_ERROR: tag = "ERROR"; break;
default: break;
}
std::fprintf(stderr, "[%-5s] %s", tag, msg);
std::fflush(stderr);
}
struct LogInstaller {
LogInstaller() {
ggml_log_set(govibevoice_log_cb, nullptr);
ggml_backend_load_all();
}
};
LogInstaller g_install;
} // namespace

View File

@@ -0,0 +1,7 @@
#pragma once
// Re-exports the vibevoice.cpp flat C ABI so this MODULE library
// resolves the same symbols that purego.RegisterLibFunc looks up by
// name. The actual definitions live in libvibevoice (linked PRIVATE).
#include "vibevoice_capi.h"

View File

@@ -0,0 +1,387 @@
package main
import (
"encoding/json"
"fmt"
"os"
"path/filepath"
"strings"
laudio "github.com/mudler/LocalAI/pkg/audio"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
)
// vibevoice.cpp synthesizes 24 kHz mono 16-bit PCM. Hardcoded - the
// model itself is fixed-rate; if the upstream ever changes this we'll
// pick it up via vv_capi_version().
const vibevoiceSampleRate = uint32(24000)
// purego-bound entry points from libgovibevoicecpp.
var (
CppLoad func(ttsModel, asrModel, tokenizer, voice string, threads int32) int32
CppTTS func(text, voicePath, dstWav string,
nSteps int32, cfgScale float32, maxSpeechFrames int32, seed uint32) int32
CppASR func(srcWav string, outJSON []byte, capacity uint64,
maxNewTokens int32) int32
CppUnload func()
CppVersion func() string
)
// VibevoiceCpp speaks gRPC against vibevoice.cpp's flat C ABI. The
// engine is a single global, so we serialize calls through SingleThread.
type VibevoiceCpp struct {
base.SingleThread
threads int
// modelRoot is the directory we use to resolve relative paths
// from Options[] and per-call overrides (TTSRequest.Voice).
// Source of truth: opts.ModelPath; falls back to the dir of
// the primary ModelFile when ModelPath is empty.
modelRoot string
ttsModel string
asrModel string
tokenizer string
voice string
}
// resolvePath joins a relative path onto `relTo`. The gallery
// convention is that Options[] carry paths relative to the LocalAI
// models dir (opts.ModelPath), so anything not absolute is treated
// as a sibling of the primary ModelFile - never CWD. Empty / already-
// absolute / no-relTo inputs pass through unchanged.
func resolvePath(p, relTo string) string {
if p == "" || filepath.IsAbs(p) || relTo == "" {
return p
}
return filepath.Join(relTo, p)
}
// parseOptions reads opts.Options[] and pulls out the per-role
// overrides documented in the gallery entries. Accepts both "key=value"
// (gallery YAML style) and "key:value" (Make-target / env-var style).
func (v *VibevoiceCpp) parseOptions(opts []string, relTo string) string {
role := ""
for _, raw := range opts {
k, val, ok := strings.Cut(raw, "=")
if !ok {
k, val, ok = strings.Cut(raw, ":")
if !ok {
continue
}
}
key := strings.TrimSpace(k)
val = strings.TrimSpace(val)
switch key {
case "type":
role = strings.ToLower(val)
case "tokenizer":
v.tokenizer = resolvePath(val, relTo)
case "voice":
v.voice = resolvePath(val, relTo)
case "tts_model":
v.ttsModel = resolvePath(val, relTo)
case "asr_model":
v.asrModel = resolvePath(val, relTo)
}
}
return role
}
func (v *VibevoiceCpp) Load(opts *pb.ModelOptions) error {
if opts.ModelFile == "" {
return fmt.Errorf("vibevoice-cpp: ModelFile is required")
}
modelFile := opts.ModelFile
if !filepath.IsAbs(modelFile) && opts.ModelPath != "" {
modelFile = filepath.Join(opts.ModelPath, modelFile)
}
// ModelPath is the LocalAI core's models root, propagated over
// gRPC. Use it as the resolution base for Options[] (and later
// for TTSRequest.Voice) so gallery entries can reference paths
// like "tokenizer=tokenizer.gguf" and have them resolved
// against the same root the core used to drop the files.
v.modelRoot = opts.ModelPath
if v.modelRoot == "" {
v.modelRoot = filepath.Dir(modelFile)
}
role := v.parseOptions(opts.Options, v.modelRoot)
// ModelFile fills the "primary" role-slot determined by `type=`
// in Options (defaults to tts). The other slot stays exactly as
// Options set it - so a closed-loop config with ModelFile=tts.gguf
// + Options[asr_model=asr.gguf] resolves correctly to both slots,
// and an explicit `tts_model=` / `asr_model=` always wins over
// ModelFile for its own slot.
primaryIsASR := false
switch role {
case "asr", "transcript", "stt", "speech-to-text":
primaryIsASR = true
}
if primaryIsASR {
if v.asrModel == "" {
v.asrModel = modelFile
}
} else if v.ttsModel == "" {
v.ttsModel = modelFile
}
if v.ttsModel == "" && v.asrModel == "" {
return fmt.Errorf("vibevoice-cpp: no TTS or ASR model resolved from ModelFile=%q + options", opts.ModelFile)
}
if v.tokenizer == "" {
return fmt.Errorf("vibevoice-cpp: tokenizer is required - pass options: [tokenizer=<path>]")
}
threads := int(opts.Threads)
if threads <= 0 {
threads = 4
}
v.threads = threads
fmt.Fprintf(os.Stderr,
"[vibevoice-cpp] Loading: tts=%q asr=%q tokenizer=%q voice=%q threads=%d\n",
v.ttsModel, v.asrModel, v.tokenizer, v.voice, threads)
if rc := CppLoad(v.ttsModel, v.asrModel, v.tokenizer, v.voice, int32(threads)); rc != 0 {
return fmt.Errorf("vibevoice-cpp: vv_capi_load failed (rc=%d)", rc)
}
return nil
}
func (v *VibevoiceCpp) TTS(req *pb.TTSRequest) error {
if v.ttsModel == "" {
return fmt.Errorf("vibevoice-cpp: TTS requested but no realtime model was loaded")
}
text := req.Text
dst := req.Dst
if text == "" || dst == "" {
return fmt.Errorf("vibevoice-cpp: TTS requires both text and dst")
}
// req.Voice may be a bare filename (e.g. "voice-en-Emma.gguf") or an
// absolute path. Resolve via the same modelRoot Load() used for
// Options[] so a swap-voice request mirrors the gallery's layout.
voice := resolvePath(req.Voice, v.modelRoot)
if req.Language != nil && *req.Language != "" {
fmt.Fprintf(os.Stderr,
"[vibevoice-cpp] note: TTSRequest.language=%q ignored - vibevoice picks language from the voice prompt\n",
*req.Language)
}
const (
defaultSteps = 20
defaultMaxFrames = 200
)
defaultCfg := float32(1.3)
if rc := CppTTS(text, voice, dst,
int32(defaultSteps), defaultCfg, int32(defaultMaxFrames), 0); rc != 0 {
return fmt.Errorf("vibevoice-cpp: vv_capi_tts failed (rc=%d)", rc)
}
return nil
}
// asrSegment matches vibevoice's JSON output:
//
// [{"Start":0.0,"End":2.8,"Speaker":0,"Content":"…"}, ...]
type asrSegment struct {
Start float64 `json:"Start"`
End float64 `json:"End"`
Speaker int `json:"Speaker"`
Content string `json:"Content"`
}
// callASR invokes vv_capi_asr with a buffer that grows on demand.
// vv_capi_asr returns: >0 bytes written, 0 no transcript, <0 error or
// -required_size. We honor the resize protocol once before giving up.
func (v *VibevoiceCpp) callASR(srcWav string, maxNewTokens int32) (string, error) {
const startCap = 256 * 1024
buf := make([]byte, startCap)
rc := CppASR(srcWav, buf, uint64(len(buf)), maxNewTokens)
if rc < 0 {
need := -int(rc)
if need > 0 && need < (16<<20) && need > len(buf) {
buf = make([]byte, need+64)
rc = CppASR(srcWav, buf, uint64(len(buf)), maxNewTokens)
}
}
if rc < 0 {
return "", fmt.Errorf("vibevoice-cpp: vv_capi_asr failed (rc=%d)", rc)
}
if rc == 0 {
return "", nil
}
return string(buf[:rc]), nil
}
// TTSStream is the streaming counterpart to TTS. vibevoice's C ABI is
// file-only (vv_capi_tts writes a complete WAV), so we synthesize to
// a tempfile, then emit a streaming-WAV header followed by the PCM
// body in chunks. The main reason this exists at all is the gRPC
// server wrapper (pkg/grpc/server.go:TTSStream) blocks on a channel
// that only this method can close - if we leave the default Base
// stub in place, every TTSStream call hangs until the client
// deadline.
func (v *VibevoiceCpp) TTSStream(req *pb.TTSRequest, results chan []byte) error {
defer close(results)
if v.ttsModel == "" {
return fmt.Errorf("vibevoice-cpp: TTSStream requested but no realtime model was loaded")
}
if req.Text == "" {
return fmt.Errorf("vibevoice-cpp: TTSStream requires text")
}
tmp, err := os.CreateTemp("", "vibevoice-cpp-stream-*.wav")
if err != nil {
return fmt.Errorf("vibevoice-cpp: tempfile: %w", err)
}
dst := tmp.Name()
_ = tmp.Close()
defer func() { _ = os.Remove(dst) }()
if err := v.TTS(&pb.TTSRequest{
Text: req.Text,
Voice: req.Voice,
Dst: dst,
Language: req.Language,
}); err != nil {
return err
}
wav, err := os.ReadFile(dst)
if err != nil {
return fmt.Errorf("vibevoice-cpp: read tempfile: %w", err)
}
// Streaming WAV header: declare 0xFFFFFFFF for chunk sizes so HTTP
// clients can start playback before they see the full PCM.
const streamingSize = 0xFFFFFFFF
hdr := laudio.NewWAVHeaderWithRate(streamingSize, vibevoiceSampleRate)
hdr.ChunkSize = streamingSize
hdrBuf := make([]byte, 0, laudio.WAVHeaderSize)
w := newByteWriter(&hdrBuf)
if err := hdr.Write(w); err != nil {
return fmt.Errorf("vibevoice-cpp: write WAV header: %w", err)
}
results <- hdrBuf
// PCM body: send in ~64 KB slices so the client gets multiple
// reply chunks (e2e harness asserts >=2 frames).
pcm := laudio.StripWAVHeader(wav)
const chunkBytes = 64 * 1024
for off := 0; off < len(pcm); off += chunkBytes {
end := off + chunkBytes
if end > len(pcm) {
end = len(pcm)
}
chunk := make([]byte, end-off)
copy(chunk, pcm[off:end])
results <- chunk
}
return nil
}
// byteWriter adapts a *[]byte to io.Writer so we can hand it to
// laudio.WAVHeader.Write without allocating a bytes.Buffer.
type byteWriter struct{ buf *[]byte }
func newByteWriter(b *[]byte) *byteWriter { return &byteWriter{buf: b} }
func (w *byteWriter) Write(p []byte) (int, error) {
*w.buf = append(*w.buf, p...)
return len(p), nil
}
func (v *VibevoiceCpp) AudioTranscription(req *pb.TranscriptRequest) (pb.TranscriptResult, error) {
if v.asrModel == "" {
return pb.TranscriptResult{}, fmt.Errorf("vibevoice-cpp: AudioTranscription requested but no ASR model was loaded")
}
if req.Dst == "" {
return pb.TranscriptResult{}, fmt.Errorf("vibevoice-cpp: TranscriptRequest.dst (audio path) is required")
}
out, err := v.callASR(req.Dst, 0)
if err != nil {
return pb.TranscriptResult{}, err
}
if out == "" {
return pb.TranscriptResult{}, nil
}
var segs []asrSegment
if err := json.Unmarshal([]byte(out), &segs); err != nil {
fmt.Fprintf(os.Stderr,
"[vibevoice-cpp] WARNING: vv_capi_asr returned non-JSON, falling back to single segment: %v\n", err)
return pb.TranscriptResult{
Segments: []*pb.TranscriptSegment{{Id: 0, Text: strings.TrimSpace(out)}},
Text: strings.TrimSpace(out),
}, nil
}
segments := make([]*pb.TranscriptSegment, 0, len(segs))
parts := make([]string, 0, len(segs))
var duration float32
for i, s := range segs {
// LocalAI's whisper backend uses int64 100ns ticks for
// Start/End (seconds * 1e7); follow the same convention so
// consumers can mix vibevoice and whisper transcripts.
segments = append(segments, &pb.TranscriptSegment{
Id: int32(i),
Text: s.Content,
Start: int64(s.Start * 1e7),
End: int64(s.End * 1e7),
Speaker: fmt.Sprintf("%d", s.Speaker),
})
parts = append(parts, strings.TrimSpace(s.Content))
if float32(s.End) > duration {
duration = float32(s.End)
}
}
return pb.TranscriptResult{
Segments: segments,
Text: strings.TrimSpace(strings.Join(parts, " ")),
Duration: duration,
}, nil
}
// AudioTranscriptionStream wraps AudioTranscription so the streaming
// gRPC endpoint (server.go:AudioTranscriptionStream) sees its channel
// close and the client doesn't sit waiting until deadline. vibevoice's
// ASR doesn't expose token-level streaming - vv_capi_asr decodes the
// whole audio and returns a JSON segment list - so we run the offline
// transcription, emit each segment's content as a delta, then close
// with a final_result whose Text equals the concatenated deltas (the
// e2e harness asserts those match).
func (v *VibevoiceCpp) AudioTranscriptionStream(req *pb.TranscriptRequest, results chan *pb.TranscriptStreamResponse) error {
defer close(results)
res, err := v.AudioTranscription(req)
if err != nil {
return err
}
var assembled strings.Builder
for _, seg := range res.Segments {
if seg == nil {
continue
}
txt := strings.TrimSpace(seg.Text)
if txt == "" {
continue
}
delta := txt
if assembled.Len() > 0 {
delta = " " + txt
}
results <- &pb.TranscriptStreamResponse{Delta: delta}
assembled.WriteString(delta)
}
final := pb.TranscriptResult{
Segments: res.Segments,
Duration: res.Duration,
Language: res.Language,
Text: assembled.String(),
}
results <- &pb.TranscriptStreamResponse{FinalResult: &final}
return nil
}

View File

@@ -0,0 +1,49 @@
package main
// Started internally by LocalAI - one gRPC server per loaded model.
import (
"flag"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
libName := os.Getenv("VIBEVOICECPP_LIBRARY")
if libName == "" {
libName = "./libgovibevoicecpp-fallback.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(err)
}
libFuncs := []LibFuncs{
{&CppLoad, "vv_capi_load"},
{&CppTTS, "vv_capi_tts"},
{&CppASR, "vv_capi_asr"},
{&CppUnload, "vv_capi_unload"},
{&CppVersion, "vv_capi_version"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
}
flag.Parse()
if err := grpc.StartServer(*addr, &VibevoiceCpp{}); err != nil {
panic(err)
}
}

View File

@@ -0,0 +1,58 @@
#!/bin/bash
# Bundle the vibevoice-cpp binary, the per-variant .so files, and the
# runtime libs the binary depends on so the package is self-contained.
# Mirrors backend/go/qwen3-tts-cpp/package.sh.
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/vibevoice-cpp $CURDIR/package/
cp -fv $CURDIR/libgovibevoicecpp-*.so $CURDIR/package/
cp -fv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

49
backend/go/vibevoice-cpp/run.sh Executable file
View File

@@ -0,0 +1,49 @@
#!/bin/bash
set -ex
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
if [ "$(uname)" != "Darwin" ]; then
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
fi
LIBRARY="$CURDIR/libgovibevoicecpp-fallback.so"
if [ "$(uname)" != "Darwin" ]; then
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/libgovibevoicecpp-avx.so ]; then
LIBRARY="$CURDIR/libgovibevoicecpp-avx.so"
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/libgovibevoicecpp-avx2.so ]; then
LIBRARY="$CURDIR/libgovibevoicecpp-avx2.so"
fi
fi
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/libgovibevoicecpp-avx512.so ]; then
LIBRARY="$CURDIR/libgovibevoicecpp-avx512.so"
fi
fi
fi
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
export VIBEVOICECPP_LIBRARY=$LIBRARY
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using library: $LIBRARY"
exec $CURDIR/lib/ld.so $CURDIR/vibevoice-cpp "$@"
fi
echo "Using library: $LIBRARY"
exec $CURDIR/vibevoice-cpp "$@"

View File

@@ -0,0 +1,74 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath $0)")
echo "Running vibevoice-cpp backend tests..."
# Required env-vars (set automatically when missing):
# VIBEVOICE_MODEL_DIR : directory containing the gguf bundle.
# VIBEVOICE_BINARY : path to the built backend (default ./vibevoice-cpp)
#
# Tests skip when the model bundle is absent and the auto-download
# fails (e.g. no network on the runner) so local devs without HF access
# still get green compile output.
cd "$CURDIR"
if [ -z "$VIBEVOICE_MODEL_DIR" ]; then
export VIBEVOICE_MODEL_DIR="./vibevoice-models"
if [ ! -d "$VIBEVOICE_MODEL_DIR" ]; then
echo "Creating vibevoice-models directory for tests..."
mkdir -p "$VIBEVOICE_MODEL_DIR"
REPO_ID="mudler/vibevoice.cpp-models"
echo "Repository: ${REPO_ID}"
# Q4_K instead of Q8_0 for the ASR model: smaller download
# (10 GB vs 14 GB), fits on ubuntu-latest's free disk after the
# runner image is loaded. The unit/closed-loop test only needs
# decode quality, not Q8_0 precision.
FILES=(
"vibevoice-realtime-0.5B-q8_0.gguf"
"vibevoice-asr-q4_k.gguf"
"tokenizer.gguf"
"voice-en-Carter_man.gguf"
)
BASE_URL="https://huggingface.co/${REPO_ID}/resolve/main"
download_ok=1
for file in "${FILES[@]}"; do
dest="${VIBEVOICE_MODEL_DIR}/${file}"
if [ -f "${dest}" ]; then
echo " [skip] ${file} (already exists)"
else
echo " [download] ${file}..."
if ! curl -fL -o "${dest}" "${BASE_URL}/${file}" --progress-bar; then
echo " [warn] failed to download ${file} - network or HF unavailable"
rm -f "${dest}"
download_ok=0
break
fi
echo " [done] ${file}"
fi
done
if [ "$download_ok" != "1" ]; then
echo "vibevoice-cpp: model bundle unavailable - tests will skip model-dependent cases."
unset VIBEVOICE_MODEL_DIR
fi
fi
fi
# Ensure the per-variant .so the binary will dlopen actually exists -
# without one, every test will hit a Dlopen panic during server start.
if [ ! -f "${CURDIR}/libgovibevoicecpp-fallback.so" ]; then
echo "vibevoice-cpp: libgovibevoicecpp-fallback.so missing - run \`make\` first."
exit 1
fi
go test -v -timeout 900s .
echo "All vibevoice-cpp tests passed."

View File

@@ -0,0 +1,382 @@
package main
import (
"context"
"os"
"os/exec"
"path/filepath"
"regexp"
"strings"
"testing"
"time"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
)
const (
testAddr = "localhost:50098"
startupWait = 5 * time.Second
)
func TestVibevoiceCpp(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "VibeVoice-cpp Backend Suite")
}
// modelDirOrSkip returns the staged model bundle dir, or Skip()s the
// current spec when VIBEVOICE_MODEL_DIR is unset / lacks the gguf
// files we need. Tests that don't depend on a model (Locking, error
// paths) don't call this.
func modelDirOrSkip() string {
dir := os.Getenv("VIBEVOICE_MODEL_DIR")
if dir == "" {
Skip("VIBEVOICE_MODEL_DIR not set, skipping model-dependent specs")
}
if _, err := os.Stat(filepath.Join(dir, "tokenizer.gguf")); os.IsNotExist(err) {
Skip("tokenizer.gguf missing in " + dir)
}
tts, _ := filepath.Glob(filepath.Join(dir, "vibevoice-realtime-*.gguf"))
asr, _ := filepath.Glob(filepath.Join(dir, "vibevoice-asr-*.gguf"))
if len(tts) == 0 && len(asr) == 0 {
Skip("neither realtime TTS nor ASR gguf found in " + dir)
}
return dir
}
// startServer launches the prebuilt backend binary and returns a
// running *exec.Cmd. test.sh ensures `./vibevoice-cpp` is built; if
// it isn't, every gRPC spec is skipped with a clear reason.
func startServer() *exec.Cmd {
binary := os.Getenv("VIBEVOICE_BINARY")
if binary == "" {
binary = "./vibevoice-cpp"
}
if _, err := os.Stat(binary); os.IsNotExist(err) {
Skip("backend binary not found at " + binary)
}
cmd := exec.Command(binary, "--addr", testAddr)
cmd.Stdout = os.Stderr
cmd.Stderr = os.Stderr
Expect(cmd.Start()).To(Succeed())
time.Sleep(startupWait)
return cmd
}
func stopServer(cmd *exec.Cmd) {
if cmd == nil || cmd.Process == nil {
return
}
_ = cmd.Process.Kill()
_, _ = cmd.Process.Wait()
}
func dialGRPC() *grpc.ClientConn {
conn, err := grpc.Dial(testAddr,
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithDefaultCallOptions(
grpc.MaxCallRecvMsgSize(50*1024*1024),
grpc.MaxCallSendMsgSize(50*1024*1024),
),
)
Expect(err).ToNot(HaveOccurred())
return conn
}
var _ = Describe("VibeVoice-cpp", func() {
Context("backend semantics (no purego load needed)", func() {
It("is locking - the engine has process-global state", func() {
Expect((&VibevoiceCpp{}).Locking()).To(BeTrue())
})
It("rejects Load with empty ModelFile", func() {
err := (&VibevoiceCpp{}).Load(&pb.ModelOptions{})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("ModelFile"))
})
It("rejects TTS without a loaded TTS model", func() {
err := (&VibevoiceCpp{}).TTS(&pb.TTSRequest{
Text: "no model loaded",
Dst: "/tmp/should-not-be-written.wav",
})
Expect(err).To(HaveOccurred())
})
It("rejects AudioTranscription without a loaded ASR model", func() {
_, err := (&VibevoiceCpp{}).AudioTranscription(&pb.TranscriptRequest{
Dst: "/tmp/some.wav",
})
Expect(err).To(HaveOccurred())
})
It("closes the channel and errors on TTSStream without a loaded model", func() {
ch := make(chan []byte, 4)
err := (&VibevoiceCpp{}).TTSStream(&pb.TTSRequest{
Text: "no model loaded",
Dst: "/tmp/should-not-be-written.wav",
}, ch)
Expect(err).To(HaveOccurred())
// Server hangs forever if the channel stays open; this guard
// is what regresses the e2e DeadlineExceeded we're fixing.
_, ok := <-ch
Expect(ok).To(BeFalse(), "TTSStream must close results channel even on error")
})
// parseOptions + slot fill is the source of the closed-loop CI
// regression where ModelFile=tts.gguf + Options[asr_model=...]
// resulted in a load with empty tts slot. These specs assert
// the slot resolution before we ever call into purego.
Describe("ModelFile slot resolution", func() {
It("fills tts slot from ModelFile when only asr_model is in Options", func() {
v := &VibevoiceCpp{}
v.modelRoot = "/abs/root"
role := v.parseOptions([]string{"asr_model=/abs/root/asr.gguf", "tokenizer=/abs/root/tokenizer.gguf"}, v.modelRoot)
Expect(v.asrModel).To(Equal("/abs/root/asr.gguf"))
Expect(v.ttsModel).To(BeEmpty())
Expect(role).To(BeEmpty())
// Mirror the Load() default-fill block:
if v.ttsModel == "" {
v.ttsModel = "/abs/root/tts.gguf"
}
Expect(v.ttsModel).To(Equal("/abs/root/tts.gguf"))
Expect(v.asrModel).To(Equal("/abs/root/asr.gguf"))
})
It("fills asr slot from ModelFile when type=asr is set", func() {
v := &VibevoiceCpp{}
v.modelRoot = "/abs/root"
role := v.parseOptions([]string{"type=asr", "tokenizer=/abs/root/tokenizer.gguf"}, v.modelRoot)
Expect(role).To(Equal("asr"))
Expect(v.asrModel).To(BeEmpty())
Expect(v.ttsModel).To(BeEmpty())
})
It("respects explicit tts_model override over ModelFile", func() {
v := &VibevoiceCpp{}
v.modelRoot = "/abs/root"
_ = v.parseOptions([]string{"tts_model=/abs/root/alt.gguf"}, v.modelRoot)
Expect(v.ttsModel).To(Equal("/abs/root/alt.gguf"))
})
It("accepts colon-separated options too", func() {
v := &VibevoiceCpp{}
v.modelRoot = "/abs/root"
role := v.parseOptions([]string{"type:asr", "tokenizer:/abs/root/tokenizer.gguf"}, v.modelRoot)
Expect(role).To(Equal("asr"))
Expect(v.tokenizer).To(Equal("/abs/root/tokenizer.gguf"))
})
})
// The gallery flow puts everything under <models_dir>/<entry>/,
// and parameters/options carry paths *relative* to <models_dir>.
// LocalAI core fills opts.ModelPath = <models_dir>; the backend
// must resolve every relative path against that root, never CWD.
Describe("resolvePath (relative-to-modelRoot)", func() {
It("joins relative path onto relTo", func() {
Expect(resolvePath("vibevoice-cpp/tokenizer.gguf", "/data/models")).
To(Equal("/data/models/vibevoice-cpp/tokenizer.gguf"))
})
It("passes absolute paths through unchanged", func() {
Expect(resolvePath("/abs/somewhere/tokenizer.gguf", "/data/models")).
To(Equal("/abs/somewhere/tokenizer.gguf"))
})
It("returns input unchanged when relTo is empty", func() {
Expect(resolvePath("vibevoice-cpp/tokenizer.gguf", "")).
To(Equal("vibevoice-cpp/tokenizer.gguf"))
})
It("returns empty input unchanged", func() {
Expect(resolvePath("", "/data/models")).To(BeEmpty())
})
It("does not consult CWD - bare filenames stay relative to modelRoot", func() {
// Even if the test runs in a directory containing a
// file with this name, the lookup must not fall back
// to CWD. This is the trap the production gallery flow
// would otherwise hit when LocalAI is launched from a
// directory that happens to contain a same-named file.
prev, _ := os.Getwd()
DeferCleanup(func() { _ = os.Chdir(prev) })
tmpCWD, err := os.MkdirTemp("", "vv-cwd-*")
Expect(err).ToNot(HaveOccurred())
DeferCleanup(func() { _ = os.RemoveAll(tmpCWD) })
Expect(os.WriteFile(filepath.Join(tmpCWD, "tokenizer.gguf"),
[]byte("not the real one"), 0o644)).To(Succeed())
Expect(os.Chdir(tmpCWD)).To(Succeed())
got := resolvePath("tokenizer.gguf", "/data/models")
Expect(got).To(Equal("/data/models/tokenizer.gguf"))
})
})
// Round-trip the gallery layout: relative paths in Options +
// an absolute ModelFile (as LocalAI core delivers them) end
// up resolved correctly inside the backend struct.
It("Load resolves relative Options paths against opts.ModelPath", func() {
tmpDir, err := os.MkdirTemp("", "vv-relpath-*")
Expect(err).ToNot(HaveOccurred())
DeferCleanup(func() { _ = os.RemoveAll(tmpDir) })
// Lay out the bundle exactly as the gallery would after install:
// <modelpath>/vibevoice-cpp/{tts,tokenizer,voice}.gguf
subDir := filepath.Join(tmpDir, "vibevoice-cpp")
Expect(os.MkdirAll(subDir, 0o755)).To(Succeed())
tts := filepath.Join(subDir, "vibevoice-realtime-stub.gguf")
tok := filepath.Join(subDir, "tokenizer.gguf")
voice := filepath.Join(subDir, "voice.gguf")
for _, p := range []string{tts, tok, voice} {
Expect(os.WriteFile(p, []byte("stub"), 0o644)).To(Succeed())
}
// Mirror Load()'s pre-purego prefix: parse + slot fill.
v := &VibevoiceCpp{}
modelFile := tts // core delivers this as an abspath already
v.modelRoot = tmpDir
role := v.parseOptions([]string{
"tokenizer=vibevoice-cpp/tokenizer.gguf",
"voice=vibevoice-cpp/voice.gguf",
}, v.modelRoot)
Expect(role).To(BeEmpty())
if v.ttsModel == "" {
v.ttsModel = modelFile
}
Expect(v.ttsModel).To(Equal(tts))
Expect(v.tokenizer).To(Equal(tok))
Expect(v.voice).To(Equal(voice))
Expect(v.asrModel).To(BeEmpty())
})
It("closes the channel and errors on AudioTranscriptionStream without a loaded model", func() {
ch := make(chan *pb.TranscriptStreamResponse, 4)
err := (&VibevoiceCpp{}).AudioTranscriptionStream(&pb.TranscriptRequest{
Dst: "/tmp/some.wav",
}, ch)
Expect(err).To(HaveOccurred())
_, ok := <-ch
Expect(ok).To(BeFalse(), "AudioTranscriptionStream must close results channel even on error")
})
})
Context("gRPC server lifecycle", func() {
var cmd *exec.Cmd
AfterEach(func() {
stopServer(cmd)
cmd = nil
})
It("answers Health checks", func() {
cmd = startServer()
conn := dialGRPC()
defer func() { _ = conn.Close() }()
resp, err := pb.NewBackendClient(conn).Health(context.Background(), &pb.HealthMessage{})
Expect(err).ToNot(HaveOccurred())
Expect(string(resp.Message)).To(Equal("OK"))
})
It("loads the realtime TTS model", func() {
dir := modelDirOrSkip()
tts, _ := filepath.Glob(filepath.Join(dir, "vibevoice-realtime-*.gguf"))
if len(tts) == 0 {
Skip("realtime TTS gguf missing")
}
cmd = startServer()
conn := dialGRPC()
defer func() { _ = conn.Close() }()
// Mirror the gallery contract: ModelFile is whatever LocalAI
// core hands us; ModelPath is the models root; Options[]
// carry paths relative to ModelPath.
resp, err := pb.NewBackendClient(conn).LoadModel(context.Background(), &pb.ModelOptions{
ModelFile: filepath.Base(tts[0]),
ModelPath: dir,
Threads: 4,
Options: []string{"tokenizer=tokenizer.gguf"},
})
Expect(err).ToNot(HaveOccurred())
Expect(resp.Success).To(BeTrue(), "LoadModel msg=%q", resp.Message)
})
It("runs a closed-loop TTS -> ASR with >=80% word recall", func() {
dir := modelDirOrSkip()
tts, _ := filepath.Glob(filepath.Join(dir, "vibevoice-realtime-*.gguf"))
asr, _ := filepath.Glob(filepath.Join(dir, "vibevoice-asr-*.gguf"))
if len(tts) == 0 || len(asr) == 0 {
Skip("closed-loop needs both realtime TTS and ASR ggufs")
}
tmpDir, err := os.MkdirTemp("", "vibevoice-cpp-closedloop-*")
Expect(err).ToNot(HaveOccurred())
DeferCleanup(func() { _ = os.RemoveAll(tmpDir) })
wav := filepath.Join(tmpDir, "say.wav")
cmd = startServer()
conn := dialGRPC()
defer func() { _ = conn.Close() }()
client := pb.NewBackendClient(conn)
// Gallery convention: ModelPath is the models root, every
// path inside Options[] is relative to it.
voiceMatches, _ := filepath.Glob(filepath.Join(dir, "voice-*.gguf"))
loadOpts := &pb.ModelOptions{
ModelFile: filepath.Base(tts[0]),
ModelPath: dir,
Threads: 4,
Options: []string{
"asr_model=" + filepath.Base(asr[0]),
"tokenizer=tokenizer.gguf",
},
}
if len(voiceMatches) > 0 {
loadOpts.Options = append(loadOpts.Options, "voice="+filepath.Base(voiceMatches[0]))
}
loadResp, err := client.LoadModel(context.Background(), loadOpts)
Expect(err).ToNot(HaveOccurred())
Expect(loadResp.Success).To(BeTrue(), "LoadModel msg=%q", loadResp.Message)
srcText := "Hello world this is a test of the synthesis system."
_, err = client.TTS(context.Background(), &pb.TTSRequest{
Text: srcText,
Dst: wav,
})
Expect(err).ToNot(HaveOccurred())
info, err := os.Stat(wav)
Expect(err).ToNot(HaveOccurred())
Expect(info.Size()).To(BeNumerically(">=", 1000),
"TTS produced suspiciously small wav (%d bytes)", info.Size())
resp, err := client.AudioTranscription(context.Background(), &pb.TranscriptRequest{
Dst: wav,
})
Expect(err).ToNot(HaveOccurred())
got := strings.ToLower(resp.Text)
GinkgoWriter.Printf("source : %s\n", srcText)
GinkgoWriter.Printf("transcribed: %s\n", got)
wordRE := regexp.MustCompile(`[a-z]+`)
srcWords := wordRE.FindAllString(strings.ToLower(srcText), -1)
Expect(srcWords).ToNot(BeEmpty())
hits := 0
for _, w := range srcWords {
if strings.Contains(got, w) {
hits++
}
}
recall := float64(hits) / float64(len(srcWords))
GinkgoWriter.Printf("recall: %d/%d = %.2f%%\n", hits, len(srcWords), recall*100)
Expect(recall).To(BeNumerically(">=", 0.80),
"closed-loop recall too low: %d/%d = %.2f%%",
hits, len(srcWords), recall*100)
})
})
})

View File

@@ -5,7 +5,7 @@ set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
add_subdirectory(./sources/whisper.cpp)
add_library(gowhisper MODULE gowhisper.cpp)
add_library(gowhisper MODULE cpp/gowhisper.cpp)
target_link_libraries(gowhisper PRIVATE whisper ggml)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# whisper.cpp version
WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
WHISPER_CPP_VERSION?=166c20b473d5f4d04052e699f992f625ea2a2fdd
WHISPER_CPP_VERSION?=fc674574ca27cac59a15e5b22a09b9d9ad62aafe
SO_TARGET?=libgowhisper.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
@@ -111,7 +111,7 @@ libgowhisper-fallback.so: sources/whisper.cpp
SO_TARGET=libgowhisper-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgowhisper-custom
rm -rfv build*
libgowhisper-custom: CMakeLists.txt gowhisper.cpp gowhisper.h
libgowhisper-custom: CMakeLists.txt cpp/gowhisper.cpp cpp/gowhisper.h
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \

View File

@@ -139,7 +139,10 @@ func (w *Whisper) AudioTranscription(opts *pb.TranscriptRequest) (pb.TranscriptR
// segment start/end conversion factor taken from https://github.com/ggml-org/whisper.cpp/blob/master/examples/cli/cli.cpp#L895
s := CppGetSegmentStart(i) * (10000000)
t := CppGetSegmentEnd(i) * (10000000)
txt := strings.Clone(CppGetSegmentText(i))
// whisper.cpp can emit bytes that aren't valid UTF-8 (e.g. a multibyte
// codepoint split across token boundaries); protobuf string fields
// reject those at marshal time. Scrub before the value escapes cgo.
txt := strings.ToValidUTF8(strings.Clone(CppGetSegmentText(i)), "<22>")
tokens := make([]int32, CppNTokens(i))
if opts.Diarize && CppGetSegmentSpeakerTurnNext(i) {

View File

@@ -168,6 +168,43 @@
nvidia-cuda-13: "cuda13-rfdetr"
nvidia-cuda-12: "cuda12-rfdetr"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-rfdetr"
- &insightface
name: "insightface"
alias: "insightface"
# Upstream insightface library is MIT. The pretrained model packs
# (buffalo_l, buffalo_s, antelopev2) are released for NON-COMMERCIAL
# research use only. The backend image also pre-bakes OpenCV Zoo
# YuNet + SFace (Apache 2.0) for commercial use. Pick the engine
# via model-gallery entries (insightface-buffalo-l / insightface-opencv
# / insightface-buffalo-s) or set `options` in your model YAML.
license: "mixed"
description: |
Face recognition backend powered by `insightface` (ONNX Runtime).
Provides face verification (/v1/face/verify), face analysis
(/v1/face/analyze), face embedding (/v1/embeddings), face
detection (/v1/detection), and 1:N identification
(/v1/face/{register,identify,forget}).
Ships two engines in a single image: one that drives the insightface
model packs (buffalo_l/s/m/sc, antelopev2 — non-commercial research
use only) and one that drives OpenCV Zoo's YuNet + SFace pair
(Apache 2.0 — commercial-safe). Select via `options: ["engine:..."]`
in your model YAML, or install one of the ready-made model-gallery
entries under the `insightface-*` prefix.
The backend image contains only code and Python deps; all model
weights are managed by LocalAI's gallery download mechanism.
urls:
- https://github.com/deepinsight/insightface
- https://github.com/opencv/opencv_zoo
tags:
- face-recognition
- face-verification
- face-embedding
- gpu
- cpu
capabilities:
default: "cpu-insightface"
nvidia: "cuda12-insightface"
nvidia-cuda-12: "cuda12-insightface"
- &sam3cpp
name: "sam3-cpp"
alias: "sam3-cpp"
@@ -226,6 +263,8 @@
amd: "rocm-vllm"
intel: "intel-vllm"
nvidia-cuda-12: "cuda12-vllm"
nvidia-cuda-13: "cuda13-vllm"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm"
cpu: "cpu-vllm"
- &sglang
name: "sglang"
@@ -248,6 +287,7 @@
amd: "rocm-sglang"
intel: "intel-sglang"
nvidia-cuda-12: "cuda12-sglang"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang"
cpu: "cpu-sglang"
- &vllm-omni
name: "vllm-omni"
@@ -274,6 +314,8 @@
nvidia: "cuda12-vllm-omni"
amd: "rocm-vllm-omni"
nvidia-cuda-12: "cuda12-vllm-omni"
nvidia-cuda-13: "cuda13-vllm-omni"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-omni"
- &mlx
name: "mlx"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx"
@@ -530,6 +572,34 @@
nvidia-l4t: "nvidia-l4t-arm64-qwen3-tts-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-qwen3-tts-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-qwen3-tts-cpp"
- &vibevoicecpp
name: "vibevoice-cpp"
description: |
vibevoice.cpp C++ backend using GGML. Native C++ port of Microsoft VibeVoice for both
text-to-speech (with voice cloning via voice prompt GGUFs) and long-form ASR with
speaker diarization. Outputs 24kHz mono WAV; ASR returns per-speaker JSON segments.
urls:
- https://github.com/mudler/vibevoice.cpp
tags:
- text-to-speech
- tts
- speech-to-text
- asr
- voice-cloning
- diarization
alias: "vibevoice-cpp"
capabilities:
default: "cpu-vibevoice-cpp"
nvidia: "cuda12-vibevoice-cpp"
nvidia-cuda-13: "cuda13-vibevoice-cpp"
nvidia-cuda-12: "cuda12-vibevoice-cpp"
intel: "intel-sycl-f16-vibevoice-cpp"
metal: "metal-vibevoice-cpp"
amd: "rocm-vibevoice-cpp"
vulkan: "vulkan-vibevoice-cpp"
nvidia-l4t: "nvidia-l4t-arm64-vibevoice-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-vibevoice-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vibevoice-cpp"
- &faster-whisper
icon: https://avatars.githubusercontent.com/u/1520500?s=200&v=4
description: |
@@ -587,7 +657,6 @@
alias: "whisperx"
capabilities:
nvidia: "cuda12-whisperx"
amd: "rocm-whisperx"
metal: "metal-whisperx"
default: "cpu-whisperx"
nvidia-cuda-13: "cuda13-whisperx"
@@ -970,6 +1039,23 @@
nvidia: "cuda12-neutts"
amd: "rocm-neutts"
nvidia-cuda-12: "cuda12-neutts"
- &sherpa-onnx
name: "sherpa-onnx"
alias: "sherpa-onnx"
urls:
- https://k2-fsa.github.io/sherpa/onnx/
description: |
Sherpa-ONNX backend for text-to-speech (VITS, Matcha, Kokoro), speech-to-text (Whisper, Paraformer, SenseVoice, Omnilingual ASR CTC), and voice activity detection via ONNX Runtime.
Supports multi-speaker voices, 1600+ language ASR, and GPU acceleration.
tags:
- text-to-speech
- TTS
- speech-to-text
- ASR
capabilities:
default: "cpu-sherpa-onnx"
nvidia: "cuda12-sherpa-onnx"
nvidia-cuda-12: "cuda12-sherpa-onnx"
- !!merge <<: *neutts
name: "neutts-development"
capabilities:
@@ -1555,6 +1641,20 @@
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant
## whisper
- !!merge <<: *whispercpp
name: "whisper-development"
capabilities:
default: "cpu-whisper-development"
nvidia: "cuda12-whisper-development"
intel: "intel-sycl-f16-whisper-development"
metal: "metal-whisper-development"
amd: "rocm-whisper-development"
vulkan: "vulkan-whisper-development"
nvidia-l4t: "nvidia-l4t-arm64-whisper-development"
nvidia-cuda-13: "cuda13-whisper-development"
nvidia-cuda-12: "cuda12-whisper-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-whisper-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-whisper-development"
- !!merge <<: *whispercpp
name: "nvidia-l4t-arm64-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-whisper"
@@ -1761,12 +1861,25 @@
nvidia: "cuda12-vllm-development"
amd: "rocm-vllm-development"
intel: "intel-vllm-development"
nvidia-cuda-12: "cuda12-vllm-development"
nvidia-cuda-13: "cuda13-vllm-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-development"
cpu: "cpu-vllm-development"
- !!merge <<: *vllm
name: "cuda12-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-vllm
- !!merge <<: *vllm
name: "cuda13-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-vllm"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-vllm
- !!merge <<: *vllm
name: "cuda13-nvidia-l4t-arm64-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-vllm"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-vllm
- !!merge <<: *vllm
name: "rocm-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-vllm"
@@ -1787,6 +1900,16 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vllm"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-vllm
- !!merge <<: *vllm
name: "cuda13-vllm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-vllm"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-vllm
- !!merge <<: *vllm
name: "cuda13-nvidia-l4t-arm64-vllm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-vllm"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-vllm
- !!merge <<: *vllm
name: "rocm-vllm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-vllm"
@@ -1809,12 +1932,19 @@
nvidia: "cuda12-sglang-development"
amd: "rocm-sglang-development"
intel: "intel-sglang-development"
nvidia-cuda-12: "cuda12-sglang-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sglang-development"
cpu: "cpu-sglang-development"
- !!merge <<: *sglang
name: "cuda12-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sglang"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-sglang
- !!merge <<: *sglang
name: "cuda13-nvidia-l4t-arm64-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-sglang"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-sglang
- !!merge <<: *sglang
name: "rocm-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-sglang"
@@ -1835,6 +1965,11 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sglang"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-sglang
- !!merge <<: *sglang
name: "cuda13-nvidia-l4t-arm64-sglang-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-sglang"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-sglang
- !!merge <<: *sglang
name: "rocm-sglang-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-sglang"
@@ -1857,11 +1992,23 @@
nvidia: "cuda12-vllm-omni-development"
amd: "rocm-vllm-omni-development"
nvidia-cuda-12: "cuda12-vllm-omni-development"
nvidia-cuda-13: "cuda13-vllm-omni-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vllm-omni-development"
- !!merge <<: *vllm-omni
name: "cuda12-vllm-omni"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm-omni"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-vllm-omni
- !!merge <<: *vllm-omni
name: "cuda13-vllm-omni"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-vllm-omni"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-vllm-omni
- !!merge <<: *vllm-omni
name: "cuda13-nvidia-l4t-arm64-vllm-omni"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-vllm-omni"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-vllm-omni
- !!merge <<: *vllm-omni
name: "rocm-vllm-omni"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-vllm-omni"
@@ -1872,6 +2019,16 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vllm-omni"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-vllm-omni
- !!merge <<: *vllm-omni
name: "cuda13-vllm-omni-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-vllm-omni"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-vllm-omni
- !!merge <<: *vllm-omni
name: "cuda13-nvidia-l4t-arm64-vllm-omni-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-vllm-omni"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-vllm-omni
- !!merge <<: *vllm-omni
name: "rocm-vllm-omni-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-vllm-omni"
@@ -2527,6 +2684,107 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-qwen3-tts-cpp
## vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "nvidia-l4t-arm64-vibevoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-vibevoice-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "nvidia-l4t-arm64-vibevoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-vibevoice-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "cuda13-nvidia-l4t-arm64-vibevoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-vibevoice-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "cuda13-nvidia-l4t-arm64-vibevoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-vibevoice-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "cpu-vibevoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-vibevoice-cpp"
mirrors:
- localai/localai-backends:latest-cpu-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "metal-vibevoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-vibevoice-cpp"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "metal-vibevoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vibevoice-cpp"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "cpu-vibevoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-vibevoice-cpp"
mirrors:
- localai/localai-backends:master-cpu-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "cuda12-vibevoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vibevoice-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "rocm-vibevoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-vibevoice-cpp"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "intel-sycl-f32-vibevoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-vibevoice-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "intel-sycl-f16-vibevoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-vibevoice-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "vulkan-vibevoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-vibevoice-cpp"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "vulkan-vibevoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-vibevoice-cpp"
mirrors:
- localai/localai-backends:master-gpu-vulkan-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "cuda12-vibevoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vibevoice-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "rocm-vibevoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-vibevoice-cpp"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "intel-sycl-f32-vibevoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-vibevoice-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "intel-sycl-f16-vibevoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-vibevoice-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "cuda13-vibevoice-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-vibevoice-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-vibevoice-cpp
- !!merge <<: *vibevoicecpp
name: "cuda13-vibevoice-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-vibevoice-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-vibevoice-cpp
## kokoro
- !!merge <<: *kokoro
name: "kokoro-development"
@@ -2745,7 +3003,6 @@
name: "whisperx-development"
capabilities:
nvidia: "cuda12-whisperx-development"
amd: "rocm-whisperx-development"
metal: "metal-whisperx-development"
default: "cpu-whisperx-development"
nvidia-cuda-13: "cuda13-whisperx-development"
@@ -2771,16 +3028,6 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-whisperx"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-whisperx
- !!merge <<: *whisperx
name: "rocm-whisperx"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-whisperx"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-whisperx
- !!merge <<: *whisperx
name: "rocm-whisperx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-whisperx"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-whisperx
- !!merge <<: *whisperx
name: "cuda13-whisperx"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-whisperx"
@@ -3721,3 +3968,118 @@
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-llama-cpp-quantization"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-llama-cpp-quantization
# insightface (face recognition) — development and concrete image entries
- !!merge <<: *insightface
name: "insightface-development"
capabilities:
default: "cpu-insightface-development"
nvidia: "cuda12-insightface-development"
nvidia-cuda-12: "cuda12-insightface-development"
- !!merge <<: *insightface
name: "cpu-insightface"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-insightface"
mirrors:
- localai/localai-backends:latest-cpu-insightface
- !!merge <<: *insightface
name: "cuda12-insightface"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-insightface"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-insightface
- !!merge <<: *insightface
name: "cpu-insightface-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-insightface"
mirrors:
- localai/localai-backends:master-cpu-insightface
- !!merge <<: *insightface
name: "cuda12-insightface-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-insightface"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-insightface
# speaker-recognition (voice/speaker biometrics) — Apache-2.0 stack
- &speakerrecognition
name: "speaker-recognition"
alias: "speaker-recognition"
# SpeechBrain is Apache-2.0. WeSpeaker / 3D-Speaker ONNX exports are
# Apache-2.0. The backend itself ships only Python deps — all model
# weights flow through LocalAI's gallery download mechanism (or
# SpeechBrain's built-in HF auto-download at first LoadModel).
license: apache-2.0
description: |
Speaker (voice) recognition backend — the audio analog to
insightface. Wraps SpeechBrain ECAPA-TDNN (default engine, 192-d
embeddings, ~1.9% EER on VoxCeleb) plus an OnnxDirectEngine for
pre-exported WeSpeaker / 3D-Speaker ONNX models.
Exposes speaker verification (/v1/voice/verify), speaker embedding
(/v1/voice/embed), speaker analysis (/v1/voice/analyze), and 1:N
speaker identification (/v1/voice/{register,identify,forget}).
Registrations use LocalAI's built-in vector store — same in-memory
backing the face-recognition registry uses, separate instance.
urls:
- https://speechbrain.github.io/
- https://github.com/wenet-e2e/wespeaker
- https://github.com/modelscope/3D-Speaker
tags:
- voice-recognition
- speaker-verification
- speaker-embedding
- gpu
- cpu
capabilities:
default: "cpu-speaker-recognition"
nvidia: "cuda12-speaker-recognition"
nvidia-cuda-12: "cuda12-speaker-recognition"
- !!merge <<: *speakerrecognition
name: "speaker-recognition-development"
capabilities:
default: "cpu-speaker-recognition-development"
nvidia: "cuda12-speaker-recognition-development"
nvidia-cuda-12: "cuda12-speaker-recognition-development"
- !!merge <<: *speakerrecognition
name: "cpu-speaker-recognition"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-speaker-recognition"
mirrors:
- localai/localai-backends:latest-cpu-speaker-recognition
- !!merge <<: *speakerrecognition
name: "cuda12-speaker-recognition"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-speaker-recognition"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-speaker-recognition
- !!merge <<: *speakerrecognition
name: "cpu-speaker-recognition-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-speaker-recognition"
mirrors:
- localai/localai-backends:master-cpu-speaker-recognition
- !!merge <<: *speakerrecognition
name: "cuda12-speaker-recognition-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-speaker-recognition"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-speaker-recognition
## sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "sherpa-onnx-development"
capabilities:
default: "cpu-sherpa-onnx-development"
nvidia: "cuda12-sherpa-onnx-development"
nvidia-cuda-12: "cuda12-sherpa-onnx-development"
- !!merge <<: *sherpa-onnx
name: "cpu-sherpa-onnx"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-sherpa-onnx"
mirrors:
- localai/localai-backends:latest-cpu-sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "cpu-sherpa-onnx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-sherpa-onnx"
mirrors:
- localai/localai-backends:master-cpu-sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "cuda12-sherpa-onnx"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sherpa-onnx"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-sherpa-onnx
- !!merge <<: *sherpa-onnx
name: "cuda12-sherpa-onnx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-sherpa-onnx

View File

@@ -1,4 +1,4 @@
grpcio==1.80.0
protobuf
certifi
packaging==24.1
packaging==26.2

View File

@@ -40,7 +40,19 @@ from diffusers import DiffusionPipeline, ControlNetModel
from diffusers import FluxPipeline, FluxTransformer2DModel, AutoencoderKLWan
from diffusers.pipelines.stable_diffusion import safety_checker
from diffusers.utils import load_image, export_to_video
from compel import Compel, ReturnedEmbeddingsType
# TODO: re-enable compel as a hard dependency once it supports transformers >= 5.
# Tracking upstream: https://github.com/damian0815/compel/pull/129
# and https://github.com/damian0815/compel/issues/128
# Until then compel pins transformers ~= 4.25, which forces the pip resolver into
# multi-hour backtracking storms in CI when DEPS_REFRESH rotates the cache.
# Keep the import optional and gate usage on the COMPEL env var (set COMPEL=1 to opt in).
try:
from compel import Compel, ReturnedEmbeddingsType
COMPEL_AVAILABLE = True
except ImportError:
Compel = None
ReturnedEmbeddingsType = None
COMPEL_AVAILABLE = False
from optimum.quanto import freeze, qfloat8, quantize
from transformers import T5EncoderModel
from safetensors.torch import load_file
@@ -66,6 +78,9 @@ from diffusers import LTX2VideoTransformer3DModel, GGUFQuantizationConfig
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
COMPEL = os.environ.get("COMPEL", "0") == "1"
if COMPEL and not COMPEL_AVAILABLE:
print("WARNING: COMPEL is enabled but the compel module is not installed. Install it manually (`pip install compel`) or unset COMPEL. Falling back to standard prompt processing.", file=sys.stderr)
COMPEL = False
SD_EMBED = os.environ.get("SD_EMBED", "0") == "1"
# Warn if SD_EMBED is enabled but the module is not available
if SD_EMBED and not SD_EMBED_AVAILABLE:

View File

@@ -4,10 +4,15 @@ opencv-python
transformers
torchvision==0.22.1
accelerate
compel
git+https://github.com/xhinker/sd_embed
peft
sentencepiece
torch==2.7.1
optimum-quanto
ftfy
ftfy
# TODO: re-add compel once it supports transformers >= 5.
# Tracking: https://github.com/damian0815/compel/pull/129
# https://github.com/damian0815/compel/issues/128
# compel currently pins transformers~=4.25, which forced pip into multi-hour
# resolver backtracking storms in CI. backend.py imports it lazily and gates
# the COMPEL=1 env var on the import succeeding, so dropping it here is safe.

View File

@@ -4,10 +4,15 @@ opencv-python
transformers
torchvision
accelerate
compel
git+https://github.com/xhinker/sd_embed
peft
sentencepiece
torch
ftfy
optimum-quanto
# TODO: re-add compel once it supports transformers >= 5.
# Tracking: https://github.com/damian0815/compel/pull/129
# https://github.com/damian0815/compel/issues/128
# compel currently pins transformers~=4.25, which forced pip into multi-hour
# resolver backtracking storms in CI. backend.py imports it lazily and gates
# the COMPEL=1 env var on the import succeeding, so dropping it here is safe.

View File

@@ -4,10 +4,15 @@ opencv-python
transformers
torchvision
accelerate
compel
git+https://github.com/xhinker/sd_embed
peft
sentencepiece
torch
ftfy
optimum-quanto
# TODO: re-add compel once it supports transformers >= 5.
# Tracking: https://github.com/damian0815/compel/pull/129
# https://github.com/damian0815/compel/issues/128
# compel currently pins transformers~=4.25, which forced pip into multi-hour
# resolver backtracking storms in CI. backend.py imports it lazily and gates
# the COMPEL=1 env var on the import succeeding, so dropping it here is safe.

View File

@@ -5,8 +5,13 @@ git+https://github.com/huggingface/diffusers
opencv-python
transformers
accelerate
compel
peft
sentencepiece
optimum-quanto
ftfy
ftfy
# TODO: re-add compel once it supports transformers >= 5.
# Tracking: https://github.com/damian0815/compel/pull/129
# https://github.com/damian0815/compel/issues/128
# compel currently pins transformers~=4.25, which forced pip into multi-hour
# resolver backtracking storms in CI. backend.py imports it lazily and gates
# the COMPEL=1 env var on the import succeeding, so dropping it here is safe.

View File

@@ -7,9 +7,14 @@ git+https://github.com/huggingface/diffusers
opencv-python
transformers
accelerate
compel
git+https://github.com/xhinker/sd_embed
peft
sentencepiece
optimum-quanto
ftfy
ftfy
# TODO: re-add compel once it supports transformers >= 5.
# Tracking: https://github.com/damian0815/compel/pull/129
# https://github.com/damian0815/compel/issues/128
# compel currently pins transformers~=4.25, which forced pip into multi-hour
# resolver backtracking storms in CI. backend.py imports it lazily and gates
# the COMPEL=1 env var on the import succeeding, so dropping it here is safe.

View File

@@ -3,10 +3,15 @@ torch
git+https://github.com/huggingface/diffusers
transformers
accelerate
compel
peft
optimum-quanto
numpy<2
sentencepiece
torchvision
ftfy
# TODO: re-add compel once it supports transformers >= 5.
# Tracking: https://github.com/damian0815/compel/pull/129
# https://github.com/damian0815/compel/issues/128
# compel currently pins transformers~=4.25, which forced pip into multi-hour
# resolver backtracking storms in CI. backend.py imports it lazily and gates
# the COMPEL=1 env var on the import succeeding, so dropping it here is safe.

View File

@@ -3,7 +3,6 @@ torch
git+https://github.com/huggingface/diffusers
transformers
accelerate
compel
peft
optimum-quanto
numpy<2
@@ -11,3 +10,9 @@ sentencepiece
torchvision
ftfy
chardet
# TODO: re-add compel once it supports transformers >= 5.
# Tracking: https://github.com/damian0815/compel/pull/129
# https://github.com/damian0815/compel/issues/128
# compel currently pins transformers~=4.25, which forced pip into multi-hour
# resolver backtracking storms in CI. backend.py imports it lazily and gates
# the COMPEL=1 env var on the import succeeding, so dropping it here is safe.

View File

@@ -4,8 +4,13 @@ git+https://github.com/huggingface/diffusers
opencv-python
transformers
accelerate
compel
peft
sentencepiece
optimum-quanto
ftfy
ftfy
# TODO: re-add compel once it supports transformers >= 5.
# Tracking: https://github.com/damian0815/compel/pull/129
# https://github.com/damian0815/compel/issues/128
# compel currently pins transformers~=4.25, which forced pip into multi-hour
# resolver backtracking storms in CI. backend.py imports it lazily and gates
# the COMPEL=1 env var on the import succeeding, so dropping it here is safe.

View File

@@ -0,0 +1,16 @@
.DEFAULT_GOAL := install
.PHONY: install
install:
bash install.sh
.PHONY: protogen-clean
protogen-clean:
$(RM) backend_pb2_grpc.py backend_pb2.py
.PHONY: clean
clean: protogen-clean
rm -rf venv __pycache__
test: install
bash test.sh

View File

@@ -0,0 +1,67 @@
# insightface backend (LocalAI)
Face recognition backend backed by ONNX Runtime. Provides face
verification (1:1), face analysis (age/gender), face detection, face
embedding, and — via LocalAI's built-in vector store — 1:N
identification.
## Engines
This backend ships with **two** interchangeable engines selected via
`LoadModel.Options["engine"]`:
| engine | Implementation | Models | License |
|---|---|---|---|
| `insightface` (default) | `insightface.app.FaceAnalysis` | `buffalo_l`, `buffalo_s`, `antelopev2` | **Non-commercial research use only** |
| `onnx_direct` | OpenCV `FaceDetectorYN` + `FaceRecognizerSF` | OpenCV Zoo YuNet + SFace | Apache 2.0 (commercial-safe) |
Both engines implement the same `FaceEngine` protocol in `engines.py`,
so the gRPC servicer in `backend.py` doesn't need to know which one is
active.
## LoadModel options
Common:
| option | default | description |
|---|---|---|
| `engine` | `insightface` | one of `insightface`, `onnx_direct` |
| `det_size` | `640x640` (insightface), `320x320` (onnx_direct) | detector input size |
| `det_thresh` | `0.5` | detector confidence threshold |
| `verify_threshold` | `0.35` | default cosine distance cutoff for FaceVerify |
`insightface` engine:
| option | default | description |
|---|---|---|
| `model_pack` | `buffalo_l` | which insightface pack to load |
`onnx_direct` engine:
| option | default | description |
|---|---|---|
| `detector_onnx` | *(required)* | path to YuNet-compatible ONNX |
| `recognizer_onnx` | *(required)* | path to SFace-compatible ONNX |
## Adding a new model pack
1. If it's an insightface pack (auto-downloadable or manually extracted
into `~/.insightface/models/<name>/`), just add a new gallery entry
in `backend/index.yaml` with `options: ["engine:insightface",
"model_pack:<name>"]`. No code change.
2. If it's an Apache-licensed ONNX pair, add a gallery entry with
`options: ["engine:onnx_direct", "detector_onnx:...",
"recognizer_onnx:..."]`. If the detector or recognizer has a
different input-tensor shape than YuNet/SFace, you may need a new
engine implementation in `engines.py`; the two-engine seam makes
that a self-contained change.
## Running tests locally
```bash
make -C backend/python/insightface # install deps + bake models
make -C backend/python/insightface test # run test.py
```
The OpenCV Zoo tests skip gracefully when `/models/opencv/*.onnx` is
absent (e.g. on dev boxes where `install.sh` wasn't run).

View File

@@ -0,0 +1,312 @@
#!/usr/bin/env python3
"""gRPC server for the insightface face recognition backend.
Implements Health / LoadModel / Status plus the face-specific methods:
Embedding, Detect, FaceVerify, FaceAnalyze. The heavy lifting is
delegated to engines.py — this file is just the gRPC plumbing.
"""
import argparse
import base64
import os
import signal
import sys
import time
from concurrent import futures
from io import BytesIO
import backend_pb2
import backend_pb2_grpc
import cv2
import grpc
import numpy as np
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "common"))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "common"))
from grpc_auth import get_auth_interceptors # noqa: E402
from engines import FaceEngine, build_engine # noqa: E402
_ONE_DAY = 60 * 60 * 24
MAX_WORKERS = int(os.environ.get("PYTHON_GRPC_MAX_WORKERS", "1"))
# Default cosine-distance threshold for "same person" on buffalo_l
# ArcFace R50. Clients can override per-request; clients using SFace
# should pass threshold≈0.4 since the distance distribution is wider.
DEFAULT_VERIFY_THRESHOLD = 0.35
def _decode_image(src: str) -> np.ndarray | None:
"""Decode a base64-encoded image into an OpenCV BGR numpy array."""
if not src:
return None
try:
data = base64.b64decode(src, validate=False)
except Exception:
return None
arr = np.frombuffer(data, dtype=np.uint8)
if arr.size == 0:
return None
img = cv2.imdecode(arr, cv2.IMREAD_COLOR)
return img
def _parse_options(raw: list[str]) -> dict[str, str]:
out: dict[str, str] = {}
for entry in raw:
if ":" not in entry:
continue
k, v = entry.split(":", 1)
out[k.strip()] = v.strip()
return out
class BackendServicer(backend_pb2_grpc.BackendServicer):
def __init__(self) -> None:
self.engine: FaceEngine | None = None
self.engine_name: str = ""
self.model_name: str = ""
self.verify_threshold: float = DEFAULT_VERIFY_THRESHOLD
def Health(self, request, context):
return backend_pb2.Reply(message=bytes("OK", "utf-8"))
def LoadModel(self, request, context):
options = _parse_options(list(request.Options))
# Surface LocalAI's models directory (ModelPath) so engines can
# anchor relative paths — OnnxDirectEngine's detector_onnx /
# recognizer_onnx point at gallery-managed files that LocalAI
# dropped there, and InsightFaceEngine auto-downloads its packs
# into that same directory alongside every other managed model.
# Private key to avoid clashing with user-provided options.
if request.ModelPath:
options["_model_dir"] = request.ModelPath
engine_name = options.get("engine", "insightface")
try:
self.engine = build_engine(engine_name)
self.engine.prepare(options)
except Exception as err: # pragma: no cover - exercised via e2e
return backend_pb2.Result(success=False, message=f"Failed to load face engine: {err}")
self.engine_name = engine_name
self.model_name = request.Model or options.get("model_pack", "")
if "verify_threshold" in options:
try:
self.verify_threshold = float(options["verify_threshold"])
except ValueError:
pass
print(f"[insightface] engine={engine_name} model={self.model_name} loaded", file=sys.stderr)
return backend_pb2.Result(success=True, message="Model loaded successfully")
def Status(self, request, context):
state = (
backend_pb2.StatusResponse.READY
if self.engine is not None
else backend_pb2.StatusResponse.UNINITIALIZED
)
return backend_pb2.StatusResponse(state=state)
def Embedding(self, request, context):
if self.engine is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("face model not loaded")
return backend_pb2.EmbeddingResult()
if not request.Images:
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
context.set_details("Embedding requires Images[0] to be a base64 image")
return backend_pb2.EmbeddingResult()
img = _decode_image(request.Images[0])
if img is None:
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
context.set_details("failed to decode image")
return backend_pb2.EmbeddingResult()
vec = self.engine.embed(img)
if vec is None:
context.set_code(grpc.StatusCode.NOT_FOUND)
context.set_details("no face detected")
return backend_pb2.EmbeddingResult()
return backend_pb2.EmbeddingResult(embeddings=[float(x) for x in vec])
def Detect(self, request, context):
if self.engine is None:
return backend_pb2.DetectResponse()
img = _decode_image(request.src)
if img is None:
return backend_pb2.DetectResponse()
detections = []
for d in self.engine.detect(img):
x1, y1, x2, y2 = d.bbox
detections.append(
backend_pb2.Detection(
x=float(x1),
y=float(y1),
width=float(x2 - x1),
height=float(y2 - y1),
confidence=float(d.score),
class_name="face",
)
)
return backend_pb2.DetectResponse(Detections=detections)
def FaceVerify(self, request, context):
if self.engine is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("face model not loaded")
return backend_pb2.FaceVerifyResponse()
img1 = _decode_image(request.img1)
img2 = _decode_image(request.img2)
if img1 is None or img2 is None:
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
context.set_details("failed to decode one or both images")
return backend_pb2.FaceVerifyResponse()
threshold = request.threshold if request.threshold > 0 else self.verify_threshold
start = time.time()
e1 = self.engine.embed(img1)
e2 = self.engine.embed(img2)
if e1 is None or e2 is None:
context.set_code(grpc.StatusCode.NOT_FOUND)
context.set_details("no face detected in one or both images")
return backend_pb2.FaceVerifyResponse()
# Both engines return L2-normalized vectors, so the dot product
# is the cosine similarity directly.
sim = float(np.dot(e1, e2))
distance = 1.0 - sim
verified = distance < threshold
confidence = max(0.0, min(100.0, (1.0 - distance / threshold) * 100.0)) if threshold > 0 else 0.0
# Detect once per image — region is needed for the response and
# potentially for the antispoof crop. Returns the highest-score face.
def _best_detection(img):
dets = self.engine.detect(img)
if not dets:
return None
return max(dets, key=lambda d: d.score)
def _region(det) -> backend_pb2.FacialArea:
if det is None:
return backend_pb2.FacialArea()
x1, y1, x2, y2 = det.bbox
return backend_pb2.FacialArea(x=x1, y=y1, w=x2 - x1, h=y2 - y1)
det1 = _best_detection(img1)
det2 = _best_detection(img2)
img1_is_real = False
img1_score = 0.0
img2_is_real = False
img2_score = 0.0
if request.anti_spoofing:
spoof1 = self.engine.antispoof(img1, det1.bbox) if det1 is not None else None
spoof2 = self.engine.antispoof(img2, det2.bbox) if det2 is not None else None
if spoof1 is None or spoof2 is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details(
"anti_spoofing requested but no antispoof model is loaded — "
"install `silent-face-antispoofing` or pick a gallery entry "
"that bundles MiniFASNet weights"
)
return backend_pb2.FaceVerifyResponse()
img1_is_real, img1_score = spoof1.is_real, spoof1.score
img2_is_real, img2_score = spoof2.is_real, spoof2.score
# Failed liveness vetoes verification regardless of similarity.
if not (img1_is_real and img2_is_real):
verified = False
return backend_pb2.FaceVerifyResponse(
verified=verified,
distance=float(distance),
threshold=float(threshold),
confidence=float(confidence),
model=self.model_name or self.engine_name,
img1_area=_region(det1),
img2_area=_region(det2),
processing_time_ms=float((time.time() - start) * 1000.0),
img1_is_real=img1_is_real,
img1_antispoof_score=float(img1_score),
img2_is_real=img2_is_real,
img2_antispoof_score=float(img2_score),
)
def FaceAnalyze(self, request, context):
if self.engine is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("face model not loaded")
return backend_pb2.FaceAnalyzeResponse()
img = _decode_image(request.img)
if img is None:
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
context.set_details("failed to decode image")
return backend_pb2.FaceAnalyzeResponse()
faces = []
for attrs in self.engine.analyze(img):
x, y, w, h = attrs.region
fa = backend_pb2.FaceAnalysis(
region=backend_pb2.FacialArea(x=float(x), y=float(y), w=float(w), h=float(h)),
face_confidence=float(attrs.face_confidence),
)
if attrs.age is not None:
fa.age = float(attrs.age)
if attrs.dominant_gender:
fa.dominant_gender = attrs.dominant_gender
for k, v in attrs.gender.items():
fa.gender[k] = float(v)
if request.anti_spoofing:
bbox = (float(x), float(y), float(x + w), float(y + h))
spoof = self.engine.antispoof(img, bbox)
if spoof is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details(
"anti_spoofing requested but no antispoof model is loaded — "
"install `silent-face-antispoofing` or pick a gallery entry "
"that bundles MiniFASNet weights"
)
return backend_pb2.FaceAnalyzeResponse()
fa.is_real = spoof.is_real
fa.antispoof_score = float(spoof.score)
faces.append(fa)
return backend_pb2.FaceAnalyzeResponse(faces=faces)
def serve(address: str) -> None:
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=MAX_WORKERS),
options=[
("grpc.max_message_length", 50 * 1024 * 1024),
("grpc.max_send_message_length", 50 * 1024 * 1024),
("grpc.max_receive_message_length", 50 * 1024 * 1024),
],
interceptors=get_auth_interceptors(),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)
server.start()
print("[insightface] Server started. Listening on: " + address, file=sys.stderr)
def _stop(sig, frame): # pragma: no cover
print("[insightface] shutting down")
server.stop(0)
sys.exit(0)
signal.signal(signal.SIGINT, _stop)
signal.signal(signal.SIGTERM, _stop)
try:
while True:
time.sleep(_ONE_DAY)
except KeyboardInterrupt:
server.stop(0)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run the insightface gRPC server.")
parser.add_argument("--addr", default="localhost:50051", help="The address to bind the server to.")
args = parser.parse_args()
print(f"[insightface] startup: {args}", file=sys.stderr)
serve(args.addr)

View File

@@ -0,0 +1,573 @@
"""Face recognition engine implementations for the LocalAI insightface backend.
Two engines are provided:
* InsightFaceEngine — wraps insightface.app.FaceAnalysis. Supports
buffalo_l / buffalo_s / antelopev2 model packs
with SCRFD detector + ArcFace recognizer +
genderage head. NON-COMMERCIAL research use
only (upstream license).
* OnnxDirectEngine — loads detector + recognizer ONNX files directly
via onnxruntime. Used for OpenCV Zoo models
(YuNet + SFace) and any future Apache-licensed
model set. Does not support analyze().
Both engines expose the same interface so the gRPC servicer (backend.py)
can dispatch without knowing which one is active.
"""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Any, Protocol
import cv2
import numpy as np
@dataclass
class FaceDetection:
bbox: tuple[float, float, float, float] # x1, y1, x2, y2
score: float
landmarks: np.ndarray | None = None # 5x2 keypoints when available
@dataclass
class FaceAttributes:
region: tuple[float, float, float, float] # x, y, w, h
face_confidence: float
age: float | None = None
dominant_gender: str | None = None
gender: dict[str, float] = field(default_factory=dict)
@dataclass
class SpoofResult:
is_real: bool
score: float # averaged probability of the "real" class, 0.0-1.0
class FaceEngine(Protocol):
"""Minimal interface every engine must implement."""
def prepare(self, options: dict[str, str]) -> None: ...
def detect(self, img: np.ndarray) -> list[FaceDetection]: ...
def embed(self, img: np.ndarray) -> np.ndarray | None: ...
def analyze(self, img: np.ndarray) -> list[FaceAttributes]: ...
# Optional: returns None when no antispoof model is loaded.
def antispoof(self, img: np.ndarray, bbox: tuple[float, float, float, float]) -> SpoofResult | None: ...
# ─── Antispoofer (Silent-Face MiniFASNet) ──────────────────────────────
class Antispoofer:
"""Liveness detector using the Silent-Face MiniFASNet ensemble.
Loads up to two ONNX exports (MiniFASNetV2 at scale 2.7 and
MiniFASNetV1SE at scale 4.0). Both are 80x80 BGR-float32-input
classifiers with 3 output logits where index 1 = "real". When both
are loaded, softmax outputs are averaged before argmax — the same
ensembling the upstream `test.py` does.
Preprocessing matches yakhyo/face-anti-spoofing's reference impl:
each model gets its own scale-expanded crop centered on the face
bbox, resized to 80x80, fed straight as float32 BGR (no /255, no
mean/std). See `_crop_face` for the bbox math.
A single model also works (the missing one is simply skipped).
"""
INPUT_SIZE = (80, 80) # h, w
REAL_CLASS_IDX = 1
def __init__(self) -> None:
self._sessions: list[tuple[Any, float, str, str]] = [] # (session, scale, input_name, output_name)
self.threshold: float = 0.5
def load(self, model_paths: list[tuple[str, float]], threshold: float = 0.5) -> None:
"""Load one or more (path, scale) pairs."""
import onnxruntime as ort
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
for path, scale in model_paths:
session = ort.InferenceSession(path, providers=providers)
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
self._sessions.append((session, float(scale), input_name, output_name))
self.threshold = float(threshold)
@property
def loaded(self) -> bool:
return bool(self._sessions)
def _crop_face(self, img: np.ndarray, bbox: tuple[float, float, float, float], scale: float) -> np.ndarray:
# bbox is (x1, y1, x2, y2) in source-image coordinates.
src_h, src_w = img.shape[:2]
x1, y1, x2, y2 = bbox
box_w = max(1.0, x2 - x1)
box_h = max(1.0, y2 - y1)
# Clamp scale so the expanded crop fits inside the source image.
scale = min((src_h - 1) / box_h, (src_w - 1) / box_w, scale)
new_w = box_w * scale
new_h = box_h * scale
cx = x1 + box_w / 2.0
cy = y1 + box_h / 2.0
cx1 = max(0, int(cx - new_w / 2.0))
cy1 = max(0, int(cy - new_h / 2.0))
cx2 = min(src_w - 1, int(cx + new_w / 2.0))
cy2 = min(src_h - 1, int(cy + new_h / 2.0))
cropped = img[cy1 : cy2 + 1, cx1 : cx2 + 1]
if cropped.size == 0:
cropped = img
out_h, out_w = self.INPUT_SIZE
return cv2.resize(cropped, (out_w, out_h))
@staticmethod
def _softmax(x: np.ndarray) -> np.ndarray:
e = np.exp(x - np.max(x, axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
def predict(self, img: np.ndarray, bbox: tuple[float, float, float, float]) -> SpoofResult:
if not self._sessions:
raise RuntimeError("Antispoofer.predict called with no models loaded")
accum = np.zeros((1, 3), dtype=np.float32)
for session, scale, input_name, output_name in self._sessions:
face = self._crop_face(img, bbox, scale).astype(np.float32)
tensor = np.transpose(face, (2, 0, 1))[np.newaxis, ...]
logits = session.run([output_name], {input_name: tensor})[0]
accum += self._softmax(logits)
accum /= float(len(self._sessions))
real_prob = float(accum[0, self.REAL_CLASS_IDX])
is_real = int(np.argmax(accum)) == self.REAL_CLASS_IDX and real_prob >= self.threshold
return SpoofResult(is_real=is_real, score=real_prob)
def _build_antispoofer(options: dict[str, str], model_dir: str | None) -> Antispoofer | None:
"""Instantiate an Antispoofer from option keys, or return None.
Recognised options:
antispoof_v2_onnx — path/filename of MiniFASNetV2 (scale 2.7)
antispoof_v1se_onnx — path/filename of MiniFASNetV1SE (scale 4.0)
antispoof_threshold — real-class probability threshold, default 0.5
Either or both can be provided. Returns None when neither is set.
"""
pairs: list[tuple[str, float]] = []
v2 = options.get("antispoof_v2_onnx", "")
if v2:
pairs.append((_resolve_model_path(v2, model_dir=model_dir), 2.7))
v1se = options.get("antispoof_v1se_onnx", "")
if v1se:
pairs.append((_resolve_model_path(v1se, model_dir=model_dir), 4.0))
if not pairs:
return None
threshold = float(options.get("antispoof_threshold", "0.5"))
spoofer = Antispoofer()
spoofer.load(pairs, threshold=threshold)
return spoofer
# ─── InsightFaceEngine ────────────────────────────────────────────────
# Canonical ONNX manifest for each upstream insightface pack (v0.7 release
# at github.com/deepinsight/insightface/releases). LocalAI's gallery extracts
# these zips flat into the models directory, so when multiple packs or other
# backends drop their own ONNX files alongside, the glob-the-directory
# approach picks up foreign files and insightface's model_zoo.get_model()
# raises IndexError trying to index `input_shape[2]` on a tensor that isn't
# shaped like a face model. The manifest lets us pre-filter to only the
# files that actually belong to the requested pack — deterministic, correct
# pack choice, no crashes on neighbour ONNX files.
_KNOWN_PACK_MANIFESTS: dict[str, frozenset[str]] = {
"buffalo_l": frozenset({
"det_10g.onnx",
"w600k_r50.onnx",
"genderage.onnx",
"2d106det.onnx",
"1k3d68.onnx",
}),
"buffalo_sc": frozenset({
"det_500m.onnx",
"w600k_mbf.onnx",
}),
}
class InsightFaceEngine:
"""Drives insightface's model_zoo directly — no FaceAnalysis wrapper.
FaceAnalysis is a thin 50-line orchestration (glob for ONNX files
in `<root>/models/<name>/`, route each through `model_zoo.get_model`,
build a `{taskname: model}` dict, then loop per-face at inference).
We reimplement the same loop here so we can:
1. Load packs from whatever directory LocalAI's gallery extracted
them into — flat (buffalo_l/s/sc — ONNX at `<dir>/*.onnx`) or
nested (buffalo_m/antelopev2 — ONNX at `<dir>/<name>/*.onnx`)
without needing a specific layout on disk.
2. Skip insightface's built-in auto-download entirely: weight
delivery is LocalAI's gallery `files:` job now, checksum-
verified and cached alongside every other managed model.
The actual inference classes (RetinaFace, ArcFaceONNX, Attribute,
Landmark) stay in insightface — we only reimplement the ~50 lines
of glue around them.
"""
def __init__(self) -> None:
self.models: dict[str, Any] = {}
self.det_model: Any = None
self.model_pack: str = "buffalo_l"
self.det_size: tuple[int, int] = (640, 640)
self.det_thresh: float = 0.5
self._providers: list[str] = ["CPUExecutionProvider"]
self._antispoofer: Antispoofer | None = None
def prepare(self, options: dict[str, str]) -> None:
import glob
import os
from insightface.model_zoo import model_zoo
self.model_pack = options.get("model_pack", "buffalo_l")
self.det_size = _parse_det_size(options.get("det_size", "640x640"))
self.det_thresh = float(options.get("det_thresh", "0.5"))
self._antispoofer = _build_antispoofer(options, options.get("_model_dir"))
pack_dir = _locate_insightface_pack(options, self.model_pack)
if pack_dir is None:
raise ValueError(
f"no insightface pack '{self.model_pack}' found — install via "
f"`local-ai models install insightface-{self.model_pack.replace('_', '-')}`"
)
onnx_files = sorted(glob.glob(os.path.join(pack_dir, "*.onnx")))
# When the pack extracts flat into a shared models directory it
# mixes with ONNX files from other backends (opencv face engine,
# MiniFASNet antispoof, WeSpeaker voice embedding, other buffalo
# packs installed earlier). Feeding those into model_zoo.get_model()
# blows up inside insightface's router — it assumes a 4-D NCHW
# input and indexes `input_shape[2]` on tensors that aren't shaped
# like a face model, raising IndexError. For the upstream packs we
# know the exact ONNX manifest; scoping to it makes the load
# deterministic (without it, det_10g.onnx from buffalo_l sorts
# before det_500m.onnx from buffalo_sc and silently wins).
manifest = _KNOWN_PACK_MANIFESTS.get(self.model_pack)
if manifest is not None:
scoped = [f for f in onnx_files if os.path.basename(f) in manifest]
if scoped:
onnx_files = scoped
if not onnx_files:
raise ValueError(f"no ONNX files in pack directory: {pack_dir}")
# CUDAExecutionProvider is picked automatically by onnxruntime-gpu
# when available; falling back to CPU keeps the CPU-only image
# working. ctx_id=0 means "first GPU if any, else CPU".
self._providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
self.models = {}
skipped: list[tuple[str, str]] = []
for onnx_file in onnx_files:
try:
m = model_zoo.get_model(onnx_file, providers=self._providers)
except Exception as err:
# Foreign ONNX (wrong rank/shape, non-insightface model) —
# older insightface versions raise IndexError / ValueError
# instead of returning None. Keep loading the rest.
skipped.append((os.path.basename(onnx_file), str(err)))
continue
if m is None:
skipped.append((os.path.basename(onnx_file), "unknown taskname"))
continue
# First occurrence of each taskname wins (matches FaceAnalysis).
if m.taskname not in self.models:
self.models[m.taskname] = m
if skipped:
import sys
print(
f"[insightface] skipped {len(skipped)} non-pack ONNX file(s) in {pack_dir}: "
+ ", ".join(f"{n} ({why})" for n, why in skipped),
file=sys.stderr,
)
if "detection" not in self.models:
raise ValueError(f"no detector (taskname='detection') found in {pack_dir}")
self.det_model = self.models["detection"]
self.det_model.prepare(0, input_size=self.det_size, det_thresh=self.det_thresh)
for name, m in self.models.items():
if name != "detection":
m.prepare(0)
def _faces(self, img: np.ndarray) -> list[Any]:
"""Run detection + all non-detection models per face."""
if self.det_model is None:
return []
from insightface.app.common import Face
bboxes, kpss = self.det_model.detect(img, max_num=0)
if bboxes is None or bboxes.shape[0] == 0:
return []
faces: list[Any] = []
for i in range(bboxes.shape[0]):
bbox = bboxes[i, 0:4]
det_score = bboxes[i, 4]
kps = kpss[i] if kpss is not None else None
face = Face(bbox=bbox, kps=kps, det_score=det_score)
for name, m in self.models.items():
if name == "detection":
continue
m.get(img, face)
faces.append(face)
return faces
def detect(self, img: np.ndarray) -> list[FaceDetection]:
return [
FaceDetection(
bbox=tuple(float(v) for v in f.bbox),
score=float(f.det_score),
landmarks=np.array(f.kps) if getattr(f, "kps", None) is not None else None,
)
for f in self._faces(img)
]
def embed(self, img: np.ndarray) -> np.ndarray | None:
faces = self._faces(img)
if not faces:
return None
best = max(faces, key=lambda f: float(f.det_score))
if getattr(best, "normed_embedding", None) is None:
return None
return np.asarray(best.normed_embedding, dtype=np.float32)
def analyze(self, img: np.ndarray) -> list[FaceAttributes]:
out: list[FaceAttributes] = []
for f in self._faces(img):
x1, y1, x2, y2 = (float(v) for v in f.bbox)
region = (x1, y1, x2 - x1, y2 - y1)
attrs = FaceAttributes(region=region, face_confidence=float(f.det_score))
age = getattr(f, "age", None)
if age is not None:
attrs.age = float(age)
gender = getattr(f, "gender", None)
if gender is not None:
# genderage head emits argmax, not probabilities —
# one-hot dict keeps the API stable.
attrs.dominant_gender = "Man" if int(gender) == 1 else "Woman"
attrs.gender = {
"Man": 1.0 if int(gender) == 1 else 0.0,
"Woman": 0.0 if int(gender) == 1 else 1.0,
}
out.append(attrs)
return out
def antispoof(self, img: np.ndarray, bbox: tuple[float, float, float, float]) -> SpoofResult | None:
if self._antispoofer is None or not self._antispoofer.loaded:
return None
return self._antispoofer.predict(img, bbox)
# ─── OnnxDirectEngine ─────────────────────────────────────────────────
class OnnxDirectEngine:
"""Loads detector + recognizer ONNX files directly.
Supports the OpenCV Zoo YuNet + SFace pair out of the box. YuNet
exposes a C++-level API via cv2.FaceDetectorYN which accepts the
ONNX file directly; SFace is driven through cv2.FaceRecognizerSF.
Both are Apache 2.0 licensed.
"""
def __init__(self) -> None:
self.detector_path: str = ""
self.recognizer_path: str = ""
self.input_size: tuple[int, int] = (320, 320)
self.det_thresh: float = 0.5
self._detector: Any = None
self._recognizer: Any = None
self._antispoofer: Antispoofer | None = None
def prepare(self, options: dict[str, str]) -> None:
raw_det = options.get("detector_onnx", "")
raw_rec = options.get("recognizer_onnx", "")
if not raw_det or not raw_rec:
raise ValueError(
"onnx_direct engine requires both detector_onnx and recognizer_onnx options"
)
model_dir = options.get("_model_dir")
self.detector_path = _resolve_model_path(raw_det, model_dir=model_dir)
self.recognizer_path = _resolve_model_path(raw_rec, model_dir=model_dir)
self.input_size = _parse_det_size(options.get("det_size", "320x320"))
self.det_thresh = float(options.get("det_thresh", "0.5"))
self._antispoofer = _build_antispoofer(options, model_dir)
# YuNet is a fixed-size detector; size is reset per detect() call to
# match the input frame.
self._detector = cv2.FaceDetectorYN.create(
self.detector_path,
"",
self.input_size,
score_threshold=self.det_thresh,
nms_threshold=0.3,
top_k=5000,
)
self._recognizer = cv2.FaceRecognizerSF.create(self.recognizer_path, "")
def detect(self, img: np.ndarray) -> list[FaceDetection]:
if self._detector is None:
return []
h, w = img.shape[:2]
self._detector.setInputSize((w, h))
retval, faces = self._detector.detect(img)
if faces is None:
return []
out: list[FaceDetection] = []
for row in faces:
x, y, fw, fh = float(row[0]), float(row[1]), float(row[2]), float(row[3])
# Landmarks at columns 4..13 are (lx1,ly1,...,lx5,ly5).
landmarks = np.array(row[4:14], dtype=np.float32).reshape(5, 2) if len(row) >= 14 else None
score = float(row[-1])
out.append(FaceDetection(bbox=(x, y, x + fw, y + fh), score=score, landmarks=landmarks))
return out
def embed(self, img: np.ndarray) -> np.ndarray | None:
if self._detector is None or self._recognizer is None:
return None
h, w = img.shape[:2]
self._detector.setInputSize((w, h))
retval, faces = self._detector.detect(img)
if faces is None or len(faces) == 0:
return None
# Pick the highest-score face (last column is score).
best = max(faces, key=lambda r: float(r[-1]))
aligned = self._recognizer.alignCrop(img, best)
feat = self._recognizer.feature(aligned)
vec = np.asarray(feat, dtype=np.float32).flatten()
# SFace outputs a 128-dim feature; L2-normalize to make dot-product
# comparable to buffalo_l's already-normed 512-dim embedding.
norm = float(np.linalg.norm(vec))
if norm == 0:
return None
return vec / norm
def analyze(self, img: np.ndarray) -> list[FaceAttributes]:
# OpenCV Zoo does not ship a demographic classifier; report
# only the face-detection regions so callers can still see
# how many faces were detected.
return [
FaceAttributes(
region=(
d.bbox[0],
d.bbox[1],
d.bbox[2] - d.bbox[0],
d.bbox[3] - d.bbox[1],
),
face_confidence=d.score,
)
for d in self.detect(img)
]
def antispoof(self, img: np.ndarray, bbox: tuple[float, float, float, float]) -> SpoofResult | None:
if self._antispoofer is None or not self._antispoofer.loaded:
return None
return self._antispoofer.predict(img, bbox)
# ─── helpers ──────────────────────────────────────────────────────────
def _parse_det_size(raw: str) -> tuple[int, int]:
raw = raw.strip().lower().replace(" ", "")
if "x" in raw:
w, h = raw.split("x", 1)
return (int(w), int(h))
n = int(raw)
return (n, n)
def _locate_insightface_pack(options: dict[str, str], name: str) -> str | None:
"""Find the directory holding the insightface pack's ONNX files.
LocalAI's gallery `files:` extracts the pack zip straight into the
models directory. Upstream packs are inconsistent:
buffalo_l/s/sc — flat zip, ONNX lands at `<models_dir>/*.onnx`
buffalo_m, antelopev2 — wrapped zip, ONNX lands at `<models_dir>/<name>/*.onnx`
We search, in order:
1. `<models_dir>/<name>/` — wrapped-zip layout, or insightface's
own FaceAnalysis-style `<root>/models/<name>/` layout.
2. `<models_dir>/models/<name>/` — insightface's FaceAnalysis
auto-download lands here (handy for dev environments that
still have old `~/.insightface` caches).
3. `<models_dir>/` — flat-zip layout directly in models dir.
Returns the first directory whose contents include `*.onnx`.
"""
import glob
import os
model_dir = options.get("_model_dir") or ""
explicit_root = options.get("root")
candidates: list[str] = []
if model_dir:
candidates.append(os.path.join(model_dir, name))
candidates.append(os.path.join(model_dir, "models", name))
candidates.append(model_dir)
if explicit_root:
expanded = os.path.expanduser(explicit_root)
candidates.append(os.path.join(expanded, "models", name))
candidates.append(os.path.join(expanded, name))
candidates.append(expanded)
for c in candidates:
if os.path.isdir(c) and glob.glob(os.path.join(c, "*.onnx")):
return c
return None
def _resolve_model_path(path: str, model_dir: str | None = None) -> str:
"""Resolve an ONNX file path across the paths LocalAI might deliver it from.
Search order:
1. The path itself if it already resolves (absolute, or relative to CWD).
2. `model_dir` (typically `os.path.dirname(ModelOptions.ModelFile)`) —
this is how LocalAI surfaces gallery-managed files. When the gallery
entry lists `files:`, each one lands under the models directory and
backends load them via filename anchored by ModelFile.
3. `<script_dir>/<path-without-leading-slash>` — covers dev layouts
where someone manually dropped weights inside the backend dir.
If none hit, return the literal input so cv2/insightface surfaces a
clearer error naming the actually-attempted path.
"""
import os
if os.path.isfile(path):
return path
stripped = path.lstrip("/")
candidates: list[str] = []
if model_dir:
candidates.append(os.path.join(model_dir, os.path.basename(path)))
candidates.append(os.path.join(model_dir, stripped))
script_dir = os.path.dirname(os.path.abspath(__file__))
candidates.append(os.path.join(script_dir, stripped))
for c in candidates:
if os.path.isfile(c):
return c
return path
def build_engine(name: str) -> FaceEngine:
"""Factory for the engine selected by LoadModel options."""
key = name.strip().lower()
if key in ("", "insightface"):
return InsightFaceEngine()
if key in ("onnx_direct", "onnx-direct", "opencv"):
return OnnxDirectEngine()
raise ValueError(f"unknown engine: {name!r}")

View File

@@ -0,0 +1,28 @@
#!/bin/bash
set -e
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
installRequirements
# We deliberately do NOT pre-bake any model weights here. Two reasons:
#
# 1. Weights should follow LocalAI's gallery-managed download flow
# like every other backend. For OpenCV Zoo (YuNet + SFace) the
# gallery entries in gallery/index.yaml list the ONNX files via
# `files:` with URI + SHA-256 — LocalAI fetches them into the
# models directory on `local-ai models install`.
#
# 2. For insightface model packs (buffalo_l, buffalo_s, buffalo_m,
# buffalo_sc, antelopev2), upstream distributes zip archives
# only (no individual ONNX URLs). We rely on insightface's own
# auto-download machinery (`FaceAnalysis(name=<pack>, root=<dir>)`)
# at first LoadModel, pointed at a writable directory. This
# matches how rfdetr behaves (uses `inference.get_model()`).
#
# Net effect: the backend image ships only Python deps (~150MB CPU).

Some files were not shown because too many files have changed in this diff Show More